# Introduction

**Pulmonary fibrosis** occurs when scarred lung tissue complicates and diminishes the ability to breathe. Prognosis varies from patient to patient and the reason for these differences are unknown. Using CT scans provided by the **Open Source Imaging Consortium (OSIC)** in addition to patient metadata, the objective is to predict the decline in breathing capacity as measured by **Forced Vital Capacity**, or **FVC**.

In this notebook, I perform exploratory data analysis (EDA) on the OSIC Pulmonary Fibrosis Progression dataset. I tried to keep everything accessible to data science newcomers - upvotes and/or comments are greatly appreciated! 😄

---

# Reading in Patient Metadata

In this section, I explore the tabular data pertaining to the patients' demographic and clinical information. First, I import the metadata from the provided csv files and take a look at its contents and overall structure. For reference, here are the columns contained in `train.csv` and `test.csv`:

* `Patient`- a unique Id for each patient (also the name of the patient's DICOM folder)
* `Weeks`- the relative number of weeks pre/post the baseline CT (may be negative)
* `FVC` - the recorded lung capacity in ml
* `Percent`- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
* `Age`
* `Sex`
* `SmokingStatus`

Before diving in to the data we need to import the following packages:

In [None]:
# import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

To list what is contained in a directory, we can use os.listdir()

In [None]:
os.listdir("../input/osic-pulmonary-fibrosis-progression")

In this notebook, we will only concern ourselves with 'train.csv', which contains the patient epidemiological data and the 'train' subdirectory, which contains the patients' CT scans. Now let's read in 'train.csv' and begin to understand our data!

In [None]:
# import metadata csv files as pandas DataFrames (DFs)
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')

# size of the DF
print(train_df.shape)

Our training data consists of 1549 rows and 7 columns. Let's take a look at the first few rows:

In [None]:
train_df.head()

We can get additional information, such as the column data types and whether they contain null values using the .info() method:

In [None]:
train_df.info()

So we have `Patient`, `Sex` and `SmokingStatus` being of type object (strings), `Weeks`, `FVC` and `Age` being integers and `Percent` as float (decimal) values.

Another useful way of getting information from a DF is by using .describe(). This method returns summary statistics such as the mean, minimum and maximum - note that returns output only for the numeric columns

In [None]:
train_df.describe()

We can get quite a bit of insight from above - taking `Age` for example, we can see that the youngest is 42, the oldest is 88 and the average patient is ~67 years old.

Notice when we returned the first few rows of our DF, it showed the same `Patient` multiple times (for different follow-up visits). Let's find out how many unique patients there are:

In [None]:
# print the number of unique patient IDs
n_patients = train_df['Patient'].nunique()
print(f'There are {n_patients} unique patients in the training dataset')
print('===============================')

# print the number of rows for each unique patient ID
print(train_df['Patient'].value_counts())

So there are 176 unique patients in the train DF. We can also note that not all patients have the same number of observations from the output of .value_counts(). From the output we might get the impression that many patients had 10 clinical visits and only a few had 7 or 6 visits, but to find out exactly the distribution of the number of visits, we can use .value_counts().value_counts() 😎

In [None]:
# get the distribution of number of visits
print(train_df['Patient'].value_counts().value_counts())

This tells us explicitly that there are 9 rows (visits) for the majority of patients.

Since the `Age`, `Sex` and `SmokingStatus` columns are repeated for rows of the same patient, we need to drop these duplicates before exploring these variables. However, the patients have numerous follow up visits over the course of ~1-2 years, so it's possible that `Age` for the same patient could have different values. Let's check whether `Age` contains different values for each patient: 

In [None]:
# for each patient, get the number of unique Age values
print(train_df.groupby('Patient')['Age'].nunique())
print('===============================')

# get unique values of the output from above
print(train_df.groupby('Patient')['Age'].nunique().unique())

`Age` is constant for every patient, great! Now we create a new DF for individual patients and confirm the shape to have 176 rows:

In [None]:
patient_df = train_df[['Patient', 'Age', 'Sex', 'SmokingStatus']].drop_duplicates()
patient_df.shape

# EDA: Patient Demographics
Now we can start exploring the patient demographics! Let's look at the distributions of `Age`, `Sex` and `SmokingStatus`.

In [None]:
sns.distplot(patient_df['Age'])
plt.title('Distribution of Age', size=20)
plt.xlabel('');

`Age` has a roughly normal distribution with a large proportion of patients being 60 to 75 years old.

In [None]:
f, axes = plt.subplots(1,2, figsize=(12, 4))

sns.countplot(patient_df['Sex'], ax=axes[0], palette='muted')
axes[0].set_title('Sex', size=20)
axes[0].set_xlabel('')

sns.countplot(patient_df['SmokingStatus'], ax=axes[1], palette=sns.color_palette("Set2"))
axes[1].set_title('Smoking Status', size=20)
axes[1].set_xlabel('');

The patients are predominantly male and mainly consist of ex-smokers. Patients who never smoked are the second most common SmokingStatus group and current smokers make up the smallest group.

We now have a sense of the distributions of each categorical variable. We can also explore them with respect to each other! For example, let's look at how `Sex` is distributed with respect to each `SmokingStatus` group:

In [None]:
g = sns.FacetGrid(patient_df, col="SmokingStatus", hue='Sex', palette='muted', height=4)
g.map(sns.countplot, 'Sex', order=['Male','Female'])
g.axes[0, 0].set_ylabel('Number of Patients', size=14)
g.set_titles(size=14, fontweight='bold', col_template="{col_name}")
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Distribution of Sex Across Smoking Groups', fontsize=20);

Ex-smokers and current smokers are predominantly male while those who never smoked are split roughly evenly by gender. We can also look at how `Age` distributed across `SmokingStatus` groups: 

In [None]:
sns.violinplot(x='SmokingStatus', y='Age', data=patient_df, hue="Sex", palette='muted', split=True)
plt.title('Age Distributions Across Smoking Groups', size=16)
plt.legend(loc='lower left', ncol=2);

Generally, patient `Age` distributions are similar across `SmokingStatus` groups with a mean of around 65-70 years. The exceptions are females who currently or used to smoke whose `Age` distributions tend to be more spread out, but this may be due to the few number of patients within these groups. 

# EDA: Patient Clinical Data
We now have a fairly good understanding of our patients' demographics. Let's investigate the numeric variables or the patients' clinical data, starting with the `Weeks` variable!

In [None]:
sns.distplot(train_df['Weeks'])
plt.title('Distribution of Weeks', size=20);

The `Weeks` distribution is considerably skewed to the right. We can see that most data points fall within 75 weeks or roughly 1.5 years. Let's shift our attention to the Forced Vital Capicity or `FVC` and `Percent` variables. First, I compare these metrics between different `Sex` and also across `SmokingStatus` groups: 

In [None]:
sns.swarmplot(x='SmokingStatus', y='FVC', data=train_df, hue='Sex', dodge=True, size=3, palette='muted')
plt.title('Males tend to Have Greater FVC than Females', size=18);

Males tend to have greater `FVC`s than females, as one might expect. However, rather surprisingly, FVC does not appear to be associated with `SmokingStatus`.

In [None]:
sns.swarmplot(x='SmokingStatus', y='Percent', data=train_df, hue='Sex', dodge=True, size=3, palette='muted')
plt.title('Percent is Comparable Between\nSex & Across Smoking Groups', size=18);

`Percent` is a calculated metric that quantifies one's FVC with respect to the typical FVC of their demographic. We can see the effect of this normalization above where the FVC distributions between males and females are now similar. There are also a few data points among the current smokers that have markedly higher Percent values than the rest of the cohort.

Let's explore the relationships between our numeric variables using the quick and dirty sns.pairplot():

In [None]:
g = sns.pairplot(train_df, hue='SmokingStatus', palette=sns.color_palette('Set2'), corner=True, plot_kws={'alpha': 0.5})
g._legend.set_bbox_to_anchor((0.7, 0.7));

There does not seem to be any apparent associations between these variables aside from `FVC` and `Percent` which are positively correlated.

Tracking a patient's `FVC` over time is central to the objective of this competition. We can visualize individual FVC dynamics of individual patients like so:

In [None]:
sns.pointplot(x='Weeks', y='FVC', data=train_df[train_df['Patient']=='ID00007637202177411956430']);

However, what if we wanted to compare changes in FVC between different patient groups? For example, we can plot the average `FVC` over time of each `SmokingStatus` group:

In [None]:
# subset data for weeks<100 since there are very few observations past this point
sns.lineplot(x='Weeks', y='FVC', data=train_df[train_df['Weeks']<100], hue="SmokingStatus", palette=sns.color_palette('Set2', n_colors=3))
plt.legend(bbox_to_anchor=(1.05, 1))
plt.title('Change in FVC Over Time', size=18);

No differences in FVC progression between smoking groups are apparent. But note that FVC can vary between individuals - this noise could be masking any trend that may be present. We can look at FVC progression in relative terms by calculating the % change in FVC:

In [None]:
train_df['FVC_%Change'] = train_df.groupby('Patient')['FVC'].pct_change().fillna(0)
sns.lineplot(x='Weeks', y='FVC_%Change', data=train_df[train_df['Weeks']<=100], hue="SmokingStatus", palette=sns.color_palette('Set2', n_colors=3))
plt.title('% Change in FVC Over Time');

Although the % change in FVC are less variable compared to FVC, we still do not observe any notable differences in averaged FVC progression between smoking groups.

---