##Preprocessing Dataset Homework

We are working with the csv file from the Chest X-ray dataset: https://www.kaggle.com/nih-chest-xrays/data

We would like to detect if a patient has some form of chest disease based on the person's characteristics. The objective of this assignment is to extract features and labels and save them as csv files in our Google Drive.

For our assignment, we will omit any image related information and assume that the only information we have is the CSV. 

Note: In real life, the information in the CSV will likely not be enough to detect what chest disease a person has since we are not looking at the X-ray. However, we ignore this fact for the purposes of this exercise.

###Setting up

Import the proper libraries

One library you will need is provided.
The other library (pandas) you will have to import

In [2]:
import numpy as np
import pandas as pd

Mount google drive

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Find the file path of your CSV file!

In [6]:
!ls "/content/drive/My Drive/MLBootcamp/Week One/Day Five"

Data_Entry_2017.csv  features.csv  HW5.ipynb  labels.csv


###Pandas


Read in the csv! Let us name the variable that stores the dataframe - *data*.

In [7]:
data = pd.read_csv("/content/drive/My Drive/MLBootcamp/Week One/Day Five/Data_Entry_2017.csv")

Take a look at the first few rows of the data!

In [8]:
data.head()

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
0,00000001_000.png,Cardiomegaly,0,1,58,M,PA,2682,2749,0.143,0.143,
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143,
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168,
3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,0.171,0.171,
4,00000003_000.png,Hernia,0,3,81,F,PA,2582,2991,0.143,0.143,


What columns are in our dataframe?

In [9]:
data.columns

Index(['Image Index', 'Finding Labels', 'Follow-up #', 'Patient ID',
       'Patient Age', 'Patient Gender', 'View Position', 'OriginalImage[Width',
       'Height]', 'OriginalImagePixelSpacing[x', 'y]', 'Unnamed: 11'],
      dtype='object')

###Determining Features and Labels

What columns seem useful? What less so? Again, please ignore all image related columns. 

Columns that do not seem useful: Image index, Follow Up #, Patient Id

Irrelevant columns: View Position, Image Width/Height, Image Pixel Spacing x/y, unnamed column

Drop all columns that are not useful and view the new dataframe.

In [10]:
data = data.drop(['Image Index', 'Follow-up #', 'Patient ID', 'View Position', 'OriginalImage[Width', 'Height]', 'OriginalImagePixelSpacing[x', 'y]', 'Unnamed: 11'], axis = 1)
data

Unnamed: 0,Finding Labels,Patient Age,Patient Gender
0,Cardiomegaly,58,M
1,Cardiomegaly|Emphysema,58,M
2,Cardiomegaly|Effusion,58,M
3,No Finding,81,M
4,Hernia,81,F
...,...,...,...
112115,Mass|Pneumonia,39,M
112116,No Finding,29,M
112117,No Finding,42,F
112118,No Finding,30,F


Which of these remaining columns are features? Which are labels? 

Hint: At this point in your assignment, you should have 3 columns.

Finding labels are the labels. Age and gender are the features.

###Examining Dataframe

Let's examine each column. What are the distribution of values in each column? Are there any weird values?

In [12]:
print(data['Finding Labels'].value_counts())

No Finding                                                       60361
Infiltration                                                      9547
Atelectasis                                                       4215
Effusion                                                          3955
Nodule                                                            2705
                                                                 ...  
Atelectasis|Emphysema|Mass|Pleural_Thickening|Pneumothorax           1
Atelectasis|Emphysema|Pneumonia|Pneumothorax                         1
Fibrosis|Infiltration|Nodule|Pleural_Thickening                      1
Atelectasis|Cardiomegaly|Effusion|Fibrosis|Pleural_Thickening        1
Effusion|Emphysema|Nodule|Pleural_Thickening                         1
Name: Finding Labels, Length: 836, dtype: int64


In [13]:
data['Patient Age'].unique()

array([ 58,  81,  74,  75,  76,  77,  78,  79,  80,  82,  69,  70,  73,
        84,  61,  60,  62,  56,  57,  71,  66,  53,  47,  48,  49,  63,
        64,  52,  68,  59,  55,  72,  67,  46,  91,  92,  87,  65,  45,
        54,  50,  51,  44,  83,  33,  42,  25,  31,  94,  89,  90,  40,
        85,  30,  32,  34,  86,  37,  27,  29,  36,  38,  39,  43,  28,
        41,  35,  22,  23,  26,  21,  88,  24,  17,  18,  19,  20,  16,
        13,  14,  11,  12,  15,  93,   9,  10,   8,   6,   7,   4,   5,
         3,   2, 412,   1, 414, 148,  95, 150, 149, 152, 151, 411, 413,
       153, 154, 155])

In [15]:
data['Patient Gender'].value_counts()

M    63340
F    48780
Name: Patient Gender, dtype: int64

###Removing rows with weird values

There's a feature column in specific that has some weird values. Let's delete the rows where this column has weird values.

One way to drop weird values is setting the cells that have a weird value to np.nan using the numpy library. NaN means not a number.

Then we drop the rows with NaN values.

Part of the statement has been written out for you. Please fill in the proper value for items inside \$ \$ and delete the \$ \$ afterwards.

Take a look at this page: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html and understand what the loc function does

and this page on instruction how to drop rows with NA values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In the future, I may not give you specific documentation links to help you out. Just as professional coders do, you will be expected to look up the functionality of the libraries and figure out if a method exists in the library that you are using that can make your life easier. 

This is definitely not an easy skill, but googling and Stack Overflow are my go-tos when I don't know something!

First, let's set the weird values to np.nan 

**(Note: You may have to modify the data variable if you have named your dataframe something else)**

In [18]:
data.loc[data['Patient Age'] > 110, 'Patient Age'] = np.nan
data['Patient Age'].unique()

array([58., 81., 74., 75., 76., 77., 78., 79., 80., 82., 69., 70., 73.,
       84., 61., 60., 62., 56., 57., 71., 66., 53., 47., 48., 49., 63.,
       64., 52., 68., 59., 55., 72., 67., 46., 91., 92., 87., 65., 45.,
       54., 50., 51., 44., 83., 33., 42., 25., 31., 94., 89., 90., 40.,
       85., 30., 32., 34., 86., 37., 27., 29., 36., 38., 39., 43., 28.,
       41., 35., 22., 23., 26., 21., 88., 24., 17., 18., 19., 20., 16.,
       13., 14., 11., 12., 15., 93.,  9., 10.,  8.,  6.,  7.,  4.,  5.,
        3.,  2., nan,  1., 95.])

Now, let's drop the rows with the NaN values.

In [19]:
data = data.dropna()
data

Unnamed: 0,Finding Labels,Patient Age,Patient Gender
0,Cardiomegaly,58.0,M
1,Cardiomegaly|Emphysema,58.0,M
2,Cardiomegaly|Effusion,58.0,M
3,No Finding,81.0,M
4,Hernia,81.0,F
...,...,...,...
112115,Mass|Pneumonia,39.0,M
112116,No Finding,29.0,M
112117,No Finding,42.0,F
112118,No Finding,30.0,F


###Transforming features into proper representation

The other feature needs a more appropriate representation that the computer can understanding. What is this representation called?

Binary/One-Hot Vector

Now transform that feature into this representation and print the new dataframe.

Part of the statement has been written out for you. Please fill in the proper value for items inside \$ \$ and delete the \$ \$ afterwards.

I have given you a major hint to use get_dummies, but I have not provided you the link to the documentation this time. Please look up the documentation and then complete the statement below.


In [22]:
gen_data = pd.get_dummies(data['Patient Gender'], prefix='Is', drop_first = True) #Use true because there are only two possible values
gen_data

Unnamed: 0,Is_M
0,1
1,1
2,1
3,1
4,0
...,...
112115,1
112116,1
112117,0
112118,0


Concatenate the outputted dataframe with the dataframe that you were working with.

In [23]:
new_data = pd.concat([data, gen_data], axis = 1)
new_data 

Unnamed: 0,Finding Labels,Patient Age,Patient Gender,Is_M
0,Cardiomegaly,58.0,M,1
1,Cardiomegaly|Emphysema,58.0,M,1
2,Cardiomegaly|Effusion,58.0,M,1
3,No Finding,81.0,M,1
4,Hernia,81.0,F,0
...,...,...,...,...
112115,Mass|Pneumonia,39.0,M,1
112116,No Finding,29.0,M,1
112117,No Finding,42.0,F,0
112118,No Finding,30.0,F,0


Delete any unnecessary columns that we will not be using and print out the new dataframe.

In [24]:
new_data = new_data.drop(["Patient Gender"], axis = 1)
new_data

Unnamed: 0,Finding Labels,Patient Age,Is_M
0,Cardiomegaly,58.0,1
1,Cardiomegaly|Emphysema,58.0,1
2,Cardiomegaly|Effusion,58.0,1
3,No Finding,81.0,1
4,Hernia,81.0,0
...,...,...,...
112115,Mass|Pneumonia,39.0,1
112116,No Finding,29.0,1
112117,No Finding,42.0,0
112118,No Finding,30.0,0


###Tranforming labels into proper representation

Similarly, labels must be transformed to an encoding that makes sense. 

Please transform them and create a new dataframe.

In [None]:
findings = data["Finding Labels"].str.get_dummies(sep = "|")
findings

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112115,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
112116,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112117,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112118,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


Concatenate the outputted dataframe with the dataframe that you were working with. Drop any unnecessary columns and print the dataframe.

In [None]:
final_data = pd.concat([new_data, findings], axis = 1).drop(["Finding Labels"], axis = 1)
final_data

Unnamed: 0,Patient Age,New_F,New_M,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax
0,58.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,58.0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
2,58.0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
3,81.0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,81.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112115,39.0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
112116,29.0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112117,42.0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112118,30.0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


###Extract the dataframe into two separate dataframes representing labels and features

First, let's extract the features. Please fill in the proper value for items inside \$ \$ and delete the \$ \$ afterwards. Print the features.

In [None]:
features = final_data[["Patient Age", "New_F", "New_M"]]
features

Unnamed: 0,Patient Age,New_F,New_M
0,58.0,0,1
1,58.0,0,1
2,58.0,0,1
3,81.0,0,1
4,81.0,1,0
...,...,...,...
112115,39.0,0,1
112116,29.0,0,1
112117,42.0,1,0
112118,30.0,1,0


Now extract the labels by dropping the unnecessary columns. Print the labels.

In [None]:
final_data = final_data.drop(["Patient Age", "New_F", "New_M"], axis = 1)
labels = final_data
labels

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112115,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
112116,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112117,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112118,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


###Save a CSV of features and labels to your Google Drive

In [None]:
features.to_csv("/content/drive/My Drive/MLBootcamp/Day Five/features.csv")

In [None]:
labels.to_csv("/content/drive/My Drive/MLBootcamp/Day Five/labels.csv")