## Data Preprocessing

* The first and often the most time consuming step
* Data must be in a form that the data learning or analysis algorithms expect

![](data_cleaning.jpg)

### Data Preprocessing Steps

* Getting the dataset
* Exploring the dataset
* Missing Values
* Categorical Values
* Splitting the dataset
* Scaling the dataset


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

### Exploring the data

In [2]:
aq = pd.read_csv("airquality.csv")
aq.head()
# if you want to see all data, uncomment the line below
#pd.set_option('display.max_rows', None) 
print(aq)
print(aq.iloc[:,2]) # column 0 is Ozone, column 2 is Wind
print(aq.loc[:,'Ozone'])
print(aq.iloc[0:10]) # rows 0 through 9, all columns

     Ozone  Solar.R  Wind  Temp  Month  Day
0     41.0    190.0   7.4    67      5    1
1     36.0    118.0   8.0    72      5    2
2     12.0    149.0  12.6    74      5    3
3     18.0    313.0  11.5    62      5    4
4      NaN      NaN  14.3    56      5    5
..     ...      ...   ...   ...    ...  ...
148   30.0    193.0   6.9    70      9   26
149    NaN    145.0  13.2    77      9   27
150   14.0    191.0  14.3    75      9   28
151   18.0    131.0   8.0    76      9   29
152   20.0    223.0  11.5    68      9   30

[153 rows x 6 columns]
0       7.4
1       8.0
2      12.6
3      11.5
4      14.3
       ... 
148     6.9
149    13.2
150    14.3
151     8.0
152    11.5
Name: Wind, Length: 153, dtype: float64
0      41.0
1      36.0
2      12.0
3      18.0
4       NaN
       ... 
148    30.0
149     NaN
150    14.0
151    18.0
152    20.0
Name: Ozone, Length: 153, dtype: float64
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0  

In [124]:
tips = sns.load_dataset("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Missing Values

* Generally indicated by NaN in Python
    - May be an extreme value like 9999
* Delete Row
    - Not a good idea unless have lots of repeated measures
* Fill in with the column mean, median, or mode
* Fill in with mean of neighboring items 
* When using a statistic fit imputer to training data.
    - Transform both the training data and test data with the fit imputer

In [5]:
tips.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [6]:
aq.isnull().sum()

Ozone      37
Solar.R     7
Wind        0
Temp        0
Month       0
Day         0
dtype: int64

In [7]:
aq.loc[:,'Wind'].isnull().sum()

0

In [8]:
pd.isnull(aq.iloc[:,3]).sum()  #check for missing values in Temp

0

In [4]:
percent_missing = aq.loc[:,'Ozone'].isnull().sum() / aq.shape[0]
print(f'% of missing values in Ozone is {round(percent_missing*100,2)}')

% of missing values in Ozone is 24.18


In [5]:
#impute NaN data with mean 
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values = np.nan, strategy='mean') # create the imputer, mean of the column is used
imp_mean.fit([[np.nan, 2, 3], [5, np.nan, 6], [10, 5, 9]]) # fit it to the 
SimpleImputer()
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
Y = imp_mean.transform(X)
print(Y)

[[ 7.5  2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


In [8]:
#following two blocks of code demonstrate how to use two different imputers 


from sklearn.impute import SimpleImputer
X = aq.iloc[:,0:4].values # extract the first 4 columns 
imputer = SimpleImputer(missing_values = np.nan, strategy='mean') #replace NaN with mean
imputer = imputer.fit(X[:,2:]) # fit the imputer for first two features
X[:,0:2] = imputer.transform(X[:,0:2]) # transform and store back
print(X)

# now look back if any missing values in Ozone column
percent_missing = aq.iloc[:2].isnull().sum() / aq.shape[0]
print(f'% of missing values in Ozone is {round(percent_missing*100,2)}') # output should be 0.0


# This is an example of K-nearest neighbor (KNN) imputation 
# KNN is a classifire algorithm in Machine Learning (We will study this in a later Module)
# The KNN algorithm assumes that similar things exist in close proximity. Thus, it imputes the 
# missing values based on N number of proximity values. Here N ois set at 3. 
from sklearn.impute import KNNImputer 
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=3, weights="uniform")
imputer.fit_transform(X)

[[ 41.         190.           7.4         67.        ]
 [ 36.         118.           8.          72.        ]
 [ 12.         149.          12.6         74.        ]
 [ 18.         313.          11.5         62.        ]
 [  9.95751634  77.88235294  14.3         56.        ]
 [ 28.          77.88235294  14.9         66.        ]
 [ 23.         299.           8.6         65.        ]
 [ 19.          99.          13.8         59.        ]
 [  8.          19.          20.1         61.        ]
 [  9.95751634 194.           8.6         69.        ]
 [  7.          77.88235294   6.9         74.        ]
 [ 16.         256.           9.7         69.        ]
 [ 11.         290.           9.2         66.        ]
 [ 14.         274.          10.9         68.        ]
 [ 18.          65.          13.2         58.        ]
 [ 14.         334.          11.5         64.        ]
 [ 34.         307.          12.          66.        ]
 [  6.          78.          18.4         57.        ]
 [ 30.    

array([[1., 2., 5.],
       [3., 4., 3.],
       [4., 6., 5.],
       [8., 8., 7.]])

### Categorical Data
 
* Labeled data
    - Labels can be strings (i.e. nominal variables) or numbers
    - Gender, Species in the iris dataset, number of cylinders in a car
    - Sometimes no ordering is implied
        - Gender, iris$Species
    - Sometimes there is a natural ordering
        - Size (small,medium or large), number of cylinders in a car
* Can be independent or dependent variable
    - When its the independent variable it serves as a grouping variable (e.g. in a boxplot)
        - Is there a difference in Sepal.Length by group?
    - As a dependent variable in Classification problems we classify new observations into one of the grourps
* Statistical(machine learning) models are based on mathematical equations that require integer values not strings
    - Need to encode strings as integers (called Dummy Encoding)
    - Will go into more detail when we cover multiple Linear Regression

* We will use the LabelEncoder class to do dummy encoding 
    


In [9]:
iris = sns.load_dataset("iris") #https://seaborn.pydata.org/generated/seaborn.load_dataset.html
# load it from git, need internet
from sklearn.preprocessing import LabelEncoder #https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

label_encoder = LabelEncoder()
iris.iloc[:, 4] = label_encoder.fit_transform(iris.iloc[:, 4]) # dummy encoding species attribute
print(iris.head())
iris.tail()

   sepal_length  sepal_width  petal_length  petal_width  species
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4          0.2        0


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


#### One-hot encoding
A representing categorical variables as binary vectors. It allows the representation of categorical data to be more expressive. 
Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.
Dummy encoding may work for problems where there is a natural ordinal relationship between the categories (attribute values) such as labels for weather; cold, sunny, rainy, cloudy, etc. But problems with no natural ordinal relationship between the categories such as 'cat' or 'dog' could cause problems in especially learning algorithms. 
Lets use the One-hot encoder in an example.

Example:
Imagine if you had 3 categories of foods: apples, chicken, and broccoli. Using label encoding, you would assign each of these a number to categorize them: apples = 1, chicken = 2, and broccoli = 3. But now, if your model internally needs to calculate the average across categories, it might do do 1+3 = 4/2 = 2. This means that according to your model, the average of apples and chicken together is broccoli.

Obviously that line of thinking by your model is going to lead to it getting correlations completely wrong. 

Tables below show before and after one-hot encoding

![](encodingtable.jpeg)

Well, our categories were formerly rows, but now they’re columns. Our numerical variable, calories, has however stayed the same.


In [56]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
onehotdata = pd.read_csv('foodcategories.csv')
print('content of the data file\n')
print(onehotdata) #output data from file
X = onehotdata.iloc[:, :-1].values # we care about the first two columns, but all rows
print('\n data from first two columns\n')
print(X)
le = LabelEncoder()
X[:, 0] = le.fit_transform(X[:, 0]) # converting first column's categorical data to numerical
print('\n first two columns are both numerical now, first column values 0-4, but second  column is not encoded. \n')
print(X)
ohe = OneHotEncoder(categories= 'auto')
X = ohe.fit_transform(X).toarray()

print('\n both columns are encoded using one-hot encoding \n')
print(X)

content of the data file

  FoodName  CategoryNumber  Calories
0    Apple               1        80
1     Pear               2        72
2  Chicken               3       220
3   Carrot               4        45
4     Beef               5       418

 data from first two columns

[['Apple' 1]
 ['Pear' 2]
 ['Chicken' 3]
 ['Carrot' 4]
 ['Beef' 5]]

 first two columns are both numerical now, first column values 0-4, but second  column is not encoded. 

[[0 1]
 [4 2]
 [3 3]
 [2 4]
 [1 5]]

 both columns are encoded using one-hot encoding 

[[1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 1.]]


In [37]:
# Data set is from https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease
# It is given to you in the folder as a CSV file

# load data
df = pd.read_csv('chronic_kidney_disease.csv', header=None, 
 names=['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sc', 'sod', 'pot',
 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'class'])
# head of df
print(df.head(10))

  age   bp     sg al su       rbc        pc         pcc          ba  bgr  ...  \
0  48   80  1.020  1  0         ?    normal  notpresent  notpresent  121  ...   
1   7   50  1.020  4  0         ?    normal  notpresent  notpresent    ?  ...   
2  62   80  1.010  2  3    normal    normal  notpresent  notpresent  423  ...   
3  48   70  1.005  4  0    normal  abnormal     present  notpresent  117  ...   
4  51   80  1.010  2  0    normal    normal  notpresent  notpresent  106  ...   
5  60   90  1.015  3  0         ?         ?  notpresent  notpresent   74  ...   
6  68   70  1.010  0  0         ?    normal  notpresent  notpresent  100  ...   
7  24    ?  1.015  2  4    normal  abnormal  notpresent  notpresent  410  ...   
8  52  100  1.015  3  0    normal  abnormal     present  notpresent  138  ...   
9  53   90  1.020  2  0  abnormal  abnormal     present  notpresent   70  ...   

  pcv     wc   rc  htn   dm cad appet   pe  ane class  
0  44   7800  5.2  yes  yes  no  good   no   no   ck

We notice in the above output that there are missing values, thus we need to handle missing values before we deal with encoding of columns like 'rbc', 'pc', 'pcc', etc.
Take care of the missing vallues first.

In [18]:
#pd.set_option('display.max_rows', None) 
df.replace('?', np.nan, inplace=True)
print(df.head(10))
#df['dm'] = df['dm'].str.strip()
df['class'] = df['class'].apply(lambda x: 1 if x =='ckd' else 0)
#print(df.head(10))
# numerical columns
num_cols = ['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc']
# categorical columns

cate_cols = df.columns.drop('class').drop(num_cols)
# display categorical columns
print(cate_cols)
# convert numerical data 
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')    
df.info()    
#print(df.head(10))

# X and y
X = df.drop(columns=['class'])
Y = df['class']
#print(X.head(10))
#print(Y.head(10))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='median') #replace NaN with mean
imputer = imputer.fit(X[num_cols]) # fit the imputer for first two features
X[num_cols] = imputer.transform(X[num_cols]) # transform and store back
print(X.head(10))

# Imputing categorical data using the most frequent value
imputercat = SimpleImputer(strategy="most_frequent")  
imputercat = imputercat.fit(X[cate_cols])
X[cate_cols] = imputercat.transform(X[cate_cols])
print(X.head(10))

#perform dummy encoding
X = pd.get_dummies(X, prefix_sep='_', drop_first=True)

# X head
print(X.head(10))

    age     bp     sg   al   su       rbc        pc         pcc          ba  \
0  48.0   80.0  1.020  1.0  0.0       NaN    normal  notpresent  notpresent   
1   7.0   50.0  1.020  4.0  0.0       NaN    normal  notpresent  notpresent   
2  62.0   80.0  1.010  2.0  3.0    normal    normal  notpresent  notpresent   
3  48.0   70.0  1.005  4.0  0.0    normal  abnormal     present  notpresent   
4  51.0   80.0  1.010  2.0  0.0    normal    normal  notpresent  notpresent   
5  60.0   90.0  1.015  3.0  0.0       NaN       NaN  notpresent  notpresent   
6  68.0   70.0  1.010  0.0  0.0       NaN    normal  notpresent  notpresent   
7  24.0    NaN  1.015  2.0  4.0    normal  abnormal  notpresent  notpresent   
8  52.0  100.0  1.015  3.0  0.0    normal  abnormal     present  notpresent   
9  53.0   90.0  1.020  2.0  0.0  abnormal  abnormal     present  notpresent   

     bgr  ...   pcv       wc   rc  htn   dm  cad  appet   pe  ane class  
0  121.0  ...  44.0   7800.0  5.2  yes  yes   no   good 

### Training Set/Test Set Split
 
* Train the model on one set of data (the training set)
* To test how well the model will generalize, we test it on a different set (the test set)
* We do this to guard against overfitting the model
    - The model relies to much on the features of the data in the training set.
        - It may be an unusual sample

#### Validation set

In [53]:
# Python Code
from sklearn.model_selection import train_test_split
X = iris.iloc[:,0:3]
y = iris.loc[:,'species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)
print("X Train: ",X_train.head())
print("\ny Train: ",y_train.head()) # species column is already encoded above


X Train:       sepal_length  sepal_width  petal_length
98            5.1          2.5           3.0
126           6.2          2.8           4.8
40            5.0          3.5           1.3
133           6.3          2.8           5.1
77            6.7          3.0           5.0

y Train:  98     1
126    2
40     0
133    2
77     1
Name: species, dtype: int64


### Feature Scaling

* Many learning algorithms perform better if data is in range (0,1)
* Some learning algorithms require normalized data
    - Euclidean distance measures

* Normalization (Min-max scaling)
    -scale all features to (0,1)
        
$$ \frac{x - min(x)}{max(x) - min(x)}$$
        
* Standardization
    - Z-scores 
    - mean =  0, standard deviation = 1
        
$$\frac{x - mean(x)}{standardDeviation(x)}$$

In [55]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("X Train: \n",X_train[0:10,:])



X Train: 
 [[-0.89800581 -1.23462679 -0.41454695]
 [ 0.45053534 -0.57321958  0.6087918 ]
 [-1.02060047  0.9700639  -1.38103354]
 [ 0.57312999 -0.57321958  0.77934826]
 [ 1.06350859 -0.13228144  0.72249611]
 [-1.26578977  0.74959484 -1.03992062]
 [-1.75616837 -0.35275051 -1.32418139]
 [-0.53022186  0.74959484 -1.15362493]
 [-1.51097907  1.19053297 -1.55159   ]
 [-1.02060047 -1.67556493 -0.24399049]]
