## Data Preprocessing

* The first and often the most time consuming step
* Data must be in a form that the data learning or analysis algorithms expect

![](data_cleaning.jpg)

### Data Preprocessing Steps

* Getting the dataset
* Exploring the dataset
* Missing Values
* Categorical Values
* Splitting the dataset
* Scaling the dataset


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

### Exploring the data

In [13]:
aq = pd.read_csv("airquality.csv")
aq.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [14]:
tips = sns.load_dataset("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Missing Values

* Generally indicated by NaN in Python
    - May be an extreme value like 9999
* Delete Row
    - Not a good idea unless have lots of repeated measures
* Fill in with the column mean, median, or mode
* Fill in with mean of neighboring items 
* When using a statistic fit imputer to training data.
    - Transform both the training data and test data with the fit imputer

In [19]:
tips.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [16]:
aq.isnull().sum()

Ozone      37
Solar.R     7
Wind        0
Temp        0
Month       0
Day         0
dtype: int64

In [17]:
aq.loc[:,'Wind'].isnull().sum()

0

In [18]:
pd.isnull(aq.iloc[:,3]).sum()  #check for missing values in Temp

0

In [23]:
percent_missing = aq.loc[:,'Ozone'].isnull().sum() / aq.shape[0]
print(f'% of missing values in Ozone is {round(percent_missing*100,2)}')

% of missing values in Ozone is 24.18


In [None]:
from sklearn.preprocessing import Imputer
X = aq.iloc[:,0:4].values
imputer = Imputer(missing_values = 'NaN',strategy='mean',axis = 0)
imputer = imputer.fit(X[:,0:2])
X[:,0:2] = imputer.transform(X[:,0:2])
df = pd.DataFrame(X)
df.head()

### Categorical Data
 
* Labeled data
    - Labels can be strings (i.e. nominal variables) or numbers
    - Gender, Species in the iris dataset, number of cylinders in a car
    - Sometimes no ordering is implied
        - Gender, iris$Species
    - Sometimes there is a natural ordering
        - Size (small,medium or large), number of cylinders in a car
* Can be independent or dependent variable
    - When its the independent variable it serves as a grouping variable (e.g. in a boxplot)
        - Is there a difference in Sepal.Length by group?
    - As a dependent variable in Classification problems we classify new observations into one of the grourps
* Statistical(machine learning) models are based on mathematical equations that require integer values not strings
    - Need to encode strings as integers (called Dummy Encoding)
    - Will go into more detail when we cover multiple Linear Regression

* We will use the LabelEncoder class to do dummy encoding 
    


In [1]:
iris = sns.load_dataset("iris")

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
iris.iloc[:, 4] = label_encoder.fit_transform(iris.iloc[:, 4])
print(iris.head())
iris.tail()

   sepal_length  sepal_width  petal_length  petal_width  species
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4          0.2        0


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


#### One-hot encoding

### Training Set/Test Set Split
 
* Train the model on one set of data (the training set)
* To test how well the model will generalize, we test it on a different set (the test set)
* We do this to guard against overfitting the model
    - The model relies to much on the features of the data in the training set.
        - It may be an unusual sample

#### Validation set

In [None]:
# Python Code
from sklearn.cross_validation import train_test_split
X=iris.iloc[:,0:3]
y = iris.loc[:,"Species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)
print("X Train: ",X_train.head())
print("y Train: ",y_train.head())


### Feature Scaling

* Many learning algorithms perform better if data is in range (0,1)
* Some learning algorithms require normalized data
    - Euclidean distance measures

* Normalization (Min-max scaling)
    -scale all features to (0,1)
        
$$ \frac{x - min(x)}{max(x) - min(x)}$$
        
* Standardization
    - Z-scores 
    - mean =  0, standard deviation = 1
        
$$\frac{x - mean(x)}{standardDeviation(x)}$$

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("X Train: ",X_train[0:10,:])

