# **Data Preprocessing**

## **Import the Libraries**

### 1. *numpy*:
numPy is the fundamental package for scientific computing in Python. ... NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python's built-in sequences.<br>
```import numpy as np```

### 1. *pandas*:
pandas provide high performance, fast, easy to use data structures and data analysis tools for manipulating numeric data and time series. Pandas is built on the numpy library, we can import data from various file formats like JSON, SQL, Microsoft Excel, etc.<br>
```import pandas as pd```

### 1. *matplotlib.pyplot*:
matplotlib. pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.<br>
```import matplotlib.pyplot as plt```

In [137]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## **Import the dataset**

In [138]:
file = 'Data.csv'
dataset = pd.read_csv(file)
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


## **Seperate dependant and independant variables**

In [139]:
# Purchased is Dependant variable
# And rest of all are independant
dep = 'Purchased'
X = dataset.loc[:,dataset.columns != dep]
y = dataset.loc[:,dep]

## **Finding Missing Values**
Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling, Many popular predictive models such as support vector machines, the glmnet, and neural networks, cannot tolerate any amount of missing values.

In [140]:
missing_values = X.isnull().sum() != 0
col = X.loc[:, missing_values]
num_col = col.loc[:, col.dtypes == np.float64].columns
num_col

Index(['Age', 'Salary'], dtype='object')

## **Handling Missing Values**
There are two method of fill NaN values
1. ### *fillna()*:
We can fill NaN values with pandas method fillna()
2. ### *Imputer class*
The scikit-learn library provides the SimpleImputer pre-processing class that can be used to replace missing values.It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The SimpleImputer class operates directly on the NumPy array instead of the DataFrame.
### **Diff between fillna() and SimpleImputer**:
I feel imputer class has its own benefits because you can just simply mention mean or median to perform some action unlike in fillna where you need to supply values. But in imputer you need to fit and transform the dataset which means more lines of code. But it may give you better speed over fillna but unless really big dataset it doesn’t matter.
But fillna has something which is really cool. You can fill the na even with a custom value which you may sometime need. This makes fillna better IMHO even if it may perform slower.

In [141]:
# Pandas fillna() method
mean_val = X.loc[:, num_col].mean()
X.fillna(mean_val)

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [142]:
# Simpleimputer method
from sklearn.impute import SimpleImputer
values = X.loc[:,num_col].values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
transformed_values = imputer.fit_transform(values)
X.loc[:,num_col] = transformed_values
X

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


### **Finding Categorical Variable**

In [143]:
cat = X.dtypes == 'object'
cat_col = X.loc[:, cat].columns
cat_col

Index(['Country'], dtype='object')

### **Handling Categorical Data**
Machine learning models require all input and output variables to be numeric.
This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.
The two most popular techniques are an **Ordinal Encoding** and a **One-Hot Encoding**<br>
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).
Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.<br>
here are three common approaches for converting ordinal and categorical variables to numerical values. They are:
1. Label Encoding
2. One-Hot Encoding
3. Dummy Variable Encoding

In [144]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

#### *sparse=False*
setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

In [145]:
# One Hot Encoding
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
Xnew = X.copy()
Xnew.loc[:,['France','Germany','Spain']] = encoder.fit_transform(pd.DataFrame(Xnew.loc[:,cat_col]))
Xnew =  Xnew.drop(cat_col, axis=1)
Xnew.head()

Unnamed: 0,Age,Salary,France,Germany,Spain
0,44.0,72000.0,1.0,0.0,0.0
1,27.0,48000.0,0.0,0.0,1.0
2,30.0,54000.0,0.0,1.0,0.0
3,38.0,61000.0,0.0,0.0,1.0
4,40.0,63777.777778,0.0,1.0,0.0


#### *drop=first*
It's drop first column of dummy variable becuase of redudancy in the dataset

In [146]:
# Dummy Variable Encoding
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first' ,sparse=False)
Xnew = X.copy()
Xnew.loc[:,['Germany','Spain']] = encoder.fit_transform(pd.DataFrame(Xnew.loc[:,cat_col]))
Xnew =  Xnew.drop(cat_col, axis=1)
Xnew.head()

Unnamed: 0,Age,Salary,Germany,Spain
0,44.0,72000.0,0.0,0.0
1,27.0,48000.0,0.0,1.0
2,30.0,54000.0,1.0,0.0
3,38.0,61000.0,0.0,1.0
4,40.0,63777.777778,1.0,0.0


#### *Pandas get_dummies Function*

In [147]:
# Pandas get_dummies Function
pd.get_dummies(X)

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,1,0,0
1,27.0,48000.0,0,0,1
2,30.0,54000.0,0,1,0
3,38.0,61000.0,0,0,1
4,40.0,63777.777778,0,1,0
5,35.0,58000.0,1,0,0
6,38.777778,52000.0,0,0,1
7,48.0,79000.0,1,0,0
8,50.0,83000.0,0,1,0
9,37.0,67000.0,1,0,0


#### *handle_unknown='ignore'*
We set handle_unknown='ignore' to avoid errors when the new dataset contains classes that aren't represented in the training data

In [148]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore' ,sparse=False)
Xnew = X.copy()
Xnew.loc[:,['France','Germany','Spain']] = encoder.fit_transform(pd.DataFrame(Xnew.loc[:,cat_col]))
Xnew =  Xnew.drop(cat_col, axis=1)
Xnew.head()

Unnamed: 0,Age,Salary,France,Germany,Spain
0,44.0,72000.0,1.0,0.0,0.0
1,27.0,48000.0,0.0,0.0,1.0
2,30.0,54000.0,0.0,1.0,0.0
3,38.0,61000.0,0.0,0.0,1.0
4,40.0,63777.777778,0.0,1.0,0.0


In [149]:
data = {
    'Age':[55,10,20,50],
    'Salary':[7500,50,40,88],
    'Country':['Afghanistan','France','Spain','Germany']
}
new_data = pd.DataFrame(data)
new_data

Unnamed: 0,Age,Salary,Country
0,55,7500,Afghanistan
1,10,50,France
2,20,40,Spain
3,50,88,Germany


In [150]:
new_data.loc[:,['France','Germany','Spain']] = encoder.transform(pd.DataFrame(new_data.loc[:,cat_col]))
new_data

Unnamed: 0,Age,Salary,Country,France,Germany,Spain
0,55,7500,Afghanistan,0.0,0.0,0.0
1,10,50,France,1.0,0.0,0.0
2,20,40,Spain,0.0,0.0,1.0
3,50,88,Germany,0.0,1.0,0.0


### **Train Test Split**
The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model

In [151]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xnew, y, random_state=1, test_size=0.3)
print("X_train: {} X_test {}".format(X_train.shape[0],X_test.shape[0]))

X_train: 7 X_test 3


### **Standardization**
Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.
This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

In [152]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_norm = sc.fit_transform(X_train)
X_test_norm = sc.transform(X_test)

In [154]:
X_train_norm

array([[-0.03891021, -0.22960023, -0.8660254 ,  1.58113883, -0.63245553],
       [ 0.50583275,  0.49120535,  1.15470054, -0.63245553, -0.63245553],
       [-0.31128169, -0.47311563, -0.8660254 , -0.63245553,  1.58113883],
       [-1.80932482, -1.6127677 , -0.8660254 , -0.63245553,  1.58113883],
       [ 1.0505757 ,  1.10486416,  1.15470054, -0.63245553, -0.63245553],
       [ 1.32294718,  1.45552633, -0.8660254 ,  1.58113883, -0.63245553],
       [-0.71983891, -0.73611226,  1.15470054, -0.63245553, -0.63245553]])