## Lecture 2: Data Pre Processing

### Data Preprocessing cycle

1. Getting Dataset
2. Importing Libraries
3. Importing Dataset
4. Find missing values
5. Encoding categorical data
6. Splitting Data into test/train sets
7. Feature Scaling

#### Getting Data Set

There are many ways to get data. But You can practice by creating dataframe through dictionaries.

In [19]:
dict={
    "Countary":["France", "Germany", "Italy", "Germany", "USA", "Italy", "France", "France"],
    "Age":[24,53,None,34,55,None,43,65],
    "Salary":[34000,43000,45000,None,33000,44000,56000,76000],
    "Purchased":["yes","no","yes","yes","yes","no","yes","no"]
}

In [20]:
dict

{'Countary': ['France',
  'Germany',
  'Italy',
  'Germany',
  'USA',
  'Italy',
  'France',
  'France'],
 'Age': [24, 53, None, 34, 55, None, 43, 65],
 'Salary': [34000, 43000, 45000, None, 33000, 44000, 56000, 76000],
 'Purchased': ['yes', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no']}

In [21]:
import pandas as pd

##### Converting Dictionary to a data frame using `DataFrame()` method of pandas

In [22]:
df=pd.DataFrame(dict)

In [23]:
df

Unnamed: 0,Countary,Age,Salary,Purchased
0,France,24.0,34000.0,yes
1,Germany,53.0,43000.0,no
2,Italy,,45000.0,yes
3,Germany,34.0,,yes
4,USA,55.0,33000.0,yes
5,Italy,,44000.0,no
6,France,43.0,56000.0,yes
7,France,65.0,76000.0,no


**Seperating input and output variables**

In [58]:
X=df.drop(['Purchased'],axis=1).values

In [59]:
X

array([['France', 24.0, 34000.0],
       ['Germany', 53.0, 43000.0],
       ['Italy', nan, 45000.0],
       ['Germany', 34.0, nan],
       ['USA', 55.0, 33000.0],
       ['Italy', nan, 44000.0],
       ['France', 43.0, 56000.0],
       ['France', 65.0, 76000.0]], dtype=object)

In [60]:
Y=df['Purchased'].values

In [61]:
Y

array(['yes', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no'], dtype=object)

**Find and Replace Missing Values**

This can be done by using `SimpleImputer()` class from `sklearn.impute` library

In [62]:
from sklearn.impute import SimpleImputer
import numpy as np

In [63]:
#creating an object of SimpleImputer() class
imputer=SimpleImputer( missing_values=np.nan,
    strategy='mean')

In [64]:
X[:,1:3]=imputer.fit_transform(X[:,1:3])

In [65]:
X

array([['France', 24.0, 34000.0],
       ['Germany', 53.0, 43000.0],
       ['Italy', 45.666666666666664, 45000.0],
       ['Germany', 34.0, 47285.71428571428],
       ['USA', 55.0, 33000.0],
       ['Italy', 45.666666666666664, 44000.0],
       ['France', 43.0, 56000.0],
       ['France', 65.0, 76000.0]], dtype=object)

#### Check Categorical Variables

**Convert Categorical variables into Numbers**

This can be done by using `Labelencoder()` class from `sklearn.preprocessing` library

In [72]:
from sklearn.preprocessing import LabelEncoder

In [73]:
#creating an object of LabelEncoder() class
encoder_x=LabelEncoder()

In [74]:
X[:,0]=encoder_x.fit_transform(X[:,0])

In [75]:
X

array([[0, 24.0, 34000.0],
       [1, 53.0, 43000.0],
       [2, 45.666666666666664, 45000.0],
       [1, 34.0, 47285.71428571428],
       [3, 55.0, 33000.0],
       [2, 45.666666666666664, 44000.0],
       [0, 43.0, 56000.0],
       [0, 65.0, 76000.0]], dtype=object)

## Ordinal and Nominal Categorical Data

Ordinal Categorical Data may have correlation between.
e.g Education Lavel etc

Nominal Categorical Data cannot be correlated. As Country Names, Gender, etc.


For Nominal Data we need to Encode the data into seperate binary classes.

This can be done using `oneHotEncoder()` class from `sklearn.preprocessing` library

In [76]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()

In [80]:
dummies=encoder.fit_transform(df.Countary.values.reshape(-1,1)).toarray()

In [81]:
dummies

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

In [82]:
label_encoder_y=LabelEncoder()

In [83]:
y=label_encoder_y.fit_transform(Y)

In [84]:
y

array([1, 0, 1, 1, 1, 0, 1, 0])

## Splitting Data into test/train sets

In [95]:
from sklearn.model_selection import train_test_split

In [96]:
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.2,random_state=0)

In [113]:
xtrain

array([[1, 53.0, 43000.0],
       [0, 65.0, 76000.0],
       [1, 34.0, 47285.71428571428],
       [0, 24.0, 34000.0],
       [2, 45.666666666666664, 44000.0],
       [3, 55.0, 33000.0]], dtype=object)

In [114]:
xtest

array([[0, 43.0, 56000.0],
       [2, 45.666666666666664, 45000.0]], dtype=object)

## Feature Scaling

In [115]:
from sklearn.preprocessing import StandardScaler

In [116]:
scalar=StandardScaler()

In [117]:
xtrain_s=scalar.fit_transform(xtrain)

In [118]:
xtest_s=scalar.fit_transform(xtest)

In [119]:
xtest_s

array([[-1., -1.,  1.],
       [ 1.,  1., -1.]])

In [120]:
xtrain_s

array([[-0.15617376,  0.50443194, -0.22473516],
       [-1.09321633,  1.38311983,  2.08254578],
       [-0.15617376, -0.88682389,  0.07491172],
       [-1.09321633, -1.6190638 , -0.85399359],
       [ 0.78086881, -0.032544  , -0.15481755],
       [ 1.71791138,  0.65087992, -0.9239112 ]])