# Welcome to the Basics of Data Preprocessing

Here,we'll learn about Feature Engineering. This is the art that helps achieve great accuracy in Machine Learning.  
There are a number of things we need to keep in mind before we send in our data to formulas.  

No null data, and no strings presence in the dataset is must, but a number of other factors also affect our data. More than 40% of the time is consumed in this step by most professionals as well, so do refer blogs to learn more.  
Let's get started

In [1]:
import pandas as pd
import numpy as np

In [2]:
data=pd.read_csv('data/Data.csv')
data.head()

Unnamed: 0,City,Experience,Salary,Promotion
0,Delhi,4.0,55000.0,No
1,Mumbai,2.0,20000.0,Yes
2,Agra,3.0,30000.0,No
3,Mumbai,8.0,72000.0,No
4,Agra,4.0,,Yes


If you look at the dataset you'll see we have missing values and you know how to deal with it.    
There are also new methods which I'll guide you through this week.

In [3]:
data.isna().sum()

City          1
Experience    2
Salary        1
Promotion     0
dtype: int64

In [4]:
# method 1
data1 = data.dropna(how='any',axis=0) 
data1.head()

Unnamed: 0,City,Experience,Salary,Promotion
0,Delhi,4.0,55000.0,No
1,Mumbai,2.0,20000.0,Yes
2,Agra,3.0,30000.0,No
3,Mumbai,8.0,72000.0,No
5,Delhi,5.0,60000.0,Yes


In [5]:
# method 2
from sklearn.impute import SimpleImputer as Imputer
x = data['Salary'].values.reshape(-1,1)

x_most_frequent = Imputer(missing_values=np.nan, 
                          strategy = 'most_frequent').fit_transform(x)
print("x_most_frequent = ",x_most_frequent)

x_mean = Imputer(missing_values=np.nan, 
                          strategy = 'mean').fit_transform(x)
print("x_mean = ",x_mean)

x_median = Imputer(missing_values=np.nan, 
                          strategy = 'median').fit_transform(x)
print("x_median = ",x_median)

x_most_frequent =  [[55000.]
 [20000.]
 [30000.]
 [72000.]
 [52000.]
 [60000.]
 [52000.]
 [51000.]
 [59000.]
 [31000.]
 [58000.]
 [52000.]
 [79000.]
 [60000.]
 [67000.]]
x_mean =  [[55000.        ]
 [20000.        ]
 [30000.        ]
 [72000.        ]
 [53285.71428571]
 [60000.        ]
 [52000.        ]
 [51000.        ]
 [59000.        ]
 [31000.        ]
 [58000.        ]
 [52000.        ]
 [79000.        ]
 [60000.        ]
 [67000.        ]]
x_median =  [[55000.]
 [20000.]
 [30000.]
 [72000.]
 [56500.]
 [60000.]
 [52000.]
 [51000.]
 [59000.]
 [31000.]
 [58000.]
 [52000.]
 [79000.]
 [60000.]
 [67000.]]


Continuing the preprocessing, do keep in mind ML require mathematics, so we cannot have words  
To solve this we need to convert them to numbers.  
We can do it by giving them numbers like
* Agra 0
* Delhi 1
* Mumbai 2

In [6]:
#converting data frame to values
X = data1.iloc[:, :-1].values
y = data1.iloc[:, 3].values

In [7]:
X

array([['Delhi', 4.0, 55000.0],
       ['Mumbai', 2.0, 20000.0],
       ['Agra', 3.0, 30000.0],
       ['Mumbai', 8.0, 72000.0],
       ['Delhi', 5.0, 60000.0],
       ['Delhi', 4.0, 51000.0],
       ['Agra', 5.0, 59000.0],
       ['Delhi', 3.0, 31000.0],
       ['Delhi', 8.0, 79000.0],
       ['Agra', 5.0, 60000.0],
       ['Delhi', 7.0, 67000.0]], dtype=object)

In [8]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No',
       'Yes'], dtype=object)

In [9]:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X

array([[1, 4.0, 55000.0],
       [2, 2.0, 20000.0],
       [0, 3.0, 30000.0],
       [2, 8.0, 72000.0],
       [1, 5.0, 60000.0],
       [1, 4.0, 51000.0],
       [0, 5.0, 59000.0],
       [1, 3.0, 31000.0],
       [1, 8.0, 79000.0],
       [0, 5.0, 60000.0],
       [1, 7.0, 67000.0]], dtype=object)

In [10]:
labelencoder_X.classes_

array(['Agra', 'Delhi', 'Mumbai'], dtype=object)

In [11]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [12]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1])

* no 0
* yes 1

Over here we'll point out that in case of cities we shall not give weightage to countries.  
In a sense as the number of cities increase, cities with larger number will be given more priority my the ML formulas.  

Due to this Mumbai will get more importance than Agra.   
Think it over or google what will happen if instead of 3 we'll have 100 countries!!

To over come this we'll have to judge a Column and apply one hot encoding.

In [13]:
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categories='auto')   
p = onehotencoder.fit_transform(X[:,0:1]).toarray()
p

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [14]:
# This will seem more senseful to you. 
dff = pd.get_dummies(data1['City'])
dff.head()

Unnamed: 0,Agra,Delhi,Mumbai
0,0,1,0
1,0,0,1
2,1,0,0
3,0,0,1
5,0,1,0


Although we converted are variables to the above format, still we'll face one issue. It's callled **Dummy Variable Trap**. We'll discuss it next week. 

In [15]:
dff=pd.concat([dff, data1["Experience"],data1["Salary"]], axis=1)
dff

Unnamed: 0,Agra,Delhi,Mumbai,Experience,Salary
0,0,1,0,4.0,55000.0
1,0,0,1,2.0,20000.0
2,1,0,0,3.0,30000.0
3,0,0,1,8.0,72000.0
5,0,1,0,5.0,60000.0
7,0,1,0,4.0,51000.0
8,1,0,0,5.0,59000.0
9,0,1,0,3.0,31000.0
12,0,1,0,8.0,79000.0
13,1,0,0,5.0,60000.0


# Normalisation
In the data frame above we should scale down the salary and Experience because with respect to 1 and 0 it is too large and will neglect the relevance of City. There are many methods to achieve this.

In [16]:
X = dff.iloc[:,:].values

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
print(sc_X.fit_transform(X))

[[-0.61237244  0.91287093 -0.47140452 -0.47140452  0.10738071]
 [-0.61237244 -1.09544512  2.12132034 -1.50849447 -1.8612657 ]
 [ 1.63299316 -1.09544512 -0.47140452 -0.98994949 -1.29879529]
 [-0.61237244 -1.09544512  2.12132034  1.60277537  1.0635804 ]
 [-0.61237244  0.91287093 -0.47140452  0.04714045  0.38861591]
 [-0.61237244  0.91287093 -0.47140452 -0.47140452 -0.11760745]
 [ 1.63299316 -1.09544512 -0.47140452  0.04714045  0.33236887]
 [-0.61237244  0.91287093 -0.47140452 -0.98994949 -1.24254825]
 [-0.61237244  0.91287093 -0.47140452  1.60277537  1.45730968]
 [ 1.63299316 -1.09544512 -0.47140452  0.04714045  0.38861591]
 [-0.61237244  0.91287093 -0.47140452  1.0842304   0.7823452 ]]


In [17]:
X = dff.iloc[:,:].values

from sklearn.preprocessing import MaxAbsScaler
m_X = MaxAbsScaler()
print(m_X.fit_transform(X))

[[0.         1.         0.         0.5        0.69620253]
 [0.         0.         1.         0.25       0.25316456]
 [1.         0.         0.         0.375      0.37974684]
 [0.         0.         1.         1.         0.91139241]
 [0.         1.         0.         0.625      0.75949367]
 [0.         1.         0.         0.5        0.64556962]
 [1.         0.         0.         0.625      0.74683544]
 [0.         1.         0.         0.375      0.39240506]
 [0.         1.         0.         1.         1.        ]
 [1.         0.         0.         0.625      0.75949367]
 [0.         1.         0.         0.875      0.84810127]]


## Machine Learning
- [Application](https://www.geeksforgeeks.org/machine-learning-introduction/)
- [Types of ML models](https://www.geeksforgeeks.org/ml-types-learning-supervised-learning/)
- [Difference between Supervised and Unsupervised Learning](https://www.geeksforgeeks.org/difference-between-supervised-and-unsupervised-learning/?ref=rp)
- [Semi-supervised Learning](https://www.geeksforgeeks.org/ml-semi-supervised-learning/?ref=rp)

## Other Links to refer: 
- [Scikit-Learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- [Geek for Geeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)
- [Medium](https://medium.com/search?q=preprocessing%20in%20machine%20learning)
- [YouTube](https://www.youtube.com/results?search_query=preprocessing+in+machine+learning)
- [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/)