<a href="https://colab.research.google.com/github/AMRISMA/DataScience/blob/master/2_ScikitLearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Data Preprocessing with ScikitLearn
```
# This is formatted as code
```



In [None]:
import pandas as pd 
import numpy as np 
df = pd.read_csv('datapreprocessing.csv')
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous
0,Green,2.3,210.0,170.0,20 to 30 kg,Yes
1,Red,4.1,100.0,,10 to 20 kg,No
2,Blue,1.4,,412.0,0 to 10 kg,No
3,Green,,313.0,123.0,10 to 20 kg,Yes
4,,5.2,512.0,372.0,0 to 10 kg,Yes


First, we import the SimpleImputer class. 

In [None]:
#Importing the SimpleImputer class
from sklearn.impute import SimpleImputer



Next, we instantiate a SimpleImputer object called imp , passing two parameters - missing_values and strategy - to the constructor.



In [None]:
# Instantiating a SimpleImputer object
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

After instantiating the SimpleImputer object, we use it to call the fit() method, passing df[['Years', 'Strength', 'Height']] to the method. This method expects a 2-dimensional array and calculates the means of the columns in the array; it stores the results in the statistics_ attribute.

In [None]:
# Calling the fit() method to calculate the means
imp.fit(df[['Years', 'Strength', 'Height']])

SimpleImputer()

In [None]:
print(imp.statistics_)

[  3.25 283.75 269.25]


3.25 is the mean of the first feature ( Years ), while 283.75 and 269.25 are the means of the second and third features ( Strength and Height ), respectively


After calculating the means, we pass df[['Years', 'Strength', 'Height']] to the transform() method to get a new dataset with the missing values replaced. This method uses the means stored in statistics_ to do the replacement and returns a transformed dataset, which we assign back to df[['Years', 'Strength', 'Height']]

In [None]:
# transforming the data
df[['Years', 'Strength', 'Height']] = imp.transform(df[['Years', 'Strength', 'Height']])

In [None]:
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous
0,Green,2.3,210.0,170.0,20 to 30 kg,Yes
1,Red,4.1,100.0,269.25,10 to 20 kg,No
2,Blue,1.4,283.75,412.0,0 to 10 kg,No
3,Green,3.25,313.0,123.0,10 to 20 kg,Yes
4,,5.2,512.0,372.0,0 to 10 kg,Yes


The missing value is now replaced


Your task: 

Let us now fill in the missing value in the colour column using most frequent strategy

In [None]:
imp.set_params(strategy='most_frequent')
imp.fit(df[['Color']])
df[['Color']] = imp.transform (df[['Color']])

In [None]:
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous
0,Green,2.3,210.0,170.0,20 to 30 kg,Yes
1,Red,4.1,100.0,269.25,10 to 20 kg,No
2,Blue,1.4,283.75,412.0,0 to 10 kg,No
3,Green,3.25,313.0,123.0,10 to 20 kg,Yes
4,Green,5.2,512.0,372.0,0 to 10 kg,Yes


Now u are ready to go !!

Encoding Categorical Data

The two classes are very similar, except that OrdinalEncoder is designed to work with the features of a dataset, while LabelEncoder is designed to work with the labels. Therefore, the fit() method of the LabelEncoder class expects a 1D array, while that of the OrdinalEncoder class expects a 2D array.


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Dangerous'] = le.fit_transform(df['Dangerous'])

Here, we first import the LabelEncoder class and instantiate a LabelEncoder object. Next, we use it to call the fit_transform() method, passing a 1D array ( df['Dangerous'] ) to the method. This method returns a NumPy array, which we assign back to the Dangerous column.



Next, let’s encode the Weight column (which is a feature) using the OrdinalEncoder class:


In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(dtype=np.int)
df[['Weight']]= oe.fit_transform(df[['Weight']])

In [None]:
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous
0,Green,2.3,210.0,170.0,2,1
1,Red,4.1,100.0,269.25,1,0
2,Blue,1.4,283.75,412.0,0,0
3,Green,3.25,313.0,123.0,1,1
4,Green,5.2,512.0,372.0,0,1


What is we wanted to encode the Color?

However, one problem with such a simple approach is that it suggests a relationship between the categories that may not exist in reality.
For instance, our df DataFrame has a Color column with three values – “Green”, “Red”, and “Blue”. The OrdinalEncoder class encodes categories in alphabetical order. Hence, “Blue” will be encoded as 0, “Green” as 1, and “Red” as 2.

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(dtype=np.int, sparse = False, drop='first')
color_encoded = ohe.fit_transform(df[['Color']])
df2 = pd.DataFrame(color_encoded, columns = ohe.get_feature_names())
df = pd.concat((df, df2), axis = 1)






1.   Here, we first import the OneHotEncoder class from the sklearn.preprocessing module. 
2.   We initialize a OneHotEncoder object called ohe . We’ve explained the dtype and sparse parameters previously. However, notice that we also pass drop='first' to the constructor; we’ll explain this parameter later.
3.   After instantiating ohe , we use it to call the fit_transform() method, passing df[['Color']] as input to the method.
4.  Next. we store the resulting array in a variable called color_encoded and convert this array to a DataFrame ( df2 ) on the next line.
5. To name the columns in df2 , we use the get_feature_names() method in the OneHotEncoder class. This method returns the names of the columns returned by the fit_transform() method.
6. Finally, we add df2 to our original DataFrame using the concat() method in pandas. This method works like the concatenate() function in NumPy.






In [None]:
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous,x0_Green,x0_Red
0,Green,2.3,210.0,170.0,2,1,1,0
1,Red,4.1,100.0,269.25,1,0,0,1
2,Blue,1.4,283.75,412.0,0,0,0,0
3,Green,3.25,313.0,123.0,1,1,1,0
4,Green,5.2,512.0,372.0,0,1,1,0


**FEATURE SCALING**

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
df[['Years']] = mms.fit_transform(df[['Years']])
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous,x0_Green,x0_Red
0,Green,0.236842,210.0,170.0,2,1,1,0
1,Red,0.710526,100.0,269.25,1,0,0,1
2,Blue,0.0,283.75,412.0,0,0,0,0
3,Green,0.486842,313.0,123.0,1,1,1,0
4,Green,1.0,512.0,372.0,0,1,1,0


You can see the years have been transform. Now, try to transform the Height and Weight

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
df[['Strength', 'Height']] = mms.fit_transform(df[['Strength', 'Height']])
df

Unnamed: 0,Color,Years,Strength,Height,Weight,Dangerous,x0_Green,x0_Red
0,Green,0.236842,0.26699,0.16263,2,1,1,0
1,Red,0.710526,0.0,0.506055,1,0,0,1
2,Blue,0.0,0.445995,1.0,0,0,0,0
3,Green,0.486842,0.51699,0.0,1,1,1,0
4,Green,1.0,1.0,0.861592,0,1,1,0


**Model Evaluation**

In [None]:
from sklearn.metrics import accuracy_score

true = ['Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Dog']
pred = ['Cat', 'Cat', 'Cat', 'Dog', 'Cat', 'Cat']

score = accuracy_score(true, pred)
print(score)

In [None]:
from sklearn.metrics import precision_score, recall_score
true = ['Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Dog']
pred = ['Cat', 'Cat', 'Cat', 'Dog', 'Cat', 'Cat']

precision = precision_score(true, pred, pos_label = 'Dog')
recall = recall_score(true, pred, pos_label = 'Dog')

print(precision)
print(recall)

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
pred = [2.1, 1.4, 5.6, 7.9]
true = [2.5, 1.6, 5.1, 6.8]
RMSE = mean_squared_error(true, pred, squared=False)
r2 = r2_score(true, pred)
print(RMSE)
print(r2)