<a href="https://colab.research.google.com/github/AnandChourasia007/codepen-clone/blob/main/Copy_of_data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing libraries

We generally include these three libraries,
1.   numpy: to use arrays and general mathematics
2.   matplotlib: to use charts using pyplot module
3.   pandas: for importing data and manipulating it



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset

Take input data from a csv file. Then extract the input variables (or features) and the output variable (or dependent variables) from the input csv file.

Python also allows you to index from the end of the list using a negative number, where [-1] returns the last element. This is super-useful since it means you don't have to programmatically find out the length of the iterable in order to work with elements at the end of it.

'iloc' stands for index location, iloc[x,y] means rows in the range x and columns in the range y, where a range is written as 'l:r'. In Python, ranges include the first term and exclude the last term, when instead of range we have to take a single row/column, we can simply enter its index instead of the range. To take all values in a range we use a ':' .

In [None]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values     # X is now a matrix of input variables (or a matrix of features)
Y = dataset.iloc[:, -1].values      

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# Taking care of missing data

When there is not a lot of missing data, we can simply drop the rows with missing data. But in other cases, we have to deal with it. Features that have numerical value missing can be replaced by the average of all the other values in that column. It is done using the SimpleImputer class in sklearn.impute module.

fit() method is used to calculate the mean (in this case) and does not return anything, and transform() method is used to apply that mean to replace nan and returns the transformed data. There is a method fit_transform() that does both at the same time.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


# Encoding independent categorical variables

If there is any categorical variable in the input matrix, we have to express it in numbers. This is achieved by encoding. For input variables that have more than two possible values we use OneHotEncoder, for the other one we use LabelEncoder. 

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[0])], remainder='passthrough')
X=np.array(ct.fit_transform(X))

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


# Encoding dependent categorical variables

In this case, dependent categorical variable has just two possible values, so we can use LabelEncoder.

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
Y=le.fit_transform(Y)

In [None]:
print(Y)

[0 1 0 0 1 1 0 1 0 1]


# Splitting the dataset into testing and training data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)  # random_state=None means shuffling before splitting, so that every time we get different splits, random_state=any integer will always return the same splits.

# New Section

In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(Y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(Y_test)

[0 1]
