<a href="https://colab.research.google.com/github/KrishnaPandya-VGEC-IT/Data-Science-/blob/main/Part_1_Data_Preprocessing_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Preprocessing Tools

### Importing the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing the Dataset

In [None]:
dataset = pd.read_csv('/content/Data.csv') #reading csv file
x = dataset.iloc[: , :-1].values #features - all rows, all cols except last one
y = dataset.iloc[:, -1:].values #dependent variables - or [-1] in last column will also work

In [None]:
print(x) #printing matrix of features

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y) #printing matrix of dependent variable column

[['No']
 ['Yes']
 ['No']
 ['No']
 ['Yes']
 ['Yes']
 ['No']
 ['Yes']
 ['No']
 ['Yes']]


### Taking care of missing Data

In [None]:
# Method 1: remove rows

# Method 2: Replace missing values by avg of available values


In [None]:
# Method 2:

from sklearn.impute import SimpleImputer #Importing useful module from sklearn
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') #Creating object of class SimpleImputer and using mean strategy
imputer.fit(x[:,1:3]) #looks all the missing values in age and salary column
x[:,1:3] = imputer.transform(x[:,1:3]) #update x with new transformed values

In [None]:
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Encoding Categorical Data

Categorial Data means Category(Country and Purchased) column.i.e., non-numeric values

Here independent variables include country column.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder #creates binary vectors for each categorical vector
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough') #ct object is column transformer object
x = np.array(ct.fit_transform(x)) #convert ct into numpy array

# encoder means encode column i nto matrix form -> i.e., France - [1,0,0], Spain - [0,1,0], Germany - [0,0,1]
# OneHotEncoder encodes the categorial column and 0 is the index of the column we need to encode

"""
remainder = passthrough will include the columns which are not transformed.
If we don't use it, it will take only first column in x. (Try it.)

 
Here note that it is necessary to convert columns into numeric values in order to use it in ML
model training. Moreover, rather than giving values like 0,1,2,3, the numpy arrays like
[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1] are given so that we can perform operations 
of numpy library.

"""






'\nremainder = passthrough will give respective values to all rows of country column.\n            Otherwise only first 3 rows will be transformed because it has all possible\n            values of country.\n\nHere note that it is necessary to convert columns into numeric values in order to use it in ML\nmodel training. Moreover, rather than giving values like 0,1,2,3, the numpy arrays like\n[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1] are given so that we can perform operations \nof numpy library.\n\n'

In [None]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

Dependent variable means Purchased Column. (It is called Label)

In [None]:
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder()
y = le.fit_transform(y) #no need to convert to numpy array as it is dependent variable


  y = column_or_1d(y, warn=True)


In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


### Splitting Dataset into Training Set and Testing Set

Making two seperate sets one for training model and other for testing.

In [None]:
# from sklearn.model_selection import train_test_split

# x_train,x_test,y_train,y_test = train_test_split(x, y, test_size = 0.2, random_state=1) #random_state = 1

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 0)

In [None]:
print(x_train)

[[0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(x_test) 

[[0.0 1.0 0.0 30.0 54000.0]
 [0.0 1.0 0.0 50.0 83000.0]]


In [None]:
print(y_train)

[1 1 1 0 1 0 0 1]


In [None]:
print(y_test)

[0 0]


### Feature Scaling

Scaling all features to make sure all take values in the same scale. It prevents one feature to dominate the other, which therefore will be neglected by machine learning model.



*   **Apply Feature Scaling after Splitting Dataset into Training and Testing. It prevents information leakage on the test set which we are not supposed to have until the training is done.**

*  **Applying Feature Scaling before Train-Test split means getting the mean and the standard deviation of the feature before Split. It will get the mean and std of all the values including Test Data which we are not supposed to do. Test Data must be prevented to get used by machine learning model(even in terms of feature scaling).** 

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,3:] = sc.fit_transform(x_train[:, 3:]) #all rows, last two columns
x_test[:,3:] = sc.fit_transform(x_test[:, 3:]) #all rows, last two columns

"""

  Here note that first 3 columns represents one country. i.e., 1 0 0 means france.
  We must not include the columns which are created by encoding. Otherwise it will
  transform 1 0 0 into normalized form. 

"""

'\n\n  Here note that first 3 columns represents one country. i.e., 1 0 0 means france.\n  We must not include the columns which are created by encoding. Otherwise it will\n  transform 1 0 0 into normalized form. \n\n'

In [None]:
"""

fit method will compute mean and standard deviation of feature.
transform method will transform the values according to the scaling formula. 

Feature Scaling methods:

1) Standardization

          x - mean(x)
Xstd. =   -----------
            std(x)


2) Normalization

          x - min(x)
Xnorm = ----------------
        max(x) - min(x)


Standardization is suitable for all cases, whereas normalization is preferred 
when normal distribution in most of the features.

"""



'\n\nfit method will compute mean and standard deviation of feature.\ntransform method will transform the values according to the scaling formula. \n\nFeature Scaling methods:\n\n1) Standardisation\n\n          x - mean(x)\nXstd. =   -----------\n            std(x)\n\n\n2) Normalisation\n\n          x - min(x)\nXnorm = ----------------\n        max(x) - min(x)\n\n\nStandardisation is suitable for all cases, whereas normalization is preferred \nwhen normal distribution in most of the features.\n\n'

In [None]:
print(x_train)

[[0.0 1.0 0.0 0.2630675731713538 0.1238147854838185]
 [1.0 0.0 0.0 -0.25350147960148617 0.4617563176278856]
 [0.0 0.0 1.0 -1.9753983221776195 -1.5309334063940294]
 [0.0 0.0 1.0 0.05261351463427101 -1.1114197802841526]
 [1.0 0.0 0.0 1.6405850472322605 1.7202971959575162]
 [0.0 0.0 1.0 -0.08131179534387283 -0.16751412153692966]
 [1.0 0.0 0.0 0.9518263102018072 0.9861483502652316]
 [1.0 0.0 0.0 -0.5978808481167128 -0.48214934111933727]]


In [None]:
print(x_test)

[[0.0 1.0 0.0 -1.0 -1.0]
 [0.0 1.0 0.0 1.0 1.0]]


**There is data preprocessing template that shows this functions ready-made. Hence, if needed, we can take reference from the template.** 