### Pre-processing steps

- Importing the libraries
- Importing the dataset
- Taking care of missing data
- Encoding categorical data
- Normalizing the data
- Splitting the data into test and train

### Step 1: Importing the libraries


- NumPy:- it is a library that allows us to work with arrays and as most machine learning models work on arrays NumPy makes it easier
- matplotlib:- this library helps in plotting graphs and charts, which are very useful while showing the result of your model
- Pandas:- pandas allows us to import our dataset and also creates a matrix of features containing the dependent and independent variable.

In [19]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Step 2: Importing the dataset

In [18]:
df = pd.read_csv(r"C:\Users\nikip\Documents\2022\TCE\DataPreprocessing-main\DataPreprocessing-main\Data.csv")

### Step 3: Understand the data

In [20]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes


In [22]:
df.isna().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [23]:
df['Country'].value_counts()

France     4
Spain      3
Germany    3
Name: Country, dtype: int64

In [24]:
df['Country'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

### Step 3: Handling the missing values

- we will use SimpleImputer class from the ScikitLearn library

In [25]:
from sklearn.impute import SimpleImputer

# 'np.nan' signifies that we are targeting missing values
# and the strategy we are choosing is replacing it with 'mean'
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer.fit(df.iloc[:, 1:3])
df.iloc[:, 1:3] = imputer.transform(df.iloc[:, 1:3])  

# print the dataset
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Step 4: Encoding categorical data

- In our case, we have two categorical columns, the country column, and the purchased column.
- OneHot Encoding consists of turning the country column into three separate columns, each column consists of 0s and 1s. 
- Therefore each country will have a unique vector/code and no correlation between the vectors and outcome can be formed.

In [27]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# [0] signifies the index of the column we are appliying the encoding on

df = pd.DataFrame(ct.fit_transform(df))
df

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,0.0,44.0,72000.0,No
1,0.0,0.0,1.0,27.0,48000.0,Yes
2,0.0,1.0,0.0,30.0,54000.0,No
3,0.0,0.0,1.0,38.0,61000.0,No
4,0.0,1.0,0.0,40.0,63777.777778,Yes
5,1.0,0.0,0.0,35.0,58000.0,Yes
6,0.0,0.0,1.0,38.777778,52000.0,No
7,1.0,0.0,0.0,48.0,79000.0,Yes
8,0.0,1.0,0.0,50.0,83000.0,No
9,1.0,0.0,0.0,37.0,67000.0,Yes


- In the last column, i.e. the purchased column, the data is in binary form meaning that there are only two outcomes either Yes or No. Therefore here we need to perform Label Encoding.

In [28]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.iloc[:,-1] = le.fit_transform(df.iloc[:,-1])
# 'df.iloc[:,-1]' is used to select the column that we need to be encoded
df

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,0.0,44.0,72000.0,0
1,0.0,0.0,1.0,27.0,48000.0,1
2,0.0,1.0,0.0,30.0,54000.0,0
3,0.0,0.0,1.0,38.0,61000.0,0
4,0.0,1.0,0.0,40.0,63777.777778,1
5,1.0,0.0,0.0,35.0,58000.0,1
6,0.0,0.0,1.0,38.777778,52000.0,0
7,1.0,0.0,0.0,48.0,79000.0,1
8,0.0,1.0,0.0,50.0,83000.0,0
9,1.0,0.0,0.0,37.0,67000.0,1


### Step 5: Normalizing the dataset

- Feature scaling is bringing all of the features on the dataset to the same scale
- this is necessary while training a machine learning model because in some cases the dominant features become so dominant that the other ordinary features are not even considered by the model.
- When we normalize the dataset it brings the value of all the features between 0 and 1 so that all the columns are in the same range, and thus there is no dominant feature.

In [29]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df = pd.DataFrame(scaler.fit_transform(df))
df

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.0,0.0,0.73913,0.685714,0.0
1,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.130435,0.171429,0.0
3,0.0,0.0,1.0,0.478261,0.371429,0.0
4,0.0,1.0,0.0,0.565217,0.450794,1.0
5,1.0,0.0,0.0,0.347826,0.285714,1.0
6,0.0,0.0,1.0,0.512077,0.114286,0.0
7,1.0,0.0,0.0,0.913043,0.885714,1.0
8,0.0,1.0,0.0,1.0,1.0,0.0
9,1.0,0.0,0.0,0.434783,0.542857,1.0


### Step 6: Splitting the dataset

- In machine learning, a larger part of the dataset is used to train the model, and a small part is used to test the trained model for finding out the accuracy and the efficiency of the model.

In [30]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# .values function coverts the data into arrays
print("Independent Variable\n")
print(X)
print("\nDependent Variable\n")
print(y)

Independent Variable

[[1.         0.         0.         0.73913043 0.68571429]
 [0.         0.         1.         0.         0.        ]
 [0.         1.         0.         0.13043478 0.17142857]
 [0.         0.         1.         0.47826087 0.37142857]
 [0.         1.         0.         0.56521739 0.45079365]
 [1.         0.         0.         0.34782609 0.28571429]
 [0.         0.         1.         0.51207729 0.11428571]
 [1.         0.         0.         0.91304348 0.88571429]
 [0.         1.         0.         1.         1.        ]
 [1.         0.         0.         0.43478261 0.54285714]]

Dependent Variable

[0. 1. 0. 0. 1. 1. 0. 1. 0. 1.]


In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#'test_size=0.2' means 20% test data and 80% train data

In [32]:
print(X_train)

[[0.         0.         1.         0.47826087 0.37142857]
 [1.         0.         0.         0.91304348 0.88571429]
 [0.         1.         0.         1.         1.        ]
 [0.         1.         0.         0.56521739 0.45079365]
 [1.         0.         0.         0.43478261 0.54285714]
 [1.         0.         0.         0.73913043 0.68571429]
 [0.         1.         0.         0.13043478 0.17142857]
 [1.         0.         0.         0.34782609 0.28571429]]


In [33]:
print(X_test)

[[0.         0.         1.         0.         0.        ]
 [0.         0.         1.         0.51207729 0.11428571]]
