# 1. Importing libraries

In [12]:
# import all the necessary files here or else you can import them when you need.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 2. Importing Dataset

In [13]:
dataset = pd.read_csv("Data.csv")
print("dataset:\n",dataset.head())

# Now we need to separate the Features and label bcz our ML Models expects the data to be this way.
x = dataset.iloc[:, :-1].values # iloc - Locate Indices
y = dataset.iloc[:, -1].values
print("\nx:\n", x)
print("\ny:\n", y)

dataset:
    Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes

x:
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

y:
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# 3. Taking care of missing values

In [14]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(x[:, 1:3])
x[: ,[1,2]]  = imputer.transform(x[:, 1:3])
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

# 4. Encoding Categorical Data

Why do we need to encode categorical data? 
- ML Models works only on Numerical Data. 

The methods to encode categorical features
1. One-Hot Enccoding 
2. Label Encoding      


# 4.1. One-Hot Encoding
If the Feature column has N categories , One-hot encoding technique creates N columns. i.e. It creates binary vectors for each category.

In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(drop="first"),[0])], remainder="passthrough")
x = ct.fit_transform(x)

# our ML Model expects training data to be in np.array formt
x = np.array(x)
print(x)

[[0.0 0.0 44.0 72000.0]
 [0.0 1.0 27.0 48000.0]
 [1.0 0.0 30.0 54000.0]
 [0.0 1.0 38.0 61000.0]
 [1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 35.0 58000.0]
 [0.0 1.0 38.77777777777778 52000.0]
 [0.0 0.0 48.0 79000.0]
 [1.0 0.0 50.0 83000.0]
 [0.0 0.0 37.0 67000.0]]


## 4.2. Label Encoding
If we encode the classes as

        France = 1
        Germany = 2
        Spain = 3
The problem with this approach is that Our model could interprete that 2 > 1 i.e. Germany > France but that's not the case. Hence we have used One-hot Encoding for categorical features.

label Encoding can be used where the data is Qualitative. Ex. Good, Better, Best; here we can do 

    Good = 0
    Better = 1
    Best = 2
    
To demonstrate Label encoding we can perform it on Dependent variable i.e. 'y'. since it has only 2 classes which will get encoded as 0 and 1.

In [16]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

# 5. Splitting the Dataset into training and testing data

In [17]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [18]:
print(x_train)

[[0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 44.0 72000.0]
 [0.0 1.0 38.0 61000.0]
 [0.0 1.0 27.0 48000.0]
 [0.0 0.0 48.0 79000.0]
 [1.0 0.0 50.0 83000.0]
 [0.0 0.0 35.0 58000.0]]


In [19]:
print(x_test)

[[1.0 0.0 30.0 54000.0]
 [0.0 0.0 37.0 67000.0]]


# 6. Feature Scaling

**What does it do?**
- Feature scaling puts all our features on to the same scale.

**Why do we need it?**
- For some of the Models, in order to avoid some features to be dominated by other features in such a way that the dominated features are not even considered by the ML Model.

**Do we have to use Feature Scaling with every model?**
- No.
- Feature scaling is essential for machine learning algorithms that calculate distances between data. If the data is not scaled, the feature with a higher value range starts dominating when calculating distances.
- in linear regression we won't need the Scaling bcz each variable is multiplied by coeffient, and while learning the coeffieins these coeffients will compensate by taking small values for large variables.

**When Sould we use feature Scaling? Before or after the train_test_split()?**
- After Splitting the Dataset
- Because, the testset is supposed to be the new data, so we are not supposed to work with it.
- some information leakge may happen if we work with test data which may lead to over fitting.


### Standardization:
$$ X_{standard} = \frac{X - Mean(X)}{Standard~Deviation(X)} $$
* all the features will take values between -3 to +3.

### Normalization
$$ X_{Normal} = \frac{X-Min(X)}{Max(X)-Min(X)} $$
* all the features will take values between 0 to 1.
* Normalization is used only when the features are normally distributed.

In [20]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 2:] = sc.fit_transform(x_train[:, 2:])
x_test[:, 2:] = sc.transform(x_test[:, 2:])
print(x_train)


[[0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [0.0 0.0 1.1475343068237058 1.232653363453549]
 [1.0 0.0 1.4379472069688968 1.5749910381638885]
 [0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
[[1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [0.0 0.0 -0.44973664397484414 0.2056403393225306]]


In [21]:
print(x_test)

[[1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [0.0 0.0 -0.44973664397484414 0.2056403393225306]]


Now all our values are in the same scale.

Congrats 🎉🎊 your data is ready for model training.