### Data Preprocessing Tools

#### Importing the libraries

In [1]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 

#### Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')

"""
Usually, 
y = dependent variable = last column ---> we want to predict
x = independent variable = others ---> features
"""

# iloc = index locate --> [rows, columns]
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


#### Taking care of missing data
1. ignore missing value --> ~1%
2. missing data replace with possible value --> set average

In [5]:
# 2
"""
The .fit() method in SimpleImputer calculates and stores the statistics (like mean, median, etc., depending on the strategy) needed to fill in missing values based on the provided data. 
It "learns" the values that will be used for imputation when the .transform() method is called later.
The .transform() method modifies the data based on the transformation learned from the .fit() method. It is used to apply the changes, such as replacing missing values, scaling features, or reducing dimensions, to both training and new datasets.
"""
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [6]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


#### Encoding categorical data
1. <b>Label Encoding</b>: Each unique category is assigned a unique integer. For example, if a column contains the values 'Red', 'Blue', and 'Green', label encoding would convert them to 0, 1, and 2, respectively.
    - ordinal data
        - Categories have a meaningful order
    - tree-based algorithms 
        - Decision Trees
        - Random Forests
        - XGBoost
<br><br>

2. <b>One-Hot Encoding</b>: Each unique category is represented by a binary vector where only one element is 1 (indicating the presence of the category), and all others are 0. For example, 'Red', 'Blue', and 'Green' would become [1,0,0], [0,1,0], and [0,0,1].
    - nominal data
        - Categories do not have any natural order

    - linear models
        - linear regression
        - logistic regression
        - neural networks

##### Encoding the Independent Variable
--> one hot encoding

In [7]:
from sklearn.compose import  ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])] , remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


##### Encoding the Dependent Variable
--> label encoding

In [8]:
from sklearn.preprocessing import LabelEncoder
# it have only one column or index, so not need to put arguments
le = LabelEncoder()
# single column, so need to transfer into np array again
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


#### Splitting the dataset into The Training set and Test set

In [9]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

In [10]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [11]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [12]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [13]:
print(y_test)

[0 1]


You should split the dataset (into training and testing sets) before applying feature scaling. This is important to prevent data leakage, ensuring that the test data remains unseen during training and isn't influenced by the scaling parameters (like mean or range) derived from the entire dataset.

#### Feature Scaling
<b>Features ---> Independent variables </b>

Particularly needed for that rely on distance-based metrics or gradient(slope)-based optimization

1. <b>Standardization</b> = (x - mean) / standard deviation
    - Data is Gaussian(normal distribution)
    - without bounding the range, though most values typically lie between -3 and +3        
    <br>
    
2. <b>Normalization</b> = (x - min) / max - min
    - Data is not Gaussian(non-normal distribution)
    - fixed range, often [0, 1]
    <br>

<br>
<b>Gaussian(Normal Distribution)</b>: perfectly symmetric


In [14]:
from sklearn.preprocessing import StandardScaler

"""
Standardization is not suitable for dummy variables (or indicator variables) because dummy variables are binary by nature (i.e., they take values of 0 or 1). 
Standardization transforms data by centering it around a mean of 0 and scaling it to have a standard deviation of 1, which is designed for continuous variables. This process doesn't make sense for binary data.

- Centering: Subtracts the mean from each data point so that the dataset has a mean of 0.

- Scaling: After centering the data, we scale it by dividing each data point by the standard deviation so that the dataset has a standard deviation of 1.
"""

sc = StandardScaler()

# I used [:, 3:] instead of [:, 1:], because binary variable contain 3 columns which was only 1 column before encoding. 
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])

In [15]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [16]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
