## MACHINE LEARNING DAY - 1 : Data Preprocessing 

### Importing the libraries

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Importing the dataset

In [5]:
dataset = pd.read_csv('../datasets/data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1:].values

In [6]:
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [7]:
print(X,"\n",y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]] 
 [['No']
 ['Yes']
 ['No']
 ['No']
 ['Yes']
 ['Yes']
 ['No']
 ['Yes']
 ['No']
 ['Yes']]


### Taking care of missing data

#### 1. `fillna()` – Fill Missing Values with Specific Logic

##### 1.1. Fill with a constant value

```python
df.fillna(0)
df.fillna("Unknown")
````

##### 1.2. Fill with column statistics

```python
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
```

##### 1.3. Fill using forward fill (propagate previous values)

```python
df.fillna(method='ffill')
```

##### 1.4. Fill using backward fill (propagate next values)

```python
df.fillna(method='bfill')
```

##### 1.5. Use limit to restrict fills

```python
df.fillna(method='ffill', limit=1)
```

---

#### 2. `interpolate()` – Estimate Missing Values

##### 2.1. Linear Interpolation (default)

```python
df.interpolate()
```

##### 2.2. Polynomial Interpolation

```python
df.interpolate(method='polynomial', order=2)
```

##### 2.3. Time-based Interpolation (if datetime index)

```python
df.interpolate(method='time')
```

---

#### 3. `dropna()` – Remove Missing Data

##### 3.1. Drop rows with any NaN

```python
df.dropna(axis=0)
```

##### 3.2. Drop columns with any NaN

```python
df.dropna(axis=1)
```

##### 3.3. Drop only if all values in row/column are NaN

```python
df.dropna(how='all')
```

##### 3.4. Drop based on specific column(s)

```python
df.dropna(subset=['Age', 'Salary'])
```

---

#### 4. `replace()` – Replace Specific Values

```python
df.replace(to_replace=np.nan, value=0)
df.replace(-999, np.nan)
```

---

#### 5. `sklearn.impute` – Model-Based Imputation

```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent', 'constant'
X = imputer.fit_transform(X)

In [10]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### Encoding the independent variable

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough' )
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the dependent variable

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y.ravel())
print(y)

[0 1 0 0 1 1 0 1 0 1]


### Split the data into training set and testing test

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print(X_train)
print('\n')
print(X_test)
print('\n')
print(y_train)
print('\n')
print(y_test)

[[1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]]


[[0.0 1.0 0.0 50.0 83000.0]
 [0.0 0.0 1.0 27.0 48000.0]]


[1 0 1 0 1 1 0 0]


[0 1]


#### Normalization Formula

The Min-Max Normalization formula is:

$$
\text{Normalized Value} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$

This scales the values to a range between **0** and **1**.

#### Standardization Formula

The Z-score Standardization formula is:

$$
\text{Standardized Value} = \frac{X - \mu}{\sigma}
$$

Where:
- \( X \) = original value  
- \( \mu \) = mean of the feature  
- \( \sigma \) = standard deviation of the feature

This scales the values to have a **mean of 0** and a **standard deviation of 1**.


### Feature Scaling

In [20]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_train)
print('\n')
print(X_test)

[[1.0 0.0 0.0 -0.7529426005471072 -0.6260377781240918]
 [1.0 0.0 0.0 1.008453807952985 1.0130429500553495]
 [1.0 0.0 0.0 1.7912966561752484 1.8325833141450703]
 [0.0 1.0 0.0 -1.7314961608249362 -1.0943465576039322]
 [1.0 0.0 0.0 -0.3615211764359756 0.42765697570554906]
 [0.0 1.0 0.0 0.22561095973072184 0.05040823668012247]
 [0.0 0.0 1.0 -0.16581046438040975 -0.27480619351421154]
 [0.0 0.0 1.0 -0.013591021670525094 -1.3285009473438525]]


[[0.0 1.0 0.0 2.1827180802863797 2.3008920936249107]
 [0.0 0.0 1.0 -2.3186282969916334 -1.7968097268236927]]
