### Download this module

In [2]:
pip install -U scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hInstalling collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.3.2
    Uninstalling scikit-learn-1.3.2:
      Successfully uninstalled scikit-learn-1.3.2
Successfully installed scikit-learn-1.4.1.post1
Note: you may need to restart the kernel to use updated packages.


# **Feature Scaling**

### **Introduction**
>Feature scaling is a crucial preprocessing step in machine learning that involves normalizing or standardizing the input features of a dataset. The goal is to ensure that all features contribute equally to the learning process and prevent certain features from dominating others.

### **Importance of feature scaling in machine learning**


> - **Equal Weight to Features**: Feature scaling ensures that all features are on a similar scale, preventing one feature from having a disproportionately large impact on the model's performance.

> - **Convergence Speed**: Algorithms that use gradient descent for optimization, such as linear regression, converge faster when features are scaled. This is because the steps taken during optimization are more uniform.

> - **Distance-based Algorithms**: Algorithms like k-nearest neighbors (KNN) and support vector machines (SVM) that rely on distance measures are sensitive to the scale of features. Scaling helps in achieving more accurate results.

> - **Regularization**: Regularization techniques, like L1 and L2 regularization, assume that all features are on the same scale. Feature scaling assists in ensuring the regularization term is fair to all features.

## **Types of feature Scaling**

## **1. Min-Max Scaling**

**Formula**

<p align="center">
     <img src="https://latex.codecogs.com/svg.latex?X_{\text{norm}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}" title="Min-Max Scaling" />
</p>

​**Range**
> Transforms data to the range [0, 1], and between -1 to 1 when there are negative values in our data.

In [12]:
# code for min-max scaling  
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = {'awen_numbers': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df.head()


Unnamed: 0,awen_numbers
0,10
1,20
2,30
3,40
4,50


In [14]:
# Applying min-max scaling
scaler = MinMaxScaler()
df['awen_numbers_scaled'] = scaler.fit_transform(df[['awen_numbers']])
df.head()


Unnamed: 0,awen_numbers,awen_numbers_scaled
0,10,0.0
1,20,0.25
2,30,0.5
3,40,0.75
4,50,1.0


## **2. Standard Scalar or Z-score normalization**

**Formula**
<p align="center">
     <img src="https://latex.codecogs.com/svg.latex?X_{\text{std}} = \frac{X - \mu}{\sigma}" title="Standardization" />
</p>

​**Range**
> Centers the data around 0 with a standard deviation of 1.

In [16]:
# code for standardization scaling or z-score normalization
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Sample data
data = {'awen_numbers': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,awen_numbers
0,10
1,20
2,30
3,40
4,50


In [18]:
# Applying standardization scaling
scaler = StandardScaler()
df['awen_numbers_scaled'] = scaler.fit_transform(df[['awen_numbers']])
df.head()

Unnamed: 0,awen_numbers,awen_numbers_scaled
0,10,-1.414214
1,20,-0.707107
2,30,0.0
3,40,0.707107
4,50,1.414214


## **3. Robust scalar**

**Formula**
<p align="center">
     <img src="https://latex.codecogs.com/svg.latex?X_{\text{robust}} = \frac{X - \text{median}}{\text{IQR}}" title="Robust Scaling" />
</p>

​**Range**
> Similar to Min-Max scaling but robust to outliers.

In [24]:
# code for robust scaling
from sklearn.preprocessing import RobustScaler
# Sample data with outliers
data = {'awen_numbers': [10, 20, 30, 1000, 50]} # 1000 is an outlier in this data
df = pd.DataFrame(data)
# Robust Scaling
scaler = RobustScaler()
df['awen_numbers_scaled'] = scaler.fit_transform(df[['awen_numbers']])
df

Unnamed: 0,awen_numbers,awen_numbers_scaled
0,10,-0.666667
1,20,-0.333333
2,30,0.0
3,1000,32.333333
4,50,0.666667


## **4. Logarithmic scaling/Normalization**

**Formula**
<p align="center">
     <img src="https://latex.codecogs.com/svg.latex?X_{\text{log}} = \log(X)" title="Log Transformation" />
</p>

​**Range**
> Shifts the distribution towards lower values.

In [25]:
# code for Logarithmic scaling/Normalization
import numpy as np
import pandas as pd

#random data with outliers
data = {'awen_numbers': [10000, 20000, 30000, 1000000, 50000]}
df = pd.DataFrame(data)

# Log Transform
df['awen_numbers_log'] = np.log(df['awen_numbers'])
df['awen_numbers_log2'] = np.log2(df['awen_numbers'])
df['awen_numbers_log10'] = np.log10(df['awen_numbers'])
df.head()

Unnamed: 0,awen_numbers,awen_numbers_log,awen_numbers_log2,awen_numbers_log10
0,10000,9.21034,13.287712,4.0
1,20000,9.903488,14.287712,4.30103
2,30000,10.308953,14.872675,4.477121
3,1000000,13.815511,19.931569,6.0
4,50000,10.819778,15.60964,4.69897


## **5. Power Transformation**

**Formula**
<p align="center">
     <img src="https://latex.codecogs.com/svg.latex?X_{\text{power}} = X^p" title="Power Transformation" />
</p>

​**Range**
> Adjusts the data distribution based on the chosen power, typically making it more symmetric.

In [26]:
# code for Power Transformation
import pandas as pd
df = pd.DataFrame({
    'Income': [15000, 1800, 120000, 10000],
    'Age': [25, 18, 42, 51],
    'Department': ['HR','Legal','Marketing','Management']
})
df

Unnamed: 0,Income,Age,Department
0,15000,25,HR
1,1800,18,Legal
2,120000,42,Marketing
3,10000,51,Management


In [27]:
# Apply power transformation
from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer(method = 'box-cox')
'''
parameters:
method = 'box-cox' or 'yeo-johnson'
'''

'''
While I will not get into too much detail of how each of the above transforms works, 
it is helpful to know that Box-Cox works with only positive values, while Yeo-Johnson works with both positive and negative values.
In our case, we will use the Box-Cox transform since all our values are positive.
'''
df_scaled = df.copy()
col_names = ['Income', 'Age']
features = df_scaled[col_names]
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,0.125158,-0.597385,HR
1,-1.395497,-1.301984,Legal
2,1.419403,0.681202,Marketing
3,-0.149064,1.218168,Management
