# 🌟 Feature Scaling in Scikit-Learn

Feature scaling is a **crucial preprocessing step** in machine learning. Many algorithms perform poorly when input features have **vastly different scales**. 🧑‍💻

For example, in the **California housing dataset**:

* `total_rooms` ranges from 6 to over 39,000 🏠
* `median_income` ranges from 0 to 15 💵

If you don’t scale these features, **models will give more importance** to `total_rooms` just because it has **larger values**. ⚖️

## ⚙️ Why Scaling Is Needed

Many models (like Linear Regression, KNN, SVMs, Gradient Descent-based algorithms) assume features are on a **similar scale**. Without scaling:

* Features with **larger ranges** can dominate model behavior 🏋️‍♂️.
* Training becomes **unstable** and **slower** 🚶‍♂️.

Scaling helps make training **more stable** and **faster** 🚀.

---

## 🔄 Min-Max Scaling (Normalization)

This method rescales the data to a specific range, usually **\[0, 1]** or **\[-1, 1]**. 🔢

### Formula:

```plaintext
scaled_value = (x - min) / (max - min)
```

### Use Scikit-Learn’s `MinMaxScaler`:

```python
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)
```

* **Use** `feature_range=(-1, 1)` for models like **neural networks** 🤖.
* **Sensitive to outliers**: Extreme values can distort the scale ⚡.

---

## 📏 Standardization (Z-score Scaling)

This method **centers the data around 0** and scales it based on **standard deviation**. 📐

### Formula:

```plaintext
standardized_value = (x - mean) / std
```

### Use Scikit-Learn’s `StandardScaler`:

```python
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)
```

* Resulting features have **zero mean** and **unit variance** 📊.
* **Robust to outliers** compared to **Min-Max Scaling** 💪.
* Recommended for most ML algorithms, especially when using **gradient descent** ⛰️.

---

Scaling is one of the simplest yet most powerful tools to ensure your models work **efficiently** and **accurately**! 💯


In [3]:
import pandas as pd
import numpy as np

In [4]:
data = pd.read_csv("xlsx/California Housing Prices/housing.csv")

In [5]:
data["income_cat"] = pd.cut(data["median_income"], bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf], labels=[1, 2, 3, 4, 5])

In [6]:
import matplotlib.pyplot as plt

In [7]:
from sklearn.model_selection import StratifiedShuffleSplit

In [8]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

In [9]:
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [10]:
# Remove the income_cat column
for sett in (strat_train_set, strat_test_set):
    sett.drop("income_cat", axis=1, inplace=True)

In [11]:
df = strat_train_set.copy()

In [12]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

In [13]:
from sklearn.impute import SimpleImputer

In [14]:
imputer = SimpleImputer(strategy="median")

In [15]:
housing_num = housing.select_dtypes(include=[np.number])

In [16]:
imputer.fit(housing_num)

In [17]:
imputer.statistics_

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

In [18]:
X = imputer.transform(housing_num)

In [19]:
housing = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

In [20]:
housing["ocean_proximity"] = df["ocean_proximity"]

In [21]:
housing_for_onehot = housing[["ocean_proximity"]]

In [22]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN


In [23]:
housing["ocean_proximity"].unique()

array(['INLAND', 'NEAR OCEAN', '<1H OCEAN', 'NEAR BAY', 'ISLAND'],
      dtype=object)

In [24]:
from sklearn.preprocessing import OrdinalEncoder

In [25]:
ordinal_encoder = OrdinalEncoder()

In [26]:
housing_cat = ordinal_encoder.fit_transform(housing)

In [27]:
housing_cat = pd.DataFrame(housing_cat, columns=housing.columns, index=housing.index)

In [28]:
housing_cat.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,239.0,569.0,28.0,3555.0,795.0,2167.0,704.0,1815.0,1.0
15502,662.0,55.0,6.0,4411.0,853.0,1965.0,766.0,9519.0,4.0
2908,481.0,273.0,43.0,1410.0,308.0,625.0,298.0,3475.0,1.0
14053,672.0,21.0,23.0,1669.0,517.0,856.0,481.0,1959.0,4.0
20496,515.0,174.0,26.0,3269.0,644.0,1791.0,578.0,6883.0,0.0


In [29]:
from sklearn.preprocessing import OneHotEncoder

In [30]:
one_hot_encoder = OneHotEncoder()

In [31]:
housing_cat = one_hot_encoder.fit_transform(housing_for_onehot)

In [32]:
housing_cat

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [33]:
housing_cat.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [34]:
one_hot_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

In [35]:
housing_cat = pd.DataFrame(housing_cat.toarray(), columns=['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], index=housing.index)

In [36]:
housing_cat

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,0.0,1.0,0.0,0.0,0.0
15502,0.0,0.0,0.0,0.0,1.0
2908,0.0,1.0,0.0,0.0,0.0
14053,0.0,0.0,0.0,0.0,1.0
20496,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
15174,1.0,0.0,0.0,0.0,0.0
12661,0.0,1.0,0.0,0.0,0.0
19263,1.0,0.0,0.0,0.0,0.0
19140,1.0,0.0,0.0,0.0,0.0


In [37]:
df = pd.concat([df, housing_cat], axis=1)

In [38]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,72100.0,INLAND,0.0,1.0,0.0,0.0,0.0
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,279600.0,NEAR OCEAN,0.0,0.0,0.0,0.0,1.0
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,82700.0,INLAND,0.0,1.0,0.0,0.0,0.0
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,112500.0,NEAR OCEAN,0.0,0.0,0.0,0.0,1.0
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,238300.0,<1H OCEAN,1.0,0.0,0.0,0.0,0.0


In [39]:
df = df.drop("ocean_proximity", axis=1)

In [40]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,72100.0,0.0,1.0,0.0,0.0,0.0
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,279600.0,0.0,0.0,0.0,0.0,1.0
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,82700.0,0.0,1.0,0.0,0.0,0.0
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,112500.0,0.0,0.0,0.0,0.0,1.0
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,238300.0,1.0,0.0,0.0,0.0,0.0


In [41]:
from sklearn.preprocessing import MinMaxScaler

In [42]:
scaller = MinMaxScaler(feature_range=(-1, 1))

In [89]:
df_scaled = scaller.fit_transform(df)

In [91]:
df_scaled

array([[-0.42430279,  0.27098831,  0.09803922, ..., -1.        ,
        -1.        , -1.        ],
       [ 0.41832669, -0.88310308, -0.76470588, ..., -1.        ,
        -1.        ,  1.        ],
       [ 0.05776892, -0.39851222,  0.68627451, ..., -1.        ,
        -1.        , -1.        ],
       ...,
       [-0.6752988 ,  0.25398512,  0.84313725, ..., -1.        ,
        -1.        , -1.        ],
       [-0.67131474,  0.22635494, -0.49019608, ..., -1.        ,
        -1.        , -1.        ],
       [-0.55976096,  0.57917109,  0.01960784, ..., -1.        ,
        -1.        , -1.        ]])

In [93]:
df_scaled = pd.DataFrame(df_scaled, columns=df.columns, index=df.index)

In [97]:
df_scaled.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,-0.424303,0.270988,0.098039,-0.803276,-0.743879,-0.874772,-0.737117,-0.769148,-0.764533,-1.0,1.0,-1.0,-1.0,-1.0
15502,0.418327,-0.883103,-0.764706,-0.729664,-0.725193,-0.887217,-0.713966,-0.194852,0.091134,-1.0,-1.0,-1.0,-1.0,1.0
2908,0.057769,-0.398512,0.686275,-0.917994,-0.900773,-0.962779,-0.888723,-0.672405,-0.720822,-1.0,1.0,-1.0,-1.0,-1.0
14053,0.438247,-0.955367,-0.098039,-0.904818,-0.833441,-0.94983,-0.820388,-0.761865,-0.597936,-1.0,-1.0,-1.0,-1.0,1.0
20496,0.125498,-0.630181,0.019608,-0.82042,-0.792526,-0.897194,-0.784167,-0.448766,-0.079175,1.0,-1.0,-1.0,-1.0,-1.0


In [99]:
from sklearn.preprocessing import StandardScaler

In [101]:
std_scaller = StandardScaler()

In [103]:
df_std_scaled = std_scaller.fit_transform(df)

In [107]:
df_std_scaled

array([[-0.94135046,  1.34743822,  0.02756357, ..., -0.0110063 ,
        -0.3548889 , -0.38421741],
       [ 1.17178212, -1.19243966, -1.72201763, ..., -0.0110063 ,
        -0.3548889 ,  2.60269309],
       [ 0.26758118, -0.1259716 ,  1.22045984, ..., -0.0110063 ,
        -0.3548889 , -0.38421741],
       ...,
       [-1.5707942 ,  1.31001828,  1.53856552, ..., -0.0110063 ,
        -0.3548889 , -0.38421741],
       [-1.56080303,  1.2492109 , -1.1653327 , ..., -0.0110063 ,
        -0.3548889 , -0.38421741],
       [-1.28105026,  2.02567448, -0.13148926, ..., -0.0110063 ,
        -0.3548889 , -0.38421741]])

In [109]:
df_std_scaled = pd.DataFrame(df_std_scaled, columns=df.columns, index=df.index)

In [111]:
df_std_scaled.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,-0.94135,1.347438,0.027564,0.584777,0.635123,0.732602,0.556286,-0.893647,-1.166015,-0.887683,1.46218,-0.011006,-0.354889,-0.384217
15502,1.171782,-1.19244,-1.722018,1.261467,0.775677,0.533612,0.721318,1.292168,0.627451,-0.887683,-0.68391,-0.011006,-0.354889,2.602693
2908,0.267581,-0.125972,1.22046,-0.469773,-0.545045,-0.674675,-0.524407,-0.525434,-1.074397,-0.887683,1.46218,-0.011006,-0.354889,-0.384217
14053,1.221738,-1.351474,-0.370069,-0.348652,-0.038567,-0.467617,-0.037297,-0.865929,-0.816829,-0.887683,-0.68391,-0.011006,-0.354889,2.602693
20496,0.437431,-0.635818,-0.131489,0.427179,0.269198,0.37406,0.220898,0.325752,0.270486,1.126529,-0.68391,-0.011006,-0.354889,-0.384217
