# Feature scaling is a crucial preprocessing step in many machine learning workflows. It ensures that the different features contribute equally to the model, which can significantly improve the performance of many algorithms. Here are the main reasons why feature scaling is necessary in machine learning:

1. Improving Model Performance
Distance-Based Algorithms: Algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and K-Means Clustering rely on distance metrics (e.g., Euclidean distance). If features are on different scales, the feature with the larger range can dominate the distance calculation, leading to biased results.
Gradient Descent Convergence: In optimization algorithms like Gradient Descent used in linear regression, logistic regression, and neural networks, feature scaling helps in faster convergence. When features are on different scales, the algorithm may take longer to converge due to the zigzagging path it takes towards the minimum.
2. Equal Contribution of Features
When features have vastly different scales, the ones with larger ranges can disproportionately influence the model. Scaling ensures that each feature contributes equally to the model's predictions, leading to more balanced models.
3. Handling Different Units
In real-world datasets, features can be measured in different units (e.g., age in years, income in dollars). Scaling brings all features to the same unit, making the dataset more uniform and easier to handle.
4. Algorithm Requirements
Certain algorithms assume or perform better when features are on a similar scale:
Principal Component Analysis (PCA): PCA is sensitive to the variances of the features. If features are not scaled, PCA might capture the variance of the feature with the highest range, not necessarily the most informative one.
Regularized Models: Models that include regularization (like Lasso and Ridge regression) assume that all features are centered around zero and have similar scales.
5. Improving Interpretability
Scaling can make the coefficients of linear models (e.g., linear regression) more interpretable. When features are scaled, the magnitude of the coefficients can be compared directly to understand the relative importance of each feature.
6. Reducing Computational Complexity
In some cases, scaling can reduce the computational complexity of the algorithms by making the numerical calculations more stable and faster.
7. Improving Data Quality
Scaling can help in identifying outliers and anomalies in the data. When features are scaled, outliers become more apparent, making it easier to detect and handle them.
Summary
Feature scaling ensures that all features contribute equally to the learning process, leading to improved model performance, faster convergence, and better interpretability. It is especially important for distance-based algorithms, optimization processes, and models that assume normally distributed data.







In [49]:
import numpy as np
import pandas as pd

# Normalization

In [50]:
df=pd.read_csv("global_renewable_energy_production.csv")

In [51]:
df.head()

Unnamed: 0,Year,Country,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
0,2000,USA,437.086107,1435.928598,1544.389701,319.396318,3736.800724
1,2001,USA,240.416776,402.792876,398.742141,439.779266,1481.731059
2,2002,USA,641.003511,1120.494351,334.99364,486.459433,2582.950935
3,2003,USA,849.198377,476.040844,609.102444,132.532029,2066.873694
4,2004,USA,373.818019,882.183361,1034.306532,181.053113,2471.361025


In [52]:
df=df.drop('Country',axis=1)

In [53]:
df.head()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
0,2000,437.086107,1435.928598,1544.389701,319.396318,3736.800724
1,2001,240.416776,402.792876,398.742141,439.779266,1481.731059
2,2002,641.003511,1120.494351,334.99364,486.459433,2582.950935
3,2003,849.198377,476.040844,609.102444,132.532029,2066.873694
4,2004,373.818019,882.183361,1034.306532,181.053113,2471.361025


In [54]:
df.describe()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
count,240.0,240.0,240.0,240.0,240.0,240.0
mean,2011.5,528.523858,857.13326,1076.581975,287.127554,2749.366647
std,6.936653,271.183089,375.020314,499.981598,128.460792,695.126957
min,2000.0,104.555425,206.02163,320.662607,54.876943,910.381025
25%,2005.75,284.700505,523.572495,593.796081,176.322725,2250.759951
50%,2011.5,533.436429,882.024084,1046.39038,291.398276,2815.458943
75%,2017.25,766.701662,1160.199295,1495.160715,405.479393,3217.212712
max,2023.0,996.973153,1487.070005,1983.858741,499.872953,4628.164753


In [55]:
from sklearn.preprocessing import MinMaxScaler

In [56]:
scaler=MinMaxScaler()

In [57]:
df[:]=scaler.fit_transform(df)

In [58]:
df.head()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
0,0.0,0.372618,0.960078,0.735768,0.594431,0.760243
1,0.043478,0.15224,0.153602,0.046945,0.864957,0.15368
2,0.086957,0.601118,0.713847,0.008617,0.969857,0.449884
3,0.130435,0.834411,0.21078,0.173425,0.174507,0.31107
4,0.173913,0.301723,0.527819,0.42908,0.283544,0.419868


In [59]:
df.describe()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
count,240.0,240.0,240.0,240.0,240.0,240.0
mean,0.5,0.475078,0.508265,0.454498,0.521916,0.494646
std,0.301594,0.303875,0.292745,0.300615,0.288679,0.186973
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.25,0.201862,0.247884,0.164222,0.272914,0.360532
50%,0.5,0.480583,0.527695,0.436345,0.531513,0.512423
75%,0.75,0.741969,0.744841,0.706169,0.787878,0.620486
max,1.0,1.0,1.0,1.0,1.0,1.0


In [60]:
df.min(),df.max()

(Year                    0.0
 SolarEnergy             0.0
 WindEnergy              0.0
 HydroEnergy             0.0
 OtherRenewableEnergy    0.0
 TotalRenewableEnergy    0.0
 dtype: float64,
 Year                    1.0
 SolarEnergy             1.0
 WindEnergy              1.0
 HydroEnergy             1.0
 OtherRenewableEnergy    1.0
 TotalRenewableEnergy    1.0
 dtype: float64)

In [61]:
df[:]=scaler.inverse_transform(df)

In [62]:
df.describe()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
count,240.0,240.0,240.0,240.0,240.0,240.0
mean,2011.5,528.523858,857.13326,1076.581975,287.127554,2749.366647
std,6.936653,271.183089,375.020314,499.981598,128.460792,695.126957
min,2000.0,104.555425,206.02163,320.662607,54.876943,910.381025
25%,2005.75,284.700505,523.572495,593.796081,176.322725,2250.759951
50%,2011.5,533.436429,882.024084,1046.39038,291.398276,2815.458943
75%,2017.25,766.701662,1160.199295,1495.160715,405.479393,3217.212712
max,2023.0,996.973153,1487.070005,1983.858741,499.872953,4628.164753


In [63]:
df.min(),df.max()

(Year                    2000.000000
 SolarEnergy              104.555425
 WindEnergy               206.021630
 HydroEnergy              320.662607
 OtherRenewableEnergy      54.876943
 TotalRenewableEnergy     910.381025
 dtype: float64,
 Year                    2023.000000
 SolarEnergy              996.973153
 WindEnergy              1487.070005
 HydroEnergy             1983.858741
 OtherRenewableEnergy     499.872953
 TotalRenewableEnergy    4628.164753
 dtype: float64)

# Normalization with different feature range

In [36]:
scaler2=MinMaxScaler(feature_range=(1,1.5))

In [37]:
df[:]=scaler2.fit_transform(df)

In [38]:
df.describe()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
count,240.0,240.0,240.0,240.0,240.0,240.0
mean,1.25,1.237539,1.254132,1.227249,1.260958,1.247323
std,0.150797,0.151937,0.146372,0.150307,0.144339,0.093487
min,1.0,1.0,1.0,1.0,1.0,1.0
25%,1.125,1.100931,1.123942,1.082111,1.136457,1.180266
50%,1.25,1.240292,1.263847,1.218173,1.265757,1.256212
75%,1.375,1.370984,1.372421,1.353085,1.393939,1.310243
max,1.5,1.5,1.5,1.5,1.5,1.5


In [39]:
scaler2.get_feature_names_out()


array(['Year', 'SolarEnergy', 'WindEnergy', 'HydroEnergy',
       'OtherRenewableEnergy', 'TotalRenewableEnergy'], dtype=object)

In [44]:
scaler2.inverse_transform(df)

array([[1.        , 1.18630887, 1.48003924, 1.36788418, 1.29721545,
        1.38012159],
       [1.02173913, 1.07611982, 1.07680086, 1.02347274, 1.4324784 ,
        1.07684014],
       [1.04347826, 1.30055885, 1.35692357, 1.00430828, 1.48492849,
        1.2249418 ],
       ...,
       [1.45652174, 1.43810517, 1.30077529, 1.30069095, 1.33077775,
        1.35275078],
       [1.47826087, 1.08587835, 1.46161953, 1.20780691, 1.1882437 ,
        1.2650115 ],
       [1.5       , 1.25911101, 1.02148011, 1.07876976, 1.36768665,
        1.11868597]])

In [45]:
df.head()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
0,1.0,1.186309,1.480039,1.367884,1.297215,1.380122
1,1.021739,1.07612,1.076801,1.023473,1.432478,1.07684
2,1.043478,1.300559,1.356924,1.004308,1.484928,1.224942
3,1.065217,1.417205,1.10539,1.086713,1.087254,1.155535
4,1.086957,1.150861,1.26391,1.21454,1.141772,1.209934


In [42]:
df.min(),df.max()

(Year                    1.0
 SolarEnergy             1.0
 WindEnergy              1.0
 HydroEnergy             1.0
 OtherRenewableEnergy    1.0
 TotalRenewableEnergy    1.0
 dtype: float64,
 Year                    1.5
 SolarEnergy             1.5
 WindEnergy              1.5
 HydroEnergy             1.5
 OtherRenewableEnergy    1.5
 TotalRenewableEnergy    1.5
 dtype: float64)

# Standardization

In [64]:
from sklearn.preprocessing import StandardScaler

In [65]:
std=StandardScaler()

In [66]:
df[:]=std.fit_transform(df)

In [67]:
df.describe()

Unnamed: 0,Year,SolarEnergy,WindEnergy,HydroEnergy,OtherRenewableEnergy,TotalRenewableEnergy
count,240.0,240.0,240.0,240.0,240.0,240.0
mean,-1.361874e-15,2.775558e-16,-2.035409e-16,1.036208e-16,-3.182639e-16,4.810966e-16
std,1.00209,1.00209,1.00209,1.00209,1.00209,1.00209
min,-1.661325,-1.56667,-1.739832,-1.515054,-1.811728,-2.651068
25%,-0.8306624,-0.9009887,-0.891306,-0.9676253,-0.8643602,-0.7187877
50%,1.64313e-14,0.01815319,0.06651064,-0.06051161,0.03331481,0.09527816
75%,0.8306624,0.880127,0.8098212,0.8389379,0.9232325,0.6744434
max,1.661325,1.731038,1.683251,1.818413,1.659573,2.708461


In [68]:
df.min(),df.max()

(Year                   -1.661325
 SolarEnergy            -1.566670
 WindEnergy             -1.739832
 HydroEnergy            -1.515054
 OtherRenewableEnergy   -1.811728
 TotalRenewableEnergy   -2.651068
 dtype: float64,
 Year                    1.661325
 SolarEnergy             1.731038
 WindEnergy              1.683251
 HydroEnergy             1.818413
 OtherRenewableEnergy    1.659573
 TotalRenewableEnergy    2.708461
 dtype: float64)

In [69]:
df.mean()

Year                   -1.361874e-15
SolarEnergy             2.775558e-16
WindEnergy             -2.035409e-16
HydroEnergy             1.036208e-16
OtherRenewableEnergy   -3.182639e-16
TotalRenewableEnergy    4.810966e-16
dtype: float64

In [70]:
df.std()

Year                    1.00209
SolarEnergy             1.00209
WindEnergy              1.00209
HydroEnergy             1.00209
OtherRenewableEnergy    1.00209
TotalRenewableEnergy    1.00209
dtype: float64