#  Standardization

What is **Standardization**? It is rescaling data to have zero mean and unit variance

* **Features** - variables or attributes in the data.
* **Mean** - average value of the dataset
* **Ouliers** - Observations far outside the normal range
* **Gaussian** - Normal distribution bell curve
  
  **Benefits :**
  
  **a.** *Preventing outliers from skewing the distribution*, standardization diminishes the effect of outliers by centering the distribution
  
  **b.** *Allowing direct comparison of model coefficients*, i.e, Features are standardized to the same scale, allowing the coefficients to be directly comparable.

  > ***X_Standardized = (x-m)/sd***

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('/kaggle/input/dataset/500hits.csv',encoding='latin-1')

In [4]:
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [11]:
df=df.drop(columns=['PLAYER','CS'])

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   YRS     465 non-null    int64  
 1   G       465 non-null    int64  
 2   AB      465 non-null    int64  
 3   R       465 non-null    int64  
 4   H       465 non-null    int64  
 5   2B      465 non-null    int64  
 6   3B      465 non-null    int64  
 7   HR      465 non-null    int64  
 8   RBI     465 non-null    int64  
 9   BB      465 non-null    int64  
 10  SO      465 non-null    int64  
 11  SB      465 non-null    int64  
 12  BA      465 non-null    float64
 13  HOF     465 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 51.0 KB


In [13]:
df.describe().round(3)

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,17.049,2048.699,7511.456,1150.314,2170.247,380.953,78.555,201.049,894.26,783.561,847.471,195.905,0.289,0.329
std,2.765,354.392,1294.066,289.635,424.191,96.483,49.363,143.623,486.193,327.432,489.224,181.846,0.021,0.475
min,11.0,1331.0,4981.0,601.0,1660.0,177.0,3.0,9.0,0.0,239.0,0.0,7.0,0.246,0.0
25%,15.0,1802.0,6523.0,936.0,1838.0,312.0,41.0,79.0,640.0,535.0,436.0,63.0,0.273,0.0
50%,17.0,1993.0,7241.0,1104.0,2076.0,366.0,67.0,178.0,968.0,736.0,825.0,137.0,0.287,0.0
75%,19.0,2247.0,8180.0,1296.0,2375.0,436.0,107.0,292.0,1206.0,955.0,1226.0,285.0,0.3,1.0
max,26.0,3308.0,12364.0,2295.0,4189.0,792.0,309.0,755.0,2297.0,2190.0,2597.0,1406.0,0.366,2.0


In [15]:
X1=df.iloc[:,0:13]

In [16]:
X2=df.iloc[:,0:13]

In [17]:
from sklearn.preprocessing import StandardScaler

In [22]:
scaleStandard = StandardScaler()


In [24]:
X1=scaleStandard.fit_transform(X1)

In [29]:
X1=pd.DataFrame(X1,columns=['YRS','G','AB','R','H','2B','3B','HR','RBI','BB','SO','SB','BA'])
X1['HOF']=None

In [30]:
X1.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
0,2.516295,2.786078,3.034442,3.787062,4.764193,3.559333,4.389485,-0.585841,-0.346449,1.423013,-1.003628,3.832067,3.64829,
1,1.792237,2.760655,2.677044,2.76053,3.444971,3.569709,1.996457,1.909487,2.175837,2.493089,-0.309948,-0.64908,1.996159,
2,1.792237,2.091184,2.075964,2.528955,3.171214,4.264876,2.909053,-0.585841,-0.350567,1.826585,-1.283965,1.299723,2.657012,
3,1.06818,1.972543,2.849554,2.670665,3.055576,1.691719,-0.254611,0.410896,0.858071,0.912434,2.030966,0.892346,1.004881,
4,1.430208,2.099658,2.257758,2.024329,2.972977,2.68778,3.517449,-0.697364,-1.84129,0.548609,-1.065016,2.896201,1.901752,


In [32]:
X1.describe().round(3)

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0
std,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001
min,-2.19,-2.027,-1.958,-1.899,-1.204,-2.116,-1.532,-1.339,-1.841,-1.665,-1.734,-1.04,-2.016
25%,-0.742,-0.697,-0.765,-0.741,-0.784,-0.715,-0.762,-0.851,-0.524,-0.76,-0.842,-0.732,-0.742
50%,-0.018,-0.157,-0.209,-0.16,-0.222,-0.155,-0.234,-0.161,0.152,-0.145,-0.046,-0.324,-0.081
75%,0.706,0.56,0.517,0.504,0.483,0.571,0.577,0.634,0.642,0.524,0.775,0.49,0.533
max,3.24,3.557,3.754,3.956,4.764,4.265,4.673,3.861,2.888,4.3,3.58,6.662,3.648


# Normalization

### What is Normalization?  
Normalization rescales each feature to a common range, making them directly comparable.  

### Key Points:
- **Unequal Influence**:  
  Features with large ranges (e.g., height in meters vs. weight in kilograms) can dominate the model.  
- **Normalization Goal**:  
  Rescales features to a uniform range.  
- **Balances Features**:  
  Ensures all features contribute equally to the model.  
- **Prevents Bias**:  
  Avoids large-range features overpowering smaller ones.  
- **Improves Performance**:  
  Essential for algorithms like KNN, SVM, and Gradient Descent.  
Descent.

In [33]:
from sklearn.preprocessing import MinMaxScaler

In [34]:
scaleMinMax=MinMaxScaler(feature_range=(0,1))

In [36]:
X2 = scaleMinMax.fit_transform(X2)

In [37]:
X2=pd.DataFrame(X1,columns=['YRS','G','AB','R','H','2B','3B','HR','RBI','BB','SO','SB','BA'])
X2['HOF']=None

In [42]:
X2.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
0,2.516295,2.786078,3.034442,3.787062,4.764193,3.559333,4.389485,-0.585841,-0.346449,1.423013,-1.003628,3.832067,3.64829,
1,1.792237,2.760655,2.677044,2.76053,3.444971,3.569709,1.996457,1.909487,2.175837,2.493089,-0.309948,-0.64908,1.996159,
2,1.792237,2.091184,2.075964,2.528955,3.171214,4.264876,2.909053,-0.585841,-0.350567,1.826585,-1.283965,1.299723,2.657012,
3,1.06818,1.972543,2.849554,2.670665,3.055576,1.691719,-0.254611,0.410896,0.858071,0.912434,2.030966,0.892346,1.004881,
4,1.430208,2.099658,2.257758,2.024329,2.972977,2.68778,3.517449,-0.697364,-1.84129,0.548609,-1.065016,2.896201,1.901752,


In [38]:
X2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   YRS     465 non-null    float64
 1   G       465 non-null    float64
 2   AB      465 non-null    float64
 3   R       465 non-null    float64
 4   H       465 non-null    float64
 5   2B      465 non-null    float64
 6   3B      465 non-null    float64
 7   HR      465 non-null    float64
 8   RBI     465 non-null    float64
 9   BB      465 non-null    float64
 10  SO      465 non-null    float64
 11  SB      465 non-null    float64
 12  BA      465 non-null    float64
 13  HOF     0 non-null      object 
dtypes: float64(13), object(1)
memory usage: 51.0+ KB


In [40]:
X2.describe().round(3)

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0
std,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001,1.001
min,-2.19,-2.027,-1.958,-1.899,-1.204,-2.116,-1.532,-1.339,-1.841,-1.665,-1.734,-1.04,-2.016
25%,-0.742,-0.697,-0.765,-0.741,-0.784,-0.715,-0.762,-0.851,-0.524,-0.76,-0.842,-0.732,-0.742
50%,-0.018,-0.157,-0.209,-0.16,-0.222,-0.155,-0.234,-0.161,0.152,-0.145,-0.046,-0.324,-0.081
75%,0.706,0.56,0.517,0.504,0.483,0.571,0.577,0.634,0.642,0.524,0.775,0.49,0.533
max,3.24,3.557,3.754,3.956,4.764,4.265,4.673,3.861,2.888,4.3,3.58,6.662,3.648
