__Centering and Scaling__

In [18]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [19]:
# Load the dataset
df = pd.read_csv("weather_data.csv")
df.head()

Unnamed: 0,Day,Temperature_C,Humidity_pct,Rainfall_mm
0,1,18.2,65.0,0.0
1,2,18.3,64.9,0.0
2,3,18.4,64.8,0.0
3,4,18.5,64.7,0.0
4,5,18.6,64.6,0.0


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Day            200 non-null    int64  
 1   Temperature_C  200 non-null    float64
 2   Humidity_pct   200 non-null    float64
 3   Rainfall_mm    200 non-null    float64
dtypes: float64(3), int64(1)
memory usage: 6.4 KB


In [21]:
# drop day column
df = df.drop("Day", axis=1)

In [22]:
# Statistical summary
df.describe()

Unnamed: 0,Temperature_C,Humidity_pct,Rainfall_mm
count,200.0,200.0,200.0
mean,28.15,55.05,0.0
std,5.787918,5.787918,0.0
min,18.2,45.1,0.0
25%,23.175,50.075,0.0
50%,28.15,55.05,0.0
75%,33.125,60.025,0.0
max,38.1,65.0,0.0


Althought the values ranges are not much but looking at the column and translate them to real world is large. For exaple minimun temperature is 18.2 (quite freezing in sub-sahara it will hardlt rain) while the maximun temperature is 38.1 which is average tem to rain.


__Why scale the data__
1. Most models use some form of dista ce to inform them
2. Feartures on large scales can disproportionately infuence the model

__How to scale data__
1. Subtract the mean and divide by the variance (standardization)
2. Subtract the minimum and divide by the range (max 1 and min 0)
3. Normalize data to make it ranges from -1 to +1

In [23]:
# Split the data
X = df.drop("Rainfall_mm", axis=1)
y = df["Rainfall_mm"]

In [24]:
# Split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
# Instantiate scaler
scaler = StandardScaler()

# Fit and transform 
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# mean and standard deviation
print(f"Mean_scaled: {X_train_scaled.mean()} Sdt_scaled: {X_train_scaled.std()}")
print(f"Mean: {X.std()} Std: {X.std()}")

Mean_scaled: -1.1352030426792225e-15 Sdt_scaled: 1.0
Mean: Temperature_C    5.787918
Humidity_pct     5.787918
dtype: float64 Std: Temperature_C    5.787918
Humidity_pct     5.787918
dtype: float64
