## **Theoretical Task**

### T3.1 - The 𝑅2 score can sometimes be 0, 1, +inf, or negative. Explain why we may get each of these values (1000 words maximum) (5%).

The R² Score is a statistical measure used to access the goodness of fit of a regression model. It determines the proportion of variance in a dependent variable that can be predicted or explained by an independent variable. It usually ranges from 0 to 1, but it can sometimes take values like +inf or negative values.  

Here is an explanation why we can get these values:

1. R² = 0 :

	R-squared is 0 when the model explains 0% or no relationship between the dependent and independent variables.  It shows that the model performs no better than a horizontal line(the mean of the target variable).  It also means that the model doesn’t predict any variability in the model.

2. R²  = 1 :

	R-squared is 1 when the model perfectly fits the data and predicts 100% variance. This could be because the model is overfitted or there is no noise in the data.

3. R²  = +inf:

	R-squared equals +inf when R²  formula breaks down mathematically. It also shows the target variable has no variance. The value is evaluated on a constant y.

4. R² = negative:

	R-squared equares negative when the model is worst than a horizontal line. This could be because the model is extremely poor or the model was trained on a different data. This happens when the sum of squared residuals (SSR) exceeds the total sum of squares (SST).


Conclusion

The 𝑅² score helps evaluate regression models, but extreme values indicate special cases:  
- 𝑅² = 0: The model is no better than predicting the mean.  
- 𝑅² = 1: The model fits perfectly (may indicate overfitting).  
- 𝑅² = +∞: The target variable has no variance (invalid for regression).  
- 𝑅² = negative: The model performs worse than a horizontal line (likely misspecified).  


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
data = pd.read_csv('CarSharing.csv')

In [None]:
data

Unnamed: 0,id,timestamp,season,holiday,workingday,weather,temp,temp_feel,humidity,windspeed,demand
0,1,2017-01-01 00:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,81.0,0.0000,2.772589
1,2,2017-01-01 01:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0000,3.688879
2,3,2017-01-01 02:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0000,3.465736
3,4,2017-01-01 03:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0000,2.564949
4,5,2017-01-01 04:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
8703,8704,2018-08-05 00:00:00,fall,No,No,Clear or partly cloudy,30.34,34.850,70.0,19.0012,5.030438
8704,8705,2018-08-05 01:00:00,fall,No,No,Clear or partly cloudy,30.34,34.850,70.0,16.9979,4.465908
8705,8706,2018-08-05 02:00:00,fall,No,No,Clear or partly cloudy,30.34,34.850,70.0,19.9995,4.290459
8706,8707,2018-08-05 03:00:00,fall,No,No,Clear or partly cloudy,29.52,34.850,74.0,16.9979,3.713572


In [None]:
data.isna().sum()

Unnamed: 0,0
id,0
timestamp,0
season,0
holiday,0
workingday,0
weather,0
temp,1202
temp_feel,102
humidity,39
windspeed,200


In [None]:
data.dropna(inplace=True)

In [None]:
numeric_cols = ['temp', 'temp_feel', 'humidity', 'windspeed']
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

In [None]:
x=data.drop('demand',axis=1)
y=data['demand']

In [None]:
categorical_cols = x.select_dtypes(include=['object']).columns.tolist()
numerical_cols = x.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [None]:
ct = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

In [None]:
x

Unnamed: 0,id,timestamp,season,holiday,workingday,weather,temp,temp_feel,humidity,windspeed
0,1,2017-01-01 00:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,81.0,0.0000
1,2,2017-01-01 01:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0000
2,3,2017-01-01 02:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0000
3,4,2017-01-01 03:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0000
4,5,2017-01-01 04:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0000
...,...,...,...,...,...,...,...,...,...,...
8703,8704,2018-08-05 00:00:00,fall,No,No,Clear or partly cloudy,30.34,34.850,70.0,19.0012
8704,8705,2018-08-05 01:00:00,fall,No,No,Clear or partly cloudy,30.34,34.850,70.0,16.9979
8705,8706,2018-08-05 02:00:00,fall,No,No,Clear or partly cloudy,30.34,34.850,70.0,19.9995
8706,8707,2018-08-05 03:00:00,fall,No,No,Clear or partly cloudy,29.52,34.850,74.0,16.9979


In [None]:
# Linear Regression pipeline
lr_pipeline = Pipeline(steps=[
    ('preprocessor', ct),
    ('regressor', LinearRegression())
])

In [None]:
alphas = [0.1, 1.0, 10.0, 50.0, 100.0]
ridge_results = []

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
lr_pipeline.fit(x_train, y_train)
y_pred_lr = lr_pipeline.predict(x_test)
lr_metrics = {
    'model': 'LinearRegression',
    'MSE': mean_squared_error(y_test, y_pred_lr),
    'MAE': mean_absolute_error(y_test, y_pred_lr),
    'R2': r2_score(y_test, y_pred_lr)
}

In [None]:
for alpha in alphas:
    ridge_pipeline = Pipeline(steps=[
        ('preprocessor', ct),
        ('regressor', Ridge(alpha=alpha))
    ])
    ridge_pipeline.fit(x_train, y_train)
    y_pred_ridge = ridge_pipeline.predict(x_test)
    ridge_results.append({
        'model': f'Ridge(alpha={alpha})',
        'MSE': mean_squared_error(y_test, y_pred_ridge),
        'MAE': mean_absolute_error(y_test, y_pred_ridge),
        'R2': r2_score(y_test, y_pred_ridge)
    })

In [None]:
lr_metrics, ridge_results

({'model': 'LinearRegression',
  'MSE': 1.6475560717435789,
  'MAE': 0.9861450316196507,
  'R2': 0.2954394404839512},
 [{'model': 'Ridge(alpha=0.1)',
   'MSE': 1.647557883446637,
   'MAE': 0.98614439926273,
   'R2': 0.29543866572760624},
  {'model': 'Ridge(alpha=1.0)',
   'MSE': 1.6475815648784593,
   'MAE': 0.986137585195831,
   'R2': 0.295428538604687},
  {'model': 'Ridge(alpha=10.0)',
   'MSE': 1.6477899181093902,
   'MAE': 0.9861448527328377,
   'R2': 0.2953394384693535},
  {'model': 'Ridge(alpha=50.0)',
   'MSE': 1.6492891988721725,
   'MAE': 0.9865680470168231,
   'R2': 0.2946982863342523},
  {'model': 'Ridge(alpha=100.0)',
   'MSE': 1.6515844946897122,
   'MAE': 0.9873135667536121,
   'R2': 0.29371672647526115}])