# Predicting Bike Rental Usage Using Various Regression Models

## Introduction
- In this notebook, we aim to predict the total rental bike usage (`cnt`) using several regression models. We begin by identifying features that have a strong relationship with the target variable (`cnt`) using Pearson correlation. Then, we implement different regression models to predict `cnt` and evaluate their performance.

## Step:1 Loading and Preparing the Dataset
- **Dataset:** We use the `day.csv` dataset, which includes daily counts of rental bikes and various attributes.
- **Dropping Columns:** The columns `registered` and `casual` are dropped since they are summed into the target variable `cnt`.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("day.csv")

# Drop 'registered' and 'casual' from the DataFrame
df = df.drop(['registered', 'casual'], axis=1)

# Let's display the modiified data
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,1600


## Step:2 Feature Selection Using Pearson Correlation
- **Numerical Features:** We selected numerical features, including the target variable `cnt`.
- **Correlation Threshold:** A threshold of `0.3` was used to identify features that are significantly correlated with `cnt`.
- **Related Features:** Features with a Pearson correlation above the threshold were selected for model training.

In [2]:
# List of numerical features (including the target variable 'cnt')
numerical_features = ['instant', 'season', 'holiday', 'weekday', 'workingday', 
                      'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'cnt']

# Calculate Pearson correlation with 'cnt'
pearson_corr = df[numerical_features].corr()['cnt'].drop('cnt')
print("Pearson Correlation with cnt:\n", pearson_corr)

# Determine whether each attribute is related to 'cnt'
threshold = 0.3  # You can adjust this threshold based on your criteria
related_features = pearson_corr[abs(pearson_corr) > threshold].index.tolist()

print("\nAttributes related to cnt based on Pearson correlation (threshold > 0.3):")
print(related_features)

Pearson Correlation with cnt:
 instant       0.628830
season        0.406100
holiday      -0.068348
weekday       0.067443
workingday    0.061156
weathersit   -0.297391
temp          0.627494
atemp         0.631066
hum          -0.100659
windspeed    -0.234545
Name: cnt, dtype: float64

Attributes related to cnt based on Pearson correlation (threshold > 0.3):
['instant', 'season', 'temp', 'atemp']


## Step:3 Splitting the Data into Training and Testing Sets
- **Feature Matrix (X):** Excludes the target variable `cnt`.
- **Target Variable (y):** The `cnt` column, representing the total rental bike usage.
- **Train-Test Split:** The data is split into training and testing sets with an 80-20 split.

In [3]:
from sklearn.model_selection import train_test_split

# Define the target variable 'y' and features 'X'
y = df['cnt']

# Exclude 'cnt' from the features to create X
X = df.drop('cnt', axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (584, 13)
X_test shape: (147, 13)
y_train shape: (584,)
y_test shape: (147,)


## 3.1 Simple Linear Regression
- **Model:** Simple Linear Regression using the feature `atemp` (feeling temperature).
- **Training:** The model is trained on `X_train[['atemp']]`.
- **Evaluation:** The model's performance is evaluated using Mean Squared Error (MSE) and R² score.

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Simple Linear Regression using 'atemp'
X_simple = X[['atemp']]
simple_model = LinearRegression()
simple_model.fit(X_train[['atemp']], y_train)

# Predictions
y_pred_simple = simple_model.predict(X_test[['atemp']])

# Evaluate the model
mse_simple = mean_squared_error(y_test, y_pred_simple)
r2_simple = r2_score(y_test, y_pred_simple)
print(f"Simple Linear Regression MSE: {mse_simple}")
print(f"Simple Linear Regression R²: {r2_simple}")

Simple Linear Regression MSE: 2359187.322761967
Simple Linear Regression R²: 0.4116567138444909


## 3.2 Multi Linear Regression
- **Model:** Multi Linear Regression using all related features identified by Pearson correlation.
- **Training:** The model is trained on the selected related features.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [5]:
# Multi Linear Regression using all related features
multi_model = LinearRegression()
multi_model.fit(X_train[related_features], y_train)

# Predictions
y_pred_multi = multi_model.predict(X_test[related_features])

# Evaluate the model
mse_multi = mean_squared_error(y_test, y_pred_multi)
r2_multi = r2_score(y_test, y_pred_multi)
print(f"Multi Linear Regression MSE: {mse_multi}")
print(f"Multi Linear Regression R²: {r2_multi}")

Multi Linear Regression MSE: 1209657.3093978039
Multi Linear Regression R²: 0.698330967758874


## 3.3 Simple Polynomial Regression
- **Model:** Polynomial Regression `(degree 2)` using the feature `atemp`.
- **Training:** The model is trained on the polynomial features derived from `atemp`.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [6]:
from sklearn.preprocessing import PolynomialFeatures

# Polynomial Regression (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train[['atemp']])
poly_model = LinearRegression()
poly_model.fit(X_poly, y_train)

# Predictions
X_test_poly = poly.transform(X_test[['atemp']])
y_pred_poly = poly_model.predict(X_test_poly)

# Evaluate the model
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
print(f"Simple Polynomial Regression MSE: {mse_poly}")
print(f"Simple Polynomial Regression R²: {r2_poly}")

Simple Polynomial Regression MSE: 2386473.376001215
Simple Polynomial Regression R²: 0.4048520120414143


## 3.4 Multi Polynomial Regression
- **Model:** Polynomial Regression `(degree 2)` using all related features.
- **Training:** The model is trained on the polynomial features derived from the related features.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [7]:
# Polynomial Regression with multiple features
poly = PolynomialFeatures(degree=2)
X_multi_poly = poly.fit_transform(X_train[related_features])
multi_poly_model = LinearRegression()
multi_poly_model.fit(X_multi_poly, y_train)

# Predictions
X_test_multi_poly = poly.transform(X_test[related_features])
y_pred_multi_poly = multi_poly_model.predict(X_test_multi_poly)

# Evaluate the model
mse_multi_poly = mean_squared_error(y_test, y_pred_multi_poly)
r2_multi_poly = r2_score(y_test, y_pred_multi_poly)
print(f"Multi Polynomial Regression MSE: {mse_multi_poly}")
print(f"Multi Polynomial Regression R²: {r2_multi_poly}")

Multi Polynomial Regression MSE: 1089210.9668534503
Multi Polynomial Regression R²: 0.7283683438901576


## 3.5 Lasso Regression
- **Model:** Lasso Regression with `alpha=0.1` using all related features.
- **Training:** The model is trained on the related features with Lasso regularization applied.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [8]:
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train[related_features], y_train)

# Predictions
y_pred_lasso = lasso_model.predict(X_test[related_features])

# Evaluate the model
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"Lasso Regression MSE: {mse_lasso}")
print(f"Lasso Regression R²: {r2_lasso}")

Lasso Regression MSE: 1209734.492870611
Lasso Regression R²: 0.6983117194450204


## 3.6 Ridge Regression
- **Model:** Ridge Regression with `alpha=0.1` using all related features.
- **Training:** The model is trained on the related features with Ridge regularization applied.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [9]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train[related_features], y_train)

# Predictions
y_pred_ridge = ridge_model.predict(X_test[related_features])

# Evaluate the model
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
print(f"Ridge Regression MSE: {mse_ridge}")
print(f"Ridge Regression R²: {r2_ridge}")

Ridge Regression MSE: 1211100.30194647
Ridge Regression R²: 0.6979711086795253


## 3.7 Decision Tree Regressor
- **Model:** Decision Tree Regressor using all related features.
- **Training:** The model is trained on the related features.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [10]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train[related_features], y_train)

# Predictions
y_pred_tree = tree_model.predict(X_test[related_features])

# Evaluate the model
mse_tree = mean_squared_error(y_test, y_pred_tree)
r2_tree = r2_score(y_test, y_pred_tree)
print(f"Decision Tree Regressor MSE: {mse_tree}")
print(f"Decision Tree Regressor R²: {r2_tree}")

Decision Tree Regressor MSE: 1106578.2653061224
Decision Tree Regressor R²: 0.7240372196319422


## 3.8 K-Nearest Neighbors (KNN) Regressor
- **Model:** KNN Regressor with `n_neighbors=5` using all related features.
- **Training:** The model is trained on the related features.
- **Evaluation:** The model's performance is evaluated using MSE and R² score.

In [11]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(X_train[related_features], y_train)

# Predictions
y_pred_knn = knn_model.predict(X_test[related_features])

# Evaluate the model
mse_knn = mean_squared_error(y_test, y_pred_knn)
r2_knn = r2_score(y_test, y_pred_knn)
print(f"KNN Regressor MSE: {mse_knn}")
print(f"KNN Regressor R²: {r2_knn}")

KNN Regressor MSE: 836722.388027211
KNN Regressor R²: 0.7913349251150229


# CONCLUSION
## Performance Summary: 
- We compared the performance of various regression models using MSE and R² scores.
- **Best Model:** Among the models tested, the model with the `highest R²` and `lowest MSE` is identified as the `best-performing model` for predicting bike rental usage.

- **Based on the results of the models, here's the comparison:**

### 1. Simple Linear Regression
- **MSE: 2,359,187.32**
- **R²: 0.4117**

### 2. Multi Linear Regression
- **MSE: 1,209,657.31**
- **R²: 0.6983**

### 3. Simple Polynomial Regression
- **MSE: 2,386,473.38**
- **R²: 0.4049**

### 4. Multi Polynomial Regression
- **MSE: 1,089,210.97**
- **R²: 0.7284**

### 5. Lasso Regression
- **MSE: 1,209,734.49**
- **R²: 0.6983**

### 6. Ridge Regression
- **MSE: 1,211,100.30**
- **R²: 0.6980**

### 7. Decision Tree Regressor
- **MSE: 1,106,578.27**
- **R²: 0.7240**

### 8. KNN Regressor
- **MSE: 836,722.39**
- **R²: 0.7913**

## Best Model
- **`KNN Regressor`** has the `lowest MSE (836,722.39)` and the `highest R² score (0.7913)`, indicating it is the `best-performing model` among the ones tested.

# ALTERNATE APPROACH 

### Making the Function to Evaluate the every Regression Models and printing the BEST Fit Model's R² and MSE Values.
### But the only constraint is here we need to make the Preprocessor to handle both numeric and categorical features.

In [12]:
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Load the dataset
df = pd.read_csv("day.csv")

# Drop 'registered' and 'casual' from the DataFrame
df = df.drop(['registered', 'casual'], axis=1)

# Extract date-related features if necessary
df['year'] = pd.to_datetime(df['dteday']).dt.year
df['month'] = pd.to_datetime(df['dteday']).dt.month
df['day'] = pd.to_datetime(df['dteday']).dt.day

# Drop the original date column
df = df.drop('dteday', axis=1)

# Define the target variable 'y' and features 'X'
y = df['cnt']
X = df.drop('cnt', axis=1)

# One-hot encode categorical variables if any (e.g., 'season', 'weathersit')
categorical_features = ['season', 'weathersit']
numerical_features = X.columns.difference(categorical_features)

# Create a preprocessor to handle both numeric and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Function to evaluate and compare different regression models
def evaluate_models(X_train, X_test, y_train, y_test):
    models = {
        "Simple Linear Regression": Pipeline([
            ('preprocessor', preprocessor),
            ('model', LinearRegression())
        ]),
        "Multi Linear Regression": Pipeline([
            ('preprocessor', preprocessor),
            ('model', LinearRegression())
        ]),
        "Simple Polynomial Regression": Pipeline([
            ('preprocessor', preprocessor),
            ('poly', PolynomialFeatures(degree=2)),
            ('model', LinearRegression())
        ]),
        "Multi Polynomial Regression": Pipeline([
            ('preprocessor', preprocessor),
            ('poly', PolynomialFeatures(degree=3)),
            ('model', LinearRegression())
        ]),
        "Lasso Regression": Pipeline([
            ('preprocessor', preprocessor),
            ('model', Lasso())
        ]),
        "Ridge Regression": Pipeline([
            ('preprocessor', preprocessor),
            ('model', Ridge())
        ]),
        "Decision Tree Regressor": Pipeline([
            ('preprocessor', preprocessor),
            ('model', DecisionTreeRegressor())
        ]),
        "KNN Regressor": Pipeline([
            ('preprocessor', preprocessor),
            ('model', KNeighborsRegressor())
        ])
    }

    results = []

    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        results.append((name, mse, r2))

    # Find the model with the best R² score
    best_model = max(results, key=lambda item: item[2])
    
    # Print all results
    for result in results:
        print(f"{result[0]} MSE: {result[1]:.2f}, R²: {result[2]:.4f}")
    
    print("\nBest Model:")
    print(f"{best_model[0]} with R²: {best_model[2]:.4f} and MSE: {best_model[1]:.2f}")

    return best_model

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Evaluate models and get the best one
best_model = evaluate_models(X_train, X_test, y_train, y_test)

Simple Linear Regression MSE: 642306.02, R²: 0.8398
Multi Linear Regression MSE: 642306.02, R²: 0.8398
Simple Polynomial Regression MSE: 1127402.23, R²: 0.7188
Multi Polynomial Regression MSE: 371743872.03, R²: -91.7069
Lasso Regression MSE: 644793.40, R²: 0.8392
Ridge Regression MSE: 649867.35, R²: 0.8379
Decision Tree Regressor MSE: 696271.07, R²: 0.8264
KNN Regressor MSE: 899537.94, R²: 0.7757

Best Model:
Simple Linear Regression with R²: 0.8398 and MSE: 642306.02
