#ML Pipeline
<br>


```
This notebook is focused on building the pipeline which include data ingestion, data
pre-processing, feature engineering, modelling, model evaluation and model inferencing.
```


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("bike_sharing.csv")
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1/1/2011,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,1/1/2011,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,1/1/2011,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,1/1/2011,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,1/1/2011,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


#Data Preprocessing
<br>

I am **Dropping** the below coloumns:
<br><br>
**1.instant**
<br>
This coloumn is merely a sequence number of rows and does not provide any meaningful information for analysis or modeling.
<br>

**2.dteday**
<br>
The 'dteday' column can be dropped because the dataset already contains columns for the year ('yr') and month ('mnth'), making the specific date redundant for analysis purposes.
<br>

 **3.casual and registered**
 <br>
 The 'casual' and 'registered' columns are combined to form the 'total_count' column; therefore, we can also drop these columns. We cant have these coloumns while predicting total count as these coloumns greatly influence total count.




In [3]:
# Drop the unnecessary columns
df.drop(['instant','dteday','casual','registered'], axis=1, inplace=True)
df.head()

Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
2,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
3,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
4,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1


**I am encoding these coloumns:**  
<br>

I am using one-hot encoding on :

1.   mnth (month)
2.   hr   (hour)
3.   weekday (day of the week)
4.   weathersit (values ranging from 1 to 4)




**Reason for encoding the above coloumns**
1. **Month (mnth)**:
- Since months are represented as integers from 1 to 12, using one-hot encoding can help treat each month as a separate categorical feature rather than a continuous one.

2.**Hour (hr)**:
- Similar to months, hours are represented as integers from 0 to 23. One-hot encoding this variable allows the model to capture hourly patterns that might affect bikeshare usage, such as office commute hours or late-night usage.

3. **Day of the Week (weekday)**:
- Weekdays are represented as integers from 0 to 6. One-hot encoding this variable helps the model understand the differences between each day of the week, as weekdays may have different usage patterns compared to weekends.

4. **Weather Situation (weathersit)**:
- The weather situation is represented by integers from 1 to 4, each corresponding to a different type of weather condition. One-hot encoding this variable allows the model to differentiate between the different weather conditions and their impact on bikeshare usage.

In [4]:
# Columns to one-hot encode
columns_to_encode = ['season','mnth','hr','weathersit']


# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=columns_to_encode)
df_encoded = df_encoded.astype(int)

# Display the first few rows of the encoded DataFrame
print(df_encoded.head())


   yr  holiday  weekday  workingday  temp  atemp  hum  windspeed  cnt  \
0   0        0        6           0     0      0    0          0   16   
1   0        0        6           0     0      0    0          0   40   
2   0        0        6           0     0      0    0          0   32   
3   0        0        6           0     0      0    0          0   13   
4   0        0        6           0     0      0    0          0    1   

   season_1  ...  hr_18  hr_19  hr_20  hr_21  hr_22  hr_23  weathersit_1  \
0         1  ...      0      0      0      0      0      0             1   
1         1  ...      0      0      0      0      0      0             1   
2         1  ...      0      0      0      0      0      0             1   
3         1  ...      0      0      0      0      0      0             1   
4         1  ...      0      0      0      0      0      0             1   

   weathersit_2  weathersit_3  weathersit_4  
0             0             0             0  
1           

#Modelling


**Modelling and Model evaluation**

Inorder to experiment with different types of algorithms, the following models have been chosen:
1. Linear Regression: Traditional linear regression model.

2. Polynomial Regression: Regression model that fits higher-order (nth degree)
   polynomial terms to capture non-linear relationships.

3. SVM (Support Vector Machine): Model that finds the optimal hyperplane for regression in high-dimensional space.

4. Decision Tree Regressor: Simple, interpretable tree-based model that splits data into subsets based on feature values.

5. Random Forest Regressor: Ensemble tree-based model that combines multiple decision trees (weak learners) to improve accuracy and reduce overfitting.

6. XGBoost Regressor: Advanced gradient-boosted tree model known for its high performance and efficiency, particularly with tabular data.

7. KNN Regressor (K-Nearest Neighbors): Instance-based learning model that predicts target values based on the average of k-nearest neighbors; requires no explicit training phase.

In [5]:
import numpy as np
import pandas as pd
import warnings

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import PolynomialFeatures
from xgboost import XGBRegressor

# Ignore warnings
warnings.filterwarnings("ignore")



def build_model_pipeline(data):

    # Select columns with numeric data types (int64 and float64) excluding the target column 'cnt'
    numeric_features = data.select_dtypes(include=['int64', 'float64']).drop(columns=['cnt']).columns


    # Create a column transformer for preprocessing.
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_features),                            # Standardize numeric features
        ])
    # I have not included the categorical variables as they have already been one hot encoded.


    # Define a dictionary of models to be used
    models = {
        'linear_regression': LinearRegression(),                                     # Linear Regression model
        'svm': SVR(),                                                                # Support Vector Regression model
        'random_forest': RandomForestRegressor(),                                    # Random Forest Regressor model
        'xgboost': XGBRegressor(),                                                   # XGBoost Regressor model
        'decision_tree_regression': DecisionTreeRegressor(),                         # Decision Tree Regressor model
        'knn': KNeighborsRegressor(),                                                # K-Nearest Neighbors Regressor model
        'polynomial_regression': make_pipeline(PolynomialFeatures(), LinearRegression()),  # Polynomial Regression model
    }

    pipelines = {}

    for model_name, model in models.items():

        pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', model)])   # Create a pipeline for each model with preprocessing and regression steps
        pipelines[model_name] = pipeline                                                    # Store the pipeline in the dictionary with the model name as the key

    return pipelines


def train_evaluate_models_cv_and_test(pipelines, X_train, X_cv, y_train, y_cv, X_test, y_test):
    #dictionaries to save the results of cross validation and testing
    results_cv = {}
    results_test = {}


    kf = KFold(n_splits=5, shuffle=True, random_state=42)                           # 5-fold cross-validation with shuffling and a fixed random state

    for name, pipeline in pipelines.items():
        # List to store MAE scores for cross-validation, R^2 scores
        cv_mae_scores = []
        cv_r2_scores = []

        # Cross-validation
        for train_index, val_index in kf.split(X_train):
            X_tr, X_val = X_train.iloc[train_index], X_train.iloc[val_index]        # Get training and validation data
            y_tr, y_val = y_train.iloc[train_index], y_train.iloc[val_index]        # Get training and validation targets

            pipeline.fit(X_tr, y_tr)
            y_val_pred = pipeline.predict(X_val)                                    # Predict on the validation data

            mae = mean_absolute_error(y_val, y_val_pred)                            # Calculate MAE for the validation set
            r2 = r2_score(y_val, y_val_pred)                                        # Calculate R^2 score for the validation set

            cv_mae_scores.append(mae)
            cv_r2_scores.append(r2)

        # Store the mean cross-validation results for the current model
        results_cv[name] = {'Mean MAE': np.mean(cv_mae_scores), 'Mean R^2': np.mean(cv_r2_scores)}

        # Testing
        pipeline.fit(X_train, y_train)
        predictions = pipeline.predict(X_test)                                    # Predict on the test set

        mae = mean_absolute_error(y_test, predictions)                            #  Calculate MAE for the test set
        r2 = r2_score(y_test, predictions)                                        # Calculate R^2 score for the test set

        # Store the test results for the current model
        results_test[name] = {'MAE': mae, 'R^2': r2}

        # Print results for the current model
        print(f'\n{name}:')
        print(f'Cross-validation - Mean MAE = {results_cv[name]["Mean MAE"]:.6f}, Mean R^2 = {results_cv[name]["Mean R^2"]:.6f}')
        print(f'Testing - MAE = {results_test[name]["MAE"]:.6f}, R^2 = {results_test[name]["R^2"]:.6f}')

    return results_cv, results_test


if __name__ == '__main__':
    df_encoded = pd.get_dummies(df, columns=['season'])
    X = df_encoded.drop(['cnt'], axis=1)
    y = df_encoded['cnt']

    # First split: 75% training, 25% remaining
    X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, test_size=0.25, random_state=42)

    # Second split: 40% of remaining (which is 10% of original) for CV, 60% of remaining (which is 15% of original) for testing
    X_cv, X_test, y_cv, y_test = train_test_split(X_remaining, y_remaining, test_size=0.6, random_state=42)

    # Print sizes to verify
    print(f'Training set size: {len(X_train)} ({len(X_train)/len(X):.2%})')
    print(f'CV set size: {len(X_cv)} ({len(X_cv)/len(X):.2%})')
    print(f'Testing set size: {len(X_test)} ({len(X_test)/len(X):.2%})')

    # Build pipelines for different regression models
    pipelines = build_model_pipeline(df_encoded)

    # Train and evaluate models using cross-validation and testing
    results_cv, results_test = train_evaluate_models_cv_and_test(pipelines, X_train, X_cv, y_train, y_cv, X_test, y_test)

    # Find the best model based on MAE values in cross-validation
    best_model_cv = min(results_cv, key=lambda x: results_cv[x]['Mean MAE'])
    best_mae_cv = results_cv[best_model_cv]['Mean MAE']
    best_r2_cv = results_cv[best_model_cv]['Mean R^2']

    # Find the best model based on MAE values on testing
    best_model_test = min(results_test, key=lambda x: results_test[x]['MAE'])
    best_mae_test = results_test[best_model_test]['MAE']
    best_r2_test = results_test[best_model_test]['R^2']

    print("-------------------------------------------------------------------------------------------------------------------------------")
    # Print the best model and its performance
    print(f"\nBest model based on cross-validation: {best_model_cv}, Mean MAE: {best_mae_cv:.6f}, Mean R^2: {best_r2_cv:.6f}")
    print(f"Best model based on testing: {best_model_test}, MAE: {best_mae_test:.6f}, R^2: {best_r2_test:.6f}")


Training set size: 13034 (75.00%)
CV set size: 1738 (10.00%)
Testing set size: 2607 (15.00%)

linear_regression:
Cross-validation - Mean MAE = 106.779482, Mean R^2 = 0.384432
Testing - MAE = 102.823157, R^2 = 0.388153

svm:
Cross-validation - Mean MAE = 92.883986, Mean R^2 = 0.375716
Testing - MAE = 87.252426, R^2 = 0.408799

random_forest:
Cross-validation - Mean MAE = 27.667207, Mean R^2 = 0.935657
Testing - MAE = 26.045886, R^2 = 0.938564

xgboost:
Cross-validation - Mean MAE = 28.096184, Mean R^2 = 0.939502
Testing - MAE = 25.862934, R^2 = 0.945160

decision_tree_regression:
Cross-validation - Mean MAE = 37.574605, Mean R^2 = 0.871676
Testing - MAE = 34.931914, R^2 = 0.891672

knn:
Cross-validation - Mean MAE = 73.215535, Mean R^2 = 0.641978
Testing - MAE = 67.378903, R^2 = 0.665108

polynomial_regression:
Cross-validation - Mean MAE = 92.700825, Mean R^2 = 0.544220
Testing - MAE = 88.758463, R^2 = 0.544970
---------------------------------------------------------------------------



---

# Model Evaluation

 The results of the model evaluation show a clear distinction in performance across different regression models based on Mean Absolute Error (MAE) and R-squared (R²) metrics for both cross-validation and testing.

**Linear Regression**:
- while straightforward, demonstrates relatively high errors and moderate R², indicating it may not capture the complexity of the data well.

**SVR (Support Vector Regression)**:
- SVR performs better than Linear Regression in terms of MAE but has similar R², suggesting limited improvement in predictive power.

**Random Forest Regressor**:
- Inference: Random Forest shows exceptional performance with low MAE and high R², indicating strong predictive accuracy and robustness.

**XGBoost Regressor**:
- XGBoost also demonstrates excellent performance, slightly outperforming Random Forest on the test set, making it the best model for this data.

**Decision Tree Regressor**:
- Although the Decision Tree Regressor performs slightly less than the XGBoost Regressor, it still offers a solid baseline due to its simplicity and interpretability. It performs well because it can capture non-linear relationships and interactions between features effectively

**K-Nearest Neighbors (KNN) Regressor**:
- KNN performs moderately well but has higher errors compared to ensemble methods, indicating it may struggle with capturing complex patterns.

**Polynomial Regression**:
- Polynomial Regression shows better performance than Linear Regression but still has relatively high errors and moderate R², suggesting limited benefit from polynomial terms.

<br>

Ensemble methods, such as Random Forest and XGBoost, exhibit significantly lower error rates with Mean Absolute Errors (MAE) in the 20s, while the majority of other models (apart from decision trees) have MAE values above 70, highlighting the superior performance of ensemble techniques.

The overall inference is that ensemble methods, specifically Random Forest and XGBoost, significantly outperform other models in terms of both accuracy and predictive power. XGBoost marginally outperforms Random Forest on the test set, making it the best model for this dataset.



# Grid Search CV

In [None]:
from sklearn.metrics import make_scorer, mean_absolute_error
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid_rf = {
    'regressor__n_estimators': [50, 100, 150],
    'regressor__max_depth': [None, 10, 20],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4]
}

# Perform GridSearchCV with scoring='neg_mean_absolute_error'
scorer = make_scorer(mean_absolute_error, greater_is_better=False)
grid_search_rf = GridSearchCV(pipelines['random_forest'], param_grid_rf, cv=5, scoring=scorer, n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

# Calculate MAE and R^2 for best parameters on CV set
best_rf_model = grid_search_rf.best_estimator_
y_cv_pred_rf = best_rf_model.predict(X_cv)
mae_cv_rf = mean_absolute_error(y_cv, y_cv_pred_rf)
r2_cv_rf = r2_score(y_cv, y_cv_pred_rf)

# Calculate MAE and R^2 for best parameters on testing set
y_test_pred_rf = best_rf_model.predict(X_test)
mae_test_rf = mean_absolute_error(y_test, y_test_pred_rf)
r2_test_rf = r2_score(y_test, y_test_pred_rf)

# Print results
print("Best parameters for Random Forest:", grid_search_rf.best_params_)
print("CV set - MAE: {:.6f}, R^2: {:.6f}".format(mae_cv_rf, r2_cv_rf))
print("Testing set - MAE: {:.6f}, R^2: {:.6f}".format(mae_test_rf, r2_test_rf))


Best parameters for Random Forest: {'regressor__max_depth': 20, 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 150}
CV set - MAE: 27.146483, R^2: 0.937409
Testing set - MAE: 25.744048, R^2: 0.940136


In [None]:
from sklearn.metrics import make_scorer, mean_absolute_error
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for XGBoost
param_grid_xgb = {
    'regressor__n_estimators': [50, 100, 150],
    'regressor__max_depth': [3, 5, 7],
    'regressor__learning_rate': [0.1, 0.01, 0.001],
    'regressor__subsample': [0.5,0.7,0.9, 1.0],
    'regressor__colsample_bytree': [0.9, 1.0]
}

# Perform GridSearchCV with scoring='neg_mean_absolute_error'
scorer = make_scorer(mean_absolute_error, greater_is_better=False)
grid_search_xgb = GridSearchCV(pipelines['xgboost'], param_grid_xgb, cv=5, scoring=scorer, n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)

# Calculate MAE and R^2 for best parameters on CV set
best_xgb_model = grid_search_xgb.best_estimator_
y_cv_pred_xgb = best_xgb_model.predict(X_cv)
mae_cv_xgb = mean_absolute_error(y_cv, y_cv_pred_xgb)
r2_cv_xgb = r2_score(y_cv, y_cv_pred_xgb)

# Calculate MAE and R^2 for best parameters on testing set
y_test_pred_xgb = best_xgb_model.predict(X_test)
mae_test_xgb = mean_absolute_error(y_test, y_test_pred_xgb)
r2_test_xgb = r2_score(y_test, y_test_pred_xgb)

# Print results
print("Best parameters for XGBoost:", grid_search_xgb.best_params_)
print("CV set - MAE: {:.6f}, R^2: {:.6f}".format(mae_cv_xgb, r2_cv_xgb))
print("Testing set - MAE: {:.6f}, R^2: {:.6f}".format(mae_test_xgb, r2_test_xgb))


Best parameters for XGBoost: {'regressor__colsample_bytree': 1.0, 'regressor__learning_rate': 0.1, 'regressor__max_depth': 7, 'regressor__n_estimators': 150, 'regressor__subsample': 0.7}
CV set - MAE: 24.358193, R^2: 0.952201
Testing set - MAE: 23.813126, R^2: 0.950036


#Model Process Flow
The process flow of this ML pipeline can be summarized as follows:

1. **Data Preparation**:
   - Drop unnecessary columns ('instant', 'dteday','casual', 'registered').

2. **One-hot encoding**:
   - Define the columns to be one-hot encoded ('mnth', 'hr', 'weekday', 'weathersit') and encode them.

3. **Model Building:**

  - Define a function (build_model_pipeline) to build pipelines for different regression models.
  - Create a column transformer (preprocessor) for preprocessing numeric  features.
  - Define regression models to be used in the pipelines.
  - Create pipelines for each model, including preprocessing and the regression model.

4. **Model Training and Evaluation:**

  - Define a function (train_evaluate_models_cv_and_test) to train and evaluate models using cross-validation and testing. The data is split into training (75%), cross-validation (10%), and testing (15%) sets. Cross-validation is performed using 5-fold cross-validation to evaluate the models.
  - Perform cross-validation for each model and calculate Mean RMSE and Mean R^2.
  - Perform testing for each model and calculate RMSE and R^2.

5. **Best Model Selection:**
  - Find the best model based on Mean RMSE in cross-validation and RMSE in testing.
  - Print the best model and its performance metrics.

6. **Grid Search CV:**
  - I have employed grid search to further try out different parameters on the 2 best models to get much lesser MAE.
  - By identifying the optimal combination of hyperparameters, Grid Search CV enhances the model's performance and generalization ability, leading to more accurate and reliable predictions.



---



# Model Inferencing

**Why I think XGBoost and Random Forest gave exceptional results**

1. **Reduction of Overfitting**:
- They combine multiple weak learners to create a stronger overall model.
- By averaging the predictions of multiple trees (in Random Forest) or using boosting techniques to focus on misclassified instances (in XGBoost), these models reduce the risk of overfitting that individual models may suffer from. This leads to better generalization on unseen (test and CV) data.

2. **Improved Accuracy**:

- Ensemble methods aggregate the strengths of various models, thereby improving accuracy.
- In Random Forest, for instance, the model builds multiple decision trees using different subsets of the data and features. The final prediction is an average of all trees, which often leads to more accurate and stable predictions.
- XGBoost, through its gradient boosting framework, sequentially builds trees where each tree corrects the errors of the previous ones, enhancing the overall accuracy.

3. **Robustness to Noise**:
- Ensemble models are more robust to noise in the training data.
- By combining multiple models, the impact of noisy or erroneous data points is minimized because the overall prediction relies on the consensus of many models rather than a single one. This collective decision-making process tends to filter out noise and result in more reliable predictions.

These reasons collectively contribute to the superior performance of ensemble methods in various machine learning tasks.



***Comparing the Mean Absolute Error (MAE) of approximately 23 with a mean(cnt)of 190 indicates a relatively low error rate. Specifically, the MAE of 23 represents around 12.1% of the mean value (23 / 190 * 100). This suggests that the model's predictions are quite close to the actual values, demonstrating excellent level of accuracy in the context of the dataset.***



---



---

