# 📊 ***Data Science, CA3 - Task 2*** 📚

* **Member 1** : [Kasra Kashani, 810101490] 🆔
* **Member 2** : [Borna Foroohari, 810101480] 🆔

📄 **Subjects**: Machine Learning: Regression

## 🔹**Imports**

Import required modules.

In [20]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

## 📍 Bike Rental Prediction

In this task, our goal is to build a *regression* model that predicts numebr of people who are `total_users` and rent bikes based on weather status in which season and on which day of week.

### 🧠 Feature Understanding & Analysis

Below is a breakdown of all major features in the dataset, including their types, meanings, and the potential insights they provide for predicting total users rent bike:

| 🏷️ Feature  Name          | 🧬 Type       | 💡 Description                                                                 | 🔍 Key Insight                                                                 |
|--------------------|------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------|
| `id`               | numeric    | Unique identifier for each record (starting from 1)                         | Uniquily specify each record
| `date`             | datetime   | Full calendar date (DD-MM-YYYY format)                                      | Used to extract day, weekday name, seasonality. Not directly predictive.     |
| `season_id`        | categorical| Season of the year (1=Spring, 2=Summer, 3=Autumn, 4=Winter)                | Summer and Autumn show higher rental activity. Seasonal variation is strong. |
| `year`             | binary     | Encoded year (0 = 2018, 1 = 2019)                                           | Useful for identifying year-over-year trends or external changes.            |
| `month`            | numeric    | Month number (1–12)                                                         | Rentals peak in warmer months (May–Sept). Strong correlation with season.    |
| `is_holiday`       | binary     | Whether the day is a public holiday (1 = Yes, 0 = No)                       | Holidays show significantly fewer rentals.                                   |
| `weekday`          | numeric    | Day of the week (0 = Monday, ..., 6 = Sunday)                               | Rentals are higher on weekdays due to commuting behavior.                    |
| `is_workingday`    | binary     | Whether it's a working day (1 = Yes, 0 = No)                                | Most predictive: working days have consistently higher demand.               |
| `weather_condition`| categorical| Weather description (e.g., clear, cloudy, rainy)                            | Clear weather boosts usage; rain and fog reduce it.                          |
| `temperature`      | continuous | Actual temperature in Celsius                                               | Rentals increase with mild temperature (15–30°C zone).                       |
| `feels_like_temp`  | continuous | Apparent temperature felt by humans                                         | Difference from actual temperature could reveal discomfort.                  |
| `humidity`         | percentage | Relative humidity level (%)                                                 | Higher humidity correlates with lower rental activity.                       |
| `wind_speed`       | continuous | Wind speed in km/h                                                          | Moderate wind acceptable; high wind reduces usage.                           |
| `total_users`      | target     | Total number of bike rentals (casual + registered)                          | This is the target variable for regression modeling.                         |

First we read and load the train dataset into a Pandas dataframe.

In [21]:
# Load the CSV file into a dataframe
df = pd.read_csv("regression-dataset-train.csv")

In [22]:
# Show the dataframe
df

Unnamed: 0,id,date,season_id,year,month,is_holiday,weekday,is_workingday,weather_condition,temperature,feels_like_temp,humidity,wind_speed,total_users
0,577,31-07-2019,3,1,7,0,2,1,1,29.246653,33.14480,70.4167,11.083475,7216
1,427,03-03-2019,1,1,3,0,6,0,2,16.980847,20.67460,62.1250,10.792293,4066
2,729,30-12-2019,1,1,12,0,0,0,1,10.489153,11.58500,48.3333,23.500518,1796
3,483,28-04-2019,2,1,4,0,6,0,2,15.443347,18.87520,48.9583,8.708325,4220
4,112,22-04-2018,2,0,4,0,5,1,2,13.803347,16.09770,72.9583,14.707907,1683
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,579,02-08-2019,3,1,8,0,4,1,1,30.852500,35.35440,65.9583,8.666718,7261
506,54,23-02-2018,1,0,2,0,3,1,1,9.091299,12.28585,42.3043,6.305571,1917
507,351,17-12-2018,4,0,12,0,6,0,2,10.591653,12.46855,56.0833,16.292189,2739
508,80,21-03-2018,2,0,3,0,1,1,2,17.647835,20.48675,73.7391,19.348461,2077


In [23]:
# Convert the date columns to datetime format
df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=True)

We can see that no column has null values and we can skip handling missing data step.

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 510 entries, 0 to 509
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 510 non-null    int64         
 1   date               510 non-null    datetime64[ns]
 2   season_id          510 non-null    int64         
 3   year               510 non-null    int64         
 4   month              510 non-null    int64         
 5   is_holiday         510 non-null    int64         
 6   weekday            510 non-null    int64         
 7   is_workingday      510 non-null    int64         
 8   weather_condition  510 non-null    int64         
 9   temperature        510 non-null    float64       
 10  feels_like_temp    510 non-null    float64       
 11  humidity           510 non-null    float64       
 12  wind_speed         510 non-null    float64       
 13  total_users        510 non-null    int64         
dtypes: datetim

We see that there are just numeric features in our dataset with a datetime feature. So we don't need to encode any categorical features because we don't have any categorical features.

We only convert the year column to the actual year instead of a binary value.

In [25]:
# convert the year column to the actual years
df["year"] = df["year"].map({0: 2018, 1: 2019})

In [26]:
# First see the count of unique values for each column
for col in df.columns:
    print(f"{col} -> {df[col].unique().size}")

id -> 510
date -> 510
season_id -> 4
year -> 2
month -> 12
is_holiday -> 2
weekday -> 7
is_workingday -> 2
weather_condition -> 3
temperature -> 386
feels_like_temp -> 491
humidity -> 444
wind_speed -> 467
total_users -> 493


Now we can extracting some additional good features.

In [27]:
df["day"] = df["date"].dt.day

df["dayofyear"] = df["date"].dt.dayofyear

df["week"] = df["date"].dt.isocalendar().week.astype(int)

df["temp_diff"] = df["feels_like_temp"] - df["temperature"]

df["humidity_temp_ratio"] = df["humidity"] / (df["temperature"] + 1e-3)

df["temp_x_humidity"] = df["temperature"] * df["humidity"]

df["wind_x_holiday"] = df["wind_speed"] * df["is_holiday"]

df["weekend"] = df["weekday"].apply(lambda x: 1 if x in [4, 5] else 0)

df["is_Sunday"] = (df["weekday"] == 5).astype(int)

df["feels_good_zone"] = df["feels_like_temp"].apply(lambda x: 1 if 18 <= x <= 27 else 0)

df["weather_holiday"] = df["weather_condition"].astype(str) + "_" + df["is_holiday"].astype(str)

q1, q3 = df["total_users"].quantile([0.25, 0.75])
iqr = q3 - q1
df = df[(df["total_users"] >= q1 - 1.5 * iqr) & (df["total_users"] <= q3 + 1.5 * iqr)]

# Delete the date and id column because we dont need them anymore
df = df.drop(columns=["date", "id"])

Then we can encode some of our features using one-hot encoding.

In [28]:
# Endcode columns using one-hot encoding
df = pd.get_dummies(df, columns=["weather_holiday"], drop_first=True)

In [29]:
# Show the dataframe
df

Unnamed: 0,season_id,year,month,is_holiday,weekday,is_workingday,weather_condition,temperature,feels_like_temp,humidity,...,humidity_temp_ratio,temp_x_humidity,wind_x_holiday,weekend,is_Sunday,feels_good_zone,weather_holiday_1_1,weather_holiday_2_0,weather_holiday_2_1,weather_holiday_3_0
0,3,2019,7,0,2,1,1,29.246653,33.14480,70.4167,...,2.407602,2059.452790,0.0,0,0,0,False,False,False,False
1,1,2019,3,0,6,0,2,16.980847,20.67460,62.1250,...,3.658318,1054.935120,0.0,0,0,1,False,True,False,False
2,1,2019,12,0,0,0,1,10.489153,11.58500,48.3333,...,4.607492,506.975379,0.0,0,0,0,False,False,False,False
3,2,2019,4,0,6,0,2,15.443347,18.87520,48.9583,...,3.169982,756.080015,0.0,0,0,1,False,True,False,False
4,2,2018,4,0,5,1,2,13.803347,16.09770,72.9583,...,5.285169,1007.068731,0.0,1,1,0,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,3,2019,8,0,4,1,1,30.852500,35.35440,65.9583,...,2.137790,2034.978451,0.0,1,0,0,False,False,False,False
506,1,2018,2,0,3,1,1,9.091299,12.28585,42.3043,...,4.652762,384.601040,0.0,0,0,0,False,False,False,False
507,4,2018,12,0,6,0,2,10.591653,12.46855,56.0833,...,5.294547,594.014853,0.0,0,0,0,False,True,False,False
508,2,2018,3,0,1,1,2,17.647835,20.48675,73.7391,...,4.178128,1301.335470,0.0,0,0,1,False,True,False,False


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 510 entries, 0 to 509
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   season_id            510 non-null    int64  
 1   year                 510 non-null    int64  
 2   month                510 non-null    int64  
 3   is_holiday           510 non-null    int64  
 4   weekday              510 non-null    int64  
 5   is_workingday        510 non-null    int64  
 6   weather_condition    510 non-null    int64  
 7   temperature          510 non-null    float64
 8   feels_like_temp      510 non-null    float64
 9   humidity             510 non-null    float64
 10  wind_speed           510 non-null    float64
 11  total_users          510 non-null    int64  
 12  day                  510 non-null    int32  
 13  dayofyear            510 non-null    int32  
 14  week                 510 non-null    int64  
 15  temp_diff            510 non-null    flo

As our test data is in another CSV file, we have to do all of these previous preproccesing and feature extracting for the test file too, after reading the test data into another Pandas datrame.

In [31]:
# Load the CSV file into a dataframe
df_test = pd.read_csv("regression-dataset-test-unlabeled.csv")

# Save the id of each row
test_ids = df_test["id"]

In [32]:
# Convert the date columns to datetime format
df_test["date"] = pd.to_datetime(df_test["date"], errors="coerce", dayfirst=True)

In [33]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 220 non-null    int64         
 1   date               220 non-null    datetime64[ns]
 2   season_id          220 non-null    int64         
 3   year               220 non-null    int64         
 4   month              220 non-null    int64         
 5   is_holiday         220 non-null    int64         
 6   weekday            220 non-null    int64         
 7   is_workingday      220 non-null    int64         
 8   weather_condition  220 non-null    int64         
 9   temperature        220 non-null    float64       
 10  feels_like_temp    220 non-null    float64       
 11  humidity           220 non-null    float64       
 12  wind_speed         220 non-null    float64       
dtypes: datetime64[ns](1), float64(4), int64(8)
memory usage: 22.5 KB


In [34]:
# convert the year column to the actual year
df_test["year"] = np.where(
    df_test["year"] == 0,
    2018,
    2019
)

df_test["day"] = df_test["date"].dt.day

df_test["dayofyear"] = df_test["date"].dt.dayofyear

df_test["week"] = df_test["date"].dt.isocalendar().week.astype(int)

df_test["temp_diff"] = df_test["feels_like_temp"] - df_test["temperature"]

df_test["humidity_temp_ratio"] = df_test["humidity"] / (df_test["temperature"] + 1e-3)

df_test["temp_x_humidity"] = df_test["temperature"] * df_test["humidity"]

df_test["wind_x_holiday"] = df_test["wind_speed"] * df_test["is_holiday"]

df_test["weekend"] = df_test["weekday"].apply(lambda x: 1 if x in [4, 5] else 0)

df_test["is_Sunday"] = (df_test["weekday"] == 5).astype(int)

df_test["feels_good_zone"] = df_test["feels_like_temp"].apply(lambda x: 1 if 18 <= x <= 27 else 0)

df_test["weather_holiday"] = df_test["weather_condition"].astype(str) + "_" + df_test["is_holiday"].astype(str)

# Delete the date and id columns because we don't need them anymore
df_test = df_test.drop(columns=["date", "id"])

# Endcode columns using one-hot encoding
df_test = pd.get_dummies(df_test, columns=["weather_holiday"], drop_first=True)

In [35]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   season_id            220 non-null    int64  
 1   year                 220 non-null    int64  
 2   month                220 non-null    int64  
 3   is_holiday           220 non-null    int64  
 4   weekday              220 non-null    int64  
 5   is_workingday        220 non-null    int64  
 6   weather_condition    220 non-null    int64  
 7   temperature          220 non-null    float64
 8   feels_like_temp      220 non-null    float64
 9   humidity             220 non-null    float64
 10  wind_speed           220 non-null    float64
 11  day                  220 non-null    int32  
 12  dayofyear            220 non-null    int32  
 13  week                 220 non-null    int64  
 14  temp_diff            220 non-null    float64
 15  humidity_temp_ratio  220 non-null    flo

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 510 entries, 0 to 509
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   season_id            510 non-null    int64  
 1   year                 510 non-null    int64  
 2   month                510 non-null    int64  
 3   is_holiday           510 non-null    int64  
 4   weekday              510 non-null    int64  
 5   is_workingday        510 non-null    int64  
 6   weather_condition    510 non-null    int64  
 7   temperature          510 non-null    float64
 8   feels_like_temp      510 non-null    float64
 9   humidity             510 non-null    float64
 10  wind_speed           510 non-null    float64
 11  total_users          510 non-null    int64  
 12  day                  510 non-null    int32  
 13  dayofyear            510 non-null    int32  
 14  week                 510 non-null    int64  
 15  temp_diff            510 non-null    flo

Finally, we use some models to get and compare their MSE and choose the best one for this regression. Models like:

- **Gradient Boosting**

- **XGBoost**

- **CatBoost**

At the end, we show the best model with its MSE and save their predictions into a CSV file.

In [None]:
# Building train data and labels
X = df.drop(columns=["total_users"])
Y = df["total_users"]

# Drop columns in test dataframe which are not exist in the train dataframe
for col in X.columns:
    if col not in df_test.columns:
        df_test[col] = 0
df_test = df_test[X.columns]

# Standardize both train and test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(df_test)

# Select best features fom our model and drop others
selector = SelectKBest(score_func=f_regression, k=60)
X_selected = selector.fit_transform(X_scaled, Y)
X_selected_names = X.columns[selector.get_support()]
X_test_selected = X_test_scaled[:, selector.get_support(indices=True)]

# Building train and test data and labels
X_train, X_val, Y_train, Y_val = train_test_split(X_selected, Y, test_size=0.2, random_state=42)

# Use models and their hyper parameters
models = {
    "Gradient Boosting": GradientBoostingRegressor(
        n_estimators=10000,
        learning_rate=0.002
    ),
    "XGBoost": XGBRegressor(
        n_estimators=10000,
        learning_rate=0.002,
        max_depth=6,
        subsample=0.85,
        colsample_bytree=0.85,
        reg_lambda=1.2,
        reg_alpha=0.3
    ),
    "CatBoost": CatBoostRegressor(
        iterations=10000, 
        learning_rate=0.002,
        depth=6,
        verbose=0
    )
}

# Run all models and choose the bet one
results = pd.DataFrame(columns=["Model", "MSE", "RMSE", "R2-Score", "MAPE", "MAE"])

for name, model in models.items():
    model.fit(X_train, Y_train)

    y_pred = model.predict(X_val)

    mse = mean_squared_error(Y_val, y_pred)

    print(f"The model {name} has the MSE {mse:.3f}")

    new_result = pd.DataFrame([{
        "Model": name,
        "MSE": f"{mse:.3f}",
        "RMSE": f"{(mse ** 0.5):.3f}",
        "R2-Score": f"{r2_score(Y_val, y_pred):.3f}",
        "MAPE": f"{(np.mean(np.abs((Y_val - y_pred) / Y_val)) * 100):.3f}%",
        "MAE": f"{mean_absolute_error(Y_val, y_pred):.3f}"
    }])
    results = pd.concat([results, new_result], ignore_index=True)

    model.fit(X_selected, Y)

    final_preds = model.predict(X_test_selected)

    submission = pd.DataFrame({"id": test_ids, "label": final_preds})

    submission.to_csv(f"{name}_predictions.csv", index=False)



The model Gradient Boosting has the MSE 503951.964
The model XGBoost has the MSE 502517.250
The model CatBoost has the MSE 476687.075


Then we calculate and report MSE, RMSE, R2-Score, MAPE and MAE during our model development process on the train data.

- **MSE (Mean Squared Error)** -> The average of the squared differences between the actual values and the predicted values. It measures how close the predicted values are to the actual values. Squaring the error magnifies the impact of larger errors, making it sensitive to outliers.

$$ MSE = \dfrac{1}{N} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2 $$

- **RMSE (Root Mean Squared Error)** -> The square root of the MSE. It has the same unit as the target variable, making it more interpretable. It is more sensitive to large errors than MSE because of the square root.

$$ RMSE = \sqrt{MSE} $$

- **R2-Score** -> This measures the proportion of variance in the target variable that is explained by the model. It ranges between 0 (the worst) and 1 (the best).

$$ R2Score = 1 - \dfrac{\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2}{\sum_{i=1}^{n} (y_{i} - \bar{y})^2} $$

- **MAPE (Mean Absolute Percentage Error)** -> The average of the absolute percentage errors between the actual and predicted values. It is useful when the magnitude of the target variable is important.

$$ MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_{i} - \hat{y_{i}}}{y_{i}} \right| \times 100 $$

- **MAE (Mean Absolute Error)** -> The average of the absolute differences between the actual and predicted values. It is less sensitive to outliers compared to MSE and RMSE.

$$ MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_{i} - \hat{y_{i}} \right| $$

Where $y_{i}$ is the actual value, $\hat{y_{i}}$ is the predicted value, $\bar{y}$ is the mean of actual values and *n* is the number of samples.

In [38]:
# Show the model results for the train data
results

Unnamed: 0,Model,MSE,RMSE,R2-Score,MAPE,MAE
0,Gradient Boosting,503951.964,709.896,0.837,15.678%,476.927
1,XGBoost,502517.25,708.885,0.837,15.879%,465.663
2,CatBoost,476687.075,690.425,0.846,16.190%,466.122
