# Bike Rental Prediction — Daily Counts 

- Objective : Predict daily bike rentals (cnt) using seasonal and environmental features.

- Dataset : day.csv


## Bike Rental Demand Prediction
**Objective:** Predict daily bike rental counts (`cnt`) using `day.csv`.  
**Contents:** EDA → Preprocessing → Feature Engineering → Modeling → Evaluation → Model Comparison → Challenges & Conclusion.


In [2]:
# Load all the libraries
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


In [3]:
# Load dataset
# Load
df = pd.read_csv('day.csv')

In [4]:
df

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,2012-12-27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.253333,0.255046,0.590000,0.155471,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.253333,0.242400,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.231700,0.483333,0.350754,364,1432,1796


# Domain Analysis


In [5]:
# Basic Checks
df.shape

(731, 16)

In [6]:
df.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

In [7]:
df.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

In [8]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [9]:
df.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
726,727,2012-12-27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.253333,0.255046,0.59,0.155471,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.253333,0.2424,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.2317,0.483333,0.350754,364,1432,1796
730,731,2012-12-31,1,1,12,0,1,1,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [10]:
data.info()

NameError: name 'data' is not defined

In [None]:
df.describe()

# Exploratory Data Analysis

## Univariate Analysis

In [None]:
# Univariate: distribution of target
plt.figure(figsize=(6,4))
sns.histplot(df['cnt'], kde=True)
plt.title('Distribution of daily count (cnt)')


### - Shape Interpretation 

- it looks slightly right-skewed there are more low rental days(left side)than very high ones.
- The peak around 4000 suggests that's the typical daily rental count.
- the small bump near 8000 means there are a few days with unusually high rentals. 

In [None]:
# Bivariate examples
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
sns.boxplot(x='season', y='cnt', data=df)
plt.title('cnt by season')


In [None]:
plt.subplot(1,3,2)
sns.scatterplot(x='temp', y='cnt', data=df)
plt.title('cnt vs temp')

In [None]:
plt.subplot(1,3,3)
sns.boxplot(x='weathersit', y='cnt', data=df)
plt.title('cnt by weather')
plt.tight_layout()

#### visualize distribution of cnt and relationships to season, temp, weathersit.

Seasonal differences 3.Summer season has higher rides.

Relationship between temperature and counts (likely positive).

Weather effect 3 bad weather lowers counts.


# Correlation matrix & collinearity check

In [None]:
plt.figure(figsize=(10,8))
corr = df.drop(columns=['dteday']).corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation matrix')


#### detects correlations; especially check temp vs atemp and casual/registered vs cnt
atemp and temp are often highly correlated — consider dropping one.

casual + registered sum to cnt (we'll drop them as target-leakage if predicting cnt).

# Preprocessing: prepare features & target

In [None]:
# Work on a copy
data = df.copy()

In [None]:
# Drop identifiers & leakage targets
data = data.drop(['instant','dteday','casual','registered'], axis=1)


In [None]:
# Convert categorical variables to dummies
cat_cols = ['season','mnth','weekday','weathersit','holiday','workingday','yr']
data = pd.get_dummies(data, columns=['season','mnth','weekday','weathersit'], drop_first=True)

In [None]:
# Feature / target split
X = data.drop('cnt', axis=1)
y = data['cnt']

In [None]:
# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### Explanation:

Drop instant, dteday (IDs) and casual/registered (leakage).

One-hot encode categorical features to let models use them.

Split into train/test sets.

### Insights:

Explain why we removed casual & registered: they directly sum to the target and would leak the answer.

# Scaling (optional, mainly for linear models)

In [None]:
scaler = StandardScaler()
num_cols = ['temp','atemp','hum','windspeed']  # adjust if atemp dropped

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test_scaled[num_cols] = scaler.transform(X_test[num_cols])


#### Explanation: scaling centers numeric features — helps regularized linear models and gradient-based models.

Insight: Random forests don’t require scaling, but linear regression and ridge/lasso do better when features are scaled.