# **Project Name**    - Regression - Bike Sharing Demand Prediction



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

Business Context:
Rental bikes have become a popular mode of transportation in many urban cities. The ability to provide rental bikes at the right time and place is crucial for minimizing waiting times and ensuring seamless mobility for users. This project aims to develop a predictive model that can forecast the number of bikes needed at each hour. This model will help maintain a stable supply of rental bikes, reduce operational costs, and enhance the user experience.


Dataset:
The dataset used for this project is the Seoul Bike Sharing Demand dataset, which contains historical data on bike rentals. The data includes features such as date, hour, temperature, humidity, wind speed, visibility, and more. This dataset provides the necessary information to build a robust predictive model.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Problem Statement:
Predict the bike count required at each hour for the stable supply of rental bikes in urban cities.

#### **Define Your Business Objective?**

Objective:
The objective of this project is to predict the bike count required at each hour for the stable supply of rental bikes in urban cities. This is crucial for minimizing waiting time and ensuring mobility comfort for users.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import cross_val_score

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')
filepath="/content/SeoulBikeData.csv"
df=pd.read_csv(filepath,encoding='latin')

### Dataset First View

In [None]:
# Dataset First Look
# Display the first few rows of the dataset
print(df.head())
# Convert the Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# Initial dataset info
print(df.info())

# Display the first few rows of the dataset
print(df.head())

# Summary statistics
print(df.describe())


### Dataset cleanup

In [None]:
print(df.isnull().sum())
print(df.duplicated().sum())
df.dropna()
df.drop_duplicates()

### EDA

In [None]:
# Visualize the distribution of the target variable
sns.histplot(df['Rented Bike Count'], bins=30, kde=True)
plt.title('Distribution of Rented Bike Count')
plt.xlabel('Rented Bike Count')
plt.ylabel('Frequency')
plt.show()

# Boxplot for target variable
plt.figure(figsize=(10, 6))
sns.boxplot(y=df['Rented Bike Count'])
plt.title('Boxplot of Rented Bike Count')
plt.show()

In [None]:
# Pairplot for numerical features
num_features = ['Rented Bike Count', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)',
                'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

sns.pairplot(df[num_features])
plt.show()

# Histograms for numerical features
df[num_features].hist(bins=30, figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()

# Boxplots for numerical features
plt.figure(figsize=(15, 10))
for i, col in enumerate(num_features[1:], 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Bar plots for categorical features
cat_features = ['Seasons', 'Holiday', 'Functioning Day']

plt.figure(figsize=(15, 5))
for i, col in enumerate(cat_features, 1):
    plt.subplot(1, 3, i)
    sns.countplot(x=df[col])
    plt.title(f'Countplot of {col}')
plt.tight_layout()
plt.show()

# Boxplots of target variable with respect to categorical features
plt.figure(figsize=(15, 5))
for i, col in enumerate(cat_features, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x=df[col], y=df['Rented Bike Count'])
    plt.title(f'{col} vs Rented Bike Count')
plt.tight_layout()
plt.show()

In [None]:
df.describe()


Feature Engineering

In [None]:
df['Day_of_Week'] = df['Date'].dt.dayofweek
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

df['Temp_Humidity_Interaction'] = df['Temperature(°C)'] * df['Humidity(%)']
df['Wind_Visibility_Interaction'] = df['Wind speed (m/s)'] * df['Visibility (10m)']

df['Lag_1'] = df['Rented Bike Count'].shift(1)
df['Lag_2'] = df['Rented Bike Count'].shift(2)
df['Lag_3'] = df['Rented Bike Count'].shift(3)

df = df.dropna()

df = pd.get_dummies(df, columns=['Seasons', 'Holiday', 'Functioning Day'], drop_first=True)

corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
df = df.drop(columns=to_drop)

X = df.drop(['Rented Bike Count', 'Date'], axis=1)
y = df['Rented Bike Count']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


Model Implementation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(random_state=42)
ridge_model = Ridge()

# Train models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
ridge_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_ridge = ridge_model.predict(X_test)

# Evaluate models
lr_mse = mean_squared_error(y_test, y_pred_lr)
lr_r2 = r2_score(y_test, y_pred_lr)

rf_mse = mean_squared_error(y_test, y_pred_rf)
rf_r2 = r2_score(y_test, y_pred_rf)

ridge_mse = mean_squared_error(y_test, y_pred_ridge)
ridge_r2 = r2_score(y_test, y_pred_ridge)

print(f'Linear Regression MSE: {lr_mse}, R2: {lr_r2}')
print(f'Random Forest MSE: {rf_mse}, R2: {rf_r2}')
print(f'Ridge Regression MSE: {ridge_mse}, R2: {ridge_r2}')

# Cross-validation
cv_scores_lr = cross_val_score(lr_model, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
cv_scores_rf = cross_val_score(rf_model, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
cv_scores_ridge = cross_val_score(ridge_model, X_scaled, y, cv=5, scoring='neg_mean_squared_error')

print(f'Linear Regression CV MSE: {-cv_scores_lr.mean()}')
print(f'Random Forest CV MSE: {-cv_scores_rf.mean()}')
print(f'Ridge Regression CV MSE: {-cv_scores_ridge.mean()}')


Model Evaluation and Model Improvement

In [None]:
# Feature importance from Random Forest
importances = rf_model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance')
plt.show()

# Evaluating Residuals
residuals = y_test - y_pred_rf
sns.histplot(residuals, bins=30, kde=True)
plt.title('Distribution of Residuals')
plt.show()

# Scatter plot of actual vs predicted values
plt.scatter(y_test, y_pred_rf, alpha=0.3)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted Values')
plt.show()


# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***