# Introduction

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people can rent a bike from one location and return it to a different place on an as-needed basis. 

This dataset contains valuable information about bike rentals, including various features such as datetime, weather conditions, and holidays. The dataset provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is from the 20th to the end of the month.

By harnessing the power of regression algorithms, this project aims to predict the total count of bikes rented during each hour covered by the test set, based on only information available before the rental period. By evaluating the performance of different regression models using the Root Mean Squared Logarithmic Error (RMSLE), we will select the most accurate model that best captures the underlying relationships within the data.

Ultimately, the goal of this project is to not only develop a reliable prediction model but also through a comprehensive analysis of the dataset, this project seeks to unearth meaningful insights about the factors influencing bike rentals. These insights could potentially aid stakeholders in making informed decisions to optimize bike availability, marketing strategies, and resource allocation.

### Understanding the Variables

1. datetime: hourly date + timestamp  
Clear, Few clouds, Partly cloudy, Partly cloudy

2. season:    
           1 = spring   
           2 = summer
           3 = fall
           4 = winter
           
           
3. holiday: whether the day is considered a holiday, 
             1 = holiday 
             0 = not holiday
            

4. workingday: whether the day is neither a weekend nor holiday
                1 = working day 
                0 = not working day
               

5. weather: 
            1 = Clear, Few clouds, Partly cloudy, Partly cloudy
            2 = Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
            3 = Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
            4 = Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
            
            
6. temp: temperature in Celsius

7. atemp: "feels like" temperature in Celsius

8. humidity: relative humidity

9. windspeed: wind speed

10. casual: number of non-registered user rentals initiated

11. registered: number of registered user rentals initiated

12. count: number of total rentals

In [None]:
# Import libraries. begin, let's import the necessary libraries that we'll be using throughout this notebook:

# Data Manipulation Libraries
import numpy as np 
import pandas as pd 

# Data Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning Libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_log_error, make_scorer

# Machine Learning Models
from sklearn.linear_model import LinearRegression  
from sklearn.tree import DecisionTreeRegressor  
from sklearn.ensemble import RandomForestRegressor

In [None]:
# knowing the name of the dataset.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load tha train data.
train = pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv")
train.head()

In [None]:
# Load tha test data.
test = pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv")
test.head()

# preparation the train data

In [None]:
# Seeing the shape of the data.
train.shape

In [None]:
# Seeing if there are dublicated.
train.duplicated().sum()

In [None]:
# seeing if there are null values.
train.isna().sum()

In [None]:
# Seeing information about data.
train.info()

To unlock a deeper level of analysis and facilitate insightful observations from the dataset, we will Convert the "datetime" variable from its current "object" type to a structured "datetime" format that enriches the dataset. This separation enables more precise temporal analysis, unveiling trends across months, days, and hours.

In [None]:
# Convert the 'datetime' column to datetime format
train['datetime'] = pd.to_datetime(train['datetime'])

In [None]:
# Seeing information about data.
train.info()

In [None]:
# Extract the year from the 'datetime' column and create a new 'year' column
train['year'] = train['datetime'].dt.year

# Extract the month as its name from the 'datetime' column and create a new 'month' column
train['month'] = train['datetime'].dt.month_name()

# Extract the day as its name from the 'datetime' column and create a new 'day' column
train['day'] = train['datetime'].dt.day_name()

# Extract the hour from the 'datetime' column and create a new 'hour' column
train['hour'] = train['datetime'].dt.hour

train.head()

In order to further dissect and comprehend the dataset, the datetime variable has been meticulously segmented. This segmentation involves extracting distinct temporal components, namely the year, month, day, and hour, from the datetime column. The result is a more structured representation of time-related data that facilitates seamless analysis and visualization and understanding of bike rental patterns.

In [None]:
# Replace the values in the 'season' column with corresponding strings
train['season'].replace({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'}, inplace=True)

# Replace the values in the 'holiday' column with corresponding strings
train['holiday'].replace({1: 'Holiday', 0: 'Not Holiday'}, inplace=True)

# Replace the values in the 'workingday' column with corresponding strings
train['workingday'].replace({1: 'Workingday', 0: 'Not Workingday'}, inplace=True)

# Replace the values in the 'weather' column with corresponding strings
train['weather'].replace({1: 'Clear', 2: 'Mist', 3: 'Rain', 4: 'Snow'}, inplace=True)

train.head()

To improve data clarity and interpretation we replaced numerical values in certain columns with corresponding descriptive strings. This process facilitates a more intuitive understanding of the dataset's categorical attributes and makes it more accessible for analysis and interpretation.

In [None]:
# Categorical columns.
categorical_features = train[['season', 'holiday', 'workingday', 'weather',  'year', 'month', 'day', 'hour']]

for i in categorical_features:
    print(train[i].value_counts())
    print('-' * 50)

We've discerned two observations:

First, it is the presence of a single value for 'snow' which leads to a potentially negative impact on our analysis. To mitigate this, we chose to combine it with the category 'rain' within the same column.

Second, we realized that there was a discrepancy between the number of 'holiday' and 'not-workingday', and it was shown in the understanding of the variables that 'workingday' is the day that is neither a weekend nor a holiday. Furthermore, the dataset shows cases where neither 'workingday' nor 'holiday' applies. Thus, this underscores the need for a third variable: the 'weekend'. This variable summarizes scenarios where work or vacation are not active, effectively representing a 'weekend'. This subtle distinction makes clear that our reference to 'holiday' relates exclusively to public holidays, which are distinct from 'weekend' holidays.

In [None]:
# Define a mapping dictionary to combine the clusters
cluster_mapping = {"Snow" : "Rain"}

# Update the "grade" column with the new cluster labels
train['weather'] = train['weather'].replace(cluster_mapping)

# Check the value_counts for the weather after replacing
train['weather'].value_counts()

In [None]:
# Filter rows where 'workingday' is equal to 'Workingday'
workingDay = train[train['workingday'] == 'Workingday']

# Filter rows where 'holiday' is equal to 'Holiday'
holiDay = train[train['holiday'] == 'Holiday']

# Filter rows where 'holiday' is not 'Holiday' and 'workingday' is not 'Workingday'
weekEnd = train[(train['holiday'] == 'Not Holiday') & (train['workingday'] == 'Not Workingday')]

In [None]:
# Numerical columns.
numerical_features = train[['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count']]

# calculate descriptive statistics for numerical values.
numerical_features.describe()

We have identified an issue with wind speed and humidity variables having a minimum value of zero, which is inherently illogical for wind speed or humidity to be zero. As a solution, we will replace these zeros with more reasonable values to ensure data consistency and reliability.

In [None]:
# Get the count of the minimum value
count_of_min_value = train[train['humidity']==0].shape[0]

count_of_min_value

In [None]:
# Filter rows with the minimum value
min_value_rows = train[train['humidity'] == 0]

min_value_rows

In [None]:
# Filter rows where the weather is 'Rain'
rain_weather = train[train['weather'] == 'Rain']

# Calculate the mean humidity for rows with 'Rain' weather
mean_rain_weather_humidity = rain_weather['humidity'].mean()

# Replace 0 values in the 'humidity' column with the calculated mean for 'Rain' weather
train['humidity'] = train['humidity'].replace(0, mean_rain_weather_humidity)

# Check the minimum value in the 'humidity' column after replacing 0 values
train['humidity'].min()

We found 22 instances of zero humidity readings, 
Intriguingly, these instances are all recorded on the same day. This suggests a potential recording error for that day's humidity data. 

To address this, we filled these instances with an appropriate value. Given that the day was mostly rainy, we calculated an average humidity value for rainy conditions in the dataset and used it to replace the zeros. This approach ensures data consistency and accuracy despite the recording irregularity.

In [None]:
# Get the count of the minimum value
count_of_min_value = train[train['windspeed']==0].shape[0]

count_of_min_value

In [None]:
# Filter rows with the minimum value
min_value_rows = train[train['windspeed'] == 0]

min_value_rows.sample(10)

In [None]:
# Replace zero 'windspeed with the values above or below
train['windspeed'] = train['windspeed'].replace(0, method='ffill').replace(0, method='bfill')

# Check the minimum value in the 'windspeed' column after replacing 0 values
train['windspeed'].min()

We've noticed a total of 1313 instances with zero wind speed values scattered throughout the dataset. 

To address this, we're opting to replace these zero values with their adjacent non-zero counterparts. This adjustment aims to approximate more realistic values by maintaining the trend found in the data.

# Data Visualiation and Analysis

In [None]:
# Calculate the correlation matrix for the selected numerical features in the 'data' DataFrame.
correlation_matrix = train[['temp', 'atemp', 'humidity', 'windspeed', 'count']].corr()

# Plot the correlation matrix as a heatmap
plt.figure(figsize=(18, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', square=True)
plt.title('Correlation Matrix')
plt.show()


As we have seen in the above graph,

1- There is a high correlation between the 'temp' column and the 'atemp' column, so we remove one of them because the two feature act as one feature

2- There is a week positive correlation between the 'temp' column and the target 'count' (0.39), a week negative correlation between the 'humidity' column and the target (-0.32), but there is no correlation between windspeed and the target (0.1).

In [None]:
# Calculate counts
counts = [workingDay.shape[0], holiDay.shape[0], weekEnd.shape[0]]
labels = ['Workingday', 'Holiday', 'Weekend']

# Create a bar chart
plt.figure(figsize=(18, 6))
plt.bar(labels, counts)
plt.xlabel('Variable')
plt.ylabel('Count')
plt.title('Counts of Different Variables')
plt.show()

In [None]:
# Calculate the average rental counts by hour of the day
hourly_counts = train.groupby('hour')['count'].mean().reset_index()

# Create a line plot to visualize the average rental counts by hour
plt.figure(figsize=(18, 6))
sns.lineplot(x='hour', y='count', data=hourly_counts)
plt.xlabel('Hour of the Day')
plt.ylabel('Average Rental Counts')
plt.title('Average Rental Counts by Hour of the Day')
plt.xticks(ticks=range(24), labels=range(24))
plt.grid()
plt.show()

In [None]:
# Plot the mean rental counts per hour based on day of the week
plt.figure(figsize=(18, 6))
hour_day_df = train.groupby(["hour", "day"])["count"].mean().to_frame().reset_index()
ax1 = sns.pointplot(x=hour_day_df["hour"], y=hour_day_df["count"], hue=hour_day_df["day"])
ax1.set_ylabel("Mean Count")

In [None]:
# Extracting unique days from the 'day' column of the 'weekEnd' DataFrame
weekEnd['day'].unique()

In [None]:
# Set up the plot
plt.figure(figsize=(18, 6))

# Create a point plot for 'casual' and 'registered' by 'hour'
sns.pointplot(data=train, x='hour', y='casual', color='orange', label='casual')
sns.pointplot(data=train, x='hour', y='registered', color='blue', label='registered')

plt.title('Point Plot of "casual" and "registered" by Hour')
plt.xlabel('Hour')
plt.ylabel('Count')
plt.legend()

plt.show()


In [None]:
# Set up the plot
plt.figure(figsize=(10, 6))

# Create a bar plot to visualize distribution of 'registered' and 'casual' by 'day'
sns.barplot(data=train, x='day', y='registered', color='blue', label='registered')
sns.barplot(data=train, x='day', y='casual', color='orange', label='casual')

plt.title('Distribution of "registered" and "casual" by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Count')
plt.legend()

plt.show()


In [None]:
# Calculate the total counts for 'casual' and 'registered'
total_casual = train['casual'].sum()
total_registered = train['registered'].sum()

# Calculate the ratios
ratio_casual = total_casual / (total_casual + total_registered)
ratio_registered = total_registered / (total_casual + total_registered)

# Create a bar plot for the ratios of 'casual' and 'registered'
ratios = [ratio_casual, ratio_registered]
labels = ['casual', 'registered']

plt.figure(figsize=(8, 6))
sns.barplot(x=labels, y=ratios)
plt.title('Ratios of "casual" and "registered"')
plt.ylabel('Ratio')
plt.show()

From the preceding graphs, a clear pattern emerges: workdays witness the highest bike rental frequency, followed by weekends, and subsequently, official holidays. Notably, the peak hours for bike rentals consistently appear at 7 AM and 5 PM on all days except Saturday and Sunday. During weekends, a shift is apparent, with rental surges occurring from 11 AM to 5 PM.

Furthermore, the data underscores that registered users predominantly utilize bikes during workdays, evident in their peak usage hours at 7 AM and 5 PM. In contrast, casual users, often utilizing bikes for leisure, exhibit pronounced activity during weekends. Consequently, weekends witness a different usage pattern compared to workdays. It's noteworthy that weekend rentals overall tend to be less frequent, accompanied by a higher proportion of casual users.

Interestingly, the dataset features a larger count of registered users compared to casual users. This insight reinforces the prevalence of consistent bike usage patterns among registered users, primarily on workdays, in contrast to the more varied casual usage on weekends.

In summary, the analysis unveils a notable interplay between user types, work schedules, and rental peak hours, enhancing our understanding of the dynamics behind bike rentals.

In [None]:
# Plot the mean rental counts per hour based on season
plt.figure(figsize=(18, 6))
hour_season_df = train.groupby(["hour", "season"])["count"].mean().to_frame().reset_index()
ax2 = sns.pointplot(x=hour_season_df["hour"], y=hour_season_df["count"], hue=hour_season_df["season"])
ax2.set_ylabel("Mean Count")


It's evident that peak hours for bike rentals remain consistent across seasons, likely due to work-related commuting patterns that persist regardless of the time of year. 

Interestingly, despite the consistent peak hour trend, spring stands out with lower bike rental counts. This divergence could be attributed to the prevalence of official holidays during the spring months.

In [None]:
# Count the occurrences of each season in the 'season' column of the 'holiDay' DataFrame
holiDay['season'].value_counts()

The code indicates that the prevalence of official holidays is not significantly higher in the spring season, suggesting it might not be the main driver behind the decline in bike rentals during that period. 

In the context of spring, an alternative reason for the reduction could potentially stem from weather conditions. In the heat map analysis, we observed that bike rental counts exhibit a minor negative correlation with humidity and a positive correlation with temperature. This insight reinforces the notion that weather conditions could be contributing to diminished rental activity during the spring season.

In [None]:
# Define custom colors for each season
season_colors = {
    'Spring': 'green',
    'Summer': 'orange',
    'Fall': 'red',
    'Winter': 'blue'
}

# Set up the plot
plt.figure(figsize=(10, 6))

# Create a kernel density plot of humidity by season with custom colors
sns.kdeplot(data=train, x='humidity', hue='season', common_norm=False, palette=season_colors.values())
plt.title('Humidity Density by Seasons')
plt.xlabel('Humidity')
plt.ylabel('Density')
plt.legend(title='Seasons', labels=season_colors.keys())

plt.show()

In [None]:
# Define custom colors for each season
season_colors = {
    'Spring': 'green',
    'Summer': 'orange',
    'Fall': 'red',
    'Winter': 'blue'
}

# Set up the plot
plt.figure(figsize=(10, 6))

# Create a kernel density plot of temperature by season with custom colors
sns.kdeplot(data=train, x='temp', hue='season', common_norm=False, palette=season_colors.values())
plt.title('Temperature Density by Seasons')
plt.xlabel('Temperature')
plt.ylabel('Density')
plt.legend(title='Seasons', labels=season_colors.keys())

plt.show()

The graphs clearly show that humidity levels during spring are not appreciably high to significantly reduce bike rentals. Furthermore, temperatures are not too low to affect the number of rentals. In fact, during summer, both humidity and temperature are Significantly higher, this did not affect bicycle rentals. Likewise, the remaining seasons show similar levels of humidity and temperature as spring, though only spring has the lowest number of bicycle fares.

This indicates that weather conditions, while a contributing factor, may not be the only driver of lower rents during the spring season. It is plausible that the distribution of the data could influence this outcome. Spring may have a relatively smaller data set compared to other seasons, which can lead to observed fluctuations in rental numbers.

In [None]:
# Calculate the number of days in each season
days_in_season = train['season'].value_counts().sort_index()

# Display the number of days in each season
print("Number of days in each season:")
print(days_in_season)

In [None]:
# Seeing the distribution of 'season' values in the train dataset
train['season'].value_counts().plot(kind = 'pie', autopct = '%1.1f%%')

The distribution of data appears remarkably balanced across seasons,  suggesting that it's unlikely the main reason for the lower rental counts in spring.

Consequently, there could be nuanced and less overt reasons within the dataset itself contributing to the reduction in rental counts during the spring season. These factors remain concealed, prompting a need for further exploration to uncover potential underlying causes responsible for this particular trend in the spring season.

# Data preprocessing

In [None]:
# Dropping unnecessary features that are not needed for modeling or have minimal impact
train.drop(['datetime', 'atemp', 'windspeed', 'casual', 'registered'], axis=1, inplace=True)

### Encoding and scalling the data

In [None]:
# One hot Endocing .
train = pd.get_dummies(train, columns=['season', 'weather', 'month', 'day'])

# Label Encoding.
label_encoder = LabelEncoder()

for i in ['holiday', 'workingday', 'year']:
    train[i] = label_encoder.fit_transform(train[i])

In [None]:
# List of columns to scale
columns_to_scale = ['temp', 'humidity', 'hour']

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the StandardScaler on the selected columns to calculate mean and standard deviation
scaler.fit(train[columns_to_scale])

# Transform the selected columns using the calculated mean and standard deviation
train[columns_to_scale] = scaler.transform(train[columns_to_scale])


### Split the data

In [None]:
# Split data into x and y.
X = train.drop("count", axis=1)
y = train["count"]

# Split train data into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Modeling

In [None]:
# Define the Root Mean Squared Logarithmic Error (RMSLE) scorer
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.clip(y_pred, 0, None)))

# Make the RMSLE scorer
rmsle_scorer = make_scorer(rmsle)

The Root Mean Squared Logarithmic Error (RMSLE) scorer is utilized to assess the model's performance. RMSLE is a metric commonly used in regression tasks to measure the accuracy of predictions. It penalizes underestimation and overestimation of the target variable, making it suitable for this bike rental count prediction task. The lower the RMSLE value, the better the model's predictions align with the actual target values.

In [None]:
# Initialize and evaluate different regression models using cross-validation
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor()
}

In [None]:
# Iterate over each model and Perform cross-validation with RMSLE scorer
for model_name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, scoring=rmsle_scorer, cv=5)
    
    print(f"Model: {model_name}")
    print(f"Average RMSLE: {np.mean(cv_scores)}\n")

After evaluating multiple regression models, we found that the Random Forest algorithm demonstrated the best performance based on the RMSLE metric. Therefore, we selected the Random Forest model to make predictions on our test data. 

In [None]:
# Fit and evaluate the best model on the test set
the_model = RandomForestRegressor()  
the_model.fit(X_train, y_train)

y_pred = the_model.predict(X_test)
test_rmsle = rmsle(y_test, y_pred)
print(f"Test RMSLE for the best model: {test_rmsle}")

# Preperation the test data

In [None]:
# Seeing if there are dublicated.
test.duplicated().sum()

In [None]:
# seeing if there are null values.
test.isna().sum()

In [None]:
test.info()

In [None]:
# Convert the 'datetime' column to datetime format
test['datetime'] = pd.to_datetime(test['datetime'])

# Extract the year from the 'datetime' column and create a new 'year' column
test['year'] = test['datetime'].dt.year

# Extract the month as its name from the 'datetime' column and create a new 'month' column
test['month'] = test['datetime'].dt.month_name()

# Extract the day as its name from the 'datetime' column and create a new 'day' column
test['day'] = test['datetime'].dt.day_name()

# Extract the hour from the 'datetime' column and create a new 'hour' column
test['hour'] = test['datetime'].dt.hour

In [None]:
# Replace the values in the 'season' column with corresponding strings
test['season'].replace({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'}, inplace=True)

# Replace the values in the 'holiday' column with corresponding strings
test['holiday'].replace({1: 'Holiday', 0: 'Not Holiday'}, inplace=True)

# Replace the values in the 'workingday' column with corresponding strings
test['workingday'].replace({1: 'Workingday', 0: 'Not Workingday'}, inplace=True)

# Replace the values in the 'weather' column with corresponding strings
test['weather'].replace({1: 'Clear', 2: 'Mist', 3: 'Rain', 4: 'Snow'}, inplace=True)

In [None]:
# Define a mapping dictionary to combine the clusters
cluster_mapping = {"Snow" : "Rain"}

# Update the "grade" column with the new cluster labels
test['weather'] = test['weather'].replace(cluster_mapping)

In [None]:
# Numerical columns.
numerical_features = test[['temp', 'atemp', 'humidity', 'windspeed']]

# calculate descriptive statistics for numerical values.
numerical_features.describe()

In [None]:
# Replace zero 'windspeed' with the values above or below
test['windspeed'] = test['windspeed'].replace(0, method='ffill').replace(0, method='bfill')

In [None]:
#Store the datetime column in a separate variable.
datetime = test['datetime']

In [None]:
test.drop(['datetime', 'atemp', 'windspeed'], axis=1, inplace=True)

In [None]:
# One hot Endocing .
test = pd.get_dummies(test, columns=['season', 'weather', 'month', 'day'])

# Label Encoding.
label_encoder = LabelEncoder()

for i in ['holiday', 'workingday', 'year']:
    test[i] = label_encoder.fit_transform(test[i])

In [None]:
# List of columns to scale
columns_to_scale = ['temp', 'humidity', 'hour']

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the StandardScaler on the selected columns to calculate mean and standard deviation
scaler.fit(test[columns_to_scale])

# Transform the selected columns using the calculated mean and standard deviation
test[columns_to_scale] = scaler.transform(test[columns_to_scale])


# Prediction and Submissiom

In [None]:
# Generate predictions for the test data using RandomForestClassifier.
test_pred = the_model.predict(test)

In [None]:
# Create a submission DataFrame with the 'datetime' column and predicted rental counts.
submission = pd.DataFrame({'datetime': datetime, 'count': test_pred})


In [None]:
# Save the submission DataFrame as a CSV file without including the index column.
submission.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

# Conclusion

Our journey of data preparation and analysis has yielded significant insights and successful model utilization. We embarked on a meticulous process of refining the dataset, including datetime format conversion, categorical value replacements, and addressing anomalies in variables like humidity and wind speed. This foundational work set the stage for more accurate analysis and model training.

Through exploratory data analysis, we unveiled crucial trends and patterns. Notably, we observed that weekdays outshine weekends and holidays in bike rental counts, and specific peak hours consistently attract more rentals.

Upon model evaluation, the Random Forest algorithm demonstrated remarkable prowess by achieving the lowest RMSLE score. 

To sum up, our journey encompassed data refinement, insightful analysis, meticulous model selection, and comprehensive utilization of predictions. This endeavor equips us with valuable information for strategic decision-making.