# **Import Data & Libraries**

In [1]:
#library imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.seasonal import seasonal_decompose
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
#importing data
from google.colab import drive
drive.mount('/content/drive')
df_day = pd.read_csv('day.csv')
df_hour = pd.read_csv('hour.csv')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
df_day.head()

In [None]:
df_hour.head()

# **Basic Data Information**

In [None]:
# printing information about data
df_day.info()
df_hour.info()

The Bike Sharing dataset contains two primary dataframes, one with 731 entries representing daily aggregated data and another with 17,379 entries for hourly data. Both datasets are comprehensive, with all columns showing no missing values, indicating a high level of data completeness.

The data is well-structured, consisting mainly of numeric variables (integers and floats) along with a date column (dteday). The features include essential information about weather conditions, dates, times, and bike usage, categorized into casual and registered users, which can be aggregated to compute the total bike count (cnt).

The data is suitable for time series analysis and regression modeling. The availability of variables like the weather and temporal factors gaving an opportunity to explore and model the impact of environmental and seasonal patterns on bike rentals. The large dataset size, especially for the hourly data, provides a robust foundation for uncovering both short-term and long-term trends.

Future steps could involve to capture cyclical patterns (such as hourly, weekly, and monthly trends), examining correlations between weather variables and bike usage, and implementing predictive models to forecast demand and optimize bike availability.

In [None]:
# printing basic describtion about day and hour datasets
df_day.describe()
df_hour.describe()

The descriptive statisticsit for both day and hour datasets shows a comprehensive overview of the distributions and characteristics of the features.

**Temporal Variables:** The "hr" column ranges from 0 to 23, indicating a 24-hour cycle. The "season" and "mnth" columns capture cyclical patterns across different times of the year, ranging from 1 (spring) to 4 (winter) and from 1 (January) to 12 (December), respectively.

**Categorical Features:** The "holiday", "weekday", "workingday", and "weathersit" variables provide information about whether a particular hour falls on a holiday or working day, the day of the week, and the weather situation. The low mean value for holiday (0.0288) indicates that bike sharing data predominantly represents non-holiday hours.

**Weather-Related Features:** The "temp" and "atemp" columns represent the normalized temperature and feels-like temperature, while 'hum" (humidity) and "windspeed" also play a role in understanding the impact of weather on bike rentals. The mean values suggest mild weather on average, but the maximum humidity and relatively high wind speeds might impact bike usage during extreme conditions.

**Bike Rental Variables:** The average number of "casual" and "registered" users per hour is 35.68 and 153.79, respectively, leading to a total average count ("cnt") of 189.46 bike rentals per hour. The standard deviations indicate considerable variability in bike usage, likely influenced by temporal, seasonal, and weather factors.


**Data Distribution:** The minimum and maximum values highlight the wide range of bike rentals, from no rentals at certain hours to a peak of 977 rentals in an hour. Additionally, the quartile values show that bike rentals are often concentrated at lower counts, with a significant increase in registered users compared to casual users.

This discribtion will help to model and identify key features that influence bike demand, explore hourly and daily trends, and forecast future bike usage effectively.

In [None]:
# converting date to numeric
df_hour['dteday'] = pd.to_datetime(df_hour['dteday'])

# adding new features like day of the year
df_hour['day_of_year'] = df_hour['dteday'].dt.dayofyear

# **Data Visualization and Distribution Analysis**

In [None]:
# Distribution of histogram of the "cnt" variable (overall pattern of bike usage)
sns.histplot(df_hour['cnt'], kde=True) # Changed kde-True to kde=True
plt.title("Distribution of Bike Rentals")
plt.xlabel("Count of Bike Rentals")
#plt.ylabel("Frequency")
plt.show()

Histogram illustrates the distribution of bike rentals, with the x-axis representing the count of bike rentals and the y-axis representing the frequency or count of occurrences. The distribution is heavily right-skewed, indicating that a large number of rental counts are clustered around lower values, with progressively fewer occurrences as the rental count increases. It suggests that low bike rental counts are much more common than higher rental counts.

In [None]:
# Scatterplot of temperature, weather, and bike rentals
sns.scatterplot(data=df_hour, x='temp', y='cnt', hue='weathersit')
plt.title("Temperature vs. Bike Rentals")
plt.xlabel("Temperature")
plt.ylabel("Count of Bike Rentals")
plt.show()

The scatterplot provides the relationship between temperature and the count of bike rentals, where x-axis represents temperature, while the y-axis shows the count of bike rentals. Different hues correspond to weather situations (indicated by values 1 through 4 in the legend), with lighter hues representing more favorable weather conditions and darker hues indicating less favorable weather. According to scatterplot, bike rentals generally increase with temperature, suggesting a positive correlation, and the variation in rental counts is affected by different weather conditions.

# **Correlation Analysis**

In [None]:
# Creating the correlation matrix
corr = df_hour.drop(columns=['dteday']).corr()

# Creating a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Creating the heatmap with the mask
plt.figure(figsize=(10, 8))
sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm', annot_kws={"size": 8})
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.title('Triangular Correlation Heatmap')
plt.show()

The triangular correlation heatmap shows the relationships between different variables in the bike-sharing dataset. lets explore them:


1.   **Temperature and Bike Rentals:** The total count of bike rentals ("cnt") is positively correlated with "temp" (0.40) and "atemp" (0.45). This means that bike rentals tend to increase with higher temperatures.


2.   **Yearly Growth:** The "yr" variable has a strong positive correlation (0.57) with "cnt", indicating that bike rentals increased in the second year compared to the first year.


3.   **Registered vs. Casual Users:** The total count ("cnt") is highly correlated with "registered" users (0.97) and moderately correlated with "casual" users (0.69). Registered users contribute significantly more to total rentals compared to casual users.

4.   **Weather Impact:** Poor weather conditions ("weathersit") and high humidity ("hum") have negative correlations with "cnt" (-0.14 and -0.32, respectively), meaning fewer rentals occur during bad weather or when humidity is high.


1.   **Working Day Effect:** "workingday" is negatively correlated with "casual" users (-0.35), indicating that casual users are less likely to rent bikes on working days, possibly preferring weekends or holidays.


The triangular correlation heatmap reveals several important relationships in the bike-sharing data. Temperature, year, and registered users are key drivers of bike rentals, meaning that rentals increase with higher temperatures, in the second year, and when more registered users are present. On the other hand, adverse weather conditions and high humidity are associated with decreased bike rentals. These insights provide valuable guidance for understanding bike rental trends and can assist in building effective predictive models.

# **Temporal Analysis**

In [None]:
# showing how the number of rentals varies throughout the day by aggregating the "cnt" by the "hr" column
hourly_trend = df_hour.groupby('hr')['cnt'].mean()
hourly_trend.plot(kind='line', title='Average Bike Rentals by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Average Count of Rentals')
plt.show()

The line plot provides the average count of bike rentals throughout the hours of the day, where x-axis represents the hour of the day (ranging from 0 to 23), while the y-axis shows the average count of rentals. The plot reveals two prominent peaks: one in the early morning hours, likely corresponding to commuters, and another in the late afternoon to early evening, which may correspond to people returning home or engaging in evening activities. Bike rentals are lowest in the early morning hours, around 3 to 5 AM, and gradually increase as the day progresses, showing distinct usage patterns related to daily routines.

In [None]:
# showing how bike rentals vary across weekdays using weekday
weekday_trend = df_hour.groupby('weekday')['cnt'].mean()
sns.barplot(x=weekday_trend.index, y=weekday_trend.values)
plt.title('Average Bike Rentals by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Count of Rentals')
plt.show()

Bar plot shows the average count of bike rentals by day of the week, where x-axis represents the days of the week, coded numerically from 0 to 6, while the y-axis indicates the average count of rentals. The bars appear relatively similar in height, suggesting that bike rentals remain fairly consistent throughout the week, with only slight variations between days. This implies that bike usage does not significantly change from one day to another, possibly reflecting consistent demand across weekdays and weekends.

In [None]:
# Evaluating monthly trends for the seasonal effects
monthly_trend = df_hour.groupby('mnth')['cnt'].mean()
sns.lineplot(x=monthly_trend.index, y=monthly_trend.values)
plt.title('Average Bike Rentals by Month')
plt.xlabel('Month')
plt.ylabel('Average Count of Rentals')
plt.show()

The line plot illustrates the average count of bike rentals over the months of the year, where x-axis represents the months, numbered from 1 (January) to 12 (December), and the y-axis indicates the average count of rentals. The plot shows a clear upward trend from the beginning of the year, peaking around the summer months (June to August), before declining towards the end of the year. This pattern suggests that bike rentals are higher during warmer months, likely due to more favorable weather conditions, and decrease in colder months.

# **Categorical Feature Analysis**

In [None]:
# bike rentals across different seasons ("season") or weather situations ("weathersit")
sns.boxplot(x='season', y='cnt', data=df_day)
plt.title('Bike Rentals by Season')
plt.xlabel('Season')
plt.ylabel('Count of Rentals')
plt.show

This box plot represents the distribution of bike rental counts across different seasons, where x-axis shows the seasons (labeled numerically from 1 to 4), and the y-axis indicates the count of bike rentals. Each box represents the interquartile range (IQR), with the line inside the box indicating the median number of rentals. The whiskers extend to show the range of the data, excluding outliers, which are plotted as individual points. The plot shows that bike rentals tend to be lower in season 1 and higher in seasons 2 and 3, suggesting that seasonal changes significantly impact bike rental activity. Season 4 has a moderate number of rentals but shows some variation, with outliers present.

In [None]:
# Comparing the average rentals during holidays (holiday == 1) vs non-holidays
sns.boxplot(x='holiday', y='cnt', data=df_day)
plt.title('Bike Rentals: Holiday vs Non-Holiday')
plt.xlabel('Holiday')
plt.ylabel('Count of Rentals')
plt.show()

This box plot compares the distribution of bike rental counts on non-holidays (0) versus holidays (1). It shows that bike rentals are generally lower on holidays, with a wider spread and a lower median compared to non-holidays.

# **Seasonality Analysis**

In [None]:
# useing seasonal decomposition to break down the time series into trend, seasonal, and residual components

result = seasonal_decompose(df_day['cnt'], model='additive', period=365)
result.plot()
plt.show()

This time series decomposition plot breaks down the "cnt" (count of bike rentals) into three components: "trend", "seasonal", and "residual". This plot suggests that bike rentals have been steadily increasing over time, with noticeable patterns that repeat, likely due to seasons or specific times of the year. However, there are some fluctuations and unexpected variations that can't be fully explained by these patterns, hinting at the influence of random or external factors.

# **Outlier Detection**

In [None]:
# detect any anomalies in the dataset
sns.boxplot(data=df_hour[['temp', 'hum', 'windspeed', 'cnt']])
plt.title('Outlier Detection in Features')
plt.show()

According to box plot while "temp", "hum", and "windspeed" have relatively few or minor outliers, the "cnt" feature shows many high-value outliers, suggesting that there are days with unusually high bike rental counts compared to the rest.

# **Feature Engineering Ideas**

In [None]:
# Creating a new feature combining holiday and weekend indicators
df_hour['day_type'] = np.where((df_hour['holiday'] == 1) | (df_hour['weekday'].isin([0, 6])), 'off_day', 'working_day')

In [None]:
# Visualizing relationships between key features like "temp", 'atemp', "hum', "windspeed", and "cnt"
sns.pairplot(df_hour[['temp', 'atemp', 'hum', 'windspeed', 'cnt']])
plt.show()


This pair plot visualizes relationships between the features "temp", "atemp", "hum", "windspeed", and "cnt". It provides insights into potential correlations and distributions within the dataset. A strong positive linear relationship is evident between "temp" and "atemp", indicating these features are closely related. The scatter plots also show a positive correlation between temperature features ("temp" and "atemp") and bike rental counts ("cnt"), suggesting that higher temperatures may be associated with more rentals. Other features, like "windspeed", do not exhibit a clear relationship with "cnt".

# **Time Series Stationarity Check**


In [None]:
# Time series plot for obvious trends or sesonality
df_day['cnt'].plot(title='Bike Rentals Time Series')

The time series plot shows the original bike rental data over time. It has a clear upward trend, indicating that the average number of rentals increases over the period. Additionally, there are visible fluctuations, suggesting some potential seasonality or irregular variations.

In [None]:
# Applying differencing
df_day['cnt_diff'] = df_day['cnt'].diff().dropna()
df_day['cnt_diff'].plot(title='Differenced Bike Rentals Time Series')

The differenced time series plot displays the data after applying first-order differencing. This transformation has removed the trend, making the series more stationary, as the values now fluctuate around a mean of zero.

# **Seasonal Autocorrelation Plot**

In [None]:
# Check for lag correlation
autocorrelation_plot(df_day['cnt'])
plt.title('Autocorrelation Plot of Bike Rentals')
plt.show()

This autocorrelation plot shows how the bike rental counts are correlated with their own past values over different time lags. It starts high at lag 0 and gradually decreases, suggesting that the series has significant autocorrelation at short lags, which diminishes over time. As the lag increases (e.g., moving from 0 to 100, 200, etc.), the autocorrelation drops, meaning that the relationship between the current and past values weakens. The presence of autocorrelation over several lags indicates that past values of bike rentals are informative for predicting future values, and it may suggest the need for further differencing or the inclusion of autoregressive terms in a forecasting model.



# Metrics and Evalulations

In [None]:
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

In [None]:
# Fit the optimized ARIMA model
optimal_order = (1, 2, 2)  # Best parameters identified through grid search
arima_model = ARIMA(df_day['cnt'], order=optimal_order)
arima_fit = arima_model.fit()

# Generate ARIMA forecasts for the last 30 data points
forecast_arima = arima_fit.forecast(steps=30)
print("ARIMA Forecast:", forecast_arima)

In [None]:
# Random Walk model (baseline)
df_day['random_walk'] = df_day['cnt'].shift(1)

# Use the last 30 data points for Random Walk predictions
random_walk_predictions = df_day['random_walk'][-30:]

In [None]:
# Define true values for the last 30 data points
true_values = df_day['cnt'][-30:]

# Calculate metrics for ARIMA
rmse_arima = np.sqrt(mean_squared_error(true_values, forecast_arima))
mae_arima = mean_absolute_error(true_values, forecast_arima)

# Calculate metrics for Random Walk
rmse_rw = np.sqrt(mean_squared_error(true_values, random_walk_predictions))
mae_rw = mean_absolute_error(true_values, random_walk_predictions)

# Print results
print(f"ARIMA RMSE: {rmse_arima}, MAE: {mae_arima}")
print(f"Random Walk RMSE: {rmse_rw}, MAE: {mae_rw}")

In [None]:
# Plot actual values vs ARIMA and Random Walk predictions
plt.figure(figsize=(10, 6))
plt.plot(true_values.index, true_values, label='Actual', color='blue')
plt.plot(true_values.index, forecast_arima, label='ARIMA Predictions', linestyle='--', color='orange')
plt.plot(true_values.index, random_walk_predictions, label='Random Walk Predictions', linestyle='--', color='green')
plt.legend()
plt.title('Actual vs Predicted Bike Rentals')
plt.xlabel('Date')
plt.ylabel('Bike Rentals')
plt.show()

# Check Quality of the Data

In [None]:
# Check for missing values
missing_values = df_day.isnull().sum()
print("Missing Values in Dataset:")
print(missing_values)

# Check summary statistics for variability
print("Summary Statistics for 'cnt':")
print(df_day['cnt'].describe())

# SARIMA Model

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fit SARIMA model
sarima_model = SARIMAX(df_day['cnt'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
sarima_fit = sarima_model.fit()

# Forecast
forecast_sarima = sarima_fit.forecast(steps=30)

# Model summary
print(sarima_fit.summary())

# Plot forecast
import matplotlib.pyplot as plt
plt.plot(df_day['cnt'], label='Actual')
plt.plot(forecast_sarima, label='SARIMA Predictions')
plt.legend()
plt.show()

# Practical Forecasting

In [None]:
# Lagged Rentals (e.g., rentals from the previous day)
df_day['lag_1'] = df_day['cnt'].shift(1)
df_day['lag_7'] = df_day['cnt'].shift(7)

# Add holiday information
df_day['holiday'] = df_day['holiday']

# Add weather-related variables
df_day['temp'] = df_day['temp']
df_day['humidity'] = df_day['hum']

# Interaction Terms (Working day × Weather)
df_day['working_weather'] = df_day['workingday'] * df_day['weathersit']

# Drop rows with NaN values created due to lag features
df_day.dropna(inplace=True)

# Combine features into a new dataset
X = pd.concat([
    df_day[['temp', 'humidity', 'workingday', 'lag_1', 'lag_7', 'holiday']],
    pd.get_dummies(df_day['weather'], drop_first=True)  # Weather dummies
], axis=1)
y = df_day['cnt']

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Evaluation metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print(f"Enhanced Model RMSE: {rmse:.2f}, MAE: {mae:.2f}")

In [None]:
import matplotlib.pyplot as plt

# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.plot(y_test.values, label='Actual', color='blue')
plt.plot(y_pred, label='Predicted', color='orange')
plt.legend()
plt.title('Actual vs Predicted Bike Rentals')
plt.xlabel('Observations')
plt.ylabel('Bike Rentals')
plt.show()