# Hands On Data Analytics

This data set is based on the Kaggle challenge [Bike Sharing Demand](https://www.kaggle.com/competitions/bike-sharing-demand/overview).

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city.

- Capital Bikeshare program in Washington D.C. collects detailed data
- The data set that will be used in the project is often used for research. A similar task was provided in a Kaggle challege.
- The goal of this project is to predict the bike sharing demand (variable `count`) in the future.

In [None]:
# Import necessary libraries and set the style of the plots
import asyncio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from IPython.core.display import HTML

# Logic to keep binder notebook alive
async def keep_me_alive():
    while True:
        await asyncio.sleep(120)
        x = 1

loop = asyncio.get_event_loop()
loop.create_task(keep_me_alive())

# 1. Importing data

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/JalalMirzayev/kaggle-bike-task/main/data/data.csv", index_col="datetime", date_parser=pd.to_datetime)

In [None]:
df.head()

You are provided hourly rental data spanning two years.

Data Fields
- datetime: hourly date + timestamp  
- season: 1 = spring, 2 = summer, 3 = fall, 4 = winter 
- holiday: whether the day is considered a holiday
- workingday: whether the day is neither a weekend nor holiday
- weather: 
  - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp: temperature in Celsius
- atemp: "feels like" temperature in Celsius
- humidity: relative humidity
- windspeed: wind speed
- count: number of total rentals

# 2. Data exploration & Feature engineering

Print basic statistics for quantitative and categorical variables

In [None]:
df.describe()

The method `value_counts()` is a helpful for looking into categorical attributes, showing how often each unique value occurs. The code to check for unique weather categories is already prepared for you. 

Run the following code cell and think about the meaning of the output.

In [None]:
df["weather"].value_counts()

📝 **Task**: Determine `value_counts()` for two additional categorical attributes/columns/variables.

In [None]:
# Code for first categorical attribute


In [None]:
# Code for second categorical attribute


📝 **Task:** Try to use `value_counts()` on a continuous column like `windspeed`. What do you notice? Does it make sense to determine value_counts() for continuous attributes?

In [None]:
# Code for continuous attribute (e.g. windspeed)


## Creating insightful plots
Charts and diagrams are a very popular way to explore data and to understand relationships. There are multiple visualization packages for Python (e.g. Matplotlib, Seaborn, Plotly, etc.). These packages can be used to create different types of visualizations (e.g. scatter, bar, line, histogram, boxplot, etc.).

In [None]:
# Plotting the output variable (count) over time gives us a first impression over the distribution
# Plotting is usually done with matplotlib (imported as plt). We are defining the scatter plot in few steps:
plt.figure(figsize=(20, 8))
plt.scatter(df.index, df["count"], c=df["temp"], cmap="coolwarm", s=5)
plt.colorbar()
plt.xlabel("Datetime")
plt.ylabel("Number of bike rentals per hour")
plt.title("Bike rentals over the course of two years\n colorcoded by temperature")
display()

## Splitting timestamp into features (Feature engineering)

Feature engineering plays a crucial role for the performance of a machine learning algorithm. 

It encompasses various techniques such as 
- handling/removing outliers or missing values
- transformation of variables 
- scaling of variables
- combination of variables or 
- splitting a feature into several ones (e.g. date -> year, month, day)

In [None]:
# The datetime timestamp provides valuable information, we need to extract the individual attributes in order to make better use of it
df["hour"] = [datetime.hour for datetime in df.index]
# 📝 Task: Use the same procedure for day, month, and year
df["day"] = 
df["month"] = 
df["year"] = 

df.head()

We splitted the date into components and there is already other information extracted out of a date (season, holiday, workingday) 

📝 **Task**: Can you think of another potentially valuable attribute that we could try to extract?

In [None]:
df["REPLACE_ATTRBUTE"] = 

df.head()

## Train - Test split for model training and evaluation

In supervised learning, we always have a labeled training set (=the correct output) at hand. The evaluation of the model is done with previously unseen data. Therefore, we need to split our dataframe representing the full data set beforehand.

In [None]:
# There are mutltiple ways to create a train - test split. Using the Pandas-internal sample() function is straightforward
df_train = df.sample(frac=0.75, random_state=1)
df_test = df.drop(df_train.index)

In [None]:
# We need to break down the train - test sets into independent variables (input features: what we have as input) and dependent variable (target: what we want to predict)
x_train = df_train.drop(["count"], axis=1)
y_train = df_train["count"]

x_test = df_test.drop(["count"], axis=1)
y_test = df_test["count"]

# 3. Supervised Machine Learning with Decision Trees

### Training a decision tree model
Decision Trees based algorithms are very popular in machine learning. They are easy to set-up, capable of capturing non-linearity, good at handling categorical variables and intuitive to understand. Trees are built from the top to the ground, branches and their associated splitting criteria are always created in order to improve a given metric. Being greedy algorithms, they always search for the local optimum at every split.

In [None]:
# The object of the model with selected parameters has to be created. max_depth specifies how many layers we want in the tree, criterion defines the splitting metric
model1 = DecisionTreeRegressor(criterion='absolute_error', max_depth=3)

# Now the model becomes trained with fit() using the previously created training data
model1.fit(x_train, y_train)

### Predicting results and visualizing the decision tree


In [None]:
# The trained model is now ready to predict any new input having the same attributes as the training data
y_predictions1 = model1.predict(x_test)

# The output is simply a list of the forecasted bike rentals in the test set
print("y_predictions: ", y_predictions1[:5])

# Print input features
df_test.head()

In [None]:
HTML('<img src="https://raw.githubusercontent.com/JalalMirzayev/kaggle-bike-task/main/images/decision_tree_plot.png" style="width:800px;height:400px;">')

In [None]:
# Let us look at an individual case and follow the path of our decision tree
print('Prediction: ' + str(y_predictions1[2]))
print(df_test.iloc[2])

### Evaluating model performance
Many different evaluation metrics can be used to assess a model's performance.
The following three metrics are often used to evaluate the performance of regression models

<hr style="border:2px solid gray"> </hr>

**Mean Squared Error**. The Mean Squared Error (MSE) is defined as follows

$$ \text{MSE} = \frac{1}{N}\sum_{n=1}^N[y_\text{n, actual} - y_\text{n,predict}]^2,$$

in which $y_\text{n,predict}$ is the prediction for the $n^\text{th}$ observation. The value $y_\text{n, actual}$ is the actual value of the $n^\text{th}$ observation.

<hr style="border:2px solid gray"> </hr>

**Mean Absolute Error**. The Mean Absolute Error (MAE) is defined as follows

$$ \text{MSE} = \frac{1}{N}\sum_{n=1}^N\left|y_\text{n, actual} - y_\text{n,predict}\right|^2,$$

in which $y_\text{n,predict}$ is the prediction for the $n^\text{th}$ observation. The value $y_\text{n, actual}$ is the actual value of the $n^\text{th}$ observation.

<hr style="border:2px solid gray"> </hr>

**Coefficient of determination**. The coefficient of determination (also known as $R^2$ or $R$ squared, ger. Bestimmtheitsmaß/Determinationskoeffizient) is defined as follows

$$R^2 = \dfrac{\sum_{n=1}^N[y_\text{n,predict}-\overline{y}]^2}{\sum_{n=1}^N[y_\text{n,actual}-\overline{y}]^2 },$$

in which $y_\text{n,predict}$ is the prediction for the $n^\text{th}$ observation. The value $y_\text{n, actual}$ is the actual value of the $n^\text{th}$ observation. Finally, $\overline{y}$ is the mean of all actual obervations $y_\text{n, actual}$. 

In [None]:
print("mean squared error: ", mean_squared_error(y_test, y_predictions1))
print("mean absolute error: ", mean_absolute_error(y_test, y_predictions1))
print("R²: ", r2_score(y_test, y_predictions1))

## Hyperparameter tuning
The Decision Tree has several parameters which can be tuned. It's an important part of machine learning to set appropriate values for these parameters.

**Code as reference**
```bash
model2 = DecisionTreeRegressor(criterion="absolute_error", max_depth=3)
model2.fit(x_train, y_train)
y_predictions2 = model2.predict(x_test)
```

📝 **Task**: Try different values for `max_depth` (e.g. 100, 200, and 300) and compare the performance metrics. Use the code above for reference.

In [None]:
# Enter your code here


In [None]:
# Display performance metrics
print("mean squared error: ", mean_squared_error(y_test, y_predictions2))
print("mean absolute error: ", mean_absolute_error(y_test, y_predictions2))
print("R²: ", r2_score(y_test, y_predictions2))

## Feature Importance: What features are important for prediction?

An interesting byproduct of building a decision tree is the Feature Importance. The importance of a feature is computed as the (normalized) total reduction of the criterion (e.g. MAE) brought by that feature. It is a crcuial measure for interpreting the model.

In [None]:
# We get the feature importance out of the model, calculate relative values and print them as bar chart
feature_importance = model2.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.sum())

sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(8,6))
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, x_train.columns[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Feature Importance")
display()

### Improving performance with decision tree ensemble methods
Despite being a powerful concept, also decision trees have their drawbacks. They tend to overfit on the training data, leading to suboptimal generalization capabilities. Generally, they don't yield a high prediction accuracy in many situations. Decision tree ensemble methods tackle these issues by building on the concept of randomization. Random Forest includes many trees in a single model (=forest) but only consides subsets of training data and input features for building the individual tree. In practice, this method achieves significantly higher robustness and performance compared to a singular decision tree. It is a popular choice for a wide range of real-world data science use cases.

In [None]:
# We need to create an object of the RandomForestRegressor() class. n_estimators defines the amount of trees in the forest, n_jobs at -1 activates mulitcore processing
model3 = RandomForestRegressor(n_estimators=100, n_jobs=-1)
model3.fit(x_train, y_train)
y_predictions3 = model3.predict(x_test)

In [None]:
print("mean squared error: ", mean_squared_error(y_test, y_predictions3))
print("mean absolute error: ", mean_absolute_error(y_test, y_predictions3))
print("R²: ", r2_score(y_test, y_predictions3))