- Author: Your Name
- First Commit: yyyy-mm-dd                      #folowing ISO  8601 Format
- Last Commit: yyyy-mm-dd                       #folowing ISO  8601 Format
- Description: This notebook is used to perform EDA on the "xxxxx" dataset

In [None]:
# Import libraries
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import explained_variance_score
from sklearn.metrics import max_error
from sklearn.metrics import mean_poisson_deviance
from sklearn.metrics import mean_gamma_deviance
from sklearn.metrics import mean_tweedie_deviance
from sklearn.metrics import mean_absolute_percentage_error

# Machine Learning

In this jupyter notebook, meant to perform a basic, but useful structure for a ML Project, we will be performing the next actions:

**Importing Data**

- Read the "cleaned_df.csv" file into a DataFrame named "cleaned_df".
- Display the head, information, and summary statistics of the "cleaned_df" DataFrame.

**Exploratory Data Analysis**

- Visualize correlations between variables using a heatmap.

**Data Preprocessing**

- Drop a highly correlated column from the DataFrame.

**Data Distribution**

- Plot histograms to visualize the distribution of the data.

**Linearity Analysis**

- Create scatter plots to analyze linearity between variables.

**Encoding Categorical Variables**

- Encode categorical variables using one-hot encoding and create new columns for each category.

**Splitting Data**

- Split the data into train and test sets.

**Scaling Data**

- Scale the data using Min-Max scaling.

**Model Training and Prediction**

- Train a linear regression model on the scaled training data.
- Make predictions on the scaled validation data.

**Model Evaluation**

- Calculate and display evaluation metrics:
  - R2 score
  - Mean Squared Error
  - Mean Absolute Error
  - Root Mean Squared Error
  - Explained Variance Score

**Residual Analysis**

- Plot the residuals between the true and predicted values.

In [None]:
# Import data
cleaned_df = pd.read_csv('../cleaned_df.csv')

# Explore data
display(cleaned_df.head())
display(cleaned_df.info())
display(cleaned_df.describe())

In [None]:
# Search for correlations between the variables in a heatmap
plt.figure(figsize=(20, 10))
sns.heatmap(cleaned_df.corr(), annot=True, cmap='RdYlGn')

In [None]:
# Drop the columns that are highly correlated with each other
cleaned_df.drop('temp', axis=1, inplace=True)

In [None]:
# Take a look of the distribution of the data
cleaned_df.hist(figsize=(20, 10))

In [None]:
# Search for linearity between the variables
sns.pairplot(cleaned_df, x_vars=['column1', 'column2', 'column3', 'column4'], y_vars=['column1', 'column2', 'column3', 'column4'], diag_kind='kde')

### Encoding cathegorical variables

In [None]:
cleaned_df.head()

In [None]:
# Use get_dummies for encoding categorical variables with unique column names
new_season = pd.get_dummies(cleaned_df['month'], prefix='month', drop_first=False)
new_weather = pd.get_dummies(cleaned_df['weather'], prefix='weather', drop_first=False)
new_year = pd.get_dummies(cleaned_df['year'], prefix='year', drop_first=False)

# Drop the old columns
cleaned_df.drop(['month', 'weather', 'year'], axis=1, inplace=True)

# Concatenate the encoded columns with the original dataframe
cleaned_df_encoded = pd.concat([cleaned_df, new_season, new_weather, new_year], axis=1)

In [None]:
# weekday is also a categorical variable. 
# We will encode them with LabelEncoder and OneHotEncoder
# LabelEncoder
le = LabelEncoder()
cleaned_df_encoded['weekday'] = le.fit_transform(cleaned_df_encoded['weekday'])

ohe_weekday = OneHotEncoder()
weekday_encoded = ohe_weekday.fit_transform(cleaned_df_encoded['weekday'].values.reshape(-1, 1)).toarray()
weekday_df = pd.DataFrame(weekday_encoded, columns=["weekday_" + str(int(i)) for i in range(weekday_encoded.shape[1])])

cleaned_df_encoded = pd.concat([cleaned_df_encoded, weekday_df], axis=1)
cleaned_df_encoded.drop(['weekday'], axis=1, inplace=True)

cleaned_df_encoded.head()

### Split the data into train and test sets

In [None]:
# Split the data into train, validation, and test sets
train_df = cleaned_df_encoded[cleaned_df_encoded['year_0'] == 0]
test_df = cleaned_df_encoded[cleaned_df_encoded['year_0'] == 1]

# Drop the year columns
train_df.drop(['year_0', 'year_1'], axis=1, inplace=True)
test_df.drop(['year_0', 'year_1'], axis=1, inplace=True)

# Split the train data into train and validation sets
X_train = train_df.drop('count', axis=1)
y_train = train_df['count']

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Split the test data into test and validation sets
X_test = test_df.drop('count', axis=1)
y_test = test_df['count']

X_test, X_val_test, y_test, y_val_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [None]:
train_df.info()

### Scale the data

In [None]:
# Bring all the values to a uniform range
# Remember that the scaling is applied because the Gradient Descent method that we use to minimize our underlying cost function, converges much faster with scaling than without it.

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

### Train and fit the model and make predictions

In [None]:
# Create the model
model = LinearRegression()

# Fit the model
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_val_scaled)

In [None]:
# Create a function that will calculate the metrics
def calculate_metrics(y_true, y_pred):
    print('R2 score: {}'.format(r2_score(y_true, y_pred)))
    print('Mean Squared Error: {}'.format(mean_squared_error(y_true, y_pred)))
    print('Mean Absolute Error: {}'.format(mean_absolute_error(y_true, y_pred)))
    print('Root Mean Squared Error: {}'.format(np.sqrt(mean_squared_error(y_true, y_pred))))
    print('Explained Variance Score: {}'.format(explained_variance_score(y_true, y_pred)))

# calculate the metrics
calculate_metrics(y_val, y_pred)

In [None]:
# Create a function that will plot the residuals
def plot_residuals(y_true, y_pred):
    residuals = y_true - y_pred
    plt.figure(figsize=(20, 10))
    plt.scatter(y_true, residuals)
    plt.title('Residual plot')
    plt.xlabel('y_true')
    plt.ylabel('residuals')
    plt.show()

# call the function to plot the residuals
plot_residuals(y_val, y_pred)

Note: In addition to the metrics we have already calculated (R2 score, Mean Squared Error, Mean Absolute Error, Root Mean Squared Error, and Explained Variance Score), there are several other metrics that we can analyze to evaluate the performance of our machine learning model. Here are some commonly used metrics:

- Accuracy: Accuracy measures the overall correctness of your model's predictions. It is the ratio of the number of correct predictions to the total number of predictions.

- Precision: Precision is the proportion of true positive predictions out of all positive predictions. It measures the accuracy of positive predictions.

- Recall: Recall is the proportion of true positive predictions out of all actual positive instances. It measures the ability of the model to correctly identify positive instances.

- F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.

- ROC AUC Score: ROC (Receiver Operating Characteristic) AUC (Area Under the Curve) score is a performance metric for binary classification models. It measures the model's ability to discriminate between positive and negative classes.

- Confusion Matrix: A confusion matrix is a table that shows the counts of true positive, true negative, false positive, and false negative predictions. It provides a detailed breakdown of the model's performance across different classes.

- Classification Report: A classification report provides a summary of precision, recall, F1 score, and support for each class in a multi-class classification problem.

These metrics can provide additional insights into the performance of our machine learning model, especially in classification tasks. You can calculate these metrics using appropriate functions from libraries such as scikit-learn or other specialized evaluation libraries.