In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
import copy # lets us copy things
import seaborn as sns #seaborn is a wrapper for matplotlib
import tensorflow as tf
from sklearn.linear_model import LinearRegression

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

# MODEL - SIMPLE LINEAR REGRESSION | SUPERVISED LEARNING
The main idea of linear regression is to fit a line (in 2D space), a plane (in 3D space), or a hyperplane (in higher dimensions) that best describes the relationship between the independent (input) and dependent (output) variables. This is accomplished by minimizing the sum of squares of the vertical deviations from each data point to the line (called residuals).
y = b0 + b1*x + e

where:

y is the dependent variable,
x is the independent variable,
b0 is the y-intercept,
b1 is the slope of the line (indicating the effect x has on y), and
e is the error term.

- trying to decrease the residual, "the line of error" for the linear prediction compared to the actual data plot

Training a linear regression model involves finding the values for the coefficients that minimize the sum of the squared residuals (also known as the loss function), usually using methods such as Gradient Descent or the Normal Equation.

# how to evaluate a linear regression model
R-Squared (Coefficient of Determination): This measures the proportion of variance in the dependent variable that can be predicted from the independent variable(s). It takes a value between 0 and 1, where a higher value generally indicates a better fit of the model. However, R-Squared doesn't tell the whole story, as it tends to increase with the addition of more predictors, regardless of whether they are truly meaningful.

Adjusted R-Squared: This is a modification of R-Squared that adjusts for the number of predictors in the model. Unlike R-Squared, the adjusted R-Squared increases only if the new term improves the model more than would be expected by chance.

Mean Squared Error (MSE) or Root Mean Squared Error (RMSE): These metrics represent the average squared difference or square root of that difference respectively between the observed and predicted values. It's a measure of the model's accuracy and lower values are desirable.

Mean Absolute Error (MAE): It represents the average of the absolute differences between the observed and predicted values. It's less sensitive to outliers compared to MSE or RMSE.

Residual Plots: Residuals are the difference between the observed and predicted values. Plotting these can give you an idea about the variance and whether it’s constant or not (homoscedasticity).

F-statistic: It is a statistical test used to compare our model with a reduced model (no predictors, just intercept). The null hypothesis states that the reduced model is better, and if the F-statistic is sufficiently large, this is evidence against the null hypothesis.

T-statistic/P-values: These are used for hypothesis testing on the coefficients. The null hypothesis states that the true coefficient is zero, i.e., the predictor has no effect on the outcome variable. If the p-value associated with a predictor is low (typically < 0.05), we reject the null hypothesis and say that the predictor is statistically significant.

Variance Inflation Factor (VIF): VIF measures the correlation and collinearity between the predictor variables in the model. If the VIF is high for a predictor variable (>5 or >10), it means that variable is highly correlated with the other predictor variables, and it might be more challenging to interpret its impact on the output variable.

In [27]:
dataset_cols = ["Date", 
                "bike-count", 
                "Hour", 
                "Temperature", 
                "Humidity", 
                "Windspeed", 
                "Visibility",
                "Dew",
                "Solar-radiation",
                "Rainfall-mm",
                "Snowfall-cm",
                "Seasons",
                "Holiday",
                "Functional-Day"]
df = pd.read_csv("SeoulBikeData.csv").drop(["Date", "Holiday", "Seasons"], axis=1) # drop certain columns if you want to

In [30]:
print(len(df.columns))
print(len(dataset_cols))

dataset_cols = dataset_cols[:len(df.columns)]
df.columns = dataset_cols # creating better columns

# NB!!!!! NEED TO RUN THESE THREE COMMENTED IF I COME BACK TO THIS
# make our data 0s and 1s
#df["Snowfall-cm"] = (df["Snowfall-cm"] == "Yes").astype(int)

#index on the hour column of the data frame, make all the hours 12 instead
#df = df[df["Hour"] == 12]
#then drop the hour column
#df = df.drop(["Hour"], axis=1)

df.head()

11
11


Unnamed: 0,Date,bike-count,Temperature,Humidity,Windspeed,Visibility,Dew,Solar-radiation,Rainfall-mm,Snowfall-cm
2583,1128,15,37,0.8,879,-2.2,1.09,0.0,0.0,Yes
2808,400,0,76,0.8,346,7.8,0.0,0.0,0.0,Yes
3256,124,16,61,3.6,1374,4.6,0.57,0.0,0.0,Yes
3297,1013,9,39,0.6,1386,-1.5,1.44,0.0,0.0,Yes
3429,24,21,86,2.3,681,9.7,0.0,2.5,0.0,Yes
