# Regression task - Bike sharing 1

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return has become automatic. Through these systems, a user is able to easily rent a bike from a particular position and return it at another place.

The dataset contains the hourly count of rental bikes between years 2011 and 2012 in the Capital Bikeshare system (Washington DC) with the corresponding weather and seasonal information.

The goal of this task is to train a regressor to predict total counts of bike rentals based on the provided features for a given hour.

## Data source
[http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)

## Feature description
* **dteday** - date time stamot
* **season** - season (1: spring, 2: summer, 3: fall, 4: winter)
* **yr** - year (0: 2011, 1: 2012)
* **mnth** - month (1 to 12)
* **hr** - hour (0 to 23)
* **holiday** - 1 if the day is a holiday, else 0 (extracted from [holiday schedules](https://dchr.dc.gov/page/holiday-schedules))
* **weekday** - day of the week (0 to 6)
* **workingday** - is 1 if day is neither weekend nor holiday, else 0.
* **weathersit**
    * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* **temp** - Normalized temperature in degrees of Celsius.
* **atemp** - Normalized feeling temperature in degrees Celsius.
* **hum** - Normalized relative humidity.
* **windspeed** - Normalized wind speed.
* **casual** - Count of casual users.
* **registered** - Count of registered users.
* **cnt** -  Count of total rental bikes including both casual and registered. This is the target value.

In [1]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/mlcollege/introduction-to-ml/master/data/bikes.csv', sep=',')
data.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## Simple regressor

Implement a simple regressor based on all reasonable features from the input data set. Notice that some of the features from the input data cannot be used.

### Data preparation

Prepare train and test data sets.

In [3]:
from sklearn.model_selection import train_test_split

X_all = data[['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit','temp', 'atemp', 'hum', 'windspeed']]
y_all = data['cnt']


X_train, X_test, y_train, y_test = train_test_split(
    X_all,
    y_all,
    random_state=1,
    test_size=0.2)

print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

Train size: 13903
Test size: 3476


### Training a regressor

Train a regressor using the following models:
* [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [Support Vector Machines for regression](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) (experiment with different kernels)
* [Gradient Boosted Trees](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) (Experiment with different depths and number of trees)

In [21]:
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

regr = Pipeline([
                #('std', StandardScaler()),
                 ('linear', LinearRegression())
                 #('svr', svm.SVR(kernel='rbf'))
                 #('svr', svm.SVR(kernel='linear'))
                 #('gbr', GradientBoostingRegressor(n_estimators=100, max_depth=4))
                ])

regr.fit(X_train, y_train)

### Evaluate the models

Measure mean squared error and mean absolute error evaluation metrics on both train and test data sets. Compute the mean and standard deviation of the target values. Decide which model performs best on the given problem.

In [13]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = regr.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print ("Test mean: {}, std: {}".format(np.mean(y_test), np.std(y_test)))
print("Test Root mean squared error: {:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))
print("Test Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print(f"Hodnota R-squared je: {r_squared}")

y_pred = regr.predict(X_train)
r_squared = r2_score(y_train, y_pred)
print("Train Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(y_train, y_pred)))
print("Train Mean absolute error: %.2f"
      % mean_absolute_error(y_train, y_pred))
print(f"Hodnota R-squared je: {r_squared}")

Test mean: 191.16168009205984, std: 182.65244995043582
Test Root mean squared error: 141.77
Test Mean absolute error: 106.34
Hodnota R-squared je: 0.3975373993285607
Train Root mean squared error: 141.82
Train Mean absolute error: 105.80
Hodnota R-squared je: 0.3865053196103222


In [15]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = regr.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print ("Test mean: {}, std: {}".format(np.mean(y_test), np.std(y_test)))
print("Test Root mean squared error: {:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))
print("Test Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print(f"Hodnota R-squared je: {r_squared}")

y_pred = regr.predict(X_train)
r_squared = r2_score(y_train, y_pred)
print("Train Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(y_train, y_pred)))
print("Train Mean absolute error: %.2f"
      % mean_absolute_error(y_train, y_pred))
print(f"Hodnota R-squared je: {r_squared}")

Test mean: 191.16168009205984, std: 182.65244995043582
Test Root mean squared error: 144.79
Test Mean absolute error: 92.03
Hodnota R-squared je: 0.37165634702771033
Train Root mean squared error: 144.54
Train Mean absolute error: 91.96
Hodnota R-squared je: 0.3627690704922618


In [17]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = regr.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print ("Test mean: {}, std: {}".format(np.mean(y_test), np.std(y_test)))
print("Test Root mean squared error: {:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))
print("Test Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print(f"Hodnota R-squared je: {r_squared}")

y_pred = regr.predict(X_train)
r_squared = r2_score(y_train, y_pred)
print("Train Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(y_train, y_pred)))
print("Train Mean absolute error: %.2f"
      % mean_absolute_error(y_train, y_pred))
print(f"Hodnota R-squared je: {r_squared}")

Test mean: 191.16168009205984, std: 182.65244995043582
Test Root mean squared error: 53.91
Test Mean absolute error: 35.63
Hodnota R-squared je: 0.9128994838235442
Train Root mean squared error: 53.24
Train Mean absolute error: 35.55
Hodnota R-squared je: 0.9135280638135063


In [20]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = regr.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print ("Test mean: {}, std: {}".format(np.mean(y_test), np.std(y_test)))
print("Test Root mean squared error: {:.2f}".format(np.sqrt(mean_squared_error(y_test, y_pred))))
print("Test Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print(f"Hodnota R-squared je: {r_squared}")

y_pred = regr.predict(X_train)
r_squared = r2_score(y_train, y_pred)
print("Train Root mean squared error: %.2f"
      % np.sqrt(mean_squared_error(y_train, y_pred)))
print("Train Mean absolute error: %.2f"
      % mean_absolute_error(y_train, y_pred))
print(f"Hodnota R-squared je: {r_squared}")

Test mean: 191.16168009205984, std: 182.65244995043582
Test Root mean squared error: 148.85
Test Mean absolute error: 102.29
Hodnota R-squared je: 0.33586240488200514
Train Root mean squared error: 148.56
Train Mean absolute error: 101.19
Hodnota R-squared je: 0.32681258877123


In [25]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Výpočet reziduí
residuals = y_train - y_pred

from scipy.stats import shapiro

# Shapiro-Wilkův test
stat, p_value = shapiro(residuals)
print(f"Shapiro-Wilkův test: Statistika = {stat}, p-hodnota = {p_value}")

# Interpretace výsledků
if p_value > 0.05:
    print("Rezidua mají normální rozdělení (nelze zamítnout nulovou hypotézu).")
else:
    print("Rezidua nemají normální rozdělení (zamítáme nulovou hypotézu).")

Shapiro-Wilkův test: Statistika = 0.8892451603262843, p-hodnota = 3.266770924842595e-71
Rezidua nemají normální rozdělení (zamítáme nulovou hypotézu).


### Feature importance

Print coefficients of the linear regression model and decide which features are the most important.


In [26]:
print('Coefficients: \n', regr['linear'].coef_)

Coefficients: 
 [ 1.95349466e+01  8.03539178e+01  1.55303642e-01  7.56516670e+00
 -2.51541124e+01  1.82643866e+00  4.59872419e+00 -3.56677652e+00
  3.52531083e+01  2.81343326e+02 -2.00577249e+02  3.84699474e+01]
