# 0104 Regression Models

Supervised Learning concerns the set of methods under which we train a ML Learning Model on labelled data. the result would then be a model that given a set of features $X$ (the same which we used to train it) is able to predict a value for the target variable $\hat{y}$

$$
\hat{y} = f(X)
$$

In general $X = \{x_1, \dots, x_n\}$ is a set of features such that

$$
\hat{y} = f(x_1, \dots , x_n)
$$

in the realm of Supervised ML there are two main classes of algorithms:
* Regression - the goal of the model is to predict a numeric value
* Classfication - the goal of the model is to predict a class

Here we will focus on regression methods

In [2]:
import pandas as pd

#!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('/home/giacomo_lini/MLOps/Coursera/azure/Azure_ML_Cloud/01_Create_ML_Models_in_Azure/data/daily-bike-share.csv')
bike_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,rentals
0,1,1/1/2011,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331
1,2,1/2/2011,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131
2,3,1/3/2011,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120
3,4,1/4/2011,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108
4,5,1/5/2011,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82


here we have a table that contains the following columns

* Instant - unique row identifier
* dteday - Date of observation
* season - numerical encoded variable for season (1:winter, 2:spring, 3:summer, 4:fall)
* yr - year of the study (0 = 2011, 1= 2012)
* mnth - month (January to December 1 to 12)
* holiday - bianry value (was it a holiday)
* weekday - which day of the week it was (0: Sunday to 6: Saturday)
* workingday - was it a working day?
* weathersit: A categorical value indicating the weather situation (1:clear, 2:mist/cloud, 3:light rain/snow, 4:heavy rain/hail/snow/fog)
* temp: temperature in celsius (normalized)
* atemp: feelslike temp
* hum: humidity level (normalized)
* windspeed: wind speed (normalized)
* rentals: target variable, number of rentals on that day

In [3]:
# add a column for day (1 to 31)

bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
bike_data.head(32)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,rentals,day
0,1,1/1/2011,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,1
1,2,1/2/2011,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,2
2,3,1/3/2011,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,3
3,4,1/4/2011,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,4
4,5,1/5/2011,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,5
5,6,1/6/2011,1,0,1,0,4,1,1,0.204348,0.233209,0.518261,0.089565,88,6
6,7,1/7/2011,1,0,1,0,5,1,2,0.196522,0.208839,0.498696,0.168726,148,7
7,8,1/8/2011,1,0,1,0,6,0,2,0.165,0.162254,0.535833,0.266804,68,8
8,9,1/9/2011,1,0,1,0,0,0,1,0.138333,0.116175,0.434167,0.36195,54,9
9,10,1/10/2011,1,0,1,0,1,1,1,0.150833,0.150888,0.482917,0.223267,41,10


In [7]:
#take a look at the true (non encoded) numeric features

numeric_features = ['temp', 'atemp', 'hum', 'windspeed']

bike_data[numeric_features+['rentals']].describe()


Unnamed: 0,temp,atemp,hum,windspeed,rentals
count,731.0,731.0,731.0,731.0,731.0
mean,0.495385,0.474354,0.627894,0.190486,848.176471
std,0.183051,0.162961,0.142429,0.077498,686.622488
min,0.05913,0.07907,0.0,0.022392,2.0
25%,0.337083,0.337842,0.52,0.13495,315.5
50%,0.498333,0.486733,0.626667,0.180975,713.0
75%,0.655417,0.608602,0.730209,0.233214,1096.0
max,0.861667,0.840896,0.9725,0.507463,3410.0


In [14]:
#how many nan do we have in the data?

bike_data.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
rentals       0
day           0
dtype: int64