<a href="https://colab.research.google.com/github/Python-Is-Long/Teaching/blob/main/sklearn_and_reshape(_1%2C1)_explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

2020-7-22

In [None]:
import pandas as pd
import numpy as np

# Load data

Data source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

In [None]:
!pip install googledrivedownloader #black magic
from google_drive_downloader import GoogleDriveDownloader as gdd
gdd.download_file_from_google_drive(file_id="1AMhUz8pbzu0PXtw-VsORYcXULBJ7nGP3",
                                    dest_path="./bikes_sharing.csv",
                                    unzip=False)

Downloading 1AMhUz8pbzu0PXtw-VsORYcXULBJ7nGP3 into ./bikes_sharing.csv... Done.


In [None]:
df = pd.read_csv('bikes_sharing.csv', header = 0, sep = ',')
df

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


# Simple linear regression

Using only one explanatory variable to predict the response variable. The formula is like y=ax+b

In [None]:
df["temp"] #let's use this as our predictor (X)

0         9.84
1         9.02
2         9.02
3         9.84
4         9.84
         ...  
10881    15.58
10882    14.76
10883    13.94
10884    13.94
10885    13.12
Name: temp, Length: 10886, dtype: float64

In [None]:
df["temp"].shape #this is a 1-D series

(10886,)

In [None]:
type(df["temp"]) #it's a pandas series

pandas.core.series.Series

In [None]:
df["temp"].values #this is a 1-D array

array([ 9.84,  9.02,  9.02, ..., 13.94, 13.94, 13.12])

In [None]:
type(df["temp"].values) #it's a numpy array

numpy.ndarray

In [None]:
df["temp"].values.reshape(-1,1) #this is the format we would need for X in sklearn (a 2-D numpy array)

array([[ 9.84],
       [ 9.02],
       [ 9.02],
       ...,
       [13.94],
       [13.94],
       [13.12]])

In [None]:
from sklearn.linear_model import LinearRegression
X = df["temp"] #this is 1-D pandas series, but sklearn expects X to be a 2-D array (where each row represents a sample and each column represents a feature), so sklearn will show error and tell you to use reshape(-1,1). 
y = df["count"] 
model = LinearRegression().fit(X, y)
print(model.coef_)
print(model.intercept_)

ValueError: ignored

In [None]:
from sklearn.linear_model import LinearRegression
X = df["temp"].values #this is 1-D numpy array, so it's still not correct and sklearn will tell you to use reshape(-1,1). 
y = df["count"]
model = LinearRegression().fit(X, y)
print(model.coef_)
print(model.intercept_)

ValueError: ignored

In [None]:
from sklearn.linear_model import LinearRegression
X = df["temp"].values.reshape(-1,1) #this is 2-D numpy array, and this is the correct format for X in sklearn. 
y = df["count"]
model = LinearRegression().fit(X, y)
print(model.coef_)
print(model.intercept_)

[9.17054048]
6.046212959616611


In [None]:
from sklearn.linear_model import LinearRegression
X = df[["temp"]] #this is 2-D pandas dataframe, and sklearn can also take this as X (this is usually my preferred way of defining X because it's simpler). 
y = df["count"]
model = LinearRegression().fit(X, y)
print(model.coef_)
print(model.intercept_)

[9.17054048]
6.046212959616611


In [None]:
df[["temp"]] #as we can see, this is a dataframe with one single column (each row is a data sample, and the single column is the single feature)

Unnamed: 0,temp
0,9.84
1,9.02
2,9.02
3,9.84
4,9.84
...,...
10881,15.58
10882,14.76
10883,13.94
10884,13.94


# Multilinear regression

You can have multiple features (dimensions) for linear regression. The formula is like z=ax+by+c

In [None]:
from sklearn.linear_model import LinearRegression
X = df[["temp", "humidity", "windspeed"]] #this is 2-D pandas dataframe, and we are including multiple features as our predictors at the same time. 
y = df["count"]
model = LinearRegression().fit(X, y)
print(model.coef_)
print(model.intercept_)

[ 8.74290707 -2.70814662  0.36417728]
177.63396231909968


The coeficient shows 3 numbers because now we have 3 features (3 dimensions) as our predictors. 

The intercept is still a single number, and it's the intercept of the 3-D hyperplane on the 4th axis (our response variable).

In [None]:
df[["temp", "humidity", "windspeed"]] #this is our X, a dataframe with 3 columns (3 features)

Unnamed: 0,temp,humidity,windspeed
0,9.84,81,0.0000
1,9.02,80,0.0000
2,9.02,80,0.0000
3,9.84,75,0.0000
4,9.84,75,0.0000
...,...,...,...
10881,15.58,50,26.0027
10882,14.76,57,15.0013
10883,13.94,61,15.0013
10884,13.94,61,6.0032


# Conclusion:

The format of df["temp"].values.reshape(-1,1) is equivalent to df[["temp"]], so the reason we need df["temp"].values.reshape(-1,1) for our X (in the case of simple linear regression) is really just to get the same format with df[["temp"]], where each row is a data sample and each column is a feature. 

The reason we want the columns (instead of rows) of a 2-D array to be our features boils down to our convention of storing the data in the dataframe (same with Excel tables etc.) – we normally always use the rows in the dataframe to record data samples, and we use the columns to record the attributes of those samples.