# 09.04.01 - Regression

## Purpose

This notebook will go over two regression algorithms that may better apply to your class project vs classification.  We'll use the weather data set this time.

## Libraries

* Sklearn

## References/Reading

* LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
  * https://en.wikipedia.org/wiki/Linear_regression
* Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
  * https://en.wikipedia.org/wiki/Ridge_regression

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression

# Part 1: Prepare the data
We've seen all this already, we'll just do it in one spot

In [2]:
weather = pd.read_csv("https://raw.githubusercontent.com/TheDarkTrumpet/BAIS-6040-0EXP-spr2021/master/data/weather.csv").dropna()

weather

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2007-11-01,Canberra,8.0,24.3,0.0,3.4,6.3,NW,30.0,SW,...,29,1019.7,1015.0,7,7,14.4,23.6,No,3.6,Yes
1,2007-11-02,Canberra,14.0,26.9,3.6,4.4,9.7,ENE,39.0,E,...,36,1012.4,1008.4,5,3,17.5,25.7,Yes,3.6,Yes
2,2007-11-03,Canberra,13.7,23.4,3.6,5.8,3.3,NW,85.0,N,...,69,1009.5,1007.2,8,7,15.4,20.2,Yes,39.8,Yes
3,2007-11-04,Canberra,13.3,15.5,39.8,7.2,9.1,NW,54.0,WNW,...,56,1005.5,1007.0,2,7,13.5,14.1,Yes,2.8,Yes
4,2007-11-05,Canberra,7.6,16.1,2.8,5.6,10.6,SSE,50.0,SSE,...,49,1018.3,1018.5,7,7,11.1,15.4,Yes,0.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,2008-10-27,Canberra,9.0,30.7,0.0,7.6,12.1,NNW,76.0,SSE,...,15,1016.1,1010.8,1,3,20.4,30.0,No,0.0,No
362,2008-10-28,Canberra,7.1,28.4,0.0,11.6,12.7,N,48.0,NNW,...,22,1020.0,1016.9,0,1,17.2,28.2,No,0.0,No
363,2008-10-29,Canberra,12.5,19.9,0.0,8.4,5.3,ESE,43.0,ENE,...,47,1024.0,1022.8,3,2,14.5,18.3,No,0.0,No
364,2008-10-30,Canberra,12.5,26.9,0.0,5.0,7.1,NW,46.0,SSW,...,39,1021.0,1016.2,6,7,15.8,25.9,No,0.0,No


In [3]:
columns = ["MinTemp", "MaxTemp", "Sunshine", "Humidity3pm"]
target = "Rainfall"

X=weather[columns]
y=weather[target]

X

Unnamed: 0,MinTemp,MaxTemp,Sunshine,Humidity3pm
0,8.0,24.3,6.3,29
1,14.0,26.9,9.7,36
2,13.7,23.4,3.3,69
3,13.3,15.5,9.1,56
4,7.6,16.1,10.6,49
...,...,...,...,...
361,9.0,30.7,12.1,15
362,7.1,28.4,12.7,22
363,12.5,19.9,5.3,47
364,12.5,26.9,7.1,39


In [4]:
y

0       0.0
1       3.6
2       3.6
3      39.8
4       2.8
       ... 
361     0.0
362     0.0
363     0.0
364     0.0
365     0.0
Name: Rainfall, Length: 328, dtype: float64

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Modeling with Linear Regression

See the wikipedia article for how this works, but we're basically attempting to fit a line to our data set, with the minimum distance between the points.  A very popular starting algorithm in this space

In [6]:
lr = LinearRegression()
lr

LinearRegression()

In [7]:
lr.fit(X_train, y_train)

LinearRegression()

In [8]:
lr.score(X_train, y_train)  # What's our score with the training data set?

0.1585020526818326

In [9]:
lr.score(X_test, y_test)    # What's our score with the test data set?

0.22861720735022106

## LR Metrics

In [10]:
import math
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

In [11]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

Score: 0.23
MAE: 1.74
RMSE: 3.03
r2: 0.23


## Predict some new samples

Let's define a few new samples.  You can do this as a list, and pass that in, or as a DataFrame.  We'll do it via a DataFrame only, but will randomly generate our sample people.

In [12]:
import random as rnd
rnd.seed(1024)

In [13]:
numElements = 3
sampleWeather = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # We'll always allow at lea
        maxValue = round(max(weather[column].values))
        dict[column] = rnd.randint(min, maxValue)
    sampleWeather.append(dict)
sampleWeather

[{'MinTemp': 0, 'MaxTemp': 30, 'Sunshine': 13, 'Humidity3pm': 49},
 {'MinTemp': 10, 'MaxTemp': 33, 'Sunshine': 1, 'Humidity3pm': 56},
 {'MinTemp': 16, 'MaxTemp': 23, 'Sunshine': 12, 'Humidity3pm': 92}]

In [14]:
pdSampleWeather = pd.DataFrame.from_dict(sampleWeather)
pdSampleWeather

Unnamed: 0,MinTemp,MaxTemp,Sunshine,Humidity3pm
0,0,30,13,49
1,10,33,1,56
2,16,23,12,92


In [15]:
predictions = lr.predict(pdSampleWeather)
predictions

array([-2.07033737, -2.26829449,  8.25422513])

In [16]:
pdPredictedWeather = pdSampleWeather.copy()
pdPredictedWeather['Predicted'] = predictions
pdPredictedWeather

Unnamed: 0,MinTemp,MaxTemp,Sunshine,Humidity3pm,Predicted
0,0,30,13,49,-2.070337
1,10,33,1,56,-2.268294
2,16,23,12,92,8.254225


## Regression Modeling with Ridge

In [17]:
rr = Ridge(solver="svd")
rr

Ridge(solver='svd')

In [18]:
rr.fit(X_train, y_train)

Ridge(solver='svd')

In [19]:
rr.score(X_train, y_train)

0.15850199803975518

In [20]:
rr.score(X_test, y_test)

0.2286000266839302

In [21]:
predictions = rr.predict(X_test)
printMetrics(y_test, predictions)

Score: 0.23
MAE: 1.74
RMSE: 3.03
r2: 0.23


In [22]:
pdSampleWeather

Unnamed: 0,MinTemp,MaxTemp,Sunshine,Humidity3pm
0,0,30,13,49
1,10,33,1,56
2,16,23,12,92


In [23]:
predictions = rr.predict(pdSampleWeather)
predictions

array([-2.06679002, -2.26305673,  8.25172236])

In [24]:
pdPredictedWeather = pdSampleWeather.copy()
pdPredictedWeather['Predicted'] = predictions
pdPredictedWeather


Unnamed: 0,MinTemp,MaxTemp,Sunshine,Humidity3pm,Predicted
0,0,30,13,49,-2.06679
1,10,33,1,56,-2.263057
2,16,23,12,92,8.251722
