# Logistic Regression
## Problem
Given a dataset, predict if it will rain tomorrow.
1. This is a classification problem.
2. If want to use regression, then the response variable need to be a numeric value.
3. One way to do it is to consider the probability of rain tomorrow as a dependent variable.
4. However this will result the predicted probability of the model can be greater than 1 or less than 0.
5. Hence, use odds ratio as dependent variable. (Logistic Regression)

## Linear Regression v.s. Logistic Regression
Logistic Regression is fit for classification problem.
### Classifiaction Problem
Assign input into classes, but in dataset have the 'true' classification (hence supervised)
Use logistic regression to solve classifiaction problems.
- Take linear convination
- Apply sigmoid function to the result so that the output in between 0 and 1
- Cross entropy as loss function
### Regression Problem
Assign input to get a continuous value.
Use linear regression to solve regression problems.

## ML Workflow
1. initialize model
2. pass input into model to obtain predictions
3. compare predictions with actual targets with loss function
4. optimization
5. repeat until model is considered to be good enough

## Load Data

In [2]:
# !pip install opendatasets --upgrade --quiet
import opendatasets as od

In [5]:
dataset_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'
od.download(dataset_url)

Downloading weather-dataset-rattle-package.zip to .\weather-dataset-rattle-package


100%|██████████| 3.83M/3.83M [00:00<00:00, 20.8MB/s]







To see the data in Data folder:

In [6]:
import os

In [11]:
os.listdir('Data')

['medical.csv', 'weatherAUS.csv']

To load the data:

In [12]:
import pandas as pd

In [13]:
raw_df = pd.read_csv('Data/weatherAUS.csv')

Check demsion and basic info of the dataset:

In [14]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

See that some of the column has null value, RainTomorrow need to be treated carefully as it is the thing we want to predict. It is not a good solution to fill in the missing value in this field, better solution is to consider null value as own class or just simply delete null. Similar for RainToday value as it is very likely to be very close related with the response.

In [15]:
raw_df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

For other missing values, have many ways:
- Can fill in the average if the feature is normally distributed
- Can delete null if null value is not many
- Can check the correlation and simply don't include that feature in the model

## Basic Analysis and Visualization

In [17]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
px.histogram(raw_df, x='Location', title='Location v.s. Rainy Days', color='RainToday')

In [None]:
px.histogram(raw_df, x='Temp3pm', title='Temperature at 3 pm vs. Rain Tomorrow', color='RainTomorrow')

In [None]:
px.histogram(raw_df, x='RainTomorrow', color='RainToday', title='Rain Tomorrow vs. Rain Today')

In [None]:
px.scatter(raw_df.sample(2000), title='Temp (3 pm) vs. Humidity (3 pm)', x='Temp3pm', y='Humidity3pm',
           color='RainTomorrow')

Location follows a uniform distribution, Temp3pm follow a normal distribution, and RainTomorrow seems strongly correlated with Location, Temp3pm, and RainToday. However, Humidity 3pm seems positively correlated with RainTomorrow.

In [None]:
px.histogram(raw_df, x='Date', title='RainTomorrow v.s. Date', color='RainToday')

See that not strongly correlated with Date.

In [None]:
px.scatter(raw_df.sample(2000), title='MinTemp vs. MaxTemp', x='MinTemp', y='MaxTemp',
           color='RainTomorrow')

See that if RainTomorrow, Max Temp tend to be lower.

In [None]:
px.histogram(raw_df, x='Rainfall', title='RainTomorrow v.s. Rainfall', color='RainToday')

Seems that Rainfall can be used, but need to take further calculation as the plot is squeezed very much around 0.

In [None]:
px.histogram(raw_df, x='Evaporation', title='Evaporation v.s. Rain Tomorrow', color='RainToday')

In [None]:
px.histogram(raw_df, x='Sunshine', title='Sunshine v.s. Rain Tomorrow', color='RainToday')

Sunshine are expected to have high correlation with the dependent value.

In [None]:
sns.violinplot(data=raw_df, x='Sunshine', y='RainTomorrow')

In [None]:
sns.violinplot(data=raw_df, x='WindSpeed9am', y='RainTomorrow')

In [None]:
sns.violinplot(data=raw_df, x='WindSpeed3pm', y='RainTomorrow')

In [None]:
px.histogram(raw_df, x='WindDir9am', title='WindDir v.s. Rain Tomorrow', color='RainToday')

In [None]:
px.histogram(raw_df, x='WindDir3pm', title='WindDir v.s. Rain Tomorrow', color='RainToday')

In [None]:
px.scatter(raw_df.sample(2000), title='Humidity', x='Humidity9am', y='Humidity3pm',
           color='RainTomorrow')

Seems Humidity at any time can have reasonable strong correlation.

In [None]:
px.scatter(raw_df.sample(2000), title='Pressure', x='Pressure9am', y='Pressure3pm',
           color='RainTomorrow')

In [None]:
sns.violinplot(data=raw_df, x='Cloud9am', y='RainTomorrow')

In [None]:
sns.violinplot(data=raw_df, x='Cloud3pm', y='RainTomorrow')

Seems cloud also have high correlation to the prediction.
## Working with a Sample

In [59]:
use_sample = False

In [60]:
sample_fraction = 0.1

In [61]:
if use_sample:
    raw_df = raw_df.sample(frac=sample_fraction).copy()

## Training, Validation and Test Sets
- Training set: to train the model
- Validation set: to evaluate the model during the training
- Test set: to test the model after training to see if model overfit

In [62]:
from sklearn.model_selection import train_test_split

In [63]:
training_val_df, testing_df = train_test_split(raw_df, test_size=0.2, random_state=42)
training_df, val_df = train_test_split(training_val_df, test_size=0.25, random_state=42)

If deal with dates, it's often better idea to separate the training, validation and test sets with time, so model can train on data former to the testing data.

In [66]:
year = pd.to_datetime(raw_df.Date).dt.year

training_df = raw_df[year < 2015]
val_df = raw_df[year == 2015]
testing_df = raw_df[year > 2015]

## Feature Filtering
For example, as stated above, Date column can be ignored

In [72]:
x_col = list(training_df.columns)[1:-1]
y_col = 'RainTomorrow'

In [79]:
training_x = training_df[x_col].copy()
training_y = training_df[y_col].copy()

In [80]:
validating_x = val_df[x_col].copy()
validating_y = val_df[y_col].copy()

In [81]:
testing_x = testing_df[x_col].copy()
testing_y = testing_df[y_col].copy()

Determine if the columns are numerical or categorical:

In [82]:
import numpy as np

In [84]:
numeric_cols = training_x.select_dtypes(include=np.number).columns.tolist()
categorical_cols = training_x.select_dtypes('object').columns.tolist()

In [85]:
training_x[numeric_cols].describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,97674.0,97801.0,97988.0,61657.0,57942.0,91160.0,97114.0,96919.0,96936.0,96872.0,88876.0,88857.0,63000.0,61966.0,97414.0,97392.0
mean,12.007831,23.022202,2.372935,5.289991,7.609004,40.215873,14.092263,18.764608,68.628745,51.469547,1017.513734,1015.132352,4.302952,4.410677,16.835126,21.540138
std,6.347175,6.984397,8.518819,3.95201,3.788813,13.697967,8.984203,8.872398,19.003097,20.756113,7.07251,6.997072,2.866634,2.69337,6.404586,6.831612
min,-8.5,-4.1,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,979.0,0.0,0.0,-5.9,-5.1
25%,7.5,17.9,0.0,2.6,4.8,31.0,7.0,13.0,57.0,37.0,1012.8,1010.4,1.0,2.0,12.2,16.6
50%,11.8,22.4,0.0,4.6,8.5,39.0,13.0,19.0,70.0,52.0,1017.5,1015.1,5.0,5.0,16.6,20.9
75%,16.6,27.9,0.8,7.2,10.6,48.0,19.0,24.0,83.0,66.0,1022.3,1019.9,7.0,7.0,21.4,26.2
max,33.9,48.1,371.0,82.4,14.3,135.0,87.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.1


In [87]:
training_x[categorical_cols].nunique()

Location       49
WindGustDir    16
WindDir9am     16
WindDir3pm     16
RainToday       2
dtype: int64

## Missing Values
