In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

In [3]:
data = pd.read_csv('daily_weather.csv')

# Daily Weather Data Description (60 points)

The file **daily_weather.csv** is a comma-separated file that contains weather data. This data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

Let's now check all the columns in the data.


Each row in daily_weather.csv captures weather data for a separate day. 

Sensor measurements from the weather station were captured at one-minute intervals. These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.
Each row, or sample, consists of the following variables:

number: unique number for each row

air_pressure_9am: air pressure averaged over a period from 8:55am to 9:04am (Unit: hectopascals)

air_temp_9am: air temperature averaged over a period from 8:55am to 9:04am (Unit: degrees Fahrenheit)

air_wind_direction_9am: wind direction averaged over a period from 8:55am to 9:04am (Unit: degrees, with 0 means coming from the North, and increasing clockwise)

air_wind_speed_9am: wind speed averaged over a period from 8:55am to 9:04am (Unit: miles per hour)

max_wind_direction_9am: wind gust direction averaged over a period from 8:55am to 9:10am (Unit: degrees, with 0 being North and increasing clockwise)

max_wind_speed_9am: wind gust speed averaged over a period from 8:55am to 9:04am (Unit: miles per hour)

rain_accumulation_9am: amount of rain accumulated in the 24 hours prior to 9am (Unit: millimeters)

rain_duration_9am: amount of time rain was recorded in the 24 hours prior to 9am (Unit: seconds)

relative_humidity_9am: relative humidity averaged over a period from 8:55am to 9:04am (Unit: percent)

relative_humidity_3pm: relative humidity averaged over a period from 2:55pm to 3:04pm (*Unit: percent *)

In [5]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

In [6]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


In [7]:
del data['number']
data = data.dropna()

In [8]:
# Binarize the relative_humidity_3pm to 0 or 1.

clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm']>24.99)*1

In [9]:
y = clean_data[['high_humidity_label']]
type(y)
y.shape

(1064, 1)

In [10]:
# Use 9am Sensor Signals as Features to Predict Humidity at 3pm

morning_features = ['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am']

In [11]:
X = clean_data[morning_features]

In [12]:
X.shape

(1064, 8)

In [13]:
X_train, X_test, y_train, y_test=train_test_split(X, y, random_state=23)

**Complete the following tasks**:

- Train a logistic regression model (10 points)
- Train a SVM model by tunning both C and gamma, report the best parameters (15 points)
- Train a decision tree model by tuning the proper parameters, report the best parameters (15 points)
- Use the same parameters to train a random forest model (10 points)
- Compare all the above models' performance (10 points)

In [14]:
# YOUR CODES
model_Scores = []
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

# Logistic Regression
lreg = LogisticRegression()
# Train a Logistic Regression model
lreg.fit(X_train, y_train)
y_pred = lreg.predict(X_test)
# Print the coefficients and model performance
cvscores = cross_val_score(lreg, X_train, y_train, cv=3)
print('Logistic Regression:')
print("Cross validation scores: {}".format(cvscores))
print("lr.coef_:", lreg.coef_)
print("lr.intercept_:", lreg.intercept_)
print("Training score: {:.2f}".format(lreg.score(X_train, y_train)))
test_score = lreg.score(X_test,y_test) 
print("Test score {:.2f}".format(test_score))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
lr_auc = roc_auc_score(y_test, lreg.predict_proba(X_test)[:,1])
print("AUC for logistic regression: {:.3f}".format(lr_auc))
## Appending the results 
model_Scores.append({'Model Type' : 'Classification',
                    'Model Name' : 'Logistic Regression',
                    'Parameters' : '-',
                    'Training_score': lreg.score(X_train, y_train),
                    'Test Score': test_score,
                    'AUC':lr_auc})

# LinearSVM
linear_svc = LinearSVC()
# Define a list of parameters
params_svc = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_svc = GridSearchCV(linear_svc, params_svc, cv=5, n_jobs=2,scoring = 'roc_auc', return_train_score=True)
grid_svc.fit(X_train, y_train)
y_pred = grid_svc.predict(X_test)
print('\nLinear SVM:')
print('Training score: ', grid_svc.score(X_train, y_train))
print('Test score: ', grid_svc.score(X_test, y_test))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
linearsvc_auc = grid_svc.best_score_
print("AUC for Linear SVM: {:.3f}".format(linearsvc_auc))
print('Best Parameters:',grid_svc.best_params_)
## Appending the results 
model_Scores.append({'Model Type' : 'Classification',
                    'Model Name' : 'Linear SVM',
                    'Parameters' : grid_svc.best_params_,
                    'Training_score': grid_svc.score(X_train, y_train),
                    'Test Score': grid_svc.score(X_test, y_test ),
                    'AUC':linearsvc_auc})

# SVM with Radial Kernal
svmRadial = SVC(kernel = 'rbf')
param_Radial_SVM = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
svmRadialGridSV  = GridSearchCV(svmRadial, param_grid = param_Radial_SVM, cv=5, n_jobs=2, scoring='roc_auc', return_train_score=True)
svmRadialGridSV.fit(X_train, y_train)
y_pred = svmRadialGridSV.predict(X_test)
print('\nSVM With Radial Kernel:')
print('Training score: ', svmRadialGridSV.score(X_train, y_train))
print('Test score: ', svmRadialGridSV.score(X_test, y_test))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
radialsvc_auc = svmRadialGridSV.best_score_
print("AUC for Radial SVM: {:.3f}".format(radialsvc_auc))
print('Best Parameters:',svmRadialGridSV.best_params_)
## Appending the results 
model_Scores.append({'Model Type' : 'Classification',
                    'Model Name' : 'SVM Radial',
                    'Parameters' : svmRadialGridSV.best_params_,
                    'Training_score': svmRadialGridSV.score(X_train, y_train),
                    'Test Score': svmRadialGridSV.score(X_test, y_test ),
                    'AUC':radialsvc_auc})

# Decision Tree
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
print('\nDecision Tree:')
print('Training score: ',tree.score(X_train, y_train))
print('Test score: ',tree.score(X_test, y_test))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
tree_auc = roc_auc_score(y_test, tree.predict_proba(X_test)[:,1])
print("AUC for Decision Tree: {:.3f}".format(tree_auc))

# Tuning the Decision Tree to determine best parameters
opt_tree = DecisionTreeClassifier(random_state = 0)
param_DT = {"max_depth": range(1,10),
           "min_samples_split": range(2,10,1),
           "max_leaf_nodes": range(2,5)}
grid_tree = GridSearchCV(opt_tree,param_DT,cv=5)
grid_tree.fit(X_train,y_train)
y_pred = grid_tree.predict(X_test)
print('\nDecision Tree with Best Parameters:')
print('Training score: ', grid_tree.score(X_train, y_train))
print('Test score: ', grid_tree.score(X_test, y_test))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
gridtree_auc = roc_auc_score(y_test, grid_tree.predict_proba(X_test)[:,1])
print("AUC for Decision Tree: {:.3f}".format(gridtree_auc))
print('Best Parameters:',grid_tree.best_params_)
## Appending the results 
model_Scores.append({'Model Type' : 'Classification',
                    'Model Name' : 'Decision Tree',
                    'Parameters' : grid_tree.best_params_,
                    'Training_score': grid_tree.score(X_train, y_train),
                    'Test Score': grid_tree.score(X_test, y_test ),
                    'AUC':gridtree_auc})

# Random Forest Model
rnd_clf = RandomForestClassifier(max_leaf_nodes=4, max_depth=2, min_samples_split=2, random_state=0)
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
print('\nRandom Forest Model:')
print('Training score: ', rnd_clf.score(X_train, y_train))
print('Test score: ', rnd_clf.score(X_test, y_test))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
rndclf_auc = roc_auc_score(y_test, rnd_clf.predict_proba(X_test)[:,1])
print("AUC for Random Forest: {:.3f}".format(rndclf_auc))
## Appending the results 
model_Scores.append({'Model Type' : 'Classification',
                    'Model Name' : 'Random Forest',
                    'Parameters' : '-',
                    'Training_score': rnd_clf.score(X_train, y_train),
                    'Test Score': rnd_clf.score(X_test, y_test ),
                    'AUC':rndclf_auc})

Logistic Regression:
Cross validation scores: [0.69172932 0.68796992 0.7593985 ]
lr.coef_: [[ 4.14333008e-03 -6.41216799e-02 -3.67783043e-04  9.77557212e-01
   8.79378752e-03 -9.13554820e-01  4.67579874e-02  3.84506288e-04]]
lr.intercept_: [0.06572658]
Training score: 0.72
Test score 0.71
Accuracy Score: 0.7142857142857143
AUC for logistic regression: 0.801


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Linear SVM:
Training score:  0.7043507211417659
Test score:  0.730130855945165
Accuracy Score: 0.46616541353383456
AUC for Linear SVM: 0.784
Best Parameters: {'C': 1}


  y = column_or_1d(y, warn=True)



SVM With Radial Kernel:
Training score:  0.9919656766671692
Test score:  0.891803092958704
Accuracy Score: 0.8458646616541353
AUC for Radial SVM: 0.916
Best Parameters: {'C': 1, 'gamma': 0.01}

Decision Tree:
Training score:  1.0
Test score:  0.7857142857142857
Accuracy Score: 0.7857142857142857
AUC for Decision Tree: 0.785

Decision Tree with Best Parameters:
Training score:  0.7694235588972431
Test score:  0.7481203007518797
Accuracy Score: 0.7481203007518797
AUC for Decision Tree: 0.800
Best Parameters: {'max_depth': 2, 'max_leaf_nodes': 4, 'min_samples_split': 2}

Random Forest Model:
Training score:  0.7794486215538847
Test score:  0.7631578947368421
Accuracy Score: 0.7631578947368421
AUC for Random Forest: 0.841




In [15]:
modelResult = pd.DataFrame(model_Scores)
modelResult.set_index('Model Name', inplace = True)
modelResult

Unnamed: 0_level_0,Model Type,Parameters,Training_score,Test Score,AUC
Model Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Logistic Regression,Classification,-,0.718045,0.714286,0.800997
Linear SVM,Classification,{'C': 1},0.704351,0.730131,0.783799
SVM Radial,Classification,"{'C': 1, 'gamma': 0.01}",0.991966,0.891803,0.916445
Decision Tree,Classification,"{'max_depth': 2, 'max_leaf_nodes': 4, 'min_sam...",0.769424,0.74812,0.800402
Random Forest,Classification,-,0.779449,0.763158,0.841245


# Model Performance

The SVM Radial Model with C=1 and gamma = 0.01 is the best performing model based on the obtained AUC (0.91) score in addition to the better training and test scores.