## Text Parse Summary

A binary variable was encoded to represent a 'cloudy' day and a 'clr' day.  Below are the encodings for the sky conditions, taken from the noaa website.

0 oktas/0 tenths is defined as CLR (clear sky) 
1-2 oktas/1-3 tenths is defined as FEW (few clouds) 
3-4 oktas/4-5 tenths is defined as SCT (scattered clouds) 
5 to less than 8/6 to less than 10 is defined as BKN (broken clouds) 
8 oktas/10 tenths is defined as OVC (overcast) 
obscured sky due to weather phenonmen is defined as VV

A clear day was defined as anything that contained CLR, FEW, or SCT clouds, while a cloudy day was defined as anything containing BKN, OVC, or VV.  

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
%matplotlib inline

In [None]:
d = pd.read_pickle("hourlies.pickle")

In [None]:
d.info()

#### Text parsing was used to assign weights for each sky condition.

In [None]:
d['HOURLYSKYCONDITIONS'].head()

In [None]:
#Replace the ':' with a space to separate the conditions
sky = d['HOURLYSKYCONDITIONS'].str.replace(':', " ")

In [None]:
#Find each sky condition
clr = sky.str.contains('CLR')
few = sky.str.contains('FEW')
sct = sky.str.contains('SCT')
bkn = sky.str.contains('BKN')
ovc = sky.str.contains('OVC')
vv = sky.str.contains('VV')

In [None]:
#Find the average
cloudy_average = (0*clr + 0.2*few + 0.45*sct + 0.7*bkn + 0.9*ovc + 1*vv)

In [None]:
#Set a threshold for a clear or cloudy day, where clear=0 and cloudy=1 
binary_clouds = 1*(cloudy_average[:] > 0.65) + 0*(cloudy_average[:] <= 0.65)

In [None]:
#Add a new column to the data frame
d['Binary_Clouds'] = binary_clouds

## Intent of Logistic Regression

The intent of the logistic model was to build a model that predicts the probability of a clear or cloudy day, both for all locations and then location by location.  A 80/20 train-test plit was used and models were cross validated by way of 'accuracy'.  After a model was built for all locations, models were seperated by location to see if there were any highly-accurate models for prediction of a clear or cloudy day.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

In [None]:
#Group dowon to numerical values for logistic regression from the whole set
X = d[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting', 'HOURLYPrecip']]
y = d['Binary_Clouds']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
confusion_matrix(y_test, predictions)

In [None]:
#This model returns around a %70 percent accurary rate
print(classification_report(y_test, predictions))  

In [None]:
#Cross-validation with 10-folds
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')
scores_accuracy

In [None]:
#Mean of the cross validated models
scores_accuracy.mean()

In [None]:
#Test logisitic regression on all cities.
idx = d.index
locations = np.unique(idx.tolist())

In [None]:
#Run the model for all locations and find means of the cross validated models
means = []

for i in locations:
    place = d.loc[i,]
    X = place[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
    y = place['Binary_Clouds']
    logmodel = LogisticRegression()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
    logmodel.fit(X_train,y_train)
    scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')
    means.append(scores_accuracy.mean())

In [None]:
means

In [None]:
print( f'The maximum mean score is {np.max(means)} at {locations[np.argmax(means)]}' )

In [None]:
#Explore for the maxixmum mean score
#Run for Albuquerque
NV = d.loc['MERCURY DESERT ROCK AIRPORT NV US',]
X = NV[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
y = NV['Binary_Clouds']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')

In [None]:
print( f'The minimum mean score is {np.min(means)} at {locations[np.argmin(means)]}' )

In [None]:
SC = d.loc['CHARLESTON INTL. AIRPORT SC US',]
X = SC[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
y = SC['Binary_Clouds']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')

In [None]:
#Minneapolis out of curiousity
MN = d.loc['MINNEAPOLIS ST PAUL INTERNATIONAL AIRPORT MN US',]
X = MN[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
y = MN['Binary_Clouds']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')

### Brief Conclusion

The results of logisitic regression for MPLS, Charleston and Alamosa are interesting.  In Minneapolis, it appears to be much easier to predict a cloudy day, and in ALAMOSA, it appears to be easier to predict a clear day. 

# Gradient Boosting Model

Let's compare the results of logisitic regression to a boosting model.


In [None]:
#Reset x and y 
X = d[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting', 'HOURLYPrecip']]
y = d['Binary_Clouds']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 7)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
boost = GradientBoostingClassifier(n_estimators = 100, random_state = 7)
fit = boost.fit(X_train, y_train)

In [None]:
predictions = fit.predict(X_test)

In [None]:
scores_accuracy = cross_val_score(fit, X, y, cv=10, scoring='accuracy')

In [None]:
scores_accuracy

In [None]:
scores_accuracy.mean()

In [None]:
#Appears comparable at a first glance