## Text Parse Summary

A binary variable was encoded to represent a 'cloudy' day and a 'clear' day.  Below are the encodings for the sky conditions, taken from the noaa website.

0 oktas/0 tenths is defined as CLR (clear sky) 
1-2 oktas/1-3 tenths is defined as FEW (few clouds) 
3-4 oktas/4-5 tenths is defined as SCT (scattered clouds) 
5 to less than 8/6 to less than 10 is defined as BKN (broken clouds) 
8 oktas/10 tenths is defined as OVC (overcast) 
obscured sky due to weather phenonmen is defined as VV

A clear day was defined as anything that contained CLR, FEW, or SCT clouds, while a cloudy day was defined as anything containing BKN, OVC, or VV.  

In [35]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
%matplotlib inline

In [36]:
d = pd.read_pickle("hourlies.pickle")

In [37]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Index: 999341 entries, AKRON CANTON AIRPORT OH US to WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT KS US
Data columns (total 16 columns):
STATION                   999341 non-null object
ELEVATION                 999341 non-null float64
LATITUDE                  999341 non-null float64
LONGITUDE                 999341 non-null float64
DATE                      999341 non-null datetime64[ns]
REPORTTPYE                999341 non-null object
HOURLYSKYCONDITIONS       999341 non-null object
HOURLYDRYBULBTEMPF        999341 non-null float64
HOURLYWETBULBTEMPF        999341 non-null float64
HOURLYDewPointTempF       999341 non-null float64
HOURLYRelativeHumidity    999341 non-null float64
HOURLYWindSpeed           999341 non-null float64
HOURLYWindDirection       999341 non-null object
HOURLYStationPressure     999341 non-null float64
HOURLYPrecip              999341 non-null float64
HOURLYAltimeterSetting    999341 non-null float64
dtypes: datetime64[ns

#### Text parsing was used to assign weights for each sky condition.

In [38]:
d['HOURLYSKYCONDITIONS'].head()

STATION_NAME
AKRON CANTON AIRPORT OH US                       OVC:08 12
AKRON CANTON AIRPORT OH US              BKN:07 7 OVC:08 10
AKRON CANTON AIRPORT OH US                        OVC:08 7
AKRON CANTON AIRPORT OH US                        OVC:08 7
AKRON CANTON AIRPORT OH US    SCT:04 7 BKN:07 14 OVC:08 20
Name: HOURLYSKYCONDITIONS, dtype: object

In [39]:
#Replace the ':' with a space to separate the conditions
sky = d['HOURLYSKYCONDITIONS'].str.replace(':', " ")

In [40]:
#Find each sky condition
clr = sky.str.contains('CLR')
few = sky.str.contains('FEW')
sct = sky.str.contains('SCT')
bkn = sky.str.contains('BKN')
ovc = sky.str.contains('OVC')
vv = sky.str.contains('VV')

In [41]:
#Find the average
cloudy_average = (0*clr + 0.2*few + 0.45*sct + 0.7*bkn + 0.9*ovc + 1*vv)

In [42]:
#Set a threshold for a clear or cloudy day, where clear=0 and cloudy=1 
binary_clouds = 1*(cloudy_average[:] > 0.65) + 0*(cloudy_average[:] <= 0.65)

In [43]:
#Add a new column to the data frame
d['Binary_Clouds'] = binary_clouds

## Intent of Logistic Regression

The intent of the logistic model was to build a model that predicts the probability of a clear or cloudy day, both for all locations and then location by location.  A 80/20 train-test split was used and models were cross validated by way of 'accuracy'.  After a model was built for all locations, models were seperated by location to see if there were any highly-accurate models for prediction of a clear or cloudy day.

In [44]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

In [45]:
#Group dowon to numerical values for logistic regression from the whole set
X = d[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting', 'HOURLYPrecip']]
y = d['Binary_Clouds']

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [47]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
predictions = logmodel.predict(X_test)

In [49]:
from sklearn.metrics import confusion_matrix, classification_report

In [50]:
confusion_matrix(y_test, predictions)

array([[60374, 35088],
       [24778, 79629]], dtype=int64)

In [51]:
#This model returns around a %70 percent accurary rate
print(classification_report(y_test, predictions))  

             precision    recall  f1-score   support

          0       0.71      0.63      0.67     95462
          1       0.69      0.76      0.73    104407

avg / total       0.70      0.70      0.70    199869



In [52]:
#Cross-validation with 10-folds
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')
scores_accuracy

array([0.68006204, 0.68519538, 0.772312  , 0.63826487, 0.67971861,
       0.65494226, 0.65654332, 0.80870183, 0.69311439, 0.68041588])

In [53]:
#Mean of the cross validated models
scores_accuracy.mean()

0.6949270582562523

In [54]:
#Test logisitic regression on all cities.
idx = d.index
locations = np.unique(idx.tolist())

In [55]:
#Run the model for all locations and find means of the cross validated models
means = []

for i in locations:
    place = d.loc[i,]
    X = place[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
    y = place['Binary_Clouds']
    logmodel = LogisticRegression()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
    logmodel.fit(X_train,y_train)
    scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')
    means.append(scores_accuracy.mean())

In [56]:
means

[0.7012575830088674,
 0.7492733859506429,
 0.6698509868438028,
 0.8574668064181152,
 0.7260851213721844,
 0.7327762077559228,
 0.6185311949101288,
 0.7197857039332635,
 0.6343126449859093,
 0.754115908184928,
 0.6736689009509176,
 0.6696046449594284,
 0.6696367917810078,
 0.9195339057813326,
 0.7116878370468414,
 0.6647076670634027,
 0.7601083794167944,
 0.764368117470887,
 0.6497821252926055]

In [57]:
print( f'The maximum mean score is {np.max(means)} at {locations[np.argmax(means)]}' )

The maximum mean score is 0.9195339057813326 at MERCURY DESERT ROCK AIRPORT NV US


In [58]:
#Explore for the maxixmum mean score
#Run for Albuquerque
NV = d.loc['MERCURY DESERT ROCK AIRPORT NV US',]
X = NV[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
y = NV['Binary_Clouds']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')

             precision    recall  f1-score   support

          0       0.93      0.98      0.96      7469
          1       0.73      0.40      0.52       860

avg / total       0.91      0.92      0.91      8329



In [59]:
print( f'The minimum mean score is {np.min(means)} at {locations[np.argmin(means)]}' )

The minimum mean score is 0.6185311949101288 at CHARLESTON INTL. AIRPORT SC US


In [60]:
SC = d.loc['CHARLESTON INTL. AIRPORT SC US',]
X = SC[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
y = SC['Binary_Clouds']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')

             precision    recall  f1-score   support

          0       0.67      0.66      0.66      5572
          1       0.63      0.64      0.63      5068

avg / total       0.65      0.65      0.65     10640



In [61]:
#Minneapolis out of curiousity
MN = d.loc['MINNEAPOLIS ST PAUL INTERNATIONAL AIRPORT MN US',]
X = MN[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting']]
y = MN['Binary_Clouds']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test, predictions))
scores_accuracy = cross_val_score(logmodel, X, y, cv=10, scoring='accuracy')

             precision    recall  f1-score   support

          0       0.59      0.45      0.51      3699
          1       0.76      0.85      0.80      7498

avg / total       0.70      0.72      0.71     11197



### Brief Conclusion

The results of logisitic regression for MPLS, Charleston and Alamosa are interesting.  In Minneapolis, it appears to be much easier to predict a cloudy day, and in ALAMOSA, it appears to be easier to predict a clear day. 

# Gradient Boosting Model

Let's compare the results of logisitic regression to a boosting model.


In [62]:
#Reset x and y 
X = d[['HOURLYDRYBULBTEMPF', 'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 
               'HOURLYWindSpeed', 'HOURLYAltimeterSetting', 'HOURLYPrecip']]
y = d['Binary_Clouds']

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 7)

In [64]:
from sklearn.ensemble import GradientBoostingClassifier

In [65]:
boost = GradientBoostingClassifier(n_estimators = 100, random_state = 7)
fit = boost.fit(X_train, y_train)

In [66]:
predictions = fit.predict(X_test)

In [67]:
scores_accuracy = cross_val_score(fit, X, y, cv=10, scoring='accuracy')

In [68]:
scores_accuracy

array([0.69804373, 0.67420824, 0.7986191 , 0.65832791, 0.69146637,
       0.66676006, 0.68070927, 0.81050304, 0.68962205, 0.68846127])

In [69]:
scores_accuracy.mean()

0.7056721029753373

In [70]:
#Appears comparable at a first glance