### **1 - The Problem with Surf Forecasting**

I am a Premium subscriber to Surfline, an online surf forecasting service. For $90 per year I have access to a worldwide list of surf locations along with live cameras, current condition reports, forecasts, charts, and other data. It's a fantastic tool which I use on a daily basis for scheduling my surf sessions in advance.

Surfline unfortunately falls short when it comes to predicting the quality of surf at a specific location. Realtime local conditions will often differ from what Surfline has reported or forecasted. This can be either a pleasant surprise or an unexpected letdown depending on the situation.

At the end of the day, regardless of what Surfline says, I am usually deciding to surf based on what I can see on a live camera or by going to the beach. Surfline helps to inform what my general expectations should be, but there is hardly ever certainty until I can see the conditions in real time. I often wonder whether it is possible to build a more locally-reliable model.

### **2 - Data Selection, Cleaning, and Exploratory Analysis**

The datasets used in this project come from noaa.gov and wblivesurf.com. NOAA provides archived swell and wind data. The "fun" rankings (increments of 0.5 on a scale of 1 through 5) are sourced from the WB Live Surf Report archive.

In [None]:
#library imports
import pandas as pd
import datetime
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split

In [None]:
#import data
waves = pd.read_csv('https://www.ndbc.noaa.gov/view_text_file.php?filename=41110h2019.txt.gz&dir=data/historical/stdmet/', delim_whitespace=True)
dfwaves = pd.DataFrame(waves)

wind = pd.read_csv('https://www.ndbc.noaa.gov/view_text_file.php?filename=41037h2019.txt.gz&dir=data/historical/stdmet/', delim_whitespace=True)
dfwind = pd.DataFrame(wind)

funfactor = pd.read_csv('https://raw.githubusercontent.com/Neiswender/2019_rating/main/dfwaves.csv%20-%20Copy%20of%20dfwaves.csv.csv')
dfrank = pd.DataFrame(funfactor)

In [None]:
#dropping redundant or unnecessary rows and columns
dfwaves = dfwaves.drop(0)
dfwaves = dfwaves[['#YY','MM','DD','WVHT','DPD','APD','MWD','WTMP']]

dfwind = dfwind.drop(0)
dfwind = dfwind[['#YY','MM','DD','WDIR','WSPD','GST','PRES','ATMP']]

#changing column names for easier interpretation 
dfwaves = dfwaves.rename(columns={'#YY':'year','MM':'month','DD':'day','WVHT':'Wave_Height','DPD':'Dominant_Period',
                                  'APD':'Average_Period',
                                  'MWD':'Mean_Wave_Direction','WTMP':'Water_Temp'})

dfwind = dfwind.rename(columns={'#YY':'year','MM':'month','DD':'day','WDIR':'Wind_Direction','WSPD':'Wind_Speed',
                       'GST':'Gust','PRES':'Pressure','ATMP':'Air_Temp'})

#consolidating data into an average of the readings for each day
dfwaves['date'] = pd.to_datetime(dfwaves[['year','month','day']])
dfwaves = dfwaves.astype({'Wave_Height':float,'Dominant_Period':float,'Average_Period':float,'Mean_Wave_Direction':float,'Water_Temp':float})
dfwaves = dfwaves.groupby(['date']).mean()

dfwind['date'] = pd.to_datetime(dfwind[['year','month','day']])
dfwind = dfwind.astype({'Wind_Direction':float,'Wind_Speed':float,'Gust':float,'Pressure':float,'Air_Temp':float})
dfwind = dfwind.groupby(['date']).mean()

dfrank['date'] = pd.to_datetime(dfrank['date'])

#changing the units to imperial
dfwaves['Wave_Height'] = dfwaves['Wave_Height']*3.28084
dfwaves['Water_Temp'] = (dfwaves['Water_Temp']*(9/5))+32
dfwind['Air_Temp'] = (dfwind['Air_Temp']*(9/5))+32

#combining this data for the model to use
df_wind_waves = dfwaves.merge(dfwind,on='date')
df = df_wind_waves.merge(dfrank,on='date')
df.describe()



Unnamed: 0,Wave_Height,Dominant_Period,Average_Period,Mean_Wave_Direction,Water_Temp,Wind_Direction,Wind_Speed,Gust,Pressure,Air_Temp,rating
count,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0,365.0
mean,3.022263,7.718357,4.846274,128.639242,74.714136,178.308706,6.46164,9.109527,1236.886108,68.125871,1.235616
std,1.181778,2.100442,0.774892,27.380304,95.091684,77.206666,2.533922,3.480191,1342.624098,11.539243,0.851206
min,0.984936,3.912708,3.190417,66.916667,51.2375,29.916667,1.683333,2.941667,1002.804167,37.4375,0.0
25%,2.217301,6.131042,4.277917,108.270833,56.9,109.25,4.566667,6.816667,1014.3875,58.984348,0.5
50%,2.74702,7.362917,4.70875,125.458333,70.37375,191.375,6.275,8.816667,1017.745833,69.6275,1.0
75%,3.523485,8.888542,5.341458,146.217391,79.95125,243.583333,8.066667,11.108333,1021.033333,78.56,2.0
max,8.994286,15.12875,7.355417,222.25,1793.76875,334.875,17.7125,32.854167,9999.0,134.72,4.0


In [None]:
#engineer target feature to create binary classification problem
df['Go_Surf'] = (df['rating'] >= 1.5)
df['Go_Surf'] = df['Go_Surf']*1


In [None]:
#drop rating column to avoid leakage
df = df.drop(labels='rating',axis=1)
df

Unnamed: 0,date,Wave_Height,Dominant_Period,Average_Period,Mean_Wave_Direction,Water_Temp,Wind_Direction,Wind_Speed,Gust,Pressure,Air_Temp,Go_Surf
0,2019-01-01,2.747020,7.541875,4.706875,147.333333,56.071250,244.625000,6.541667,8.879167,1018.583333,61.67750,1
1,2019-01-02,2.667733,5.672500,4.644375,148.656250,56.738750,141.375000,4.756250,7.131250,1020.537500,58.48250,0
2,2019-01-03,2.068980,7.071042,5.083333,125.458333,56.900000,190.041667,2.816667,4.300000,1017.275000,57.83000,0
3,2019-01-04,2.731474,6.062766,4.487021,156.765957,56.882128,76.739130,4.043478,5.847826,1014.630435,58.85913,0
4,2019-01-05,3.824913,7.145625,4.485625,160.354167,57.777500,265.583333,9.483333,13.291667,1006.887500,58.63250,1
...,...,...,...,...,...,...,...,...,...,...,...,...
360,2019-12-27,2.922740,8.458511,5.540000,123.297872,62.148085,144.458333,4.479167,6.454167,1024.270833,68.24000,1
361,2019-12-28,2.451744,7.962292,5.666875,127.083333,62.596250,94.125000,2.891667,4.304167,1023.200000,67.34000,1
362,2019-12-29,2.729932,7.876875,5.356042,134.666667,61.640000,185.875000,4.545833,6.416667,1020.258333,68.90750,0
363,2019-12-30,4.690918,7.105417,5.128750,156.458333,59.558750,231.125000,9.012500,12.033333,1011.150000,66.76250,1


In [None]:
df = df.groupby(['date']).mean()
df


Unnamed: 0_level_0,Wave_Height,Dominant_Period,Average_Period,Mean_Wave_Direction,Water_Temp,Wind_Direction,Wind_Speed,Gust,Pressure,Air_Temp,Go_Surf
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-01-01,2.747020,7.541875,4.706875,147.333333,56.071250,244.625000,6.541667,8.879167,1018.583333,61.67750,1
2019-01-02,2.667733,5.672500,4.644375,148.656250,56.738750,141.375000,4.756250,7.131250,1020.537500,58.48250,0
2019-01-03,2.068980,7.071042,5.083333,125.458333,56.900000,190.041667,2.816667,4.300000,1017.275000,57.83000,0
2019-01-04,2.731474,6.062766,4.487021,156.765957,56.882128,76.739130,4.043478,5.847826,1014.630435,58.85913,0
2019-01-05,3.824913,7.145625,4.485625,160.354167,57.777500,265.583333,9.483333,13.291667,1006.887500,58.63250,1
...,...,...,...,...,...,...,...,...,...,...,...
2019-12-27,2.922740,8.458511,5.540000,123.297872,62.148085,144.458333,4.479167,6.454167,1024.270833,68.24000,1
2019-12-28,2.451744,7.962292,5.666875,127.083333,62.596250,94.125000,2.891667,4.304167,1023.200000,67.34000,1
2019-12-29,2.729932,7.876875,5.356042,134.666667,61.640000,185.875000,4.545833,6.416667,1020.258333,68.90750,0
2019-12-30,4.690918,7.105417,5.128750,156.458333,59.558750,231.125000,9.012500,12.033333,1011.150000,66.76250,1


In [None]:
df['Go_Surf'].value_counts(normalize=True)

0    0.569863
1    0.430137
Name: Go_Surf, dtype: float64

### **3 - A Binary Classification Problem**

Target: Go Surf
*   How - Filtering qualitative rankings to select days that have at least a 1.5 fun factor
*   Why - Our positive results represent the best days to target for a surf session. 

Metric: Accuracy
*   How/Why - This a clear first choice as a metric due to the balanced distribution of my target feature
 
Baseline: "Don't Go"
*   How - The surf quality is considered a "Don't Go" if the fun factor rating is 1 or less. There may be surfable waves and some opportunity for good rides, but overall this is not a day to target for a surf.
*   Why - Choosing the majority class as a baseline is a great first step in evaluating a model




In [None]:
#select features and target
features = df.columns.drop('Go_Surf')
target = 'Go_Surf'
X = df[features]
y = df[target]

In [None]:
#divide into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=42)

In [None]:
#establish baseline
baseline_acc = 0.5698

In [None]:
#instantiate our logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(
    LogisticRegression(random_state=42, max_iter=3000)
)

model.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=3000,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=42,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [None]:
#check individual predictions
input = '2019-03-20'
example = X.loc[input:input]


In [None]:
model.predict(example)

array([1])

In [None]:
#How does the model score?
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
print('Training Accuracy Score:', train_acc)
print('Testing Accuracy Score:', test_acc)

Training Accuracy Score: 0.8116438356164384
Testing Accuracy Score: 0.7945205479452054


In [None]:
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()
forest.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
#scoring our random forest model
forest.score(X_test,y_test)

0.8493150684931506

In [None]:
forest.feature_importances_

array([0.25772261, 0.09596235, 0.19110527, 0.05845598, 0.06932768,
       0.07313549, 0.06290378, 0.07426271, 0.05771827, 0.05940587])