<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold">

Classification of Weather Data <br><br>
using scikit-learn
<br><br>
</p>

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Daily Weather Data Analysis</p>

In this notebook, we will use scikit-learn to perform a decision tree-based classification of weather data.

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Importing the Necessary Libraries<br></p>

In [3]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Creating a Pandas DataFrame from a CSV file<br></p>


In [38]:
data = pd.read_csv('data.csv')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">Daily Weather Data Description</p>
<br>
The file **data.csv** is a comma-separated file that contains weather data. This data comes from the National Oceanic and Atmospheric Administration using Cape Canaveral, Florida zip code. The data has precipitation information of 2010-2011 and 2018-2019. The precipitation data are adjusted by dates.<br><br>
Let's now check all the columns in the data.

In [15]:
data.columns

Index(['STATION', 'NAME', 'DATE', 'PRCP_2010-2011', 'DATE.1',
       'PRCP_2018-2019'],
      dtype='object')

<br>Each row in data.csv captures weather data for a separate day.  <br><br>
Sensor measurements from the weather station were captured at hour intervals.  These measurements were then processed to generate values to describe daily weather. The idea is to use precipitation data from 2018-2019 to predict whether the day will have moderate precipitation or not based on the data from 2010-2011.

Each row, or sample, consists of the following variables:

* **STATION:** The station the data was collected
* **NAME:** The name of the location
* **DATE:** Two dates the information  was capture from in sets of 2010-2011 and 2018-2019
* **PRCP_2010-2011:** The precipitation collected in 2010-2011 measured in standard units.
* **PRCP_2018-2019:** The precipitation collected in 2018-2019 measured in standard units.


In [16]:
data

Unnamed: 0,STATION,NAME,DATE,PRCP_2010-2011,DATE.1,PRCP_2018-2019
0,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",1/2/2010,0.00,1/2/2018,0.66
1,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",1/3/2010,0.00,1/3/2018,1.22
2,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",1/4/2010,0.00,1/4/2018,0.22
3,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",1/8/2010,0.00,1/8/2018,0.00
4,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",3/7/2010,0.00,3/7/2018,0.00
...,...,...,...,...,...,...
116,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",9/4/2011,0.00,9/4/2019,2.14
117,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",9/6/2011,0.06,9/6/2019,0.00
118,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",9/7/2011,0.20,9/7/2019,0.00
119,US1FLBV0012,"CAPE CANAVERAL 0.6 ESE, FL US",9/8/2011,0.46,9/8/2019,0.00


In [17]:
data[data.isnull().any(axis=1)]

Unnamed: 0,STATION,NAME,DATE,PRCP_2010-2011,DATE.1,PRCP_2018-2019


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">
Convert to a Classification Task <br><br></p>
Binarize the moderate_precipitation to 0 or 1.<br>


In [39]:
clean_data = data.copy()
clean_data['moderate_precipitation'] = (clean_data['PRCP_2018-2019'] > 2.5)*1
print(clean_data['moderate_precipitation'])

0      0
1      0
2      0
3      0
4      0
      ..
116    1
117    0
118    0
119    0
120    0
Name: moderate_precipitation, Length: 121, dtype: int32


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Target is stored in 'y'.
<br><br></p>


In [40]:
y=clean_data[['moderate_precipitation']].copy()
y

Unnamed: 0,moderate_precipitation
0,0
1,0
2,0
3,0
4,0
...,...
116,1
117,0
118,0
119,0


In [41]:
clean_data['PRCP_2018-2019'].head()

0    0.66
1    1.22
2    0.22
3    0.00
4    0.00
Name: PRCP_2018-2019, dtype: float64

In [42]:
y.head()

Unnamed: 0,moderate_precipitation
0,0
1,0
2,0
3,0
4,0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Use 2010-2011 as Features to Predict precipitation in 2018-2019
<br><br></p>


In [43]:
precipitation_2010_2011 = ['PRCP_2010-2011']

In [44]:
X = clean_data[precipitation_2010_2011].copy()

In [45]:
X.columns

Index(['PRCP_2010-2011'], dtype='object')

In [46]:
y.columns

Index(['moderate_precipitation'], dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Perform Test and Train split

<br><br></p>



In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

In [48]:
#type(X_train)
#type(X_test)
#type(y_train)
#type(y_test)
#X_train.head()
y_train.describe()

Unnamed: 0,moderate_precipitation
count,81.0
mean,0.037037
std,0.190029
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Fit on Train Set
<br><br></p>


In [49]:
precipitation_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
precipitation_classifier.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=10,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [50]:
type(precipitation_classifier)

sklearn.tree._classes.DecisionTreeClassifier

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Predict on Test Set 

<br><br></p>


In [51]:
predictions = precipitation_classifier.predict(X_test)

In [52]:
predictions[:10]

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [53]:
y_test['moderate_precipitation'][:10]

51     0
39     0
10     0
41     0
33     0
109    0
60     0
45     0
11     0
93     0
Name: moderate_precipitation, dtype: int32

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Measure Accuracy of the Classifier
<br><br></p>


In [55]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.975