<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold">
Classification of Weather Data using Decision Trees
</p>

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Daily Weather Data Analysis</p>

In this notebook, we will use scikit-learn to perform a decision tree based classification of weather data.

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Importing the Necessary Libraries<br></p>

In [14]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Creating a Pandas DataFrame from a CSV file<br></p>


In [16]:
data = pd.read_csv('daily_weather.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   number                  1095 non-null   int64  
 1   air_pressure_9am        1092 non-null   float64
 2   air_temp_9am            1090 non-null   float64
 3   avg_wind_direction_9am  1091 non-null   float64
 4   avg_wind_speed_9am      1092 non-null   float64
 5   max_wind_direction_9am  1092 non-null   float64
 6   max_wind_speed_9am      1091 non-null   float64
 7   rain_accumulation_9am   1089 non-null   float64
 8   rain_duration_9am       1092 non-null   float64
 9   relative_humidity_9am   1095 non-null   float64
 10  relative_humidity_3pm   1095 non-null   float64
dtypes: float64(10), int64(1)
memory usage: 94.2 KB


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">Daily Weather Data Description</p>


The file **daily_weather.csv** is a comma-separated file that contains weather data.  This data comes from a weather station located in San Diego, California.  The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity.  Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.<br><br>
Let's now check all the columns in the data.

In [18]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

In [19]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


<br>Each row in daily_weather.csv captures weather data for a separate day.  <br><br>
Sensor measurements from the weather station were captured at one-minute intervals.  These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon.  The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.

Each row, or sample, consists of the following variables:

* **number:** unique number for each row
* **air_pressure_9am:** air pressure averaged over a period from 8:55am to 9:04am (*Unit: hectopascals*)
* **air_temp_9am:** air temperature averaged over a period from 8:55am to 9:04am (*Unit: degrees Fahrenheit*)
* **air_wind_direction_9am:** wind direction averaged over a period from 8:55am to 9:04am (*Unit: degrees, with 0 means coming from the North, and increasing clockwise*)
* **air_wind_speed_9am:** wind speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* ** max_wind_direction_9am:** wind gust direction averaged over a period from 8:55am to 9:10am (*Unit: degrees, with 0 being North and increasing clockwise*)
* **max_wind_speed_9am:** wind gust speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* **rain_accumulation_9am:** amount of rain accumulated in the 24 hours prior to 9am (*Unit: millimeters*)
* **rain_duration_9am:** amount of time rain was recorded in the 24 hours prior to 9am (*Unit: seconds*)
* **relative_humidity_9am:** relative humidity averaged over a period from 8:55am to 9:04am (*Unit: percent*)
* **relative_humidity_3pm:** relative humidity averaged over a period from 2:55pm to 3:04pm (*Unit: percent *)


In [21]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


### Checking Null Values

In [23]:
data.isnull().any()

number                    False
air_pressure_9am           True
air_temp_9am               True
avg_wind_direction_9am     True
avg_wind_speed_9am         True
max_wind_direction_9am     True
max_wind_speed_9am         True
rain_accumulation_9am      True
rain_duration_9am          True
relative_humidity_9am     False
relative_humidity_3pm     False
dtype: bool

In [24]:
data.isnull().any(axis = 1)

0       False
1       False
2       False
3       False
4       False
        ...  
1090    False
1091    False
1092    False
1093    False
1094    False
Length: 1095, dtype: bool

In [25]:
#any rows with null values?
filterRowWithNull = data.isnull().any(axis = 1)
data[filterRowWithNull].head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03


In [26]:
data[data.isnull().any(axis=1)].shape

(31, 11)

In [27]:
data.shape

(1095, 11)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Data Cleaning Steps<br></p>

We will not need to number for each row so we can clean it.

In [29]:
del data['number']
print(data.shape)

(1095, 10)


Now let's drop null values using the pandas **dropna** function.

In [31]:
before_rows = data.shape[0]
print(before_rows)

1095


In [32]:
#dropping rows with at least one NaN
data = data.dropna()

In [33]:
after_rows = data.shape[0]
print(after_rows)

1064


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
How many rows dropped due to cleaning?<br><br></p>


In [35]:
before_rows - after_rows

31

In [36]:
data.shape

(1064, 10)

In [37]:
data.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">
Convert to a Classification Task <br></p>
Let's predict/classify the relative humidity values at 3pm by using some of the 9am sensor signals. <br>

Since a classification task predicts a categorical output, and relative_humidity_3pm currently is a numerical type, let's binarize the relative_humidity_3pm to 0 or 1 -- if the humidity > 24.99, then we set it to 1; otherwise 0. <br>

**Inputs (features)**: some 9am sensor signals. <br>
**Output**: humidity level @3pm (1 or 0)


In [39]:
data.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


In [40]:
#unit for relative_humidity_3pm is a percent.
data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [41]:
print (False * 1)

0


In [42]:
print (True * 1)

1


In [43]:
data['relative_humidity_3pm'] > 24.99

0        True
1       False
2       False
3       False
4        True
        ...  
1090     True
1091     True
1092     True
1093     True
1094    False
Name: relative_humidity_3pm, Length: 1064, dtype: bool

In [44]:
clean_data = data.copy()
# The "*1" below converts a boolean type to a int type
print(clean_data['relative_humidity_3pm'] > 24.99)
print("--------")
print((clean_data['relative_humidity_3pm'] > 24.99)*1)
print("--------")
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1
clean_data.head()

0        True
1       False
2       False
3       False
4        True
        ...  
1090     True
1091     True
1092     True
1093     True
1094    False
Name: relative_humidity_3pm, Length: 1064, dtype: bool
--------
0       1
1       0
2       0
3       0
4       1
       ..
1090    1
1091    1
1092    1
1093    1
1094    0
Name: relative_humidity_3pm, Length: 1064, dtype: int64
--------


Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm,high_humidity_label
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16,1
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597,0
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46,0
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547,0
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74,1


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Output: Target is stored in 'y'.
<br><br></p>


In [46]:
y=clean_data[['high_humidity_label']].copy()
#print(y)
print(type(y))
#TODO: replace the "[[" with "[" and check the type of y again. What is it?  
#print(type(clean_data['high_humidity_label']))
y.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


In [47]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [48]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">
Inputs: Features are stored in X <br></p>
Choose a subset of 9am sensor data.


In [50]:
clean_data.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm,high_humidity_label
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16,1
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597,0
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46,0
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547,0
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74,1


In [51]:
data.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

### Choose a subset of 9am sensor data.

In [53]:
#morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
#       'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
#        'rain_duration_9am']

In [54]:
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_speed_9am',
       'max_wind_speed_9am','rain_accumulation_9am',
        'rain_duration_9am']

In [55]:
X = clean_data[morning_features].copy()

In [56]:
X.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_speed_9am',
       'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am'],
      dtype='object')

In [57]:

y.columns

Index(['high_humidity_label'], dtype='object')

In [58]:
X.shape

(1064, 6)

In [59]:
y.shape

(1064, 1)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"> <br>
Perform Test and Train split
<br></p>



## REMINDER: Training Phase

In the **training phase**, the learning algorithm uses the training data to adjust the model’s parameters to minimize errors.  At the end of the training phase, you get the trained model.

<img src="Images/TrainingVSTesting.png" align="middle" style="width:350px;height:200px;"/>
<br>

In the **testing phase**, the trained model is applied to test data.  Test data is separate from the training data, and is previously unseen by the model.  The model is then evaluated on how it performs on the test data.  The goal in building a classifier model is to have the model perform well on training as well as test data.


### Split Dataset into training set and testing set <br>
For example 67% for training and 33% for testing. <br>
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" width = 200> document for sklearn.model_selection.train_test_split() </a>

In [63]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

In [64]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(712, 6)
(712, 1)
(352, 6)
(352, 1)


In [65]:
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [66]:
X_train.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_speed_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am
841,918.37,72.932,2.013246,2.773806,0.0,0.0
75,920.1,53.492,13.444009,15.367778,0.0,0.0
95,927.61,54.896,4.988376,7.202947,0.0,0.0
895,919.235153,65.951112,2.942019,3.65881,0.0,0.0
699,919.888128,68.687822,3.960858,5.185547,0.0,0.0


In [67]:
y_train.describe()

Unnamed: 0,high_humidity_label
count,712.0
mean,0.494382
std,0.50032
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Fit on the Train Set (Model building, e.g., creating the decision tree)</p>

 https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.tree.DecisionTreeClassifier


In [69]:
#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.tree.DecisionTreeClassifier

#humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=5, max_depth = 5, random_state=0)

#Build the model, creating a decision tree, which is defined by the humidity_classifier.
humidity_classifier.fit(X_train, y_train)

humidity_classifier.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': 5,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 0,
 'splitter': 'best'}

In [70]:
type(humidity_classifier)

sklearn.tree._classes.DecisionTreeClassifier

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Predict on Test Set 
</p>


In [72]:
X_test.shape

(352, 6)

In [73]:
predictions = humidity_classifier.predict(X_test)

In [74]:
type(predictions)

numpy.ndarray

In [75]:
predictions[:10]

array([1, 0, 1, 1, 1, 1, 1, 0, 1, 1])

In [76]:
#y_test holds output values for the X_test set
y_test['high_humidity_label'][:10]

456     0
845     0
693     1
259     1
723     1
224     1
300     1
442     0
585     1
1057    1
Name: high_humidity_label, dtype: int64

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Measure Accuracy of the Classifier
<br></p>

Compare predicted y values with actual y values from the test set and calculate accuracy.

In [78]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.7840909090909091

### The Training and Testing Process
<img src="Images/TrainingTesting_Process.png" align="middle" style="width:650px"/>
<br>

<p style="font-family: Arial; font-size:1.75em;color:red; font-style:bold"><br>
Practice
</p>
1. Change some hyperparameters of DecisionTreeClassifier, e.g., criterion, splitter, max_depth, etc, re-train and te-test the model and what is the result? <br>

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html <br>
2. Change *morning_features*, re-train and te-test the model and what is the result?
<p>


## 1. Change some hyperparameters of DecisionTreeClassifier


## 2. Adding "relative_humidity_9am"  from the feature set.