In this notebook, we will scikit-learn to perform a decision tree based classification of weather data.

Import Libraries

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Creating a Pandas DataFrame from a CSV file 

In [None]:
data = pd.read_csv('daily_weather.csv')

#Daily Weather Data Description<br>
The file **daily_weather** is a comma-seperated file that contains weather data. The data comes from a weather station located in San Dlego,California .The weather station is equipped with sensor that capture weather-related measurements such as air temperature,air pressure,and relative humidity.Data was collected for a period of three years,from September 2011 to September 2019,to ensure that sufficient data for different seasons and weather conditions is captured.

let's now check all the columns in the data.

In [None]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

predict weather the day will be low-humidity or not based on the afternoon measurement of terrible humidity.<br>
Each row or sample ,consist of the following  variables:

1.**number**: unique number for each row.<br>
2.**air_pressure_9am**:air pressure averaged over a period from 8:55am to 9:04am (unit:hectoascais)<br>
3.**air_temp_9am**:air temperature averaged  over a period from 8:55am to 9:04am(unit degree Fahrenheit)<br>
4.**air_wind_direction_9am**:wind direction averaged over a period from 8:55am to 9:04am(unit degrees with 0 means coming from the North and increasing clockwise)<br>
5.**air_wind_speed_9am**:wid speed averaged over a period  from 8:55am to 9:04am (unit miles per hour)<br>
6.**max_wind_speed_9am**:wind gust speed averaged over a period from 8:55am to 9:04am (unit miles per hour)<br>
7.**rain_accumuulation_9am**:amount of time rain was accumulated in the 24 hours prior to 9am (unit milimeters)<br>
8.**rain-duration_9am**:amount of time rain was recorded in he 24 hours prior to 9am (unit seconds)<br>
9.**realative_humidity_9am**:relative humidity averaged over a period from 8:55am to 9:04am(unit percent)<br>
10.**relative_humidity_3pm** :relative humidity averaged over a period from 2:55pm to 3:04pm(unit percent) 

In [None]:
data

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.060000,74.822000,271.100000,2.080354,295.400000,2.863283,0.0,0.0,42.420000,36.160000
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.040000,60.638000,51.000000,17.067852,63.700000,22.100967,0.0,20.0,8.900000,14.460000
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.160000,44.294000,277.800000,1.856660,136.500000,2.863283,8.9,14730.0,92.410000,76.740000
...,...,...,...,...,...,...,...,...,...,...,...
1090,1090,918.900000,63.104000,192.900000,3.869906,207.300000,5.212070,0.0,0.0,26.020000,38.180000
1091,1091,918.710000,49.568000,241.600000,1.811921,227.400000,2.371156,0.0,0.0,90.350000,73.340000
1092,1092,916.600000,71.096000,189.300000,3.064608,200.800000,3.892276,0.0,0.0,45.590000,52.310000
1093,1093,912.600000,58.406000,172.700000,3.825167,189.100000,4.764682,0.0,0.0,64.840000,58.280000


In [None]:
data[data.isnull().any(axis=1)]

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03
334,334,916.23,75.74,149.1,2.751436,187.5,4.183078,,1480.0,31.88,32.9
358,358,917.44,58.514,55.1,10.021491,,12.705819,0.0,0.0,13.88,25.93
361,361,920.444946,65.801845,49.823346,21.520177,61.886944,25.549112,,40.364018,12.278715,7.618649
381,381,918.48,66.542,90.9,3.467257,89.4,4.406772,,0.0,20.64,14.35
409,409,,67.853833,65.880616,4.328594,78.570923,5.216734,0.0,0.0,18.487385,20.356594


#Data cleaning steps

We will not need to number for each row so we can clean it.

In [None]:
del data['number']

Now let's drop null values using the pandas dropna function.

In [None]:
before_rows=data.shape[0]
print(before_rows)

1095


In [None]:
data=data.dropna()


In [None]:
after_rows = data.shape[0]
print(after_rows)

1064


How many rows dropped due to cleaning?

In [None]:
before_rows - after_rows

31

#Convert to a Classification Task

Binarize the relative_humidity_3pm to 0 or 1

In [None]:
clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm']> 24.99)*1
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
       ..
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int64


Target is stored in Y

In [None]:
1*False

0

In [None]:
y=clean_data[['high_humidity_label']].copy()
#y

In [None]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

Use 9am Sensor Signals as Features to Predict Humidity at 3pm

In [None]:
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am','max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am','rain_duration_9am']

In [None]:
x = clean_data[morning_features].copy()

In [None]:
x.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am'],
      dtype='object')

In [None]:
y.columns

Index(['high_humidity_label'], dtype='object')

#Perform Test and Train split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.33, random_state=325)

Fit on Train Set

In [None]:
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
humidity_classifier.fit(x_train,y_train)

DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)

In [None]:
type(humidity_classifier)

sklearn.tree._classes.DecisionTreeClassifier

Predict on Test Set

In [None]:
predictions=humidity_classifier.predict(x_test)

In [None]:
predictions[:10]

array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])

In [None]:
y_test['high_humidity_label'][:10]

459    1
875    0
602    0
138    1
874    1
42     1
599    0
165    0
10     1
107    0
Name: high_humidity_label, dtype: int64

Measure Accuracy of the Classifier

In [None]:
accuracy_score(y_true = y_test,y_pred = predictions)

0.7585227272727273