Weather Data Classification using scikit-learn

First, we import the necessary libraries in Python for demonstrating the Decision Tree Classifier.

In [30]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Read the data of the weather from the CSV file using the read_csv function of the pandas library.

In [31]:
import pandas as pd
data = pd.read_csv('/Users/kaumudiseri/Downloads/daily_weather (1).csv')  # Adjust for your OS


Daily Weather Data Description

The file "daily_weather.csv" is a comma-separated file containing weather data collected from a weather station in San Diego, California. The station is equipped with sensors that measure air temperature, air pressure, and relative humidity. Data was collected over a three-year period, from September 2011 to September 2014, to ensure a comprehensive representation of different seasons and weather conditions.

Know about various columns in the dataset.

In [32]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm', 'Unnamed: 11'],
      dtype='object')

In [33]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm,Unnamed: 11
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16,
1,1,917.347688,71.403843,101.935179,2.443009,140.471549,3.533324,0.0,0.0,24.328697,19.426597,
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46,
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547,
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74,


To check if there exists null values in the dataset.

In [34]:
data[data.isnull().any(axis=1)].head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm,Unnamed: 11
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16,
1,1,917.347688,71.403843,101.935179,2.443009,140.471549,3.533324,0.0,0.0,24.328697,19.426597,
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46,
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547,
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74,


Data Cleaning Steps

We do not require to number each row, therefore we can clean it.

Data Cleaning process --> As 'number' column contains unique values which can not help us make any decision

In [35]:
del data['number']

In [36]:
data.fillna(value=0, inplace=True)

Filter the values which contain more than 24.99 relative humidity at 3pm.

In [37]:
clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] >24.99) *1
clean_data['high_humidity_label'].head()

0    1
1    0
2    0
3    0
4    1
Name: high_humidity_label, dtype: int64

In [38]:
y = clean_data[['high_humidity_label']].copy()
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


In [39]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


In [40]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

Using 9am Sensor Signals as Features to Predict Humidity at 3pm

Storing all the morning features other than Humidity at 3 pm in the 'morning_feature'

In [41]:
morning_features = ['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am']

Copying the values from the clean_data dataset to new dataset x which only consist of the 'morning_feature' Data

In [42]:
x=data[morning_features].copy()
x.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am'],
      dtype='object')

In [43]:
y.columns

Index(['high_humidity_label'], dtype='object')

Perform Test and Train split

By using train_test_split we split the data into traing dataset and testing datasets.

In [44]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=263)

Fit on Train Set

We made a classifier for making the Decision Tree and also to train the data.

In [45]:
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10,random_state=0)
humidity_classifier.fit(X_train,y_train)

In [46]:
type(humidity_classifier)

sklearn.tree._classes.DecisionTreeClassifier

Predict on Test Set

Using humidity_classifier we predict the value for the X_test and store it in y_predicted

In [47]:
y_predicted = humidity_classifier.predict(X_test)

In [48]:
y_predicted[:10]

array([0, 1, 1, 0, 1, 1, 0, 0, 1, 0])

In [49]:
y_test['high_humidity_label'][:10]

162    0
750    0
183    0
270    0
824    1
474    1
794    0
90     0
451    1
648    0
Name: high_humidity_label, dtype: int64

Measure Accuracy of the Classifier

Check the accuracy of the model using accuracy_score function from sklearn metrics which is around 91% accuracy for this model.

In [50]:
accuracy_score(y_test,y_predicted)*100

91.16022099447514