# Classification using Scikit-learn

### Daily analysis of meteorological data
In this notebook we'll use scikit-learn to make a classification of meteorological data based in a decision tree.

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [2]:
data = pd.read_csv("meteo/diario.csv")

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">Daily description of meteorological data</p>
<br>
The **daily_weather.csv** file is a comma-separated file containing weather data. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity . The data was collected over a three-year period, from September 2011 to September 2014, to ensure that sufficient data is captured for different seasons and weather conditions.<br><br>

Let's now look at all the columns in the data.

In [3]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

<br>Each row in daily_weather.csv captures weather data for a separate day.

Measurements from the weather station sensors were captured at one-minute intervals. These measurements were processed to generate values ​​to describe the daily climate. Since this data set was created to classify low humidity days versus non-low humidity days (i.e., days with normal or high humidity), the variables included are weather measurements in the morning, and in the late. The idea is to use the morning weather values ​​to predict whether the day will be low in humidity or not based on the afternoon relative humidity measurement.

What are each variables (the categories)?

* **number:** Unique number for each row
* **air_pressure_9am:** atmospheric pressure between 8:55am and 9:04am (*Hectopascals*)
* **air_temp_9am:** average air temperature between 8:55am and 9:04am (*Degrees Fahrenheit*)
* **air_wind_direction_9am:** Direction from 8:55am to 9:04am (*Degrees where 0 is North and increases counterclockwise*)
* **air_wind_speed_9am:** Average wind speed from 8:55am to 9:04am (*Miles per hour*)
* **max_wind_direction_9am:** Average gust direction from 8:55am to 9:10am
* **max_wind_speed_9am:** Average gust speed from 8:55am to 9:04am (*Miles per hour*)
* **rain_accumulation_9am:** Amount of rain accumulated in the 24 hours prior to 9am. (*Millimeters*)
* **rain_duration_9am:** Amount of time rain was recorded in the 24 hours prior to 9am (*seconds*)
* **relative_humidity_9am:** average relative humidity from 8:55am to 9:04am (*percent*)
* **relative_humidity_3pm:** average relative humidity from 2:55pm to 3:04pm (*percent*)

In [5]:
data.head(10)

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74
5,5,915.3,78.404,182.8,9.932014,189.0,10.983375,0.02,170.0,35.13,33.93
6,6,915.598868,70.043304,177.875407,3.745587,186.606696,4.589632,0.0,0.0,10.657422,21.385657
7,7,918.07,51.71,242.4,2.527742,271.6,3.646212,0.0,0.0,80.47,74.92
8,8,920.08,80.582,40.7,4.518619,63.0,5.883152,0.0,0.0,29.58,24.03
9,9,915.01,47.498,163.1,4.943637,195.9,6.576604,0.0,0.0,88.6,68.05


In [7]:
# The method "any()" can receive the axis as 0 (index) or as 1 (rows). Let's see if we have to filter smthing.
data[data.isnull().any(axis=1)]

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03
334,334,916.23,75.74,149.1,2.751436,187.5,4.183078,,1480.0,31.88,32.9
358,358,917.44,58.514,55.1,10.021491,,12.705819,0.0,0.0,13.88,25.93
361,361,920.444946,65.801845,49.823346,21.520177,61.886944,25.549112,,40.364018,12.278715,7.618649
381,381,918.48,66.542,90.9,3.467257,89.4,4.406772,,0.0,20.64,14.35
409,409,,67.853833,65.880616,4.328594,78.570923,5.216734,0.0,0.0,18.487385,20.356594


</br></br>
Steps to clean the data
</br></br>
We won't need any number for each row so we can delete that column.

In [8]:
del data['number']

Deleting null values.

In [9]:
before_rows = data.shape[0]
print(before_rows)

1095


In [10]:
data = data.dropna()

In [11]:
after_rows = data.shape[0]
print(after_rows)

1064


In [12]:
before_rows - after_rows

31

### Aplying classification
Making relative_humidity_3pm variable to binary label.

In [14]:
clean_data = data.copy()
#The relational operator '>' will display boolean values, we can make them integers on the go with the product of 1.
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99) * 1
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
       ..
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int32


The target is stored in 'y'.

In [17]:
y = clean_data[['high_humidity_label']].copy()
y

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1
...,...
1090,1
1091,1
1092,1
1093,1


In [18]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [19]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1
