# What is a Decision Tree?

<b>Decision tree </b> is a type of <b>supervised learning algorithm</b> that is mostly used in classification problems. It works for both categorical and continuous input and output variables. 

Example:-

Let’s say we have a sample of <b>30 students </b>with three variables Gender (Boy/Girl), Class(IX/X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.



# Practical Implementation

The file **daily_weather.csv** is a comma-separated file that contains weather data.    The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity.  Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

**Problem: Use morning sensor signals as features to predict whether the humidity will be high at 3pm.**


First we import the necessary libraries of the python for demostration of the Decision Tree Classifier.

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Read the data of the weather from the csv file using read_csv function of pandas dataframe.

In [None]:
data = pd.read_csv('daily_weather.csv')

In [None]:
data.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,high_humidity_3pm
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,1
1,917.347688,71.403843,101.935179,2.443009,140.471549,3.533324,0.0,0.0,24.328697,0
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,0
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,0
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,1


In [None]:
data.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'high_humidity_3pm'],
      dtype='object')

In [None]:
data.shape

(1095, 10)

In [None]:
data.describe()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,high_humidity_3pm
count,1092.0,1090.0,1091.0,1092.0,1092.0,1091.0,1089.0,1092.0,1095.0,1095.0
mean,918.882551,64.933001,142.235511,5.508284,148.953518,7.019514,0.203079,294.108052,34.241402,0.499543
std,3.184161,11.175514,69.137859,4.552813,67.238013,5.598209,1.593952,1598.078779,25.472067,0.500228
min,907.99,36.752,15.5,0.693451,28.9,1.185578,0.0,0.0,6.09,0.0
25%,916.55,57.281,65.972506,2.248768,76.553003,3.067477,0.0,0.0,15.092243,0.0
50%,918.921045,65.715479,166.0,3.871333,177.3,4.943637,0.0,0.0,23.179259,0.0
75%,921.160073,73.450974,191.0,7.337163,201.233153,8.94776,0.0,0.0,45.4,1.0
max,929.32,98.906,343.4,23.554978,312.2,29.84078,24.02,17704.0,92.62,1.0


In [None]:
data['high_humidity_3pm'].value_counts()

0    548
1    547
Name: high_humidity_3pm, dtype: int64

Removing the rows which contains the null values

In [None]:
data.isnull().sum()

air_pressure_9am          3
air_temp_9am              5
avg_wind_direction_9am    4
avg_wind_speed_9am        3
max_wind_direction_9am    3
max_wind_speed_9am        4
rain_accumulation_9am     6
rain_duration_9am         3
relative_humidity_9am     0
high_humidity_3pm         0
dtype: int64

In [None]:
data.dropna(inplace=True)

In [None]:
data.shape

(1064, 10)

In [None]:
independent_variables = ['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am']

In [None]:
dependent_variable = 'high_humidity_3pm'

In [None]:
X = data.drop(columns='high_humidity_3pm',axis=1)
Y = data['high_humidity_3pm']

By using train_test_split we have split the data into traing dataset and testing datasets.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=112)

**Fit on Train Set**

In [None]:
dec_classifier = DecisionTreeClassifier(criterion='gini', max_leaf_nodes=20)
dec_classifier.fit(X_train, Y_train)

In [None]:
y_pred_train = dec_classifier.predict(X_train)

In [None]:
accuracy_score(Y_train,y_pred_train)

0.9353932584269663

**Predict on Test Set**

In [None]:
y_predicted = humidity_classifier.predict(X_test)

In [None]:
y_predicted[:10]

array([1, 1, 0, 1, 0, 0, 0, 0, 1, 0], dtype=int64)

In [None]:
y_test[:10]

835    1
910    1
646    0
350    1
2      0
797    0
423    1
850    0
302    1
284    0
Name: high_humidity_3pm, dtype: int64

### Measure Accuracy of the Classifier

In [None]:
accuracy_score(y_test, y_predicted) * 100

88.63636363636364

The conclusion is that the model has good accuracy in predicting whether the humidity will be high at 3pm using morning sensor signals as features. The accuracy on the training set is 93.54%, and the accuracy on the test set is 88.64%, which is a good indication that the model generalizes well to unseen data.