# Week 7 - Decision trees

## Part 1: Decision Tree Intro

***Exercise 1.1***

*There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?*

***Answer:***

1. *Categorical variable decision trees includes a categorical target variable. This could be just like 'yes' or 'no'. This means that the decision tree will always pick one of the two - nothing inbetween 'yes' and 'no'*

2. *Continuous variable decision trees includes a continuous target variable. This could for instance be something that would predict the final speed of a sprinters run, based on his/her gender, income, family heritage etc.*
__________

***Exercise 1.2***

*Explain in your own words: Why is entropy useful when deciding where to split the data?*

***Answer:***

*Entropy is a measure of randomness in the processed data. The higher the entropy gets, the harder it is for us to conclude anything. If we flipped a coin 100 times, that would be an action that is truly random - hence a higher entropy. If the data is split, providing a high entropy, then we know that it is not a good split, since there is a high randomness*
____

***Exercise 1.3***

*Why are trees prone to overfitting?*

***Answer:***

*Decision trees have a tendency to overfit, when adding too much depth to the model - i.e the model will fit perfectly to the training data set, but will not be predicting the test sets correctly. This can be solved by understanding and testing to what granularity the model should go into. I.e. when is a parameter meaningful to the general model.*
____

***Exercise 1.4***

*Explain (in your own words) how random forests help prevent overfitting.*

***Answer:***

*It creates x amount of Decision Trees with different decisive features. By creating a random forest of trees and testing all of their accuracies on a test set, we can check that it not only reduces the MSE of the training set, but still provides a good fit (not overfit) with the test set. The trick is to get a model that does both parts equally well.*
____

## Part 2: Decision Tree Baseline

***Exercise 2.1***

*Decision trees and real-world crime data*

***Answer:***

*We start off by importing the dataset (and the needed packages) of reported incidents:*

In [86]:
import pandas as pd
import numpy as np

In [3]:
policing_dataframe = pd.read_csv('../rawdata/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv', parse_dates=[['Date', 'Time']])

*We chose to work with*
- **Type 1:** `BURGLARY`
- **Type 2:** `FORGERY/COUNTERFEITING`

*We select these in the dataset*

In [158]:
# We define the masks
mask_p_df_burg = (policing_dataframe['Category'] == "BURGLARY")
mask_p_df_forg = (policing_dataframe['Category'] == "FORGERY/COUNTERFEITING")

# We now get the training dataframes with 10000 cases (we use .sample method to get random samples over time)
pd_burg_train = policing_dataframe.loc[mask_p_df_burg].reset_index(drop=True).sample(10000)
pd_forg_train = policing_dataframe.loc[mask_p_df_forg].reset_index(drop=True).sample(10000)

# We now get the test dataframes with 2000 cases (we use .sample method to get random samples over time)
#pd_burg_test = policing_dataframe.loc[mask_p_df_burg].reset_index(drop=True).sample(2000)
#pd_forg_test = policing_dataframe.loc[mask_p_df_forg].reset_index(drop=True).sample(2000)

*We can now encode the four integer parameters that we need:*

- **Feature 1:** `DayOfWeek`
- **Feature 2:** `PdDistrict`
- **Feature 3:** `HourOfDay`
- **Feature 4:** `MonthOfYear`


*We combine the two train datasets*

In [159]:
frames = [pd_burg_train, pd_forg_train]
pd_train = pd.concat(frames).reset_index(drop=True)

In [160]:
# We get the date of the week and set it to the column
## Burglaries
pd_train["DayOfWeek"] = pd_train["Date_Time"].dt.dayofweek

# We get the hour of the day
## Burglaries
pd_train["HourOfDay"] = pd_train["Date_Time"].dt.hour

# We get the month of the year
## Burglaries
pd_train["MonthOfYear"] = pd_train["Date_Time"].dt.month

*We also need to encode our `PdDistrict` column. We will do this have we have chosen the data we are interesting in working with:*

In [161]:
feature_set = ["DayOfWeek", "PdDistrict", "HourOfDay", "MonthOfYear", "Category"]

In [162]:
# Reset the index
pd_train = pd_train[feature_set].reset_index(drop=True)

*Now we want to do one-hot encoding for the dataframes*

In [163]:
pd_enc = pd.get_dummies(pd_train)

*We now want to split the dataframes into features and targets*

*We first decide on the labels, which is `[Category_BURGLARY, Category_FORGERY/COUNTERFEITING]`.*

In [164]:
# Labels we want to predict
labels = np.array(pd_enc[["Category_BURGLARY", "Category_FORGERY/COUNTERFEITING"]])

# Remove the labels from the features
pd_enc = pd_enc.drop(["Category_BURGLARY", "Category_FORGERY/COUNTERFEITING"], axis=1)

# Save the feature names for later use
feature_list = list(pd_enc.columns)

# Convert to numpy array
features = np.array(pd_enc)

*We can now use sklearn to create our training and test data*

In [166]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.20, random_state = 42)

In [167]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (16000, 13)
Training Labels Shape: (16000, 2)
Testing Features Shape: (4000, 13)
Testing Labels Shape: (4000, 2)


*Now we would like to train the model*

In [173]:
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier

# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
# Train the model on training data

rf.fit(train_features, train_labels);

*We've now made a fit with 1000 generate decision trees. Now it's time to test the data by making predictions on it*

In [174]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

In [175]:
predictions

array([[0, 1],
       [1, 0],
       [1, 0],
       ...,
       [1, 0],
       [1, 0],
       [1, 0]], dtype=uint8)

In [176]:
train_labels

array([[1, 0],
       [1, 0],
       [1, 0],
       ...,
       [1, 0],
       [1, 0],
       [0, 1]], dtype=uint8)

In [186]:
from sklearn.metrics import accuracy_score

print("The accuracy of the model is:", accuracy_score(test_labels, predictions) * 100,"%")

The accuracy of the model is: 63.4 %


___________

## Part 3: Beyond the Baseline with Weather

***Exercise 3.1***

*Load the weather dataset. If you have your training data and test data on separate DataFrames then merging them with the weather information should be simple*

***Answer:***

*We start by loading in the dataset*

In [188]:
df_weather = pd.read_csv("weather_data.csv")

In [190]:
df_weather["date"] = pd.to_datetime(df_weather["date"])

In [191]:
df_weather

Unnamed: 0,date,temperature,humidity,weather,wind_speed,wind_direction,pressure
0,2012-10-01 13:00:00,16.330000,88.0,light rain,2.0,150.0,1009.0
1,2012-10-01 14:00:00,16.324993,87.0,sky is clear,2.0,147.0,1009.0
2,2012-10-01 15:00:00,16.310618,86.0,sky is clear,2.0,141.0,1009.0
3,2012-10-01 16:00:00,16.296243,85.0,sky is clear,2.0,135.0,1009.0
4,2012-10-01 17:00:00,16.281869,84.0,sky is clear,2.0,129.0,1009.0
5,2012-10-01 18:00:00,16.267494,83.0,sky is clear,2.0,123.0,1010.0
6,2012-10-01 19:00:00,16.253119,82.0,sky is clear,2.0,117.0,1010.0
7,2012-10-01 20:00:00,16.238745,81.0,sky is clear,1.0,110.0,1010.0
8,2012-10-01 21:00:00,16.224370,80.0,sky is clear,1.0,104.0,1010.0
9,2012-10-01 22:00:00,16.209995,79.0,sky is clear,1.0,98.0,1011.0


In [192]:
# We get the date of the week and set it to the column
## Burglaries
df_weather["DayOfWeek"] = df_weather["date"].dt.dayofweek

# We get the hour of the day
## Burglaries
df_weather["HourOfDay"] = df_weather["date"].dt.hour

# We get the month of the year
## Burglaries
df_weather["MonthOfYear"] = df_weather["date"].dt.month

In [197]:
df_weather

Unnamed: 0,date,temperature,humidity,weather,wind_speed,wind_direction,pressure,DayOfWeek,HourOfDay,MonthOfYear
0,2012-10-01 13:00:00,16.330000,88.0,light rain,2.0,150.0,1009.0,0,13,10
1,2012-10-01 14:00:00,16.324993,87.0,sky is clear,2.0,147.0,1009.0,0,14,10
2,2012-10-01 15:00:00,16.310618,86.0,sky is clear,2.0,141.0,1009.0,0,15,10
3,2012-10-01 16:00:00,16.296243,85.0,sky is clear,2.0,135.0,1009.0,0,16,10
4,2012-10-01 17:00:00,16.281869,84.0,sky is clear,2.0,129.0,1009.0,0,17,10
5,2012-10-01 18:00:00,16.267494,83.0,sky is clear,2.0,123.0,1010.0,0,18,10
6,2012-10-01 19:00:00,16.253119,82.0,sky is clear,2.0,117.0,1010.0,0,19,10
7,2012-10-01 20:00:00,16.238745,81.0,sky is clear,1.0,110.0,1010.0,0,20,10
8,2012-10-01 21:00:00,16.224370,80.0,sky is clear,1.0,104.0,1010.0,0,21,10
9,2012-10-01 22:00:00,16.209995,79.0,sky is clear,1.0,98.0,1011.0,0,22,10


In [208]:
df_combined = df_weather.join(pd_train, lsuffix="_caller", how="right")

In [209]:
df_combined = df_combined[["temperature", "humidity","weather", "wind_speed", "wind_direction", "pressure", "DayOfWeek_caller", "HourOfDay_caller", "MonthOfYear_caller", "DayOfWeek", "PdDistrict", "HourOfDay", "MonthOfYear", "Category"]]

In [210]:
combined_enc = pd.get_dummies(df_combined)

*We create our new randomforestclassifier based on the new data*

In [221]:
# Labels we want to predict
labels = np.array(combined_enc[["Category_BURGLARY", "Category_FORGERY/COUNTERFEITING"]])

# Remove the labels from the features
combined_enc = combined_enc.drop(["Category_BURGLARY", "Category_FORGERY/COUNTERFEITING"], axis=1)

# Save the feature names for later use
feature_list = list(combined_enc.columns)

# Convert to numpy array
features = np.array(combined_enc)

In [222]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.20, random_state = 42)

In [223]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (16000, 45)
Training Labels Shape: (16000, 2)
Testing Features Shape: (4000, 45)
Testing Labels Shape: (4000, 2)


In [224]:
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier

# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
# Train the model on training data

rf.fit(train_features, train_labels);

In [225]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

In [226]:
from sklearn.metrics import accuracy_score

print("The accuracy of the model is:", accuracy_score(test_labels, predictions) * 100,"%")

The accuracy of the model is: 88.875 %


*Quite interestingly, the prediction accuracy is almost 20% higher than the model that did not include the weather data*
____

## Part 4: Video Lectures and Reading

***Exercise 4.1***

*What are the three key elements to keep in mind when you design an explanatory visualization?*

***Answer:*** *There are 3 key elements:*

- *Start with the question (What is the question that you want to answer and communicate?)*
- *Allow exploration (Let the users investigate the data. Therefore something like D3, or even Tableau can be used)*
- *Know you readers (Make sure that you are not dumbing something down for instance to peers. Or making it overly complicated when it needs to be cross-disciplinary)*
____

***Exercise 4.2***

*In the video I talk about (1) overview first, (2) zoom and filter, (3) details on demand*
- *Go online and find a visualization that follows these principles (don't use one from the video)*
- *Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.*

***Answer:*** 
- *Online we found [this link](http://manpopex.us/)*
- *(1) first you get a good overview of dynamic population of Manhattan. (2) Then you can zoom in and filter the data that you need, for instance, I can select one district, instead of the 12 districts highlighted. (3) Then I can specifically select data that happens on each day, to see how much it changes over the day*
_________

***Exercise 4.3***

*Explain in your own words: How is explanatory data analysis different from exploratory data analysis?*

***Answer:*** *Exploratory data analysis is where you are looking for correlations and interesting stories that the data has to tell you. This is opposed to the explanatory data analysis, where you've found the interesting story that the data has to show, and you've spent time on visualising that story, so that others may see it as well.*
_____

***Exercise 4.4***

*What is the Oxford English Dictionary's defintion of a narrative?*

***Answer:*** *“an account of a series of events, facts, etc., given in order and with the establishing of connections between them"*
_____

***Exercise 4.5***

*What is your favorite visualization among the examples in section 3? Explain why in a few words.*

***Answer:*** *Figure 3 is good looking, it's clear and concise. From this figure, I can obviously see in which areas of Afghanistan certain building activities have been taking place. I like the fact that there is a narrative to go with the story, meaning that you are following this explanatory storyline that the creator has found for you. Then the data is also open, meaning that I can interactively confirm that the data is correct - to some extend*
_____