# Week 7



## The intro

Anyway. I'm sure you guys have a lot to do this week, so we'll try to keep it relatively light (although there should be enough optional exercises to keep you all busy).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/6b4EQk96SfQ/0.jpg)](https://www.youtube.com/watch?v=6b4EQk96SfQ)

Remember that last week you worked classifing data using *KNN*'s. We are going to continue working with machine learning, this time looking at *decision trees* and see how new information can influence the performance of our model in predicting which type of crime happened.

Specifically, crimes can have many causes, so we can combine datasouces to better understand what makes a criminal commit a crime. Are there specific factors which trigger that individual to act? Since criminals are notoriously shy about sharing information, we must try to find this out in a different way. Lucky for us, we can do this with data! 

*We are going to use weather data* from San Franciso to try to relate different crimes with meteorological conditions!

* We'll start with a relatively simple exercise focusing adding weather data to the decision tree from last week (Part 1, 2, and 3).
* Then we'll prepare a bit for next week, when we get into the topic of explanatory data visualization with some lectures and reading (Part 4)

## Part 1: Decision Tree Intro

Now we turn to decision trees. This is a fantastically useful supervised machine-learning method, that we use all the time in research. To get started on the decision trees, we'll use some fantastic *visual* introduction. 


*Decision Trees Reading 1*: The visual introduction to decision trees on this webpage is AMAZING. Take a look to get an intuitive feel for how trees work. Do not miss this one, it's a treat! http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

*Decision Trees Reading 2*: the second part of the visual introduction is about the topic of model selection, and bias/variance tradeoffs that we looked into earlier during this lesson. But once again, here those topics are visualized in a fantastic and inspiring way, that will make it stick in your brain better. So check it out http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

*Decision Trees Reading 3*: Finally, you can also read about decision trees in DSFS, chapter 17. **You can get it on DTU Learn**

And our little session on decision trees wouldn't be complete without hearing from Ole about these things. 

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/LAA_CnkAEx8/0.jpg)](https://www.youtube.com/watch?v=LAA_CnkAEx8)


*Decision tree "reading" 4*: And of course the best way to learn how to get this stuff rolling in practice, is to work through a tutorial or two. We recommend the ones below:
  * https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html
  * https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 (this one also has good considerations regarding the one-hot encodings)
  
(But there are many other good ones out there.)

In [1]:
# # Ole explains decision trees
# YouTubeVideo("LAA_CnkAEx8",width=600, height=338)

> Exercises: Just a few questions to make sure you've read the text (DSFS chapter 17) and/or watched the video.
> 
> * There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?
> * Explain in your own words: Why is entropy useful when deciding where to split the data?
> * Why are trees prone to overfitting?
> * Explain (in your own words) how random forests help prevent overfitting.

## Part 2: Decision Tree Baseline


> *Exercise*: Decision trees and real-world crime data
> 
> The idea for today is to pick two crime-types that have *different geographical patterns* and *different temporal patterns*. We can then use various variables of the real crime data as categories to build a decision tree. I'm thinking we can use
> * `DayOfWeek` (`Sunday`, ..., `Saturday`). (Note: Will need to be encodede as integer in `sklearn`)
> * `PD District` (`TENDERLOIN`, etc). (Note: Will need to be encodede as integer in `sklearn`)
> 
> And we can extract a few more from the `Time` and `Date` variables
> * Hour of the day (1-24)
> * Month of the year (1-12)
> 
> So your job is to **select two crime categories** that (based on your analyses from the past three weeks) have different spatio-temporal patterns. Since we will use weather data to disitinguish later, let's try to think of crime categories that our intuition tells us might be strongly influenced by the weather conditions (type 1). And also think of other categories where we **don't** expect weather to play a role (type 2). We suggest:

* `BURGLARY or VEHICLE THEFT` for type 1. 
* `FORGERY/COUNTERFEITIN or FRAUD` for type 2. 

But you are free to choose other ones, if you like 🤓

In [1]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder  
from matplotlib.image import NonUniformImage
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

In [12]:
data = pd.read_csv("Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv")
data['time']= [datetime.strptime(i,"%H:%M") for i in data.Time]
data['hours'] = pd.DatetimeIndex(data['time']).hour
data['Month']=pd.DatetimeIndex(data['Date']).month

In [46]:
data['DayOfWeek_'] = data['DayOfWeek'].replace(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"], np.arange(1,8) )

In [47]:
le = LabelEncoder()
data['PdDistrict_']=le.fit_transform(data['PdDistrict'])

In [60]:
data_B = data[data.Category == 'BURGLARY']
data_F = data[data.Category == 'FRAUD']

In [72]:
data_B = data_B[['DayOfWeek_','PdDistrict_','Month','hours']]
data_B.reset_index(drop=True)

data_F = data_F[['DayOfWeek_','PdDistrict_','Month','hours']]
data_F.reset_index(drop=True)

Unnamed: 0,DayOfWeek_,PdDistrict_,Month,hours
0,5,6,10,21
1,4,8,4,15
2,4,4,7,10
3,1,8,10,18
4,1,2,9,12
...,...,...,...,...
41343,2,1,8,20
41344,3,9,7,13
41345,5,8,6,15
41346,0,4,1,20



Now we are going to to build is a decision tree (or, even better, a [Random Forest](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html), here is [another tutorial for Random Forests](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)) classifier that takes as input the four labels (Hour-of-the-day, Day-of-the-week, Month-of-the-year, and PD-District) of a crime (from one of the two categories) and then tries to predict which category that crime is from.
>
> Some notes/hints
> * Remember to create a balanced dataset, that is, **grab an equal number of examples** from each of the two crime categories. Pick categories with lots of training data. It's probably nice to have something like 10000+ examples of each category to train on. 
> * Also, I recommend you grab your training data at `random` from the set of all examples, since we want crimes to be distributed equally over time.
> * A good option is the  `DecisionTreeClassifier`.
> * We recommed you build a separate Pandas `Dataframe` with it, so the process of adding the weather data will be as smooth as possible later on. The same goes for your testing data.
> * Create a function to evaluate the precision of your classifier. Make sure your test data is not used for training. (Since you have created a balanced dataset, the baseline performance (random guess) is 50%. How good can your classifier get?)
> * (Optional, although this one might improve performance). Does one hot encoding affect your results? Why/Why not?  
> * (Optional) Are your results tied to the specific training data you used? Are you overfitting? Try performing [cross-validation](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) to answer this question.
> * (Optional) If you find yourself with extra time, come back to this exercise and tweak the variables you use to see if you can improve the accuracy of the tree. Try for example adding Year, Month or other variables you think may be relevant.

In [80]:
sample_n = 40000
data_F=data_F.sample(n=sample_n, random_state=1)
data_B=data_B.sample(n=sample_n, random_state=1)

In [85]:
data_F['y']=np.ones(sample_n)
data_B['y']=2*np.ones(sample_n)
data_tree = pd.concat([data_F,data_B])

In [None]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


In [124]:
X = data_tree[['DayOfWeek_','PdDistrict_','Month','hours']]
y = data_tree['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [129]:
model = RandomForestClassifier(n_estimators=1000, random_state=42)
model.fit(X_train,y_train)

RandomForestClassifier(n_estimators=1000, random_state=42)

In [130]:
y_hat = model.predict(X_test)

In [131]:
np.sum(y_hat == y_test)/y_hat.shape

array([0.59155303])

In [132]:
from sklearn.metrics import accuracy_score
accuracy_score(y_hat, y_test)

0.5915530303030303

### normal decision tree

In [119]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [120]:
y_cl_hat=clf.predict(X_test)

In [121]:
np.sum(y_cl_hat == y_test)/y_cl_hat.shape

array([0.5812])

## Part 3: Beyond the Baseline with Weather

In Part 2, you built a Decision Tree/Random Forest classifier to predict the category of a crime with the help of our friend `sklearn` using the following variables:

* `Hour of the week` (`1 , 2, ..., 168 `). 
* `PD District` (`TENDERLOIN`, etc).  (**Remember**, You'll need to encode this  labels as integers in `sklearn`, you can just assign numbers to the labels with something like sklearn's `Label Encoder` or do your own custom function). 

That model from Part 2 will function as our baseline. Now that we have it set up, we can use it to understand how adding variables from a **weather dataset** will influence the decisions of the tree later on.

In [154]:
data_w = pd.read_csv("weather_data.csv")


In [155]:
data_w.shape


(44306, 7)

In [156]:
data_w['hours'] = pd.DatetimeIndex(data_w['date']).hour
data_w['Month']=pd.DatetimeIndex(data_w['date']).month
data_w = data_w.dropna()

In [159]:
merge=pd.merge(data_w,data_tree, how='inner', left_index=True, right_index=True)

In [161]:
merge2 = pd.concat([data_w,data_tree], join='inner', axis=1)

In [182]:
new_df = data_tree.merge(data_w,  how='left', on = ['Month','hours'])
new_df = new_df.dropna()

In [184]:
new_df

Unnamed: 0,DayOfWeek_,PdDistrict_,Month,hours,y,date,temperature,humidity,weather,wind_speed,wind_direction,pressure
0,6,8,9,0,1.0,2013-09-01T00:00:00.000Z,20.0415,71.0,sky is clear,2.0,238.0,1026.0
1,6,8,9,0,1.0,2013-09-02T00:00:00.000Z,18.8880,79.0,moderate rain,2.0,255.0,1026.0
2,6,8,9,0,1.0,2013-09-03T00:00:00.000Z,23.8100,84.0,broken clouds,3.0,251.0,1014.0
3,6,8,9,0,1.0,2013-09-04T00:00:00.000Z,19.5695,71.0,sky is clear,3.0,268.0,1027.0
4,6,8,9,0,1.0,2013-09-05T00:00:00.000Z,25.7300,59.0,sky is clear,4.0,270.0,1013.0
...,...,...,...,...,...,...,...,...,...,...,...,...
12306917,1,10,4,11,2.0,2017-04-26T11:00:00.000Z,15.0000,87.0,mist,2.0,210.0,1018.0
12306918,1,10,4,11,2.0,2017-04-27T11:00:00.000Z,12.1700,87.0,mist,7.0,280.0,1017.0
12306919,1,10,4,11,2.0,2017-04-28T11:00:00.000Z,10.1800,76.0,mist,4.0,280.0,1017.0
12306920,1,10,4,11,2.0,2017-04-29T11:00:00.000Z,12.9900,59.0,mist,2.0,40.0,1018.0


Time to get that weather data rolling. The raw data we are using can be found online [here](https://www.meteoblue.com/en/weather/archive/export/san-francisco_united-states-of-america_5391959) or [you can get a convenient version from the files folder our class repository by clicking here](https://raw.githubusercontent.com/suneman/socialdata2021/master/files/weather_data.csv). 

> *Exercise*
> 
> * Load the weather dataset. If you have your training data and test data on separate `DataFrames` then merging them with the weather information should be simple 
>   * **Hint**: you can use the join method from pandas. To do so, you will need to round the time to the hour because weather data is recorded hourly. Also it's fine to drop missing values. Here's a [stackoverflow post](https://stackoverflow.com/questions/36292959/pandas-merge-data-frames-on-datetime-index) which may help you. 
>  * *Note*: you'll need to do some encoding on the weather data as before if you want to use the weather column. Also, check if all of the entries of the new training data have indeed a weather part to them. 
> * Now that you have the data properly merged, you can **fit a new random forest on the data and compare the results**. How does the weather data influence the prediction performances? (Use the evaluation function you built above.) Is there as impact in the accuracy of predictions? Is weather data relevant for the predictions?
> * *Optional*: Try experimenting with using only certain variables of the weather data. Can you improve the performance of classification by using fewer features/variables?


**Note**: It's not 100% given that adding weather will improve your predictive performance. It can go either way depending on the details of your implementation. The important thing is not performance, but that you implement your code in the right way.


In [192]:
lee = LabelEncoder()
new_df['weather_']=lee.fit_transform(new_df['weather'])


In [191]:
pd.unique(new_df['weather'])

array(['sky is clear', 'moderate rain', 'broken clouds', 'mist',
       'scattered clouds', 'overcast clouds', 'light rain', 'few clouds',
       'haze', 'smoke', 'thunderstorm', 'fog', 'proximity shower rain',
       'proximity thunderstorm', 'light intensity drizzle',
       'heavy intensity rain', 'drizzle', 'thunderstorm with light rain',
       'heavy snow', 'light snow', 'proximity thunderstorm with rain',
       'thunderstorm with heavy rain', 'very heavy rain',
       'thunderstorm with rain', 'heavy intensity drizzle',
       'light intensity shower rain', 'shower rain', 'squalls'],
      dtype=object)

In [193]:
X = new_df[['DayOfWeek_','PdDistrict_','Month','hours','weather_','wind_speed','temperature']]
y = new_df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
model2 = RandomForestClassifier(n_estimators=1000, random_state=42)
model2.fit(X_train,y_train)
y_hat = model.predict(X_test)
accuracy_score(y_hat, y_test)

## Part 4: Video Lectures and Reading

Next week we'll be playing around with *explanatory data visualization*. Roughly speaking this means using data visualization to communicate your results to others. Thus, there are new things to think about. We'll start thinking about that already this week.

We start with a video from from yours truly and then read a bit from a scientific article about types of explanatory dataviz. (*The video is from an old version of the class that used D3, so just ignore those parts. I'll make a new one ASAP*).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/yHKYMGwefso/0.jpg)](https://www.youtube.com/watch?v=yHKYMGwefso)

In [4]:
# # Sune talks about designing visualizations.
# from IPython.display import YouTubeVideo
# YouTubeVideo("yHKYMGwefso",width=600, height=338)

> *Exercises*: Explanatory data visualization
> * What are the three key elements to keep in mind when you design an explanatory visualization?
> * In the video I talk about (1) *overview first*,  (2) *zoom and filter*,  (3) *details on demand*. 
>   - Go online and find a visualization that follows these principles (don't use one from the video). 
>   - Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.
> * Explain in your own words: How is explanatory data analysis different from exploratory data analysis?

*Reading*: [Narrative Visualization: Telling Stories with Data](http://vis.stanford.edu/files/2010-Narrative-InfoVis.pdf) by Edward Segel and Jeffrey Heer. We'll read section 1-3 today. (And the rest next time).

When you get to section 3 it's fun to open up the examples mentioned by the authors in a browser and explore them as you read the text. 

> *Exercise*: Answer a couple of questions about the paper.
> 
> * What is the *Oxford English Dictionary's* defintion of a narrative?
> * What is your favorite visualization among the examples in section 3? Explain why in a few words.