
**Part 1: Questions to text and lectures.**

A) Please answer my questions to the Segal and Heer paper we read during lecture 7 and 8.

* What is the Oxford English Dictionary's defintion of a narrative?
* What is your favorite visualization among the examples in section 3? Explain why in a few words.
* What's the point of Figure 7?
* Use Figure 7 to find the most common design choice within each category for the Visual narrative and Narrative structure (the categories within visual narrative are 'visual structuring', 'highlighting', etc).
* Check out Figure 8 and section 4.3. What is your favorite genre of narrative visualization? Why? What is your least favorite genre? Why?

B) Also please answer the questions to my talk on explanatory data visualization

* What are the three key elements to keep in mind when you design an explanatory visualization?
* In the video I talk about (1) overview first, (2) zoom and filter, (3) details on demand.
* Go online and find a visualization that follows these principles (don't use one from the video).
* Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.
* Explain in your own words: How is explanatory data analysis different from exploratory data analysis?



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
#from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline

df = pd.read_csv('../../SF_Police_Reports.csv')


**Part 2: Random forest and weather**

The aim here is to recreate the work you did in Part 1 and 2 of the Week 7 lecture. I've phrased things differently relative to the exercise to make the purpose more clear.

Part 2A: Random forest binary classification.

* Using the and instructions and material from Week 7, build a random forest classifier to distinguish between two types (you choose) of crime using on spatio-temporal (where/when) features of data describing the two crimes. When you're done, you should be able to give the classifier a place and a time, and it should tell you which of the two types of crime happened there.
 * Explain about your choices for training/test data, features, and encoding. (You decide how to present your results, but here are some example topics to consider: Did you balance the training data? What are the pros/cons of balancing? Do you think your model is overfitting? Did you choose to do cross-validation? Which specific features did you end up using? Why? Which features (if any) did you one-hot encode? Why ... or why not?))
 * Report accuracy. Discuss the model performance.

Part 2B: Info from weather features.

* Now add features from weather data to your random forest.
 * Report accuracy.
 * Discuss how the model performance changes relative to the version with no weather data.
 * Discuss what you have learned about crime from including weather data in your model.



In [None]:
forest_df = df.copy()
datetime_format = "%Y/%m/%d, %H:%M:%S"
forest_df['DateTime'] = pd.to_datetime(forest_df['Date'] + ' ' + forest_df['Time'])

In [None]:
forest_df['HourOfWeek'] = forest_df['DateTime'].dt.dayofweek * 24 + (forest_df['DateTime'].dt.hour + 1)
forest_df['HourOfDay'] = forest_df['DateTime'].dt.hour
forest_df['MonthOfYear'] = forest_df['DateTime'].dt.month
forest_df['DayOfWeek'] = forest_df['DateTime'].dt.dayofweek
districts = forest_df['PdDistrict'].unique()
print(districts)

In [None]:
forest_df['Category'].unique()
crime_mask = (forest_df['Category'].isin(['ASSAULT', 'ROBBERY']))
assault_or_robbery = forest_df.loc[crime_mask]

assault_or_robbery.isnull().sum()

In [None]:
label_crime_type = LabelEncoder()
assault_or_robbery['Category'] = label_crime_type.fit_transform(assault_or_robbery['Category'])

In [None]:
assault_or_robbery.head(10)

In [None]:
assault_or_robbery['Category'].value_counts()

In [None]:
sns.countplot(assault_or_robbery['Category'])


In [None]:
sns.countplot(forest_df.loc[crime_mask]['Category'])

In [None]:
#Npw seperate the dataset as response variable and feature variables

X = assault_or_robbery[['MonthOfYear','DayOfWeek','HourOfDay', 'X', 'Y']].copy()
y = assault_or_robbery['Category'].copy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

In [None]:
#Let's see how well our model performed
print(classification_report(y_test, pred_rfc))

In [None]:
weather_df = pd.read_csv('../../weather_data.csv')
weather_df['date'] = pd.to_datetime(weather_df['date'])
weather_df['HourOfWeek'] = weather_df['date'].dt.dayofweek * 24 + (weather_df['date'].dt.hour + 1)

In [None]:
weather_df.head(10)

In [None]:
pd.merge(X, weather_df, on=['HourOfWeek', 'MonthOfYear', 'DayOfWeek']).head()


**Part 3: Data visualization**

* Create the Bokeh visualization from Part 2 of the Week 8 Lecture, displayed in a beautiful .gif below.
* Provide nice comments for your code. Don't just use the # inline comments, but the full Notebook markdown capabilities and explain what you're doing.

Movie


In [2]:
import bokeh.plotting as bp
bokeh_df = df.copy()

In [3]:
bokeh_df['Date'] = pd.to_datetime(bokeh_df['Date'])
bokeh_df['Time'] = pd.to_datetime(bokeh_df['Time'])
bokeh_df.head()


Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,146196161,NON-CRIMINAL,LOST PROPERTY,Tuesday,2014-09-23,2020-03-30 01:00:00,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421,POINT (-122.403404791479 37.775420706711),14619616171000
1,150045675,ASSAULT,BATTERY,Thursday,2015-01-15,2020-03-30 17:00:00,TARAVAL,NONE,1800 Block of VICENTE ST,-122.485604,37.738821,POINT (-122.48560378101 37.7388214326705),15004567504134
2,140632022,SUSPICIOUS OCC,INVESTIGATIVE DETENTION,Wednesday,2014-07-30,2020-03-30 09:32:00,BAYVIEW,NONE,100 Block of GILLETTE AV,-122.396535,37.71066,POINT (-122.396535107224 37.7106603302503),14063202264085
3,150383259,ASSAULT,BATTERY,Saturday,2015-05-02,2020-03-30 23:10:00,BAYVIEW,"ARREST, BOOKED",2400 Block of PHELPS ST,-122.400131,37.730093,POINT (-122.400130573297 37.7300925390327),15038325904134
4,40753980,OTHER OFFENSES,RECKLESS DRIVING,Friday,2004-07-02,2020-03-30 13:43:00,BAYVIEW,NONE,I-280 / CESAR CHAVEZ ST,-120.5,90.0,POINT (-120.5 90),4075398065020


In [4]:
start_date = '2009-12-31'
end_date = '2019-01-01'
date_mask = (bokeh_df['Date'] > start_date) & (df['Date'] < end_date)
bokeh_df = bokeh_df.loc[date_mask]
bokeh_df.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,146196161,NON-CRIMINAL,LOST PROPERTY,Tuesday,2014-09-23,2020-03-30 01:00:00,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421,POINT (-122.403404791479 37.775420706711),14619616171000
1,150045675,ASSAULT,BATTERY,Thursday,2015-01-15,2020-03-30 17:00:00,TARAVAL,NONE,1800 Block of VICENTE ST,-122.485604,37.738821,POINT (-122.48560378101 37.7388214326705),15004567504134
2,140632022,SUSPICIOUS OCC,INVESTIGATIVE DETENTION,Wednesday,2014-07-30,2020-03-30 09:32:00,BAYVIEW,NONE,100 Block of GILLETTE AV,-122.396535,37.71066,POINT (-122.396535107224 37.7106603302503),14063202264085
3,150383259,ASSAULT,BATTERY,Saturday,2015-05-02,2020-03-30 23:10:00,BAYVIEW,"ARREST, BOOKED",2400 Block of PHELPS ST,-122.400131,37.730093,POINT (-122.400130573297 37.7300925390327),15038325904134
9,111027676,ASSAULT,BATTERY,Saturday,2011-12-24,2020-03-30 07:00:00,SOUTHERN,NONE,0 Block of DORE ST,-122.412933,37.773927,POINT (-122.412933062384 37.7739274524819),11102767604134


In [None]:
pivot_df = pd.pivot_table(bokeh_df, index=bokeh_df['Time'].dt.hour, values='Category', aggfunc=lambda x:x.sum()/bokeh_df['Category'].sum())

pivot_df

In [None]:
source = ColumnDataSource(pivot_df)

In [None]:
p = figure