# Assignment 2

### Libraries

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns

import random

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, FactorRange, Legend
from bokeh.plotting import figure

# Part 1: Questions to text and lectures

***What is the Oxford English Dictionary's defintion of a narrative?***

<span style="color:red">***Answer:***</span> *Oxford English Dictionary's* explains the term **narrative** as "an account of a series of events, facts, etc., given in order and with the establishing of connections between them." In simple words this definition gives us the sense of a sequence of events where the prior event sparks another event and this sequence creates a story. This definition enables us to have a better understanding of the term "Narrative Visualization" which describes the usage of storytelling strategies, mechanisms, or structures to provide meaningful visualizations.


___

***What is your favorite visualization among the examples in section 3? Explain why in a few words.***

<span style="color:red">***Answer:***</span> The Gapminder Human Development Trends visualization is one of the most interesting visualizations as it allows dense information to be quickly comprehended by the user. This is an *interactive slideshow* that examines the trends in global income and health connecting the variables **Income, Regions, Poverty, Health, Countries, Differences, Trends, Gaps, and Deaths** to tell us a story about Human Development. 

One of the most important and innovative things -for the period that it was released- about this interactive visualization is that allows the user to choose among a number of languages that he wants to interact with the plots. The whole visualization is like a presentation and the user can walk through the plots that describe the aforementioned variables via a progress bar.

This "interactive slideshow" uses three kinds of charts -histograms, scatter plots, and bar charts- without confusing the user. In each plot, there are *annotations, highlitings, etc.* to explain to the user the key observations. In addition, the user can move his mouse over the plot to have access in more detailed data.

Lastly, in some segments -especially in those that are visualized time series data- the visualization provides the user an increased level of interactivity.

All the above, demonstrate the reasons why the "The Gapminder Human Development Trends" Visualization was an innovative and special kind of visualization.
___

***What's the point of Figure 7?***

<span style="color:red">***Answer:***</span> Figure 7 illustrates the design space of the visualization and relates 58 examples that were retrieved from journalism, business, and visualization research in a table.

The first column contains the 58 different examples and across the first row the features of the Genre, Visual Narrative, and Narrative Structure are divided into subcategories. If any of the examples fell into any of the categories presented in the first row, the symbol "+" is placed in the specific cell where the row and the column are crossed.

This table contributes to the effort to analyze the design space and provides a structure of the visual elements that help our storytelling.
___

***Use Figure 7 to find the most common design choice within each category for the Visual narrative and Narrative structure (the categories within visual narrative are 'visual structuring', 'highlighting', etc).***

<span style="color:red">***Answer:***</span>  
The most common design choices within each category for the Visual Narrative and Narrative structure are the following:

##### Visual Narrative:
 * Visual Structuring: Consistent Visual Platform
 * Highlighting: Feature Distinction
 * Transition Guidance: Object Continuity

##### Narrative Structure:
 * Ordering: User-Directed Path
 * Interactivity: Filtering / Selection / Search
 * Messaging: Captions / Headlines 
___

***Check out Figure 8 and section 4.3. What is your favorite genre of narrative visualization? Why? What is your least favorite genre? Why?***

<span style="color:red">***Answer:***</span> The most interesting genre of narrative visualization is the Film/Video/Animation because combining both visual and auditory stimuli to promote a data story it motivates the viewer to focus and provides him with too much information in a short period of time. Moreover, researchers have argued that data videos can be highly impactful, making it a particularly interesting form of narrative visualization to study. The combination of voice narration, video footage, data visualizations, and attention cues provides a balanced video with a powerful story about our evolving relationship among the examined features. 

On the other hand, the magazine-style is a static narrative visualization that doesn't intrigue the reader to spend time or interact and analyze it. Lastly, some researchers have proved that the cognitive ability perceptual speed has been shown to correlate negatively with time on task while working with static grouped bar charts(Carenini et al. 2014).
___

***What are the three key elements to keep in mind when you design an explanatory visualization?***

<span style="color:red">***Answer:***</span>  
**1. Start with the question (What is the question that you want to answer and communicate?).** This is maybe the most crucial element because the right answer in this question sets the ultimate goal of your work and it defines your limits and where you should be focused when you are trying to communicate with the audience using explanatory data visualization.  
  
**2. Allow exploration (Let the users investigate. An interactive visualization tool).** If we include this element in our explanatory data visualization analysis we allow the user to interact with our work, and we motivate our audience to spend more time exploring the data using our visualization.  
  
**3. Know your readers (Design for an audience).** If we want to successfully communicate our findings we have to take into consideration the knowledge level of our audience. For instance, if we design visualization with many technical terms or aspects for an ordinary audience we are in danger to fail

___

***In the video I talk about (1) overview first, (2) zoom and filter, (3) details on demand. Go online and find a visualization that follows these principles (don't use one from the video).***

<span style="color:red">***Answer:***</span>  

https://qz.com/296941/interactive-graphic-every-active-satellite-orbiting-earth/
___

***Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.***

<span style="color:red">***Answer:***</span>  
1) First an overview about satellites orbiting earth is presented.  
  
![Image](https://github.com/Chrypapado/Social_Data_Analysis_and_Visualization/blob/master/Assignment2/Files/Overview.png?raw=true "movie")
  
2) A filtering by the sattelites attributes can be performed.
  
![Image](https://github.com/Chrypapado/Social_Data_Analysis_and_Visualization/blob/master/Assignment2/Files/Filter.png?raw=true "movie")
  
3) Finally, there is an option to specifically select a satellite.
  
![Image](https://github.com/Chrypapado/Social_Data_Analysis_and_Visualization/blob/master/Assignment2/Files/Details%20on%20Demand.png?raw=true "movie")
___

***Explain in your own words: How is explanatory data analysis different from exploratory data analysis?***

<span style="color:red">***Answer:***</span>  Exploratory data analysis defines the situation where the user explores the data in order to find interesting patterns of the data. Is often the first step of data analysis. Here we get familiar with data, visualize just to explore, ask questions, look for relationships between the variables, look for outliers, patterns and trends in data. 
Explanatory data analysis defines the situation where the user creates visualizations in order to show these interesting patterns to readers and viewers in a way that will intrigue them and make inform them. It is often when you've already gone through the exploratory analysis and from this have determined something specific you want to communicate to a given audience: in other words, when you want to tell a story with data.

___

# Part 2: Random forest and weather

In [2]:
#Import the Crime and Weather Data
df = pd.read_csv('Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')
url = 'https://raw.githubusercontent.com/suneman/socialdata2021/master/files/weather_data.csv'
weather = pd.read_csv(url, error_bad_lines=False, parse_dates=["date"], 
                      date_parser=lambda x: pd.to_datetime(x).tz_convert(None).tz_localize("Etc/GMT+3").tz_convert("Etc/GMT-7")) 

#Crime DataFrame
crimes = df[df['Category'].isin(['BURGLARY','FRAUD'])].reset_index(drop=True)
crimes = crimes[['Category', 'PdDistrict', 'DayOfWeek', 'Date', 'Time']].copy()
crimes["date"] = crimes.apply(lambda x: pd.to_datetime(x.Date + " " + x.Time).round("H").tz_localize("ETC/GMT-7"), axis = 1) 

#Merge DataFrames
temp = pd.merge(crimes, weather, on='date', how='left')
temp.dropna(axis=0, inplace=True)
temp.reset_index(drop=True, inplace=True)

#Selecting two Crimes and Grabbing Equal Number of Examples
burglary = temp[temp['Category'].isin(['BURGLARY'])].reset_index(drop=True)
burglary = burglary.sample(n=12000).reset_index(drop=True) 
fraud = temp[temp['Category'].isin(['FRAUD'])].reset_index(drop=True)
fraud = fraud.sample(n=12000).reset_index(drop=True)
data = pd.concat([burglary, fraud], axis=0)
data = data.sample(frac=1).reset_index(drop=True)

<span style="color:red">***Comment:***</span> We are merging the two datasets at the beginning in order to compare both of the models with exactly the same training set. In the first we are not going to choose the attributes of the weather dataset, in contrast to the second one.

## Part 2A: Random forest binary classification

### Using the and instructions and material from Week 7, build a random forest classifier to distinguish between two types (you choose) of crime using on spatio-temporal (where/when) features of data describing the two crimes. When you're done, you should be able to give the classifier a place and a time, and it should tell you which of the two types of crime happened there.

In [3]:
#Drop the Weather Attributes for the First Case
crime_data = data[['Category', 'PdDistrict', 'DayOfWeek', 'Date', 'Time']].copy()

#Creating Month of the Year Column
crime_data['MonthOfYear'] = crime_data['Date'].str.strip().str[0:2]

#Creating Hour of the Day Column
crime_data['HourOfDay'] = crime_data['Time'].str.strip().str[0:2]

#Creating the Year Column
crime_data['Year'] = crime_data['Date'].str.strip().str[-4:]

#Selecting Features
crime_data = crime_data[['Category', 'PdDistrict', 'DayOfWeek', 'HourOfDay', 'MonthOfYear', 'Year']].copy()

#One-hot Encoding (PdDistrict and DayOfWeek)
district = pd.get_dummies(crime_data.PdDistrict)
dayofweek = pd.DataFrame(pd.Categorical(crime_data.DayOfWeek, 
                                        categories=["Monday", "Tuesday", "Wednesday", "Thursday", 
                                                    "Friday", "Saturday", "Sunday"], 
                                        ordered=True).codes, columns=['Day'])
crime_data = pd.concat([crime_data, district, dayofweek], axis=1)
crime_data.drop(['PdDistrict', 'DayOfWeek'], axis=1, inplace=True)

#Features to Integer Type
crime_data['HourOfDay'] = crime_data['HourOfDay'].astype(int)
crime_data['MonthOfYear'] = crime_data['MonthOfYear'].astype(int)
crime_data['Year'] = crime_data['Year'].astype(int)

#Splitting the Data
train, test = train_test_split(crime_data, test_size=0.2)
x_train = train.iloc[:, 1:].values
y_train = train.iloc[:, 0].values
x_test = test.iloc[:, 1:].values
y_test = test.iloc[:, 0].values

#Random Forest Classifier
rf = RandomForestClassifier(n_estimators=1000, max_depth=20)
rf.fit(x_train, y_train)

#Predictions
predictions = rf.predict(x_test)

#Creating a Function to Give the Classifier a Place and Time to Predict the Crime
def prediction(day, hour, month, year, location):
    pred = []
    district = ['BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION', 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']
    days = ["MONDAY", "TUESDAY", "WEDNESFAY", "THURSDAY", "FRIDAY", "SATURDAY", "SUNDAY"]

    pred.append(hour)
    pred.append(month)
    pred.append(year)
    for i in range(10):
        if location == district[i]:
            pred.append(1)
        else:
            pred.append(0)
    for i in range(len(days)):
        if day == days[i]:
            pred.append(i)
    pred = rf.predict(np.array(pred).reshape(1, 14))
    return pred

#Random Prediction
prediction('MONDAY', 4, 2, 2013, 'PARK')

array(['BURGLARY'], dtype=object)

***Explain about your choices for training/test data, features, and encoding. (You decide how to present your results, but here are some example topics to consider: Did you balance the training data? What are the pros/cons of balancing? Do you think your model is overfitting? Did you choose to do cross-validation? Which specific features did you end up using? Why? Which features (if any) did you one-hot encode? Why ... or why not?))***

<span style="color:red">***Comment:***</span> The data is splitted to 80% to training data and 20% to testing data. Furthermore the data is balanced to the same number of samples, in order not to overfit towards a specific crime. Also the max depth of the tree has been set to 20, for not to overfit. The selected features were the specific hours of the time, the months, the years, as well as the police district and the days of the week, in which one-hot encode was performed.
___

### Report accuracy. 

In [4]:
print("The model's accuracy is: "+"{:.2f}".format(accuracy_score(y_test, predictions)*100)+"%")

The model's accuracy is: 60.56%


***Discuss the model performance.***

<span style="color:red">***Comment:***</span> It isn't a very accurate model, which may result that the two crimes have similar spatio-temporal features. However, the random forest classifier performs better than 50% accuracy-wise, which is the baseline (50/50 chance by guess). 

___

## Part 2B: Info from weather features

### Add features from weather data to your random forest.

In [5]:
#Copy DataFrame
weather_data = data.copy()

#Creating Month of the Year Column
weather_data['MonthOfYear'] = weather_data['Date'].str.strip().str[0:2]

#Creating Hour of the Day Column
weather_data['HourOfDay'] = weather_data['Time'].str.strip().str[0:2]

#Creating the Year Column
weather_data['Year'] = weather_data['Date'].str.strip().str[-4:]

#Droping Columns
weather_data.drop(['Date', 'Time', 'date'], axis=1, inplace=True)

#One-hot Encoding (PdDistrict, DayOfWeek and Weather)
district = pd.get_dummies(weather_data.PdDistrict)
weather = pd.get_dummies(weather_data.weather)
dayofweek = pd.DataFrame(pd.Categorical(weather_data.DayOfWeek, 
                                        categories=["Monday", "Tuesday", "Wednesday", "Thursday", 
                                                    "Friday", "Saturday", "Sunday"], 
                                        ordered=True).codes, columns=['Day'])
weather_data = pd.concat([weather_data, district, dayofweek, weather], axis=1)
weather_data.drop(['PdDistrict', 'DayOfWeek', 'weather'], axis=1, inplace=True)

#Features to Integer Type
weather_data['HourOfDay'] = weather_data['HourOfDay'].astype(int)
weather_data['MonthOfYear'] = weather_data['MonthOfYear'].astype(int)
weather_data['Year'] = weather_data['Year'].astype(int)

#Splitting the Data
train, test = train_test_split(weather_data, test_size=0.2)
x_train = train.iloc[:, 1:].values
y_train = train.iloc[:, 0].values
x_test = test.iloc[:, 1:].values
y_test = test.iloc[:, 0].values

#Random Forest Classifier
rf = RandomForestClassifier(n_estimators=1000, max_depth=20)
rf.fit(x_train, y_train)

#Predictions
predictions = rf.predict(x_test)

### Report accuracy.

In [6]:
print("The model's accuracy is: "+"{:.2f}".format(accuracy_score(y_test, predictions)*100)+"%")

The model's accuracy is: 63.35%


***Discuss how the model performance changes relative to the version with no weather data.***

<span style="color:red">***Comment:***</span> There is a slightly increase to the accuracy of the model's perfomance with the combined weather data. Adding those features showed that the weather affected one out of these two crimes and most probably the burglary one.
___

***Discuss what you have learned about crime from including weather data in your model.***

<span style="color:red">***Comment:***</span> As a conclusion, by including the weather data it is obvious that at 
least one of the crimes depends on the weather. This example shows that you can go beyond your current dataset and explore new features from your data, based on the timeline, the location and probably other categories depending on the dataset.


___

# Part 3: Data visualization

### Create the Bokeh visualization from Part 2 of the Week 8 Lecture, displayed in a beautiful .gif below.

***Importing the Crime Data and filtering for the period 2010-2018***

In [7]:
data = df.copy()
data = data[data['Date'].str.strip().str[-4:].isin(['2010','2011','2012','2013', '2014', 
                                                    '2015','2016','2017','2018'])]

***Selecting the Focus Crimes***

In [8]:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])
data = data[data['Category'].isin(focuscrimes)]

***Create an Hour Column along with the Hourly DataFrame***

In [9]:
data['Hour'] = data['Time'].str.strip().str[0:2]

hours = ['Midnight - 1am', '1am - 2am', '2am - 3am', '3am - 4am', '4am - 5am', '5am - 6am', '6am - 7am', '7am - 8am', '8am - 9am', '9am - 10am', '10am - 11am', '11am - Noon', 'Noon - 1pm', '1pm - 2pm', '2pm - 3pm', '3pm - 4pm', '4pm - 5pm', '5pm - 6pm', '6pm - 7pm', '7pm - 8pm', '8pm - 9pm', '9pm - 10pm', '10pm - 11pm', '11pm - Midnight']
temp_hour = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23']

for i in range(9):
    data['Hour'] = data['Hour'].str.replace(temp_hour[i], hours[i])    
data['Hour'] = data['Hour'].str.replace('09', '%temp1%').replace('10', '%temp2%').replace('11', '11am - Noon').replace('%temp2%', '10am - 11am').replace('%temp1%', '9am - 10am')
for i in range(12):
    data['Hour'] = data['Hour'].str.replace(temp_hour[i+12], hours[i+12])
    
hourData = pd.pivot_table(data, index = "Hour", columns = "Category",values = 'IncidntNum' ,aggfunc = 'count')
hourData = hourData.reindex(hours)

***Normalize the Data***

In [10]:
normal = hourData.copy()
normal = normal.astype(float)

for i in range(hourData.shape[1]):
    for j in range(hourData.shape[0]):      
        normal.iloc[:,i][j] = hourData.iloc[:,i][j] / sum(hourData.iloc[:,i])

***Convert Pandas DataFrame to Bokeh ColumnDataSource***

In [11]:
source = ColumnDataSource(normal)
output_notebook()

***Create an Empty Figure***

In [12]:
p = figure(x_range=FactorRange(factors=normal.index),
           plot_width=1200,
           title='Crimes per Hour', 
           x_axis_label='Hour of the Day', 
           y_axis_label='Relative Frequency')

***Different Colors for Different Crime***

In [13]:
cmap = sns.color_palette('icefire', len(normal.columns)).as_hex()

***Bars and Figure Settings***

In [14]:
bar = {}
items = []

for i, crime in enumerate(focuscrimes):
    bar[crime] = p.vbar(x='Hour', 
                    top=crime, 
                    source=source, 
                    width=0.7,
                    color=cmap[i],
                    fill_alpha=1.5,
                    muted=True, 
                    muted_alpha=0.005) 
    items.append((crime, [bar[crime]]))
    
p.xaxis.major_label_orientation = 1
p.y_range.start = 0

***Create an Interactive Legend***

In [15]:
legend = Legend(items=items)
p.add_layout(legend, 'left')    
p.legend.click_policy = 'mute'

show(p)

![Movie](https://github.com/Chrypapado/Social_Data_Analysis_and_Visualization/blob/master/Assignment2/Files/Bokeh%20Visualization.gif?raw=true "movie")

<span style="color:red">***Comment:***</span> In that specific interactive bar plot, there is the possibility to select whichever of the crimes you desire in order to see how they are distributed during the hours of the day. Also you can choose more than one and compare them together. This example shows that by creating an interactive plot, the user can have an interesting experience by exploring the graph and gaining some knowledge out of it.