# 02806 - Social Data Analysis and Visualization: Explainer Notebook

## Neighborhood complaints in New York City

This page is the explainer notebook for the Neighborhood Complaints in New York City project found at https://aksteffensen.github.io/. 

### Motivation
The data used in this project is a combination of data from different sources. Originally, we chose to work with these datasets as we wanted to investigate whether there was a correlation between criminal incidents/service requests and how much money people in the five New York City boroughs made. However, it turned out that the NYC payroll data we had found was very lacking, and for that reason we switched our focus to looking at the boroughs and the amount of service requests and criminal incidents in the boroughs during recent years.

The aim of this project is to help users obtain an overview of different criminal incidents and service requests in New York City. Such information can be helpful for people moving or travelling to New York, that want to avoid areas where certain incidents happen more often than elsewhere.

### Basic Stats
We consider in total three number of datasets:
    1. Service requests in NYC in 2015 found here: https://data.cityofnewyork.us/Social-Services/311-Service-Requests-2015/hemm-82xw
    2. NYPD complaint data from 1967 to 2017 found here: https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Map-Year-to-Date-/2fra-mtpn
    3. Map of the NYC boroughs found here: https://github.com/dwillis/nyc-maps/blob/master/boroughs.geojson

Dataset 1 contains 1,707,076 observations and 39 variables. Due to the dataset being 900 Mb, a small sample of 2431 observations with no missing values has been randomly generated in the <tt>R</tt>-script called <tt>datafix.R</tt>. Furthermore, formatting of dates and cleaning of data can be seen in <tt>datafix2.R</tt>. The datafile <tt>complaint_sample2.csv</tt> contains the preprocessed and cleaned data, which is used for the timeline on the website along with the geojson file <tt>boroughs.geojson</tt>. Lastly, a dataset <tt>incidents.csv</tt> is generated from the Dataset 1, which is used in the first bar plot on the website.

Dataset 2 contains 468,761 observations and 24 variables. The data have been summarized for each type of <tt>LAW_CAT_CD</tt> and for each borough. LAW_CAT_CD describes which type of three legal offenses have been commited: FELONY, MISDEMEANOR, and VIOLATION. The preprocessing of this dataset can be seen in <tt>NYPD_datafix.R</tt>. Finally these values are merged with Dataset 3 file above, such that a new geojson file <tt>geodata.json</tt> is generated. This is used for the figure in the NYPD Complaints section on the website. Further preprocessing have been done in <tt>NYPD_datafix2.R</tt> which is used for the correlation plot in the last section on the website. 

All the <tt>R</tt>-scripts and the <tt>Matlab</tt>-script for preprocessing can be seen in the data folder.
### Genre
<b>Genre: </b>The genres that have been used in this project are annotated graphs / maps and an animation. These genres have been introduced as we felt they are informative and visually nice. 


<b>Visual Narrative:</b> In the visual structuring we have mostly used consistent visual platforms, but a kind of progress bar was also used in the timeline animation. Highlighting was done using feature distinction, as different features were presented for the different boroughs using among other things barcharts and tooltips. The transitions in the visualizations were mostly animated to provide the user with a smooth feel.

    
<b>Narrative Structure:</b> The overall ordering of the project is through a user directed path where the user is guided through the different visualizations. This is done to tell the story in the correct order. Meanwhile, the user is free and has random access to each of the different visualizations that he/she is introduced to. The visualizations has interactivity in the forms of Hover details and filtering, while the text that supports the visualizations aid the user with a few intructions. This makes sure that the user is introduced to the visualization and the data that is depicted but the user is also free to explore the data themself. Hence the messasing is also mostly done through introductory text, but annotations and headlines are also use to make sure the necessary information is present in each of the visualizations.

### Visualizations
The first visualization illustrated on the page is a fairly simple barchart, that displays the amount of complaint incidents from NYCOpenData in the five boroughs. This plot is introduced to the reader first, as it gives a quick overview of the data, but it also shows, that when the number of complaint incidents are normalized the difference between the five boroughs is small. This leads the user onwards to the next illustration.

The second visualization is a geoplot of the boroughs that interacts with a barplot that displays the number of incidents as a timeline over 2017. A brush has been added to the timeline such that the user can choose the desired timeframe and these datapoints are then visualized on the geoplot using the coordinate data in <tt> complaint_sample.csv </tt>. Additionally an animate button has been added to the visualization such that the user can get a visual feel of where the most incidents has happened in 2017 and how often the incidents occur around the different boroughs. 

The third visualization is a geoplot containing information about the NYPD complaint data. We wanted to investigate which of the NYC boroughs was the safest place to be, which we believed could be correlated with the number of complaints to the police. The number of complaints are somewhat proportional with the number of inhabitants, however, we found that there were som significance when normalizing the data with respect to the number of inhabitants in each borough in 2016. 

The fourth and final visualization is a correlation plot illustrating that there does not seem to be a correlation between the legal category of a complaint and the borough it has been reported it. This plot, sadly, shows the user that it is not possible using simple statistics to distinguish between the incidents that happen in the different boroughs and that it therefore will not be simple to choose a borough to travel or move to given that they want to avoid a specific kind of incidents or legal violations. 

### Discussion
<b> What went well? </b>

We have managed to create an overview of the amount of service request and NYPD complaints that happen in the NYC boroughs. We have shown both the totalt amount of service requests in the boroughs, but also the amount of service requests normalized according to population size, which revealed, that there was no significant difference between the five boroughs. Additionally we managed to illustrate where the service requests were reported and created a nice timeline that helps illustrating the requests on the NYC map.

Furthermore, we managed to find and illustrate the amount of offenses reported to the NYPD and displayed them in a figure when a boroughs was selected. This gives the user an idea of how many offenses are reported and which of three legal categories the offenses belong to.

Lastly we examined the correlation between the boroughs and the legal categories of the offenses reported, which showed that there was no correlation between where the offense had happened and what legal category it belonged to.

<b> What could be improved? </b>

We did not manage to find the clear differences between the five boroughs that we wanted to, and was therefore hard to draw any final conclusions. To resolve this, more variables from the datasets could have been considered, for instance the service request resolve time, which would add a larger perspective to the story. Furthermore a more complex analysis could have been carried out in order to examine the different boroughs. Currently we are only looking at total/ normalized numbers and simple correlations, which are very simple measures. Had we carried out a more thorough explorative analysis of the data, we probably would have better ideas for modelling the data through for instance machine learning or deep learning models, the data could have been .

The third visualization on the page did not end up as intended and could therefore definitely be improved. We would have liked the information that it illustrated when hovering one of the boroughs to be illustrated in a barchart that only showed when hovering one of the boroughs. However, we were not able to make the two plots work properly together, and therefore decided to implement the tooltip instead, such that the user still received the intended information. 

Lastly, since the data displayed in the first barplot and the second brushplot is a random sample of a larger dataset, we could have displayed the full actual information somewhere, even if it was just written on the page, such that the user does not interpret data depicted in our visualizations as the actual true numbers.

### Contribution

The below information shows who was main responsible for the different parts of the project. However, we have both worked on all sections. The main responsible has been the primary force working on both text and visualization in the respective sections:

<b> Kasper Schjødt-Hansen: </b>Data, Visualizing incidents in 2015 over time, NYPD complaints in Boroughs, Conclusion.

<b> Andreas Kjer Steffensen: </b>Introduction, Service Request in Numbers, Correlation between Boroughs and Complaint Types, Conclusion. 
