# Week 4

## Overview

Yay! It's week 4. Today's we'll keep things light. I've noticed that many of you are struggling a bit to keep up and still working on exercises from the previous week. Thus, this week we only have two components with no lectures and very little reading. 

* An exercise on visualizing geodata from a different perspective than the shapefile we played with during Lecture 2.
* Thinking about visualization, data quality, and binning. Why looking at very granular data might is often important.

## Part 1: Visualizing geo-data

Today, we'll start by working with [Folium](https://github.com/python-visualization/folium) for plotting the GPS data. The exercise below is based code illustrated in this nice [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data)), so let us start by taking a look at that one.

*Reading*. Read through the following tutorial
 * "How to: Folium for maps, heatmaps & time data". Get it here: https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data
 * (Optional) There are also some nice tricks in "Spatial Visualizations and Analysis in Python with Folium". Read it here: https://towardsdatascience.com/data-101s-spatial-visualizations-and-analysis-in-python-with-folium-39730da2adf

> *Exercise*: A new take on geospatial data. 
>
>A couple of weeks ago (Part 4 of Week 2), we worked with spacial data by using color-intensity of shapefiles to show the counts of certain crimes within those individual areas. Today we look at studying geospatial data by plotting raw data points as well as heatmaps on top of actual maps.
> 
> * First start by plotting a map of San Francisco with a nice tight zoom. Simply use the command `folium.Map([lat, lon], zoom_start=13)`, where you'll have to look up San Francisco's longitude and latitude.
> * Next, use the the coordinates for SF City Hall `37.77919, -122.41914` to indicate its location on the map with a nice, pop-up enabled maker. (In the screenshot below, I used the black & white Stamen tiles, because they look cool).
> ![example](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/city_hall_2020.png)
> * Now, let's plot some more data (no need for popups this time). Select a couple of months of data for `'DRUG/NARCOTIC'` and draw a little dot for each arrest for those two months. You could, for example, choose June-July 2016, but you can choose anything you like - the main concern is to not have too many points as this uses a lot of memory and makes Folium behave non-optimally. 
> We can call this a kind of visualization a *point scatter plot*.

Ok. Time for a little break. Note that a nice thing about Folium is that you can zoom in and out of the maps.

> * Now, let's play with **heatmaps**. You can figure out the appropriate commands by grabbing code from the main [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data)) and modifying to suit your needs.
>    * To create your first heatmap, grab all arrests for the category `'SEX OFFENSES, NON FORCIBLE'` across all time. Play with parameters to get plots you like.
>    * Now, comment on the differences between scatter plots and heatmaps. 
>.      - What can you see using the scatter-plots that you can't see using the heatmaps? 
>.      - And *vice versa*: what does the heatmaps help you see that's difficult to distinguish in the scatter-plots?
>    * Play around with the various parameter for heatmaps. You can find a list here: https://python-visualization.github.io/folium/plugins.html
>    * Comment on the effect on the various parameters for the heatmaps. How do they change the picture? (at least talk about the `radius` and `max_zoom`).
> For one combination of settings, my heatmap plot looks like this.
> ![maps](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/crime_hot_spot.png)
>    * In that screenshot, I've (manually) highlighted a specific hotspot for this type of crime. Use your detective skills to find out what's going on in that building on the 800 block of Bryant street ... and explain in your own words. 

(*Fun fact*: I remembered the concentration of crime-counts discussed at the end of this exercise from when I did the course back in 2016. It popped up when I used a completely different framework for visualizing geodata called [`geoplotlib`](https://github.com/andrea-cuttone/geoplotlib). You can spot it if you go to that year's [lecture 2](https://nbviewer.jupyter.org/github/suneman/socialdataanalysis2016/blob/master/lectures/Week3.ipynb), exercise 4.)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import folium

In [14]:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])

In [15]:
policedata = pd.read_csv(r'C:\Users\Bruger\Desktop\LargeDataFiles\Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')
import datetime
policedata['DateTime'] = policedata['Date'] + ' ' + policedata['Time']

policedata['DateTime'] = pd.to_datetime(policedata['DateTime'] , format="%m/%d/%Y %H:%M") 

fcdata = policedata[policedata['Category'].isin(list(focuscrimes))]

In [186]:
sanfran = folium.Map([37.7749,-122.4194], zoom_start=13)
sanfran1 = folium.Map([37.7749,-122.4194], zoom_start=13)
sanfran2 = folium.Map([37.7749,-122.4194], zoom_start=13)
sanfran

In [51]:
folium.Marker([37.77919, -122.4191], popup='City Hall').add_to(sanfran)  

<folium.map.Marker at 0x179f6784470>

In [52]:
sanfran

In [59]:
drugnarcotic = policedata[policedata['Category'] =='DRUG/NARCOTIC']
drugnarcoticsubset = drugnarcotic[(drugnarcotic["DateTime"] > "2016-06-01") & (drugnarcotic['DateTime'] < "2016-08-01")]
for i in range(0,len(drugnarcoticsubset.X)): 
    folium.CircleMarker(location=[drugnarcoticsubset.Y.values[i], drugnarcoticsubset.X.values[i]],
                            radius=0.2,
                            weight=1).add_to(sanfran)

In [60]:
sanfran

## HEATMAPS

In [92]:
sexoffends = policedata[(policedata["Category"] == "SEX OFFENSES, NON FORCIBLE")]
heat_data = [[row['Y'],row['X']] for index, row in sexoffends.iterrows()]

In [103]:
from folium.plugins import HeatMap
HeatMap(heat_data,min_opacity=0.5,radius = 20).add_to(sanfran1)
sanfran1

Density of crimes is a bit easier to get an idea of when using heatmaps, however the exact location is better suited for points. The points give more insight into the frequency of the crime, as the heatmap is relative to the number of crimes committed. If poitns are on top of each other the heatmap can also help idenitfy high-risk zones. Increasing the radius can help if either points are very far apart in general - relative to the zoom of map you are using, or if you have an idea of the risk-radius of a certain crime, (i.e. for copenhagen it would probably be wise to use a small radius, as crimes are locally more centered around hotspots, but if looking at the entirety of crimes in denmark it is wiser to have a larger radius). Zooming one level out on the sanfransico map reveals a giant blur of crimes which provides no real information.

For the final element of working with heatmaps, let's now use the cool Folium functionality `HeatMapWithTime` to create a visualization of how the patterns of your favorite crime-type changes over time.

> *Exercise*: Heat map movies. This exercise is a bit more independent than above - you get to make all the choices.
> * Start by choosing your favorite crimetype. Prefereably one with spatial patterns that change over time (use your data-exploration from the previous lectures to choose a good one).
> * Now, choose a time-resolution. You could plot daily, weekly, monthly datasets to plot in your movie. Again the goal is to find interesting temporal patterns to display. We want at least 20 frames though.
> * Create the movie using `HeatMapWithTime`.
> * Comment on your results: 
>   - What patterns does your movie reveal?
>   - Motivate/explain the reasoning behind your choice of crimetype and time-resolution. 

In [183]:
disorderlyconduct = policedata[(policedata["Category"] == "DISORDERLY CONDUCT")]
# Choose monthly changes
dosorted = disorderlyconduct.sort_values(by='DateTime')
months = dosorted["Date"].str.slice(0, 2, 1)
monthsyears = months.str.cat(dosorted["Date"].str.slice(6, 10, 1).values)
dosorted["MonthYears"] = monthsyears

In [185]:
heat_time_series = [[[row['Y'],row['X']] for index, row in dosorted[dosorted['MonthYears'] == i].iterrows()] for i in list(np.unique(dosorted["MonthYears"].values))]

In [188]:
from folium import plugins
hm = plugins.HeatMapWithTime(heat_time_series,auto_play=True,max_opacity=0.8)
hm.add_to(sanfran2)
# Display the map
sanfran2

## Part 2: Errors in the data. The importance of looking at raw (or close to raw) data.

We started the course by plotting simple histogram plots that showed a lot of cool patterns. But sometimes the binning can hide imprecision, irregularity, and simple errors in the data that could be misleading. In the work we've done so far, we've already come across at least three examples of this in the SF data. 

1. In the hourly activity for `PROSTITUTION` something surprising is going on on Wednesday. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/prostitution_hourly.png), where I've highlighted the phenomenon I'm talking about.
1. When we investigated the details of how the timestamps are recorded using jitter-plots, we saw that many more crimes were recorded e.g. on the hour, 15 minutes past the hour, and to a lesser in whole increments of 10 minutes. Crimes didn't appear to be recorded as frequently in between those round numbers. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/jitter_plot.png), where I've highlighted the phenomenon I'm talking about.
1. And finally, today we saw that the Hall of Justice seemed to be an unlikely hotspot for sex offences. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/crime_hot_spot.png).

> *Exercise*: Data errors. The data errors we discovered above become invisible when we aggregate data. When we calculate mean values, statistics more generally. And when we visualize, they become difficult to notice when when we bin the data. We explore this process in the exercise below.
>
>This last exercise for today has two parts.
> * In each of the three examples above, describe in your own words how could the data-errors I call attention to above can biased the binned versions of the data and also briefly mention how it could create errors in how we understand what's going on in San Francisco and our modeling.
> * Find your own example of human noise in the data and visualize it.