# Week 4

Yay! It's week 4. Last week had a lot of material, this week we only have three components with very little reading. 


## Overview

* A video lecture with a few questions
* An exercise on visualizing geodata using a different set of tools from the ones we played with previously.
* Thinking about visualization, data quality, and binning. Why ***looking at the details of the data before applying fancy methods*** is often important.

## Part 1: More lecturing on dataviz

We begin today by learning more about the theory of visualization, digging into data encodings and representations.

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/zE6Nr8trdrw/0.jpg)](https://www.youtube.com/watch?v=zE6Nr8trdrw)

> *Excercise:* Some questions about the video. <font color=gray>Try to answer using your human brain (rather than your LLMs first). </font>
>
> * Mention 10 examples of ways we can encode data.
> * Are all encodings created equally? Why not? Can you think of an example from the previous lectures?
> * Mention 3 encodings that are difficult for the human eye to parse. Can you find an example of a visualization online that uses one of those three?
> * Explain in your own words: What is the problem with pie-charts?

## Part 2: Visualizing geo-data

It turns out that `plotly` (which we used during Week 2) is not the only way of working with geo-data. There are many different ways to go about it. (The more advanced PhD and PostDoc researchers in my group simply use matplotlib, since that provides more control. For an example of that kind of thing, check out [this tutorial](https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621).)

Today, we'll try another library for geodata called [Folium](https://github.com/python-visualization/folium). It's good for you all to try out a few different libraries - remember that data visualization and analysis in Python is all about the ability to use many different tools. 

The exercise below is based on the code illustrated in this nice [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data), so let us start by taking a look at that one.

*Reading*. Read through the following tutorial
 * "How to: Folium for maps, heatmaps & time data". Get it here: https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data. \[**UPDATE 2024**: Note that the Stamen tiles are no longer avialible.\]
 * (Optional) There are also some nice tricks in "Spatial Visualizations and Analysis in Python with Folium". Read it here: https://towardsdatascience.com/data-101s-spatial-visualizations-and-analysis-in-python-with-folium-39730da2adf

> *Exercise*: A different take on geospatial data. <font color=gray>It's OK to use your LLM for all of the Folium exercises</font>.
>
>A couple of weeks ago (Part 3 of Week 2), we worked with spacial data by using color-intensity of shapefiles to show the counts of certain crimes within those individual areas. Today, we look at studying geospatial data by plotting raw data points as well as heatmaps on top of actual maps.
> 
> * First start by plotting a map of San Francisco with a nice tight zoom. Simply use the command `folium.Map([lat, lon], zoom_start=13)`, where you'll have to look up San Francisco's longitude and latitude.
> * Next, use the the coordinates for SF City Hall `37.77919, -122.41914` to indicate its location on the map with a nice, pop-up enabled maker. (In the screenshot below, I used the black & white Stamen tiles, because they look cool. <mark>**UPDATE 2024**: Note that the Stamen tiles are no longer avialible, but there are many other tile-options. Link for more options on Stamen [**here**](https://stamen.com/here-comes-the-future-of-stamen-maps/)</mark>).
>  
> <img src="https://raw.githubusercontent.com/suneman/socialdata2022/main/files/city_hall_2022.png" alt="drawing" width="600"/>
>
> * Now, let's plot some more data (no need for pop-ups this time). Select a couple of months of data for `'DRUG/NARCOTIC'` and draw a little dot for each arrest for those two months. You could, for example, choose June-July 2016, but you can choose anything you like - the main concern is to not have too many points as this uses a lot of memory and makes Folium behave non-optimally. 
> We can call this kind of visualization a *point scatter plot*.

Ok. Time for a little break. Note that a nice thing about Folium is that you can zoom in and out of the maps.

> *Exercise*: Heatmaps.
> * Now, let's play with **heatmaps**. You can figure out the appropriate commands by grabbing code from the main [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data)) and modifying to suit your needs.
>    * To create your first heatmap, grab all arrests for the category `'SEX OFFENSES, NON FORCIBLE'` across all time. Play with parameters to get plots you like.
>    * Now, comment on the differences between scatter plots and heatmaps. 
>.      - What can you see using the scatter-plots that you can't see using the heatmaps? 
>.      - And *vice versa*: what does the heatmaps help you see that's difficult to distinguish in the scatter-plots?
>    * Play around with the various parameters for heatmaps. You can find a list here: https://python-visualization.github.io/folium/plugins.html
>    * Comment on the effect on the various parameters for the heatmaps. How do they change the picture? (at least talk about the `radius` and `blur`).

In [121]:
import numpy as np
import pandas as pd
import folium
from folium.plugins import HeatMap
import os

In [103]:
#get current working directory
cwd = os.getcwd()
#get parent directory
parent = os.path.dirname(cwd)
#get files directory
files = os.path.join(parent, 'files')
#get police department data as pandas dataframe
police = pd.read_csv(os.path.join(files, 'Police_Department_Incident_Reports__Historical_2003_to_May_2018_20240130.csv'))



In [104]:
#CLEAN DATA (Remove all data from 2018)
#convert string to datetime
police['Date'] = pd.to_datetime(police['Date'])
#get year from datetime and store into new column called year
police['Year'] = police['Date'].dt.year
#remove all rows with year 2018
police = police[police.Year != 2018]

In [115]:
lat = 37.773972
lon = -122.431297
# create map and display it
sanfran_map = folium.Map(location=[lat, lon], zoom_start=12)

In [68]:
SF_city_hall = [37.77919, -122.41914]
folium.Marker(SF_city_hall, popup='San Francisco City Hall').add_to(sanfran_map)

<folium.map.Marker at 0x14d126c60d0>

In [69]:
url = 'https://raw.githubusercontent.com/suneman/socialdata2022/main/files/sfpd.geojson'
popup = folium.GeoJsonPopup(fields=['DISTRICT'])
folium.GeoJson(url, popup = popup).add_to(sanfran_map)


<folium.features.GeoJson at 0x14ce56ec810>

In [116]:
#GET ALL DRUG CRIMES IN JUNE 2016
crime = 'DRUG/NARCOTIC'
#get all instances of crime between june-july 2016
crime_data = police[(police['Category'] == crime) & (police['Year'] == 2016) & (police['Date'].dt.month == 6)]
#make circle plots for each crime using X and Y coordinates
for lat, lon in zip(crime_data['Y'], crime_data['X']):
    folium.CircleMarker([lat, lon],
                        radius=3,
                        color='red',
                        fill=True,
                        fill_color='red',
                        fill_opacity=0.6,
                       ).add_to(sanfran_map)

In [137]:
sanfran_map = folium.Map(location=[lat, lon], zoom_start=12)
#Get all instances of SEX OFFENSES, NON FORCIBLE across all years
sex_crime = police[police['Category'] == 'SEX OFFENSES, NON FORCIBLE']

heat_data = list(zip(sex_crime['Y'],sex_crime['X']))
HeatMap(heat_data).add_to(sanfran_map)
sanfran_map

For the final element of working with heatmaps, let's now use the cool Folium functionality `HeatMapWithTime` to create a visualization of how the patterns of your favorite crime-type changes over time.

> *Exercise*: Heatmap movies. This exercise is a bit more independent than above - you get to make all the choices.
> * Start by choosing your favorite crimetype. Prefereably one with spatial patterns that change over time (use your data-exploration from the previous lectures to choose a good one).
> * Now, choose a time-resolution. You could plot daily, weekly, monthly datasets to plot in your movie. Again the goal is to find interesting temporal patterns to display. We want at least 20 frames though.
> * Create the movie using `HeatMapWithTime`.
> * Comment on your results: 
>   - What patterns does your movie reveal?
>   - Motivate/explain the reasoning behind your choice of crimetype and time-resolution. 

In [138]:

np.random.seed(3141592)
initial_data = np.random.normal(size=(100, 2)) * np.array([[1, 1]]) + np.array(
    [[48, 5]]
)

move_data = np.random.normal(size=(100, 2)) * 0.01

data = [(initial_data + move_data * i).tolist() for i in range(100)]

In [232]:
categories = police['Category'].unique()
categories

array(['ROBBERY', 'VEHICLE THEFT', 'ARSON', 'ASSAULT', 'TRESPASS',
       'BURGLARY', 'LARCENY/THEFT', 'WARRANTS', 'OTHER OFFENSES',
       'DRUG/NARCOTIC', 'SUSPICIOUS OCC', 'LIQUOR LAWS', 'VANDALISM',
       'WEAPON LAWS', 'NON-CRIMINAL', 'MISSING PERSON', 'FRAUD',
       'SEX OFFENSES, FORCIBLE', 'SECONDARY CODES', 'DISORDERLY CONDUCT',
       'RECOVERED VEHICLE', 'KIDNAPPING', 'FORGERY/COUNTERFEITING',
       'PROSTITUTION', 'DRUNKENNESS', 'BAD CHECKS',
       'DRIVING UNDER THE INFLUENCE', 'LOITERING', 'STOLEN PROPERTY',
       'SUICIDE', 'BRIBERY', 'EXTORTION', 'EMBEZZLEMENT', 'GAMBLING',
       'PORNOGRAPHY/OBSCENE MAT', 'SEX OFFENSES, NON FORCIBLE', 'TREA'],
      dtype=object)

In [228]:
#SEX OFFENSES
Category = 'SEX OFFENSES, FORCIBLE'
theft = police[police['Category'] == Category]
#loop through theft thrugh each month and year
time_coords = []
for i in range(1, 13):
    for j in range(2003, 2017):
        #get all instances of theft for each month and year
        theft_data = theft[(theft['Year'] == j) & (theft['Date'].dt.month == i)]
        time_coords.append(list(zip(theft_data['Y'],theft_data['X'])))
#convert coordinates from tuple to list
time_coords = [list(map(list, i)) for i in time_coords]

In [229]:
time_ = 0

N = len(time_coords)
itensify_factor = 30
for time_entry in time_coords:
    time_ = time_+1
    for row in time_entry:
        weight = 0.3
        row.append(weight)

In [230]:
sanfran_map = folium.Map(location=[lat, lon], zoom_start=12)
HeatMapWithTime(time_coords).add_to(sanfran_map)
sanfran_map

## Part 3: Errors in the data. The importance of looking at raw (or close to raw) data.

We started the course by plotting simple histogram and bar plots that showed a lot of cool patterns. But sometimes the binning can hide imprecision, irregularity, and simple errors in the data that could be misleading. In the work we've done so far, we've already come across at least three examples of this in the SF data. <font color=gray>It's 100% OK to use your LLM for this one.</font>

1. In the temporal activity for `PROSTITUTION` something surprising is going on on Thursday. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/prostitution.png), where I've highlighted the phenomenon I'm talking about.
2. Last week, when we investigated the details of how the timestamps are recorded using jitter-plots in the DAOST exercises, we saw that many more crimes were recorded e.g. on the hour, 15 minutes past the hour, and to a lesser in whole increments of 10 minutes. Crimes didn't appear to be recorded as frequently in between those round numbers. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/jitter.png), where I've highlighted the phenomenon I'm talking about.
3. Also, the *Hall of Justice* on the 800 block of Bryant street seems to be an unlikely hotspot for sex offences. Take a look here [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/crime_hot_spot.png).

> *Exercise*: Data errors. The data errors we discovered above become difficult to notice when we aggregate data (and when we calculate mean values, as well as statistics more generally). Thus, when we visualize, errors become difficult to notice when binning the data. We explore this process in the exercise below.
>
>This last exercise for today has two parts:
> * In each of the examples above, describe in your own words how the data-errors I call attention to above can bias the binned versions of the data. Also, briefly mention how not noticing these errors can result in misconceptions about the underlying patterns of what's going on in San Francisco (and our modeling).
> * Find your own example of human noise in the data and visualize it.
> * Were you able to use your LLM for anything in this exercise?

**A1 part 4**:
- The temporal activity for `PROSTITUTION` shows a surprising pattern on a Thursday. Either there's a lot of prostitution going on in San Francisco on Thursdays, or there's something wrong with the data. It could be the case that reports done during the week were not registered until that day, or that the data was not properly recorded. It could also be the case that the same crime was reported several times. This leads to a bias in the data, and if not noticed, it could lead to misconceptions about the underlying patterns of prostitution during the week.
- The jitter plot shows that many more crimes were recorded e.g. on the hour, 15 minutes past the hour, and to a lesser in whole increments of 10 minutes. Crimes didn't appear to be recorded as frequently in between those round numbers. It's a common human habit to round the time to the nearest 5 or 10 minutes, and this could lead to a bias in the data. If the exact time couldn't be recalled, then rounding to the nearest half or whole hour would be the most likely option. This habit of rounding the time leads to a bias in the data. Consequently, if we were to model the daily crime rate, our predictions would most likely pick up on the rounding pattern, and we wouldn't be able to accurately predict the crime occurences throughout the day.
- The *Hall of Justice* on the 800 block of Bryant street seems to be an unlikely hotspot, since it's right next to the penetentiary. The most likely explanation of this error, is that sex offences were registered at the penetentiary, but the location was not properly recorded, so the penententiary was used as a default location. The bias in the data will be picked up by our visualization tools, so one could easily be misled to believe that the Hall of Justice is a hotspot for sex offences, if not noticed.
- As shown in the following heatmap, the most frequent reportings of sex offenses (non forcible) occour at the San Francisco General Hospital at Potrero Avenue. It is extremely unlikely that sex offenses would occur at a place like this, so similar to the first example, it's a result of reportings taking place at the hospital, some time after the actual incident. Consequently, one could be misled to believe that the hospital would be a hotspot for sex offenses, which could have serious consequences for model predictions. As an example, if we were to distribute police officers based on the crime reports, we would most likely distribute more officers to the hospital, which would be a waste of resources.