# Crime data visualization in San Francisco

San Francisco has one of the most "open data" policies of any large city. In this lab, we are going to download about 85M of data (238,456) describing all police incidents since 2018 (I'm grabbing data on August 5, 2019).

## Getting started

Download [Police Department Incident Reports: Historical 2003 to May 2018](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783) or, if you want, all [San Francisco police department incident since 1 January 2003](https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry). Save in "CSV for Excel" format.

We can easily figure out how many records there are:

```bash
$ wc -l Police_Department_Incident_Reports__2018_to_Present.csv 
  238457
Police_Department_Incident_Reports__2018_to_Present.csv
```

So 238,456 not including the header row.  You can name that data file whatever you want but I will call it `SFPD.csv` for these exercises.

## Sniffing the data

Let's assume the file you downloaded and is in `/tmp`:

In [42]:
import pandas as pd

df_sfpd = pd.read_csv('/tmp/SFPD.csv')
df_sfpd.head(2).T

Unnamed: 0,0,1
Incident Datetime,2018/12/02 12:45:00 AM,2018/12/01 08:30:00 PM
Incident Date,2018/12/02,2018/12/01
Incident Time,00:45,20:30
Incident Year,2018,2018
Incident Day of Week,Sunday,Saturday
Report Datetime,2018/12/02 01:56:00 AM,2018/12/01 09:18:00 PM
Row ID,74374327130,74370071000
Incident ID,743743,743700
Incident Number,180908554,180908112
CAD Number,1.8336e+08,1.83354e+08


To get a better idea of what the data looks like, let's do a simple histogram of the categories and crime descriptions.  Here is the category histogram:

In [23]:
from collections import Counter
counter = Counter(df_sfpd['Incident Category'])
counter.most_common(10)

[('Larceny Theft', 75062),
 ('Other Miscellaneous', 18431),
 ('Non-Criminal', 14886),
 ('Assault', 14323),
 ('Malicious Mischief', 13936),
 ('Burglary', 10617),
 ('Warrant', 8984),
 ('Lost Property', 8796),
 ('Motor Vehicle Theft', 8318),
 ('Fraud', 7063)]

In [37]:
from collections import Counter
counter = Counter(df_sfpd['Incident Description'])
counter.most_common(10)

[('Theft, From Locked Vehicle, >$950', 32641),
 ('Lost Property', 8796),
 ('Theft, Other Property, $50-$200', 7302),
 ('Battery', 7154),
 ('Malicious Mischief, Vandalism to Property', 6970),
 ('Mental Health Detention', 5960),
 ('Theft, Other Property, >$950', 5702),
 ('Vehicle, Recovered, Auto', 5181),
 ('Vehicle, Stolen, Auto', 4969),
 ('Warrant Arrest, Local SF Warrant', 4588)]

## Word clouds

A more interesting way to visualize differences in term frequency is using a so-called word cloud.  For example, here is a word cloud showing the categories from 2003 to the present.

<img src="figures/SFPD-wordcloud.png" width="400">

Python has a nice library you can use:

```bash
$ pip install wordcloud
```

**Exercise**: In a file called `catcloud.py`, once again get the categories and then create a word cloud object and display it:

```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import pandas as pd
import sys

df_sfpd = pd.read_csv(sys.argv[1])

... delete Incident Categories with nan ...
categories = ... create Counter object on column 'Incident Category' ...

wordcloud = WordCloud(width=1800,
                      height=1400,
                      max_words=10000,
                      random_state=1,
                      relative_scaling=0.25)

wordcloud.fit_words(categories)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()
```

### Which neighborhood is the "worst"?

**Exercise**: Now, pullout the neighborthood and do a word cloud on that in `hoodcloud.py` (it's ok to cut/paste):

<img src="figures/SFPD-hood-wordcloud.png" width="400">

### Crimes per neighborhood


**Exercise**: Filter the CSV file using `grep` from the commandline to get just the rows from a particular precinct or neighborhood, such as Mission and South of Market.  Modify `catcloud.py` to use a pandas query to filter for those records.  Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python catcloud.py /tmp/SFPD.csv Mission
```

Run the `catcloud.py` script to get an idea of the types of crimes per those two neighborhoods. Here are the mission and SOMA districts crime category clouds:

<table>
    <tr>
        <td><b>Mission</b></td><td>SOMA</td>
    </tr>
    <tr>
        <td><img src="figures/SFPD-mission-wordcloud.png" width="300"></td><td><img src="figures/SFPD-soma-wordcloud.png" width="300"></td>
    </tr>
 </table>

### Which neighborhood has most car break-ins?

**Exercise**: Modify `hoodcloud.py` to filter for `Motor Vehicle Theft`. Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python hoodcloud.py /tmp/SFPD.csv 'Motor Vehicle Theft'
```

<img src="figures/SFPD-car-theft-hood-wordcloud.png" width="300">

Hmm..ok, so parking in the Mission is ok, but SOMA, BayView/Hunters point are bad news.

If you get stuck in any of these exercises, you can look at the [code associated with this notes](https://github.com/parrt/msds692/tree/master/notes/code/sfpd).