# Initial data exploration for my datasets

More information about this project is available in my github repo here: https://github.com/Noah-Baustin/sf_crime_data_analysis

In [1]:
#import modules
import pandas as pd
#import altair as alt



In [2]:
# import csv into a variable
historical_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2003-May2018/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv', dtype=str)

Let's take a look at our data:

In [3]:
historical_data

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2
0,3114751606302,031147516,06302,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Sunday,09/28/2003,10:00,SOUTHERN,NONE,...,,,,,,,,,,
1,5069701104134,050697011,04134,ASSAULT,BATTERY,Wednesday,06/22/2005,12:20,NORTHERN,NONE,...,,,,,2,,,,,97
2,6074729204104,060747292,04104,ASSAULT,ASSAULT,Saturday,07/15/2006,00:55,CENTRAL,NONE,...,,,,,2,,,,,106
3,7103536315201,071035363,15201,ASSAULT,STALKING,Tuesday,09/25/2007,00:01,TARAVAL,NONE,...,,,,,1,,,,,49
4,11082415274000,110824152,74000,MISSING PERSON,MISSING ADULT,Saturday,09/24/2011,11:00,TARAVAL,LOCATED,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2129520,16099543005043,160995430,05043,BURGLARY,"BURGLARY OF RESIDENCE, UNLAWFUL ENTRY",Thursday,12/01/2016,17:30,NORTHERN,NONE,...,,1,,,1,,,1,,21
2129521,16093783009031,160937830,09031,OTHER OFFENSES,"MONEY, PROPERTY OR LABOR, FRAUDULENTLY OBTAINING",Monday,11/14/2016,00:01,INGLESIDE,NONE,...,,,,,2,,,,,58
2129522,17078486916110,170784869,16110,DRUG/NARCOTIC,POSSESSION OF HEROIN FOR SALES,Tuesday,09/26/2017,01:38,CENTRAL,"ARREST, BOOKED",...,23,,23,1,2,,,,,106
2129523,18014823304138,180148233,04138,ASSAULT,"BATTERY, FORMER SPOUSE OR DATING RELATIONSHIP",Saturday,02/24/2018,20:30,CENTRAL,"ARREST, BOOKED",...,,,,,2,,,,,108


In [4]:
historical_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2129525 entries, 0 to 2129524
Data columns (total 35 columns):
 #   Column                                                    Dtype 
---  ------                                                    ----- 
 0   PdId                                                      object
 1   IncidntNum                                                object
 2   Incident Code                                             object
 3   Category                                                  object
 4   Descript                                                  object
 5   DayOfWeek                                                 object
 6   Date                                                      object
 7   Time                                                      object
 8   PdDistrict                                                object
 9   Resolution                                                object
 10  Address                                   

I also see that I've got some date columns that need to be reformatted: for sure 'Date'. Let's reformat that:

In [5]:
historical_data['Date'] = pd.to_datetime(historical_data['Date'])

In [6]:
historical_data.head()

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2
0,3114751606302,31147516,6302,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Sunday,2003-09-28,10:00,SOUTHERN,NONE,...,,,,,,,,,,
1,5069701104134,50697011,4134,ASSAULT,BATTERY,Wednesday,2005-06-22,12:20,NORTHERN,NONE,...,,,,,2.0,,,,,97.0
2,6074729204104,60747292,4104,ASSAULT,ASSAULT,Saturday,2006-07-15,00:55,CENTRAL,NONE,...,,,,,2.0,,,,,106.0
3,7103536315201,71035363,15201,ASSAULT,STALKING,Tuesday,2007-09-25,00:01,TARAVAL,NONE,...,,,,,1.0,,,,,49.0
4,11082415274000,110824152,74000,MISSING PERSON,MISSING ADULT,Saturday,2011-09-24,11:00,TARAVAL,LOCATED,...,,,,,,,,,,


Now that our dates are formatted correctly, let's take a look at our date range included in the data:

In [7]:
historical_data['Date'].min()

Timestamp('2003-01-01 00:00:00')

In [8]:
historical_data['Date'].max()

Timestamp('2018-05-15 00:00:00')

We can see here that it appears that we most likely have complete data beginning in 2003. BUT we see here that the 2018 data is incomplete, so if we want to do an annual analysis, we'll need to exclude 2018.

We've got 2,129,525 entries. Let's check out our three different columns that have incident indentification codes to see how many unique values are in each:

In [9]:
historical_data['IncidntNum'].nunique()

1703626

In [15]:
historical_data['Incident Code'].nunique()

885

In [16]:
historical_data['PdId'].nunique()

2129525

Our documentation tells us that the PdId column is equivelant to the row_id column in the new data (see downloaded pdf titled: Change Notice - Police Incident Reports). And the documentation for the newer dataset tells us that the row_id is the unique identifier for each row. 

It's a good sign that there's exaclty as many PdId unique values as there are rows in the dataset.

It also makes sense that there are duplicate values in the IncidntNum column. The IncidntNum refers to the case number. So if a supplemental report was filed after the incident was initially entered in this dataset, it would show up with a new PdId BUT the IncidntNum would be the same. 

But that does mean that I need to make sure I'm not counting the same incident multiple times if it shows up in this dataset multiple times. Most likely, my analysis will focus on a unique set of IncidntNum. 

Let's double check that there's no duplicate rows:

In [19]:
historical_data[historical_data.duplicated()]

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2


No duplicate rows, that's great.

Let's find out how many duplicated IncidntNum values we have:

In [20]:
historical_data[historical_data['IncidntNum'].duplicated()]

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2
523,17001676971010,170016769,71010,OTHER OFFENSES,LOST/STOLEN LICENSE PLATE,Thursday,2017-01-05,18:00,BAYVIEW,NONE,...,,,,,2,,,,,86
562,18023965028150,180239650,28150,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,2018-03-31,14:30,TARAVAL,NONE,...,,,,,2,,,,,43
857,17018698663010,170186986,63010,WARRANTS,WARRANT ARREST,Monday,2017-03-06,16:00,TENDERLOIN,"ARREST, BOOKED",...,18,,18,6,2,1,1,,,20
968,17063457109320,170634571,09320,FRAUD,"CREDIT CARD, THEFT BY USE OF",Friday,2017-08-04,16:00,CENTRAL,NONE,...,12,,12,4,2,,,,35,19
1179,16096494626135,160964946,26135,ASSAULT,UNLAWFUL DISSUADING/THREATENING OF A WITNESS,Sunday,2016-11-27,14:29,BAYVIEW,"ARREST, BOOKED",...,,,,,2,,,,,78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2129517,16092245574000,160922455,74000,MISSING PERSON,MISSING ADULT,Saturday,2016-11-12,15:08,SOUTHERN,NONE,...,,,,,2,,,,,32
2129519,18014350616652,180143506,16652,DRUG/NARCOTIC,POSSESSION OF METH-AMPHETAMINE FOR SALE,Thursday,2018-02-22,23:27,CENTRAL,"ARREST, BOOKED",...,12,,12,5,2,,,,,19
2129521,16093783009031,160937830,09031,OTHER OFFENSES,"MONEY, PROPERTY OR LABOR, FRAUDULENTLY OBTAINING",Monday,2016-11-14,00:01,INGLESIDE,NONE,...,,,,,2,,,,,58
2129522,17078486916110,170784869,16110,DRUG/NARCOTIC,POSSESSION OF HEROIN FOR SALES,Tuesday,2017-09-26,01:38,CENTRAL,"ARREST, BOOKED",...,23,,23,1,2,,,,,106


In [21]:
1703626 + 425899

2129525

There's 425,899 duplicates for 'IncidntNum'. We we add that to the number of unique values (above) it is equal to the number of rows in our dataset... that tells us that 1,703,626 is the actual number of incidents that we're working with.

Let's take a look at some of our duplicates:

First we'll create a dataframe with our duplicated cases:

In [29]:
dupe_cases = historical_data[historical_data['IncidntNum'].duplicated()].copy()

Now we create a `list` or `array` (the numpy version of a list) of those unique PdId's:

In [40]:
dupe_IncidentNum = dupe_cases['IncidntNum'].to_list()

Finally, we're displaying a subset our of original data that just includes the duplicate incident numbers:

In [43]:
dupe_cases_full = historical_data[historical_data['IncidntNum'].isin(dupe_IncidentNum)].sort_values(by='IncidntNum')

In [44]:
dupe_cases_full.head(30)

Unnamed: 0,PdId,IncidntNum,Incident Code,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2
1021616,6238907023,62389,7023,VEHICLE THEFT,STOLEN MOTORCYCLE,Tuesday,2004-03-30,11:35,MISSION,NONE,...,,3.0,,,1,,,3.0,,53
1869704,6238907043,62389,7043,VEHICLE THEFT,"VEHICLE, RECOVERED, MOTORCYCLE",Tuesday,2004-03-30,11:35,MISSION,NONE,...,,3.0,,,1,,,3.0,,53
1003273,12343616010,123436,16010,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Friday,2004-07-02,14:08,SOUTHERN,"ARREST, BOOKED",...,,,,,2,,,,,32
625462,12343602004,123436,2004,"SEX OFFENSES, FORCIBLE","FORCIBLE RAPE, BODILY FORCE",Friday,2004-07-02,14:08,SOUTHERN,"ARREST, BOOKED",...,,,,,2,,,,,32
1836290,24825063010,248250,63010,WARRANTS,WARRANT ARREST,Monday,2003-01-27,12:00,SOUTHERN,"ARREST, BOOKED",...,,,,,2,,,,,32
1009328,24825062050,248250,62050,WARRANTS,ENROUTE TO OUTSIDE JURISDICTION,Monday,2003-01-27,12:00,SOUTHERN,"ARREST, BOOKED",...,,,,,2,,,,,32
1554935,32468419090,324684,19090,DRUNKENNESS,UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE,Monday,2003-03-31,22:08,NORTHERN,"ARREST, BOOKED",...,9.0,,9.0,11.0,2,,,,,103
1297782,32468463010,324684,63010,WARRANTS,WARRANT ARREST,Monday,2003-03-31,22:08,NORTHERN,"ARREST, BOOKED",...,9.0,,9.0,11.0,2,,,,,103
1434901,36322862050,363228,62050,WARRANTS,ENROUTE TO OUTSIDE JURISDICTION,Wednesday,2004-11-03,10:45,TENDERLOIN,"ARREST, BOOKED",...,18.0,1.0,18.0,6.0,2,1.0,1.0,1.0,,20
1029718,36322863010,363228,63010,WARRANTS,WARRANT ARREST,Wednesday,2004-11-03,10:45,TENDERLOIN,"ARREST, BOOKED",...,18.0,1.0,18.0,6.0,2,1.0,1.0,1.0,,20


We can see here anecdotely that additional entries for duplicated IncidentNum's include additional entries in the descript column in some cases, but sometimes they remain the same. 
#### come back to this!

We're going to want to isolate our incidents that include marijuana crimes. So let's take a look at the unique values in the two columns that might contain information about marijuana crimes:

In [45]:
historical_data['Category'].unique()

array(['LARCENY/THEFT', 'ASSAULT', 'MISSING PERSON', 'VEHICLE THEFT',
       'BURGLARY', 'DRUG/NARCOTIC', 'DRIVING UNDER THE INFLUENCE',
       'VANDALISM', 'OTHER OFFENSES', 'DRUNKENNESS', 'NON-CRIMINAL',
       'ROBBERY', 'SUSPICIOUS OCC', 'TRESPASS', 'WARRANTS',
       'FORGERY/COUNTERFEITING', 'STOLEN PROPERTY',
       'SEX OFFENSES, FORCIBLE', 'FRAUD', 'SECONDARY CODES',
       'PROSTITUTION', 'RECOVERED VEHICLE', 'BRIBERY', 'ARSON',
       'DISORDERLY CONDUCT', 'WEAPON LAWS', 'LIQUOR LAWS', 'EXTORTION',
       'SUICIDE', 'KIDNAPPING', 'SEX OFFENSES, NON FORCIBLE',
       'BAD CHECKS', 'EMBEZZLEMENT', 'LOITERING', 'GAMBLING', 'TREA',
       'PORNOGRAPHY/OBSCENE MAT'], dtype=object)

We can see here that there's a DRUG/NARCOTIC category, so that's probably where we're going to find the marijuana crimes. But nothing specific about marijuana here. That's going to show up in our 'Descript' column. There's too many unique values in that column to list, so let's create a subset:

In [25]:
historical_data_marijuana = historical_data[
    historical_data['Descript'].str.contains('MARIJUANA')
].copy()

 I also see that there's a bunch of columns in the data that are not included in the data dictionary on the city site where we got the data from: https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry
 
 #### Eventually: will want to get rid of these spare columns!!