# Initial data exploration for my datasets

# Note: this notebook is just for Noah's purposes and is not tidied up

More information about this project is available in my github repo here: https://github.com/Noah-Baustin/sf_crime_data_analysis

In [2]:
#import modules
import pandas as pd
#import altair as alt



In [None]:
# import csv into a variable
historical_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2003-May2018/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv', dtype=str)

Let's take a look at our data:

In [None]:
historical_data

In [None]:
historical_data.info()

I also see that I've got some date columns that need to be reformatted: for sure 'Date'. Let's reformat that:

In [None]:
historical_data['Date'] = pd.to_datetime(historical_data['Date'])

In [None]:
historical_data.head()

Let's take a closer look at our columns:

In [None]:
historical_data.columns

We can see here that there's a bunch of extra columns in the data that's not included in the documentation and are not essential to our analysis. Let's get rid of those columns.

In [None]:
#get rid of all those extra columns we don't need

historical_data = historical_data[['PdId', 'IncidntNum', 'Incident Code', 'Category', 'Descript',
       'DayOfWeek', 'Date', 'Time', 'PdDistrict', 'Resolution', 'Address', 'X',
       'Y', 'location']].copy()

In [None]:
historical_data.columns

Now that our columns are cleaned up and our dates are formatted correctly, let's take a look at our date range included in the data:

In [None]:
historical_data['Date'].min()

In [None]:
historical_data['Date'].max()

We can see here that it appears that we most likely have complete data beginning in 2003. BUT we see here that the 2018 data is incomplete, so if we want to do an annual analysis, we'll need to exclude 2018.

We've got 2,129,525 entries. Let's check out our three different columns that have incident indentification codes to see how many unique values are in each:

In [None]:
historical_data['IncidntNum'].nunique()

In [None]:
historical_data['Incident Code'].nunique()

In [None]:
historical_data['PdId'].nunique()

Our documentation tells us that the PdId column is equivelant to the row_id column in the new data (see downloaded pdf titled: Change Notice - Police Incident Reports). And the documentation for the newer dataset tells us that the row_id is the unique identifier for each row. 

It's a good sign that there's exaclty as many PdId unique values as there are rows in the dataset.

It also makes sense that there are duplicate values in the IncidntNum column. The IncidntNum refers to the case number. So if a supplemental report was filed after the incident was initially entered in this dataset, it would show up with a new PdId BUT the IncidntNum would be the same. 

But that does mean that I need to make sure I'm not counting the same incident multiple times if it shows up in this dataset multiple times. Most likely, my analysis will focus on a unique set of IncidntNum. 

Let's double check that there's no duplicate rows:

In [None]:
historical_data[historical_data.duplicated()]

No duplicate rows, that's great.

Let's find out how many duplicated IncidntNum values we have:

In [None]:
historical_data[historical_data['IncidntNum'].duplicated()]

In [None]:
historical_data['IncidntNum'].nunique() + len(historical_data[historical_data['IncidntNum'].duplicated()])

There's 425,899 duplicates for 'IncidntNum'. We we add that to the number of unique values (above) it is equal to the number of rows in our dataset... that tells us that 1,703,626 is the actual number of incidents that we're working with.

#### NOTE: explain this step better b/c confusion in Soo meeting

Let's take a look at some of our duplicates:

First we'll create a dataframe with our duplicated cases:

In [None]:
dupe_cases = historical_data[historical_data['IncidntNum'].duplicated()].copy()

Now we create a `list` or `array` (the numpy version of a list) of those unique PdId's:

In [None]:
dupe_IncidentNum = dupe_cases['IncidntNum'].to_list()

Finally, we're displaying a subset our of original data that just includes the duplicate incident numbers:

In [None]:
dupe_cases_full = historical_data[historical_data['IncidntNum'].isin(dupe_IncidentNum)].sort_values(by='IncidntNum')

In [None]:
dupe_cases_full.head(30)

We can see here anecdotely that additional entries for duplicated IncidentNum's include additional entries in the descript column in some cases, but sometimes they remain the same. 
#### come back to this!

We're going to want to isolate our incidents that include marijuana crimes. So let's take a look at the unique values in the two columns that might contain information about marijuana crimes:

In [None]:
historical_data['Category'].unique()

We can see here that there's a DRUG/NARCOTIC category, so that's probably where we're going to find the marijuana crimes. But nothing specific about marijuana here. That's going to show up in our 'Descript' column. There's too many unique values in that column to list, so let's create a subset:

In [None]:
#create dataframe with all our marijuana incidents

historical_data_marijuana = historical_data[
    historical_data['Descript'].str.contains('MARIJUANA')
].copy()

## Soo meeting notes

In [None]:
drug_narcotic_incidents = historical_data[
    historical_data['Category'] == 'DRUG/NARCOTIC'
].reset_index(drop=True)

In [None]:
# needed if I decide to answer my 'extra' question: Compare marijuana arrests to other types of crimes, like narcotics.

drug_narcotic_incidents.to_csv('drug_narcotic_incidents_historical.csv', index=False)

In [None]:
drug_narcotic_incidents.head()

In [None]:
#how to isolate the marijuana duplicate incidentnum's
historical_data[
    historical_data['Descript'].str.contains('MARIJUANA') & historical_data['IncidntNum'].isin(dupe_IncidentNum)
].sort_values(by='IncidntNum')

In [None]:
historical_data[historical_data['IncidntNum'] == "000123436"]

In [None]:
#show info about the first row in the dataframe [iloc means index location]
incident_duplicate.iloc[0]

In [None]:
#what unique values are in this marijuana data frame for types of arrests
historical_data_marijuana['Descript'].unique()

In [None]:
#dropping duplicate incident numbers
#going to need to write an explainer 
historical_data_marijuana[historical_data_marijuana['IncidntNum'].duplicated()]

In [None]:
historical_data_marijuana[historical_data_marijuana['IncidntNum'] == '160676737']

In [None]:
# export the df to a csv
historical_data_marijuana.to_csv("historical_data_marijuana.csv", index=False)

### Bring in the more recent dataset

In [None]:
newer_data.columns

In [None]:
historical_data.columns

In [3]:
# import csv into a variable
newer_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2018-10.14.21/Police_Department_Incident_Reports__2018_to_Present(1).csv', dtype=str)

In [None]:
historical_data['Resolution'].unique()

In [4]:
newer_data.columns

Index(['Incident Datetime', 'Incident Date', 'Incident Time', 'Incident Year',
       'Incident Day of Week', 'Report Datetime', 'Row ID', 'Incident ID',
       'Incident Number', 'CAD Number', 'Report Type Code',
       'Report Type Description', 'Filed Online', 'Incident Code',
       'Incident Category', 'Incident Subcategory', 'Incident Description',
       'Resolution', 'Intersection', 'CNN', 'Police District',
       'Analysis Neighborhood', 'Supervisor District', 'Latitude', 'Longitude',
       'Point', 'Neighborhoods', 'ESNCAG - Boundary File',
       'Central Market/Tenderloin Boundary Polygon - Updated',
       'Civic Center Harm Reduction Project Boundary',
       'HSOC Zones as of 2018-06-05', 'Invest In Neighborhoods (IIN) Areas',
       'Current Supervisor Districts', 'Current Police Districts'],
      dtype='object')

In [54]:
newer_data['Incident Description'] = newer_data['Incident Description'].str.upper()

Now we need to figure out where the marijuana cases are organized in the newer dataset

In [55]:
find_marijuana_1 = newer_data[
    newer_data['Incident Description'].str.contains('MARIJUANA')
].reset_index(drop=True)

In [56]:
find_marijuana_1

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,Longitude,Point,Neighborhoods,ESNCAG - Boundary File,Central Market/Tenderloin Boundary Polygon - Updated,Civic Center Harm Reduction Project Boundary,HSOC Zones as of 2018-06-05,Invest In Neighborhoods (IIN) Areas,Current Supervisor Districts,Current Police Districts
0,2021/04/21 09:22:00 AM,2021/04/21,09:22,2021,Wednesday,2021/04/21 09:55:00 AM,103250716010,1032507,210259178,210259178,...,-122.40851633190513,POINT (-122.40851633190513 37.77375969975922),32,,,,,,10,1
1,2021/10/08 12:42:00 PM,2021/10/08,12:42,2021,Friday,2021/10/08 12:43:00 PM,107857316010,1078573,210656643,212811587,...,-122.40667700592424,POINT (-122.40667700592424 37.767142180962104),33,,,,2,,9,3
2,2021/10/09 12:26:00 PM,2021/10/09,12:26,2021,Saturday,2021/10/09 12:26:00 PM,107886216030,1078862,210658815,212821487,...,-122.38717421937969,POINT (-122.38717421937969 37.74615712680034),56,,,,,,9,2
3,2021/06/05 11:07:00 AM,2021/06/05,11:07,2021,Saturday,2021/06/05 11:07:00 AM,103724316030,1037243,210348549,211561131,...,-122.42353498869593,POINT (-122.42353498869593 37.77223581387494),26,,,,,,11,4
4,2021/07/07 10:01:00 PM,2021/07/07,22:01,2021,Wednesday,2021/07/07 10:01:00 PM,104811216030,1048112,210428470,211883301,...,-122.44331168934727,POINT (-122.44331168934727 37.77046953691186),112,,,,,,5,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
532,2020/09/02 04:28:00 PM,2020/09/02,16:28,2020,Wednesday,2020/09/02 04:28:00 PM,95820816010,958208,200528905,202462307,...,-122.43129384491388,POINT (-122.43129384491388 37.72872954740928),94,,,,,,1,9
533,2020/10/22 08:47:00 AM,2020/10/22,08:47,2020,Thursday,2020/10/22 02:42:00 PM,97176416030,971764,200638140,202961857,...,-122.5040487777567,POINT (-122.5040487777567 37.76426117163131),39,,,,,,4,10
534,2021/01/19 03:13:00 AM,2021/01/19,03:13,2021,Tuesday,2021/01/19 03:13:00 AM,99731116030,997311,210041141,210190283,...,-122.41811812009416,POINT (-122.41811812009416 37.7603010605011),53,,,,3,,2,3
535,2021/01/17 10:05:00 AM,2021/01/17,10:05,2021,Sunday,2021/01/17 10:05:00 AM,99688016030,996880,210037532,210170877,...,-122.42746205880601,POINT (-122.42746205880601 37.76877049785351),28,,,,5,,5,3


In [58]:
find_marijuana_1['Incident Description'].unique()

array(['MARIJUANA OFFENSE', 'MARIJUANA, POSSESSION FOR SALE',
       'MARIJUANA, SALES', 'MARIJUANA, CULTIVATING/PLANTING',
       'MARIJUANA, FURNISHING', 'MARIJUANA, TRANSPORTING'], dtype=object)

Unfortunately this shows us that the Incident Description column does not contain information categorizing crimes by the term 'marijuana', unlike the older data.

In [18]:
find_marijuana_2 = newer_data[
    newer_data['Incident Subcategory'].str.contains('MARIJUANA', na=False)
].reset_index(drop=True)

In [19]:
find_marijuana_2

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,Longitude,Point,Neighborhoods,ESNCAG - Boundary File,Central Market/Tenderloin Boundary Polygon - Updated,Civic Center Harm Reduction Project Boundary,HSOC Zones as of 2018-06-05,Invest In Neighborhoods (IIN) Areas,Current Supervisor Districts,Current Police Districts


No marijuana strings in the incident subcategory either.

In [20]:
find_marijuana_3 = newer_data[
    newer_data['Incident Category'].str.contains('MARIJUANA', na=False)
].reset_index(drop=True)

In [21]:
find_marijuana_3

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,Longitude,Point,Neighborhoods,ESNCAG - Boundary File,Central Market/Tenderloin Boundary Polygon - Updated,Civic Center Harm Reduction Project Boundary,HSOC Zones as of 2018-06-05,Invest In Neighborhoods (IIN) Areas,Current Supervisor Districts,Current Police Districts


In [22]:
find_marijuana_4 = newer_data[
    newer_data['Report Type Description'].str.contains('MARIJUANA', na=False)
].reset_index(drop=True)

In [23]:
find_marijuana_4

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,Longitude,Point,Neighborhoods,ESNCAG - Boundary File,Central Market/Tenderloin Boundary Polygon - Updated,Civic Center Harm Reduction Project Boundary,HSOC Zones as of 2018-06-05,Invest In Neighborhoods (IIN) Areas,Current Supervisor Districts,Current Police Districts


Frustratingly I'm not finding any marijuana information in any of these columns

In [24]:
upper_test = newer_data

In [25]:
upper_test = upper_test.upper()

AttributeError: 'DataFrame' object has no attribute 'upper'

In [26]:
test = 'hello'
print(test)

hello


In [27]:
test.upper()

'HELLO'

In [32]:
newer_data.columns

Index(['Incident Datetime', 'Incident Date', 'Incident Time', 'Incident Year',
       'Incident Day of Week', 'Report Datetime', 'Row ID', 'Incident ID',
       'Incident Number', 'CAD Number', 'Report Type Code',
       'Report Type Description', 'Filed Online', 'Incident Code',
       'Incident Category', 'Incident Subcategory', 'Incident Description',
       'Resolution', 'Intersection', 'CNN', 'Police District',
       'Analysis Neighborhood', 'Supervisor District', 'Latitude', 'Longitude',
       'Point', 'Neighborhoods', 'ESNCAG - Boundary File',
       'Central Market/Tenderloin Boundary Polygon - Updated',
       'Civic Center Harm Reduction Project Boundary',
       'HSOC Zones as of 2018-06-05', 'Invest In Neighborhoods (IIN) Areas',
       'Current Supervisor Districts', 'Current Police Districts'],
      dtype='object')

In [35]:
incident_descript_upper = newer_data['Incident Subcategory'].str.upper()

In [38]:
find_marijuana_5 = incident_descript_upper.str.contains('MARIJUANA').reset_index(drop=True)

In [41]:
find_marijuana_5.unique()

array([False, nan], dtype=object)

AttributeError: 'Series' object has no attribute 'info'

In [47]:
newer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513217 entries, 0 to 513216
Data columns (total 34 columns):
 #   Column                                                Non-Null Count   Dtype 
---  ------                                                --------------   ----- 
 0   Incident Datetime                                     513217 non-null  object
 1   Incident Date                                         513217 non-null  object
 2   Incident Time                                         513217 non-null  object
 3   Incident Year                                         513217 non-null  object
 4   Incident Day of Week                                  513217 non-null  object
 5   Report Datetime                                       513217 non-null  object
 6   Row ID                                                513217 non-null  object
 7   Incident ID                                           513217 non-null  object
 8   Incident Number                                       

In [50]:
newer_data['Incident Description'].unique()

array(['Malicious Mischief, Vandalism to Property', 'Arson',
       'Lost Property', 'Theft, From Locked Vehicle, >$950',
       'Suspicious Occurrence', 'Trespassing',
       'Vehicle, Recovered, Stolen outside SF', 'Mental Health Detention',
       'Burglary, Hot Prowl, Forcible Entry',
       'Theft, From Locked Vehicle, $200-$950',
       'Malicious Mischief, Vandalism to Vehicle', 'Battery, Sexual',
       'Firearm, Possession of Loaded', 'License Plate, Stolen',
       'Theft, From Unlocked Vehicle, >$950',
       'Weapon, Deadly, Imitation or Laser Scope, Exhibiting',
       'Vehicle, Stolen, Motorcycle', 'Investigative Detention',
       'Vehicle, Stolen, Auto', 'Vehicle, Recovered, Auto',
       'Found  Property', 'Theft, From Building, $200-$950',
       'Theft, Lost Property, Petty',
       'Firearm, Discharging Within City Limits',
       'Phone Calls, Harassing', 'Warrant Arrest, Local SF Warrant',
       'Burglary, Residence, Forcible Entry',
       'Burglary, Residence, 