### Data Cleaning
#### The purpose of this notebook is to create cleaned .csv files to export for use in my data analyses

More information about this project is available in my github repo here: https://github.com/Noah-Baustin/sf_crime_data_analysis

In [1]:
#import modules
import pandas as pd



In [2]:
# import historical csv into a variable
historical_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2003-May2018/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv', dtype=str)

# import newer csv into a variable
newer_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2018-10.14.21/Police_Department_Incident_Reports__2018_to_Present(1).csv', dtype=str)

Trim the extra columns that we don't need from the historical data:

In [3]:
historical_data = historical_data[
    ['PdId', 'IncidntNum', 'Incident Code', 'Category', 'Descript',
       'DayOfWeek', 'Date', 'Time', 'PdDistrict', 'Resolution', 'X',
       'Y', 'location']
].copy()

Change the column names in the historical data to match the API names in the newer data. The SFPD published a key that I used to translate the column names over, which can be found on pg two of this document: https://drive.google.com/file/d/13n7pncEOxFTWig9-sTKnB2sRiTB54Kb-/view?usp=sharing

In [4]:
historical_data.rename(columns={'PdId': 'row_id',
                                'IncidntNum': 'incident_number',
                                'Incident Code': 'incident_code',
                                'Category': 'incident_category',
                                'Descript': 'incident_description',
                                'DayOfWeek': 'day_of_week',
                                'Date': 'incident_date',
                                'Time': 'incident_time',
                                'PdDistrict': 'police_district',
                                'Resolution': 'resolution',
                                'X': 'longitude',
                                'Y': 'latitude',
                                'location': 'the_geom'
                               }, 
                       inplace=True)

Now let's trim down the columns from the newer dataset so that we're only working with columns that match up to the old data. 

Note: there's no 'the geom' column, but the column 'point' seems to be equivelant. 

In [6]:
newer_data = newer_data[
    ['Row ID', 'Incident Number', 'Incident Code', 'Incident Category', 
     'Incident Description', 'Incident Day of Week', 'Incident Date', 'Incident Time', 
     'Police District', 'Resolution', 'Longitude', 'Latitude', 'Point']
].copy()

Change the column names in the newer dataset to match the API names of the columns. Doing this because the original column names have spaces, which could cause issues down the road.

In [7]:
newer_data.rename(columns={'Row ID': 'row_id',
                           'Incident Number': 'incident_number',
                           'Incident Code': 'incident_code',
                           'Incident Category': 'incident_category',
                           'Incident Description': 'incident_description',
                           'Incident Day of Week': 'day_of_week', 
                           'Incident Date': 'incident_date',
                           'Incident Time': 'incident_time',
                           'Police District': 'police_district',
                           'Resolution': 'resolution',
                           'Longitude': 'longitude', 
                           'Latitude': 'latitude',
                           'Point': 'the_geom' 
                               }, 
                       inplace=True)

In [8]:
historical_data.columns

Index(['row_id', 'incident_number', 'incident_code', 'incident_category',
       'incident_description', 'day_of_week', 'incident_date', 'incident_time',
       'police_district', 'resolution', 'longitude', 'latitude', 'the_geom'],
      dtype='object')

In [9]:
newer_data.columns

Index(['row_id', 'incident_number', 'incident_code', 'incident_category',
       'incident_description', 'day_of_week', 'incident_date', 'incident_time',
       'police_district', 'resolution', 'longitude', 'latitude', 'the_geom'],
      dtype='object')

Now that our datasets have matching columns, let's merge them together. 

In [10]:
frames = [historical_data, newer_data]
all_data = pd.concat(frames)

The dataframe all_data now contains our combined dataset!

In [16]:
all_data.info()
all_data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2642742 entries, 0 to 513216
Data columns (total 13 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   row_id                object
 1   incident_number       object
 2   incident_code         object
 3   incident_category     object
 4   incident_description  object
 5   day_of_week           object
 6   incident_date         object
 7   incident_time         object
 8   police_district       object
 9   resolution            object
 10  longitude             object
 11  latitude              object
 12  the_geom              object
dtypes: object(13)
memory usage: 282.3+ MB


Unnamed: 0,row_id,incident_number,incident_code,incident_category,incident_description,day_of_week,incident_date,incident_time,police_district,resolution,longitude,latitude,the_geom
0,3114751606302,31147516,6302,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Sunday,09/28/2003,10:00,SOUTHERN,NONE,-120.5,90.0,POINT (-120.50000000000001 90)
1,5069701104134,50697011,4134,ASSAULT,BATTERY,Wednesday,06/22/2005,12:20,NORTHERN,NONE,-122.428223303176,37.7818959488603,POINT (-122.42822330317601 37.7818959488603)
2,6074729204104,60747292,4104,ASSAULT,ASSAULT,Saturday,07/15/2006,00:55,CENTRAL,NONE,-122.410672425337,37.799788690123,POINT (-122.41067242533701 37.799788690123)
3,7103536315201,71035363,15201,ASSAULT,STALKING,Tuesday,09/25/2007,00:01,TARAVAL,NONE,-122.458226300605,37.7413616001449,POINT (-122.458226300605 37.7413616001449)
4,11082415274000,110824152,74000,MISSING PERSON,MISSING ADULT,Saturday,09/24/2011,11:00,TARAVAL,LOCATED,-122.459172646607,37.7082001648459,POINT (-122.459172646607 37.7082001648459)


In [19]:
all_data['incident_date'] = pd.to_datetime(all_data['incident_date'])

Now let's create a dataframe with all our marijuana data:

In [17]:
all_data_marijuana = all_data[
    all_data['incident_description'].str.contains('MARIJUANA')
].reset_index(drop=True)

In [20]:
all_data_marijuana

Unnamed: 0,row_id,incident_number,incident_code,incident_category,incident_description,day_of_week,incident_date,incident_time,police_district,resolution,longitude,latitude,the_geom
0,16055139916010,160551399,16010,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Friday,07/08/2016,08:00,MISSION,"ARREST, BOOKED",-122.42326589360349,37.765649515945,POINT (-122.42326589360349 37.765649515945)
1,17102985016010,171029850,16010,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Thursday,12/21/2017,10:40,TARAVAL,"ARREST, BOOKED",-122.45364594949392,37.72327255110331,POINT (-122.45364594949392 37.72327255110331)
2,17026584716010,170265847,16010,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Saturday,04/01/2017,02:10,NORTHERN,"ARREST, BOOKED",-122.43959183986001,37.783850873845424,POINT (-122.43959183986001 37.783850873845424)
3,16071288616010,160712886,16010,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Friday,09/02/2016,17:30,PARK,"ARREST, BOOKED",-122.45351291112611,37.76869697865512,POINT (-122.45351291112611 37.76869697865512)
4,16054757016030,160547570,16030,DRUG/NARCOTIC,POSSESSION OF MARIJUANA FOR SALES,Wednesday,07/06/2016,18:32,RICHMOND,NONE,-122.46620466789287,37.772540539159316,POINT (-122.46620466789287 37.772540539159316)
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21535,16042276216030,160422762,16030,DRUG/NARCOTIC,POSSESSION OF MARIJUANA FOR SALES,Tuesday,05/24/2016,00:36,NORTHERN,"ARREST, BOOKED",-122.42577891406009,37.78231923926429,POINT (-122.42577891406009 37.78231923926429)
21536,17088308516030,170883085,16030,DRUG/NARCOTIC,POSSESSION OF MARIJUANA FOR SALES,Sunday,10/29/2017,01:23,CENTRAL,"ARREST, BOOKED",-122.41007890005163,37.79649323840542,POINT (-122.41007890005163 37.79649323840542)
21537,17001056716010,170010567,16010,DRUG/NARCOTIC,POSSESSION OF MARIJUANA,Tuesday,01/03/2017,00:01,CENTRAL,NONE,-122.41736820674113,37.79050941515949,POINT (-122.41736820674113 37.79050941515949)
21538,17054496416030,170544964,16030,DRUG/NARCOTIC,POSSESSION OF MARIJUANA FOR SALES,Wednesday,07/05/2017,11:44,PARK,"ARREST, BOOKED",-122.4525403126461,37.768390114069064,POINT (-122.4525403126461 37.768390114069064)


Let's export our two dataframes to .csv's that we can now use in other data analysis!

In [21]:
all_data.to_csv("all_data.csv", index=False)
all_data_marijuana.to_csv("all_data_marijuana.csv", index=False)