# Initial data exploration for my datasets

More information about this project is available in my github repo here: https://github.com/Noah-Baustin/sf_crime_data_analysis

In [None]:
#import modules
import pandas as pd
#import altair as alt

In [None]:
# import csv into a variable
historical_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2003-May2018/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv', dtype=str)

Let's take a look at our data:

In [None]:
historical_data

In [None]:
historical_data.info()

I also see that I've got some date columns that need to be reformatted: for sure 'Date'. Let's reformat that:

In [None]:
historical_data['Date'] = pd.to_datetime(historical_data['Date'])

In [None]:
historical_data.head()

Let's take a closer look at our columns:

In [None]:
historical_data.columns

We can see here that there's a bunch of extra columns in the data that's not included in the documentation and are not essential to our analysis. Let's get rid of those columns.

In [None]:
#get rid of all those extra columns we don't need

historical_data = historical_data[['PdId', 'IncidntNum', 'Incident Code', 'Category', 'Descript',
       'DayOfWeek', 'Date', 'Time', 'PdDistrict', 'Resolution', 'Address', 'X',
       'Y', 'location']].copy()

In [None]:
historical_data.columns

Now that our columns are cleaned up and our dates are formatted correctly, let's take a look at our date range included in the data:

In [None]:
historical_data['Date'].min()

In [None]:
historical_data['Date'].max()

We can see here that it appears that we most likely have complete data beginning in 2003. BUT we see here that the 2018 data is incomplete, so if we want to do an annual analysis, we'll need to exclude 2018.

We've got 2,129,525 entries. Let's check out our three different columns that have incident indentification codes to see how many unique values are in each:

In [None]:
historical_data['IncidntNum'].nunique()

In [None]:
historical_data['Incident Code'].nunique()

In [None]:
historical_data['PdId'].nunique()

Our documentation tells us that the PdId column is equivelant to the row_id column in the new data (see downloaded pdf titled: Change Notice - Police Incident Reports). And the documentation for the newer dataset tells us that the row_id is the unique identifier for each row. 

It's a good sign that there's exaclty as many PdId unique values as there are rows in the dataset.

It also makes sense that there are duplicate values in the IncidntNum column. The IncidntNum refers to the case number. So if a supplemental report was filed after the incident was initially entered in this dataset, it would show up with a new PdId BUT the IncidntNum would be the same. 

But that does mean that I need to make sure I'm not counting the same incident multiple times if it shows up in this dataset multiple times. Most likely, my analysis will focus on a unique set of IncidntNum. 

Let's double check that there's no duplicate rows:

In [None]:
historical_data[historical_data.duplicated()]

No duplicate rows, that's great.

Let's find out how many duplicated IncidntNum values we have:

In [None]:
historical_data[historical_data['IncidntNum'].duplicated()]

In [None]:
historical_data['IncidntNum'].nunique() + len(historical_data[historical_data['IncidntNum'].duplicated()])

There's 425,899 duplicates for 'IncidntNum'. We we add that to the number of unique values (above) it is equal to the number of rows in our dataset... that tells us that 1,703,626 is the actual number of incidents that we're working with.

#### NOTE: explain this step better b/c confusion in Soo meeting

Let's take a look at some of our duplicates:

First we'll create a dataframe with our duplicated cases:

In [None]:
dupe_cases = historical_data[historical_data['IncidntNum'].duplicated()].copy()

Now we create a `list` or `array` (the numpy version of a list) of those unique PdId's:

In [None]:
dupe_IncidentNum = dupe_cases['IncidntNum'].to_list()

Finally, we're displaying a subset our of original data that just includes the duplicate incident numbers:

In [None]:
dupe_cases_full = historical_data[historical_data['IncidntNum'].isin(dupe_IncidentNum)].sort_values(by='IncidntNum')

In [None]:
dupe_cases_full.head(30)

We can see here anecdotely that additional entries for duplicated IncidentNum's include additional entries in the descript column in some cases, but sometimes they remain the same. 
#### come back to this!

We're going to want to isolate our incidents that include marijuana crimes. So let's take a look at the unique values in the two columns that might contain information about marijuana crimes:

In [None]:
historical_data['Category'].unique()

We can see here that there's a DRUG/NARCOTIC category, so that's probably where we're going to find the marijuana crimes. But nothing specific about marijuana here. That's going to show up in our 'Descript' column. There's too many unique values in that column to list, so let's create a subset:

In [None]:
#create dataframe with all our marijuana incidents

historical_data_marijuana = historical_data[
    historical_data['Descript'].str.contains('MARIJUANA')
].copy()

## Soo meeting notes

In [None]:
drug_narcotic_incidents = historical_data[
    historical_data['Category'] == 'DRUG/NARCOTIC'
].reset_index(drop=True)

In [None]:
# needed if I decide to answer my 'extra' question: Compare marijuana arrests to other types of crimes, like narcotics.

drug_narcotic_incidents.to_csv('drug_narcotic_incidents_historical.csv', index=False)

In [None]:
drug_narcotic_incidents.head()

In [None]:
#how to isolate the marijuana duplicate incidentnum's
historical_data[
    historical_data['Descript'].str.contains('MARIJUANA') & historical_data['IncidntNum'].isin(dupe_IncidentNum)
].sort_values(by='IncidntNum')

In [None]:
historical_data[historical_data['IncidntNum'] == "000123436"]

In [None]:
#show info about the first row in the dataframe [iloc means index location]
incident_duplicate.iloc[0]

In [None]:
#what unique values are in this marijuana data frame for types of arrests
historical_data_marijuana['Descript'].unique()

In [None]:
#dropping duplicate incident numbers
#going to need to write an explainer 
historical_data_marijuana[historical_data_marijuana['IncidntNum'].duplicated()]

In [None]:
historical_data_marijuana[historical_data_marijuana['IncidntNum'] == '160676737']

In [None]:
# export the df to a csv
historical_data_marijuana.to_csv("historical_data_marijuana.csv", index=False)

### Bring in the more recent dataset

In [None]:
newer_data.columns

In [None]:
historical_data.columns

In [None]:
# import csv into a variable
newer_data = pd.read_csv('raw_data/SFPD_Incident_Reports_2018-10.14.21/Police_Department_Incident_Reports__2018_to_Present(1).csv', dtype=str)

In [None]:
historical_data['Resolution'].unique()