### The purpose of this notebook is to answer question two of my analysis questions:

#### Which types of marijuana crimes did SFPD report each year? Comparing the types of incidents over time.

In [1]:
#import modules
import pandas as pd
import altair as alt



Import our cleaned dataset that contains all of our marijuana incidents. We made this .csv file in the data_cleaning notebook.

In [2]:
mari_incidents = pd.read_csv('all_data_marijuana.csv', dtype=str)

Convert our incident dates to a datetime data format.

In [3]:
mari_incidents['incident_date'] = pd.to_datetime(mari_incidents['incident_date'])

Check our date ranges

In [4]:
mari_incidents['incident_date'].min()

Timestamp('2003-01-01 00:00:00')

In [5]:
mari_incidents['incident_date'].max()

Timestamp('2021-10-09 00:00:00')

Looks like we've got a full year of data for 2003, our earliest year. But since 2021 ends in October, we can't do full annual analysis on that year. So let's make a dataframe with our full years of data.

In [6]:
full_years = mari_incidents[
    (mari_incidents['incident_date'] >= '2003-01-01') &
    (mari_incidents['incident_date'] < '2021-01-01')
].reset_index(drop=True)

In [7]:
full_years['incident_description'].unique()

array(['possession of marijuana', 'possession of marijuana for sales',
       'transportation of marijuana', 'planting/cultivating marijuana',
       'sale of marijuana', 'furnishing marijuana', 'marijuana offense'],
      dtype=object)

We can see here that the 'incident_description' column contains the information about which type of marijuana crime the police department logged in its incident database. And we can see from the unique entries in that column that each row only contains a single crime listed in the incident_description column. So while it's true that there might be multiple rows in our dataset that describe a single incident, we don't need to drop the duplicate incident numbers for this analysis. Because to capture all the marijuana crimes in a specific incident, there will be multiple rows for that incident. 

So now we're going to use groupby to count up the number of marijuana incidents in each category during the full duration of our data.

In [8]:
description_counts_all = full_years.groupby(['incident_description']).count()

Now let's isolate one column that we know won't have any null values: row_id

In [9]:
clean_description_counts_all = description_counts_all[['row_id']].copy()

In [10]:
clean_description_counts_all = clean_description_counts_all.reset_index()

In [11]:
#rename columns
clean_description_counts_all.columns = ['crime', 'number_of_incidents']

In [12]:
#sort by number of incidents
clean_description_counts_all = clean_description_counts_all.sort_values(by=['number_of_incidents'], ascending=False).reset_index(drop=True)

In [13]:
clean_description_counts_all

Unnamed: 0,crime,number_of_incidents
0,possession of marijuana,11287
1,possession of marijuana for sales,5961
2,sale of marijuana,2976
3,transportation of marijuana,770
4,planting/cultivating marijuana,630
5,marijuana offense,262
6,furnishing marijuana,166


Great! We can draw some conclusions from this data. It shows us that from 2003 to 2020, the San Francisco Police Department responded to thousands of marijuana-related incidents. Possession of marijuana was the type of crime that the police department dealt with the most, followed by possession of marijuana for sales.

Here's a visualization of our data:

In [14]:
alt.Chart(clean_description_counts_all).mark_bar().encode(
    x=alt.X('crime:O', sort='-x'),
    y='number_of_incidents'
).properties(
    title='San Francisco Police: Number of Marijuana Incidents 2003-2020'
)

Now let's take a look at how the types of marijuana incidents actually changed year to year:

Create a dataframe that has the number of incidents for each type of crime for each year:

In [15]:
test = full_years.groupby(['incident_description', pd.Grouper(key='incident_date', axis=0, freq='A')]).count()

In [16]:
test

Unnamed: 0_level_0,Unnamed: 1_level_0,row_id,incident_number,incident_code,incident_category,day_of_week,incident_time,police_district,resolution,longitude,latitude,the_geom
incident_description,incident_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
furnishing marijuana,2003-12-31,11,11,11,11,11,11,11,11,11,11,11
furnishing marijuana,2004-12-31,7,7,7,7,7,7,7,7,7,7,7
furnishing marijuana,2005-12-31,10,10,10,10,10,10,10,10,10,10,10
furnishing marijuana,2006-12-31,16,16,16,16,16,16,16,16,16,16,16
furnishing marijuana,2007-12-31,24,24,24,24,24,24,24,24,24,24,24
...,...,...,...,...,...,...,...,...,...,...,...,...
transportation of marijuana,2016-12-31,35,35,35,35,35,35,35,35,35,35,35
transportation of marijuana,2017-12-31,23,23,23,23,23,23,23,23,23,23,23
transportation of marijuana,2018-12-31,21,21,21,21,21,21,21,21,21,21,21
transportation of marijuana,2019-12-31,13,13,13,13,13,13,13,13,13,13,13


Clean up that dataframe:

In [17]:
test_2 = test['row_id'].reset_index()

In [18]:
test_2

Unnamed: 0,incident_description,incident_date,row_id
0,furnishing marijuana,2003-12-31,11
1,furnishing marijuana,2004-12-31,7
2,furnishing marijuana,2005-12-31,10
3,furnishing marijuana,2006-12-31,16
4,furnishing marijuana,2007-12-31,24
...,...,...,...
103,transportation of marijuana,2016-12-31,35
104,transportation of marijuana,2017-12-31,23
105,transportation of marijuana,2018-12-31,21
106,transportation of marijuana,2019-12-31,13


Now I want to set up my dataframe so that the year is the row label and the type of crime is the column label and the values are the number of incidents.

In [19]:
incidents_per_year =  test_2.pivot(index='incident_date',
                       columns='incident_description',
                       values='row_id',
                      )

In [20]:
incidents_per_year

incident_description,furnishing marijuana,marijuana offense,planting/cultivating marijuana,possession of marijuana,possession of marijuana for sales,sale of marijuana,transportation of marijuana
incident_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2003-12-31,11.0,,11.0,1313.0,284.0,204.0,11.0
2004-12-31,7.0,,27.0,1073.0,400.0,209.0,21.0
2005-12-31,10.0,,23.0,624.0,394.0,165.0,22.0
2006-12-31,16.0,,35.0,622.0,409.0,248.0,23.0
2007-12-31,24.0,,36.0,817.0,586.0,363.0,47.0
2008-12-31,17.0,,43.0,1058.0,552.0,384.0,86.0
2009-12-31,19.0,,84.0,1136.0,693.0,329.0,111.0
2010-12-31,11.0,,94.0,935.0,585.0,280.0,68.0
2011-12-31,8.0,,78.0,604.0,403.0,155.0,82.0
2012-12-31,9.0,,50.0,613.0,369.0,194.0,70.0


In [21]:
incidents_per_year = incidents_per_year.reset_index()

Let's clean up this table a bit:

In [22]:
incidents_per_year['year'] = incidents_per_year['incident_date'].dt.year

In [23]:
incidents_per_year_final = incidents_per_year[
    ['year', 
     'furnishing marijuana', 
     'marijuana offense', 
     'planting/cultivating marijuana',
     'possession of marijuana',
     'possession of marijuana for sales',
     'sale of marijuana',
     'transportation of marijuana'
    ]].copy()

In [24]:
incidents_per_year_final

incident_description,year,furnishing marijuana,marijuana offense,planting/cultivating marijuana,possession of marijuana,possession of marijuana for sales,sale of marijuana,transportation of marijuana
0,2003,11.0,,11.0,1313.0,284.0,204.0,11.0
1,2004,7.0,,27.0,1073.0,400.0,209.0,21.0
2,2005,10.0,,23.0,624.0,394.0,165.0,22.0
3,2006,16.0,,35.0,622.0,409.0,248.0,23.0
4,2007,24.0,,36.0,817.0,586.0,363.0,47.0
5,2008,17.0,,43.0,1058.0,552.0,384.0,86.0
6,2009,19.0,,84.0,1136.0,693.0,329.0,111.0
7,2010,11.0,,94.0,935.0,585.0,280.0,68.0
8,2011,8.0,,78.0,604.0,403.0,155.0,82.0
9,2012,9.0,,50.0,613.0,369.0,194.0,70.0


#### There's our table! It shows us the number of incidents in the San Francisco Police Department database for each type of marijuana crime in each year from 2003 to 2020!

Generally speaking, the table looks good. One hiccup is that the 'marijuana offense' incident description didn't come into use until 2018, so we don't have a full history of data for that bucket. Also, an additional reporting question would be: did SFPD begin to code different types of marijuana crimes into the more general description 'marijuana offense' after 2018? For example, we're no longer seeing possession of marijuana crimes after 2019, but could those be showing up in this more general category?

#### Now let's visualize our data!

First, we've got to massage the shape of our data using the melt function.

In [25]:
chart_data = incidents_per_year_final.melt('year')

In [26]:
chart_data

Unnamed: 0,year,incident_description,value
0,2003,furnishing marijuana,11.0
1,2004,furnishing marijuana,7.0
2,2005,furnishing marijuana,10.0
3,2006,furnishing marijuana,16.0
4,2007,furnishing marijuana,24.0
...,...,...,...
121,2016,transportation of marijuana,35.0
122,2017,transportation of marijuana,23.0
123,2018,transportation of marijuana,21.0
124,2019,transportation of marijuana,13.0


Now that we've got our data in the correct shape, we can make a multi series line chart!

In [27]:
alt.Chart(chart_data).mark_line().encode(
    x='year:O',
    y='value',
    color='incident_description'
).properties(
    title='San Francisco Police: Annual Marijuana Incident Types'
)