# Pandas Breakout Questions

To get some hands on experience with `pandas` we will be working with the somewhat morose dataset looking at all the mass shooting events that occured in 2015 (`./data/mass_shootings_2015.csv`)

In [4]:
import pandas as pd


## 1. Reading in the Dataframe

First things first, we must read in our dataset!
Turn to your neighbor and figure out how to read this data into a Dataframe.  Once you have it read in, take a look at the columns and look at some summary statistics.

In [5]:
df = pd.read_csv('./data/mass_shootings_2015.csv')

In [10]:
df.describe

<bound method NDFrame.describe of          Incident Date           State                    City Or County  \
0    December 31, 2015       Louisiana                       New Orleans   
1    December 27, 2015       Tennessee                           Jackson   
2    December 26, 2015    Pennsylvania                      Philadelphia   
3    December 25, 2015         Florida                      Jacksonville   
4    December 25, 2015         Alabama                            Mobile   
5    December 21, 2015      California                       San Leandro   
6    December 20, 2015  North Carolina                        Wilmington   
7    December 20, 2015         Florida                    Miami (Goulds)   
8    December 20, 2015         Florida               Miami-dade (county)   
9    December 14, 2015        Illinois                Lovejoy (Brooklyn)   
10   December 13, 2015      California                       Los Angeles   
11   December 13, 2015      California                

## 2. Clean Column Names

The first thing I always do when reading in a Datafram is clean up our column names!  Personally I view spaces, special characters, and capitals in column names as no-no's.  Granted this isn't a hard and fast rule but we have already seen cases where having spaces in column names causes issues.

With your neighbor, clean the columns names of this Dataframe!

In [7]:
df.columns

Index(['Incident Date', 'State', 'City Or County', 'Address', '# Killed',
       '# Injured', 'Operations'],
      dtype='object')

In [11]:
df2 = df.copy()
cols = df2.columns.tolist()
cols = [c.replace(' ','_').lower() for c in cols]
df2.columns = cols
df2.describe

<bound method NDFrame.describe of          incident_date           state                    city_or_county  \
0    December 31, 2015       Louisiana                       New Orleans   
1    December 27, 2015       Tennessee                           Jackson   
2    December 26, 2015    Pennsylvania                      Philadelphia   
3    December 25, 2015         Florida                      Jacksonville   
4    December 25, 2015         Alabama                            Mobile   
5    December 21, 2015      California                       San Leandro   
6    December 20, 2015  North Carolina                        Wilmington   
7    December 20, 2015         Florida                    Miami (Goulds)   
8    December 20, 2015         Florida               Miami-dade (county)   
9    December 14, 2015        Illinois                Lovejoy (Brooklyn)   
10   December 13, 2015      California                       Los Angeles   
11   December 13, 2015      California                

In [12]:
df = df2
df.columns

Index(['incident_date', 'state', 'city_or_county', 'address', '#_killed',
       '#_injured', 'operations'],
      dtype='object')

## 3. Cast the `date` column as a datetime object

We can see from our initial look at this data that the `date` column is actually an object (basically a string).  Let's alter this column to make it an actual date!

HINT: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

In [15]:
date = pd.to_datetime(df.incident_date)
df.incident_date = date
df.info

<bound method DataFrame.info of     incident_date           state                    city_or_county  \
0      2015-12-31       Louisiana                       New Orleans   
1      2015-12-27       Tennessee                           Jackson   
2      2015-12-26    Pennsylvania                      Philadelphia   
3      2015-12-25         Florida                      Jacksonville   
4      2015-12-25         Alabama                            Mobile   
5      2015-12-21      California                       San Leandro   
6      2015-12-20  North Carolina                        Wilmington   
7      2015-12-20         Florida                    Miami (Goulds)   
8      2015-12-20         Florida               Miami-dade (county)   
9      2015-12-14        Illinois                Lovejoy (Brooklyn)   
10     2015-12-13      California                       Los Angeles   
11     2015-12-13      California                  Huntington Beach   
12     2015-12-12         Georgia            

In [21]:
df.columns

Index(['incident_date', 'state', 'city_or_county', 'address', '#_killed',
       '#_injured', 'operations'],
      dtype='object')

## 4. Make a new column `month`

Using that `date` column, create a new column called `month` that will have an `int` representing the ordinal month (e.g. `1` would indicate January)

HINT: Try pulling out a single date (use `.loc` for practice!) and extract the month.  Then try using the `.map` function to create a new column

In [26]:
month = df.loc[0, 'incident_date'].month
print(month)

12


In [28]:
df['month'] = df['incident_date'].dt.month
df['month'].head

<bound method NDFrame.head of 0      12
1      12
2      12
3      12
4      12
5      12
6      12
7      12
8      12
9      12
10     12
11     12
12     12
13     12
14     12
15     12
16     12
17     12
18     12
19     12
20     11
21     11
22     11
23     11
24     11
25     11
26     11
27     11
28     11
29     11
       ..
300     2
301     2
302     2
303     2
304     2
305     2
306     2
307     1
308     1
309     1
310     1
311     1
312     1
313     1
314     1
315     1
316     1
317     1
318     1
319     1
320     1
321     1
322     1
323     1
324     1
325     1
326     1
327     1
328     1
329     1
Name: month, dtype: int64>

## 5. Drop the `operations` column

It looks like the operations column doesn't actually contain any useful information so let's drop that!

In [29]:
df.drop('operations', axis=1, inplace=True)

In [30]:
df.columns

Index(['incident_date', 'state', 'city_or_county', 'address', '#_killed',
       '#_injured', 'month'],
      dtype='object')

## 6. How many incidents occured in each Month?

Let's look at how many incidents took place in each month.
HINT: (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)

In [31]:
df.head

<bound method NDFrame.head of     incident_date           state                    city_or_county  \
0      2015-12-31       Louisiana                       New Orleans   
1      2015-12-27       Tennessee                           Jackson   
2      2015-12-26    Pennsylvania                      Philadelphia   
3      2015-12-25         Florida                      Jacksonville   
4      2015-12-25         Alabama                            Mobile   
5      2015-12-21      California                       San Leandro   
6      2015-12-20  North Carolina                        Wilmington   
7      2015-12-20         Florida                    Miami (Goulds)   
8      2015-12-20         Florida               Miami-dade (county)   
9      2015-12-14        Illinois                Lovejoy (Brooklyn)   
10     2015-12-13      California                       Los Angeles   
11     2015-12-13      California                  Huntington Beach   
12     2015-12-12         Georgia              

In [32]:
by_month = df.groupby('month')
by_month.count()['incident_date']

month
1     23
2     16
3     20
4     19
5     35
6     36
7     41
8     39
9     34
10    20
11    27
12    20
Name: incident_date, dtype: int64

In [33]:
df.columns

Index(['incident_date', 'state', 'city_or_county', 'address', '#_killed',
       '#_injured', 'month'],
      dtype='object')

## 7. How many casualties occured in each Month?

Right now we have the number of people involved broken out into the number killed and the number injured.  Let's create a single column that indicates the number of casualties (i.e. the sum of killed and injured)

In [39]:
df['casualties'] = df['#_killed']+df['#_injured']
df.columns

Index(['incident_date', 'state', 'city_or_county', 'address', '#_killed',
       '#_injured', 'month', 'casualties'],
      dtype='object')

In [40]:
df.groupby('month').sum()['casualties']

month
1     111
2      85
3     102
4      88
5     185
6     185
7     206
8     200
9     163
10    107
11    137
12    115
Name: casualties, dtype: int64

## 8. How many casualties occured by State?

Now that we have a casualties column, let's brake down the number of casualties by state.

In [41]:
df.groupby('state').sum()['casualties']

state
Alabama                  17
Arizona                  29
Arkansas                 16
California              158
Colorado                 20
Connecticut              14
Delaware                  6
District of Columbia      9
Florida                 106
Georgia                  98
Illinois                117
Indiana                  49
Iowa                      9
Kansas                    4
Kentucky                 20
Louisiana                83
Maryland                 61
Massachusetts            24
Michigan                 54
Minnesota                24
Mississippi               9
Missouri                 51
Montana                   5
Nebraska                 17
Nevada                    4
New Jersey               36
New Mexico               11
New York                109
North Carolina           59
Ohio                     67
Oklahoma                 19
Oregon                   24
Pennsylvania             73
Rhode Island              4
South Carolina           51
South Dakota  

## 9. How many distinct City or Counties are represented?

Let's see how many distinct City or Counties are represented in this dataset.

In [42]:
df.columns

Index(['incident_date', 'state', 'city_or_county', 'address', '#_killed',
       '#_injured', 'month', 'casualties'],
      dtype='object')

In [43]:
df['city_or_county'].nunique()

205

## 10. INDEXING!

You should be using `.loc`, `.iloc`, or `.ix` for all of these questions!

1. Return all rows occuring in Alabama
2. Return all shootings with more than 5 people killed
3. Return the address of shootings occuring on or after November 1st
4. Return the address and date of all shootings occuring in Louisiana or Florida with the casualty counts ranging from 6 to 10 (inclusive) 

In [45]:
df[df['state'] == 'Alabama']

Unnamed: 0,incident_date,state,city_or_county,address,#_killed,#_injured,month,casualties
4,2015-12-25,Alabama,Mobile,785 Schillinger Rd S,0,4,12,4
35,2015-11-16,Alabama,Cherokee (county),1400 block of County Road 664,3,1,11,4
229,2015-05-24,Alabama,Montgomery,Smiley Court,1,3,5,4
259,2015-04-18,Alabama,Montgomery,1800 block of Gibbs Court,0,5,4,5


In [46]:
df[df['#_killed'] > 5]

Unnamed: 0,incident_date,state,city_or_county,address,#_killed,#_injured,month,casualties
18,2015-12-02,California,San Bernardino,1365 South Waterman Avenue,16,19,12,35
66,2015-10-01,Oregon,Roseburg,1140 Umpqua College Rd,10,9,10,19
83,2015-09-17,South Dakota,Platte,36705 379th Street,6,0,9,6
128,2015-08-08,Texas,Houston,2211 Falling Oaks,8,0,8,8
159,2015-07-16,Tennessee,Chattanooga,4051 Amnicola Highway,6,2,7,8
195,2015-06-17,South Carolina,Charleston,110 Calhoun Street,9,0,6,9
235,2015-05-17,Texas,Waco,4671 S Jack Kultgen Expy,9,18,5,27
292,2015-02-26,Missouri,Tyrone,18279 Highway H,8,1,2,9


df[df['month'] > 10].head

In [52]:
df[df['state'] == ('Louisiana','Florida').any()].head

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

## 11. Sort by Date and reset index

Let's reorder our Dataframe based on the date and reset our index to reflect this.

NOTE: Don't use `df.sort()` as this method is deprecated!  Instead you should be using `df.sort_values()`

## EXTRA CREDIT:  Create a graph showing the weekly frequency of shootings
HINT: Set the index as the date and refer to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html