# More complicated data

Now things are going to get real.  We're going to look at a dataset of reports from accidents in New Jersey between 2008 and 2013. There are over 1.7 million rows, and the data file is kind of messy.

Thanks to Tom Meagher of the Marshall Project for his work to get the data from the New Jersey Department of Transportation.

If you're not using a computer set up for this workshop. you may need to download the data from this URL: https://s3.amazonaws.com/nicar15/njaccidents.csv Be sure to copy the downloaded file into the same directory where as this notebook.


In [None]:
# running this cell takes a while, don't sweat it...
import pandas as pd
njaccidents = pd.read_csv('njaccidents.csv')

You probably saw a warning 

    Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
    
Let's see what's up with that. 

Normally we use column names to check data from a `DataFrame` but since we haven't even looked at them yet, here's a trick to see what's up with column 6 without knowing its name.

In [None]:
njaccidents[njaccidents.columns[6]].describe()

So from the above, we see that this data is for the *Police Dept Code*. `pandas` is treating it as a number (`int64`), and there are only 11 unique values for this column. Eleven isn't too many to eyeball, so let's see what they are:

In [None]:
njaccidents[njaccidents.columns[6]].unique()

Yeah, we can see a mix of strings and numbers in there. Since we've got some experience working with data, and since the column name has *code* in it, that makes us nervous. Codes should just about always be treated as strings, not numbers. Let's reload the datafile so that that doesn't trip us up later. Notice that while in the last lesson we used the column name  to set the `dtype` (data type), we can also use a column number. Since `pandas` is mostly concerned with math, all non-numeric data is considered type `object`, so we use that.

In [None]:
njaccidents = pd.read_csv('njaccidents.csv', dtype={6: 'object'}) # re-reading 1.7M rows is slow, don't worry
njaccidents[njaccidents.columns[6]].unique()

That looks a lot better. So let's take a look at what we actually have here:

In [None]:
njaccidents.head()

Let's start with a typical question you might have. Which counties have the most rows in this dataset? This is a good use of the `value_counts()` function.

In [None]:
njaccidents['County Name'].value_counts()

OK, that was a little trick. You should have gotten a lot of error trace, with the important part at the end: 

    KeyError: 'County Name'

That means there's no column with that name. That's weird, that looked like the name of the column. What's going on?

In [None]:
njaccidents.columns

Yuk. Almost every column has a leading space in the name. 

As we saw in the previous section, you can rename columns in a `DataFrame`. Last time we used a `dict` to map current column names to new ones.

But in this case, there are so many problem columns, we should take advantage of an alternate usage: we can pass a function which, given the current column name, returns a replacement name. In our case, we'll strip the white space.

In [None]:
njaccidents.rename(columns=lambda x: x.strip(), inplace=True)
njaccidents.columns

OK. Now we let's try again: how many rows are there for each county?

In [None]:
njaccidents['County Name'].value_counts()

In [None]:
njaccidents[njaccidents['County Name']=='HUDSON']

In [None]:
njaccidents['County Name']=njaccidents['County Name'].map(str.strip)

In [None]:
njaccidents.dtypes

Objects are strings or date. Int64 are integers

In [None]:
njaccidents['Police Dept Code'].unique()

In [None]:
njaccidents['Crash Type Code'].unique()

In [None]:
njaccidents['Police Dept Code']=njaccidents['Police Dept Code'].astype(str)

In [None]:
njaccidents['Police Dept Code'].unique()

In [None]:
njcrashinfo = njaccidents[['County Name', 'Municipality Name', 'Crash Date', 'Crash Day Of Week', 'Crash Time', 
                           'Total Killed', 'Total Injured', 'Pedestrians Killed', 'Pedestrians Injured', 
                           'Total Vehicles Involved', 'Alcohol Involved', 'Cell Phone In Use Flag']]

In [None]:
njcrashinfo

How many car accidents had alcohol involved?

In [None]:
njcrashinfo['Alcohol Involved'].unique()

In [None]:
alcohol = pd.DataFrame(njcrashinfo['Alcohol Involved'].value_counts())

In [None]:
alcohol

In [None]:
njcrashcount =njcrashinfo['Alcohol Involved'].count()

In [None]:
alcohol['Percent'] = (alcohol['Alcohol Involved']/njcrashcount)*100

In [None]:
alcohol

In [None]:
njcrashinfo['County Name'].value_counts()

In [None]:
njcrashinfo.groupby('County Name').sum()

In [None]:
countydeaths = njcrashinfo.groupby('County Name').sum().iloc[:,0].sort_values(ascending=False)

In [None]:
pd.DataFrame(countydeaths)

In [None]:
from datetime import datetime
njcrashinfo['Crash Date']=njcrashinfo['Crash Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y").date())

In [None]:
njcrashinfo

In [None]:
crashesbydate = njcrashinfo.groupby('Crash Date').count().iloc[:,0]

In [None]:
crashesbydate.sort_values(ascending=False)