# Exploring Accidental Drug Death Data

Heroin and opioid painkillers have led to increasing overdose death for several years. I've had trouble finding good open data about it.

I was impressed by [this page on the Connecticut data portal](https://data.ct.gov/view/ecj5-r2i9) which takes data of the sort which I'd not had much luck finding for Illinois and presents it with some good overview graphs.

It also seemed like a better dataset for looking at pivot tables than the one I started with, so I decided to run through some exercises with it.

## Reading data from Socrata

First we need the data. Downloading a file and moving it around is inelegant, so let's see if we can get it over the web. This [web page from Socrata](https://dev.socrata.com/consumers/examples/data-visualization-with-python.html) clued me to the fact that the SODA API provides perfect input to a dataframe. 

The [SODA API docs](https://dev.socrata.com/docs/endpoints.html) show how to get the `pandas`-friendly version of the original URL, and once you know, the pattern is pretty straightforward. We follow the link from the view linked above to the [original data source](https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-January-2012-Sept-2/rybz-nyjw) and 
  
`https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-January-2012-Sept-2/rybz-nyjw`

becomes

`https://data.ct.gov/resource/rybz-nyjw.json`
  



In [None]:
import pandas as pd
from urllib2 import urlopen

data_url = 'https://data.ct.gov/resource/rybz-nyjw.json'
df = pd.read_json(data_url)

Unfortunately, when I ran the command above, I got

```URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)>```

[Stack Overflow to the rescue](http://stackoverflow.com/a/28048260)

In [39]:
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
response = urlopen(data_url,context=ctx)
df = pd.read_json(response)

print "Read {} rows".format(len(df))

Read 1000 rows


## Securing the perimeter

Before you get too deep into a dataset, you need to check it out and make sure there aren't any gotchas hiding. Ideally, we find a data dictionary for a dataset. Truth be told, we often skate by it and just use inference, especially since often there isn't one anyway! But remember that before you make any public pronouncements about finding the data, you really ought to make sure that the columns and values mean what you guessed they mean.

There's no sign of a full data dictionary for this dataset, although there is some useful information in the [Socrata description page](https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-January-2012-Sept-2/rybz-nyjw/about), specifically about how deaths involving heroin and morphine are documented.

So, whether we are skipping the data dictionary out of laziness or reality, data analysts develop some routine checks when they start with a dataset. In 2014, Hilary Mason solicited people for their checks with [a tweet](https://twitter.com/hmason/statuses/476905839035305984), and Jeff Leek [knit them into a blog post](http://simplystatistics.org/2014/06/13/what-i-do-when-i-get-a-new-data-set-as-told-through-tweets/)

I always like to start with `df.head()` because it's easy to scan and see patterns.


In [27]:
df.head()

Unnamed: 0,age,amphet,any_opioid,benzo_s,casenumber,coc,date,death_city,death_county,death_state,...,morphine_not_heroin,other,oxyc,oxym,race,residence_city,residence_county,residence_state,sex,tramad
0,52,,,,15-10038,Y,2015-06-16,Southington,HARTFORD,CT,...,,,,,White,Southington,HARTFORD,CT,Male,
1,26,,Y,,15-10152,,2015-06-19,Manchester,HARTFORD,CT,...,,,,,White,Manchester,HARTFORD,CT,Male,
2,50,,Y,,15-10196,Y,2015-06-19,Danbury,FAIRFIELD,CT,...,,,,,White,Danbury,FAIRFIELD,CT,Male,
3,42,,Y,Y,15-10202,,2015-06-19,New London,NEW LONDON,CT,...,,OPIOID NOS,,,White,Waterford,NEW LONDON,CT,Female,
4,42,,Y,,15-10208,Y,2015-06-20,New London,NEW LONDON,CT,...,,,,,"Hispanic, White",Lebanon,NEW LONDON,CT,Male,


We see a lot of `NaN` values, blanks in the original dataset but, one would infer, they could also be converted to `N` values if null/`NaN` is going to be a problem, but they shouldn't make you nervous.

I also like to use `df.describe()` early on. `describe()` gives different summaries for numeric and non-numeric columns. By default, `df.describe()` only describes numeric columns. There's only one in this dataset (`age`), so the default isn't super helpful. We can ask for everything with `df.describe(include='all')`. That's fine, and maybe easiest to remember, but it makes a lot of `NaN` values. So let's do it in a couple of steps.


In [40]:
df.describe()

Unnamed: 0,age
count,1000.0
mean,41.491
std,12.419991
min,14.0
25%,31.0
50%,42.0
75%,51.0
max,81.0


OK. Earlier we saw (with `len(df)`) that there are 1000 rows, so we can see that there are no missing values for `age`. (I should note that the nice round 1000 rows has also set off my data "spidey sense" -- I have a hunch that we just got one page of results -- but we can defer that while we get a general sense of the data.

Let's see what datatypes we have in here, so we can use `describe` to summarize the rest.


In [42]:
df.dtypes.value_counts()

object            28
datetime64[ns]     1
int64              1
dtype: int64

OK: 28 `object` -- or for regular people, string values -- and one `datetime`. 
Anyway, let's look at each of those. Since there's only one date, let's look at that first.


In [44]:
df.describe(include=['datetime64[ns]'])

Unnamed: 0,date
count,1000
unique,488
top,2015-07-05 00:00:00
freq,7
first,2014-01-02 00:00:00
last,2015-09-30 00:00:00


In [46]:
df.describe(include=['O'])

Unnamed: 0,amphet,any_opioid,benzo_s,casenumber,coc,death_city,death_county,death_state,deathloc,etoh,...,morphine_not_heroin,other,oxyc,oxym,race,residence_city,residence_county,residence_state,sex,tramad
count,26,463,285,1000,235,1000,846,508,1000,237,...,11,127.0,152,23,997,966,482,492,1000,22
unique,2,2,1,1000,2,149,8,1,224,2,...,2,44.0,2,1,8,194,18,7,2,1
top,Y,Y,Y,15-12337,Y,Hartford,NEW HAVEN,CT,"{u'latitude': u'41.765775', u'needs_recoding':...",Y,...,Y,,Y,Y,White,Waterbury,NEW HAVEN,CT,Male,Y
freq,25,462,285,1,225,104,248,508,56,235,...,10,17.0,151,23,826,61,142,481,732,22
