In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.float_format", lambda x: "%.2f" % x) # Suppress scientific notation for float data type

## Which drugs are states buying most frequently?

First, we need to determine which drugs appear to be outliers as measured by the number of drugs purchased in each state. To get started, query the Medicaid API to return the drug name, the state that bought the drugs and the total number of drugs purchased in 2016.

In [2]:
query = "https://data.medicaid.gov/resource/neai-csgh.json?$select=state_code,product_fda_list_name,sum(units_reimbursed)&$where=suppression_used=False%20and%20not%20state_code='XX'&$group=state_code,product_fda_list_name&$limit=4625479&$$app_token=v3AK8nRjxbWjtmIBGHJ9OmMlb"
units = pd.read_json(query)
units.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95081 entries, 0 to 95080
Data columns (total 3 columns):
product_fda_list_name    95079 non-null object
state_code               95081 non-null object
sum_units_reimbursed     95081 non-null float64
dtypes: float64(1), object(2)
memory usage: 2.2+ MB


In [3]:
units.head()

Unnamed: 0,product_fda_list_name,state_code,sum_units_reimbursed
0,ZINC OXIDE,KY,417559.25
1,RAVICTI,TN,15775.0
2,BICILLIN L,IN,1919.67
3,Tramadol H,WA,36053.0
4,NAPROXEN 3,NV,8678.0


Some of our drug names are fully capitalized. Others are not. Since we'll eventually group on that column, we need to standardize that.

In [4]:
units["product_fda_list_name"] = units["product_fda_list_name"].str.upper()
units.head()

Unnamed: 0,product_fda_list_name,state_code,sum_units_reimbursed
0,ZINC OXIDE,KY,417559.25
1,RAVICTI,TN,15775.0
2,BICILLIN L,IN,1919.67
3,TRAMADOL H,WA,36053.0
4,NAPROXEN 3,NV,8678.0


Rank the drugs by their units reimbursed within each state.

In [5]:
units["rank"] = units.groupby("state_code")["sum_units_reimbursed"].rank(method="min", ascending=False).astype(int)
units.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95081 entries, 0 to 95080
Data columns (total 4 columns):
product_fda_list_name    95079 non-null object
state_code               95081 non-null object
sum_units_reimbursed     95081 non-null float64
rank                     95081 non-null int32
dtypes: float64(1), int32(1), object(2)
memory usage: 2.5+ MB


Create a new dataframe with the top 10 drugs in each state.

In [6]:
top_10_units = units[units["rank"] <= 10]
top_10_units.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 73 to 95066
Data columns (total 4 columns):
product_fda_list_name    510 non-null object
state_code               510 non-null object
sum_units_reimbursed     510 non-null float64
rank                     510 non-null int32
dtypes: float64(1), int32(1), object(2)
memory usage: 17.9+ KB


How many times does each drug appear in a state's top-10 list?

In [7]:
counts_units = top_10_units["product_fda_list_name"].value_counts().reset_index() # Create new dataframe of drug counts
counts_units.columns = ["product_fda_list_name", "count"] # Rename columns
counts_units.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 2 columns):
product_fda_list_name    62 non-null object
count                    62 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.0+ KB


Merge the dataframes into a single dataframe with both ranks and counts.

In [8]:
top_10_units = top_10_units.merge(counts_units, how="inner", on="product_fda_list_name")
top_10_units.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 0 to 509
Data columns (total 5 columns):
product_fda_list_name    510 non-null object
state_code               510 non-null object
sum_units_reimbursed     510 non-null float64
rank                     510 non-null int32
count                    510 non-null int64
dtypes: float64(1), int32(1), int64(1), object(2)
memory usage: 21.9+ KB


Which drugs appear in only a single state's top-10 list?

In [9]:
outliers_units = top_10_units[top_10_units["count"] == 1]
outliers_units.sort_values("product_fda_list_name", ascending=True)

Unnamed: 0,product_fda_list_name,state_code,sum_units_reimbursed,rank,count
226,ADVATE 5ML,NV,8649028.0,8,1
500,ALPRAZOLAM,MO,9810078.31,10,1
427,AMLODIPINE,DC,2354345.13,10,1
486,AMMONIUM L,NY,61596452.85,8,1
442,BROMFED DM,TX,61984061.99,7,1
493,BUPROPION,VT,1365163.5,9,1
497,CHILDREN I,TX,77695830.83,6,1
351,CLONAZEPAM,RI,2107665.0,9,1
501,DEXTROAMP-,MA,14594796.5,8,1
487,DEXTROSE 5,WV,12265060.0,8,1


Export the outliers data as an Excel file.

In [10]:
outliers_units.to_excel("outliers_units.xlsx")

## Which drugs are states spending the most money on?

So far, we've determined which drugs appear to be outliers as measured by the number of drugs purchased in each state. We now need to determine which drugs appear to be outliers as measured by the total amount of money spent on the drugs. To get started, query the Medicaid API to return the drug name, the state that bought the drugs and the total amount reimbursed for drugs purchased in 2016.

In [11]:
query = "https://data.medicaid.gov/resource/neai-csgh.json?$select=state_code,product_fda_list_name,sum(total_amount_reimbursed)&$where=suppression_used=False%20and%20not%20state_code='XX'&$group=state_code,product_fda_list_name&$limit=4625479&$$app_token=v3AK8nRjxbWjtmIBGHJ9OmMlb"
amount = pd.read_json(query)
amount.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95081 entries, 0 to 95080
Data columns (total 3 columns):
product_fda_list_name          95079 non-null object
state_code                     95081 non-null object
sum_total_amount_reimbursed    95081 non-null float64
dtypes: float64(1), object(2)
memory usage: 2.2+ MB


In [12]:
amount.head()

Unnamed: 0,product_fda_list_name,state_code,sum_total_amount_reimbursed
0,ZINC OXIDE,KY,44159.87
1,RAVICTI,TN,2396199.88
2,BICILLIN L,IN,49854.37
3,Tramadol H,WA,15903.36
4,NAPROXEN 3,NV,1353.14


Some of our drug names are fully capitalized. Others are not. Since we'll eventually group on that column, we need to standardize that.

In [13]:
amount["product_fda_list_name"] = amount["product_fda_list_name"].str.upper()
amount.head()

Unnamed: 0,product_fda_list_name,state_code,sum_total_amount_reimbursed
0,ZINC OXIDE,KY,44159.87
1,RAVICTI,TN,2396199.88
2,BICILLIN L,IN,49854.37
3,TRAMADOL H,WA,15903.36
4,NAPROXEN 3,NV,1353.14


Rank the drugs by their amount reimbursed within each state.

In [14]:
amount["rank"] = amount.groupby("state_code")["sum_total_amount_reimbursed"].rank(method="min", ascending=False).astype(int)
amount.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95081 entries, 0 to 95080
Data columns (total 4 columns):
product_fda_list_name          95079 non-null object
state_code                     95081 non-null object
sum_total_amount_reimbursed    95081 non-null float64
rank                           95081 non-null int32
dtypes: float64(1), int32(1), object(2)
memory usage: 2.5+ MB


Create a new dataframe with the top 10 drugs in each state.

In [15]:
top_10_amount = amount[amount["rank"] <= 10]
top_10_amount.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 171 to 94787
Data columns (total 4 columns):
product_fda_list_name          510 non-null object
state_code                     510 non-null object
sum_total_amount_reimbursed    510 non-null float64
rank                           510 non-null int32
dtypes: float64(1), int32(1), object(2)
memory usage: 17.9+ KB


How many times does each drug appear in a state's top-10 list?

In [16]:
counts_amount = top_10_amount["product_fda_list_name"].value_counts().reset_index() # Create new dataframe of drug counts
counts_amount.columns = ["product_fda_list_name", "count"] # Rename columns
counts_amount.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 2 columns):
product_fda_list_name    111 non-null object
count                    111 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.8+ KB


Merge the dataframes into a single dataframe with both ranks and counts.

In [17]:
top_10_amount = top_10_amount.merge(counts_amount, how="inner", on="product_fda_list_name")
top_10_amount.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 0 to 509
Data columns (total 5 columns):
product_fda_list_name          510 non-null object
state_code                     510 non-null object
sum_total_amount_reimbursed    510 non-null float64
rank                           510 non-null int32
count                          510 non-null int64
dtypes: float64(1), int32(1), int64(1), object(2)
memory usage: 21.9+ KB


Which drugs appear in only a single state's top-10 list?

In [18]:
outliers_amount = top_10_amount[top_10_amount["count"] == 1]
outliers_amount.sort_values("product_fda_list_name", ascending=True)

Unnamed: 0,product_fda_list_name,state_code,sum_total_amount_reimbursed,rank,count
410,ABILIFY 10,MT,2452464.82,7,1
365,ADVAIR HFA,ME,3750448.51,9,1
209,ADVATE 5ML,NV,12833381.3,3,1
403,ARANESP (D,SD,1961736.07,9,1
501,COMPLERA,NJ,20299740.37,10,1
416,COMPLERA T,DC,5734734.51,9,1
503,DEXTROAMP-,MA,24331641.31,6,1
417,DULERA INH,AL,12122809.43,6,1
498,EPCLUSA,NH,2185523.57,7,1
323,EPCLUSA 4,WA,17715291.62,6,1


Export the outliers data as an Excel file.

In [19]:
outliers_amount.to_excel("outliers_amount.xlsx")