# Predicting "Opioid Prescribers" in Emergency Medicine: New Mexico Focus

**The overarching aim of this project is to predict which Doctors will prescribe higher than average amounts of opiods and, thus, be termed an "Opiod Prescriber"


**The goals are to:**
- Predict opiod prescribers in Emergency Medicine across the USA
- Predict opiod prescribers in Emergency Medicine in New Mexico, USA
- Find which predictors are most significant
- Analyze and compare possible trends in opiod prescription within Emergency Medicine in the USA and New Mexico, USA

**Dataset:**
The open source data for this project was acquired (and can be downloaded) from: https://www.kaggle.com/apryor6/us-opiate-prescriptions
The data is made up of three .csv files "opiods.csv", "overdoses.csv", and "prescriber-info.csv"

This dataset contains summaries of prescription records for 250 common opioid and non-opioid drugs written by 25,000 unique licensed medical professionals in 2014 in the United States for citizens covered under Class D Medicare as well as some metadata about the doctors themselves. This is a small subset of data that was sourced from cms.gov. The full dataset contains almost 24 million prescription instances in long format. Dr. Alan ("AJ") Pryor (who posted this dataset on Kaggle) has cleaned and compiled this data here in a format with 1 row per prescriber and limited the approximately 1 million total unique prescribers down to 25,000 to keep it manageable. For instructions on getting the full data, plese refer to the instructions listed by Dr. Pryor in the kaggle link above.

### A. Import packages and get the data

In [1]:
#for data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
import re

#for data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.plotly as py
import plotly.graph_objs as go

#for machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

In [2]:
prescriber = pd.read_csv('/Users/rmtaylor/Opioid_ER/prescriber-info.csv')
opioids = pd.read_csv('/Users/rmtaylor/Opioid_ER/opioids.csv')
overdose = pd.read_csv('/Users/rmtaylor/Opioid_ER/overdoses.csv')

### B. First look at the PRESCRIBER DATASET first

#### 1. Prescriber Dataset

In [3]:
prescriber.head()

Unnamed: 0,NPI,Gender,State,Credentials,Specialty,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,...,VERAPAMIL.ER,VESICARE,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber
0,1710982582,M,TX,DDS,Dentist,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1245278100,F,AL,MD,General Surgery,0,0,0,0,0,...,0,0,0,0,0,0,0,0,35,1
2,1427182161,F,NY,M.D.,General Practice,0,0,0,0,0,...,0,0,0,0,0,0,0,0,25,0
3,1669567541,M,AZ,MD,Internal Medicine,0,43,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1679650949,M,NV,M.D.,Hematology/Oncology,0,0,0,0,0,...,0,0,0,0,17,28,0,0,0,1


In [4]:
prescriber.columns

Index(['NPI', 'Gender', 'State', 'Credentials', 'Specialty', 'ABILIFY',
       'ACETAMINOPHEN.CODEINE', 'ACYCLOVIR', 'ADVAIR.DISKUS', 'AGGRENOX',
       ...
       'VERAPAMIL.ER', 'VESICARE', 'VOLTAREN', 'VYTORIN', 'WARFARIN.SODIUM',
       'XARELTO', 'ZETIA', 'ZIPRASIDONE.HCL', 'ZOLPIDEM.TARTRATE',
       'Opioid.Prescriber'],
      dtype='object', length=256)

In [5]:
prescriber.shape

(25000, 256)

We can see that that:
- There is a prescriber #, gender, state, etc. followed by a list of drugs prescribed
- We see there is a "Specialty" feature that may or may not contain the "Emergency Medicine" specialty we are interested in.

**Let's see if Emergency Medicine is in the Specialty feature...**

In [6]:
prescriber["Specialty"].value_counts().sort_values(ascending = False)

Internal Medicine                                                 3194
Family Practice                                                   2975
Dentist                                                           2800
Nurse Practitioner                                                2512
Physician Assistant                                               1839
Emergency Medicine                                                1087
Psychiatry                                                         691
Cardiology                                                         688
Obstetrics/Gynecology                                              615
Orthopedic Surgery                                                 575
Optometry                                                          571
Student in an Organized Health Care Education/Training Program     547
Ophthalmology                                                      519
General Surgery                                                    487
Gastro

In [7]:
# We could have also done it this way but it is harder to read
# prescriber["Specialty"].unique()

- We see that "Emergency Medicine" is in fact one of the specialties and that there are 1087 entries for EM

**Let's make a new list that only contains the Emergency Medicine data. We will do this to**
- Have an easier list to work with, only containing EM
- Look to see what unique types of specialties are within the EM Dept. 

In [8]:
em_prescriber = prescriber[prescriber["Specialty"] == "Emergency Medicine"]
em_prescriber.head()

Unnamed: 0,NPI,Gender,State,Credentials,Specialty,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,...,VERAPAMIL.ER,VESICARE,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber
68,1114977758,M,NE,MD,Emergency Medicine,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
91,1265477343,M,TX,DO,Emergency Medicine,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
108,1588612477,F,CA,,Emergency Medicine,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
109,1881825461,M,FL,D.O.,Emergency Medicine,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
118,1467627729,M,FL,,Emergency Medicine,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [9]:
em_prescriber["Credentials"].value_counts().sort_values(ascending = False)

MD                    479
M.D.                  373
D.O.                   88
DO                     69
M.D                    11
D.O                    10
M. D.                   3
M.D., M.S.              2
MD, MBA                 2
MD PHD                  1
M.D/                    1
MD, PHD                 1
M.D..                   1
M.D,                    1
D.O.,                   1
MD FACEP                1
M.D., PH.D.             1
M.B.B.S                 1
M.D., M.B.A             1
BA,,DC, DO,JD, LLM      1
M.D. PHD                1
MD, FACEP, FAAEM        1
M.D., MS                1
M.D., M.P.H.            1
MD, MSBS                1
FACEP MD                1
Name: Credentials, dtype: int64

##### We learn several different things from looking at the credentials:
- Most are MD and DO
- We will need to clean the credentials by:
    - removing periods, commas, and dashes in all credentials
    - for all listings with multiple degrees in a single row, we will drop the less advanced degrees, leaving only the medical degree.
        - HOWEVER, we will keep the other degrees listed in a new feature called "OtherDegree" in case it is usefull
    - We will then look at the credential list again to make sure we have cleaned up all info.

Let's start by removing periods, commas, and dashes

In [10]:
em_prescriber = em_prescriber.replace({'Credentials': r'\.'}, {'Credentials': ''}, regex=True)

In [11]:
em_prescriber["Credentials"].value_counts().sort_values(ascending = False)

MD                    864
DO                    167
M D                     3
MD, MBA                 3
MD, MS                  3
MD, PHD                 2
MD PHD                  2
MD,                     1
MBBS                    1
DO,                     1
MD FACEP                1
BA,,DC, DO,JD, LLM      1
MD/                     1
MD, MPH                 1
MD, FACEP, FAAEM        1
MD, MSBS                1
FACEP MD                1
Name: Credentials, dtype: int64

Let's also remove the one record with a dash in it...

In [12]:
em_prescriber = em_prescriber.replace({'Credentials': r'/'}, {'Credentials': ''}, regex=True)

In [13]:
em_prescriber["Credentials"].value_counts().sort_values(ascending = False)

MD                    865
DO                    167
MD, MBA                 3
MD, MS                  3
M D                     3
MD, PHD                 2
MD PHD                  2
MBBS                    1
MD,                     1
FACEP MD                1
MD FACEP                1
BA,,DC, DO,JD, LLM      1
MD, MPH                 1
MD, FACEP, FAAEM        1
MD, MSBS                1
DO,                     1
Name: Credentials, dtype: int64

...and now let's remove the space from the record that shows "M D"

In [14]:
em_prescriber = em_prescriber.replace({'Credentials': r'\s'}, {'Credentials': ''}, regex=True)

In [15]:
em_prescriber["Credentials"].value_counts().sort_values(ascending = False)

MD                  868
DO                  167
MD,MBA                3
MD,MS                 3
MD,PHD                2
MDPHD                 2
MD,MSBS               1
MBBS                  1
MD,                   1
MDFACEP               1
MD,MPH                1
FACEPMD               1
BA,,DC,DO,JD,LLM      1
DO,                   1
MD,FACEP,FAAEM        1
Name: Credentials, dtype: int64

Now, to make splitting the names easier, let's add a comma to the "MDPHD", "MDFACEP", and "FACEPMD" records

(I also removed the extra comma in the record that starts with "BA,,...")

(I also filled in nan values with "unknown")

In [16]:
# Let's deal the special cases...
# remove extra comma
em_prescriber = em_prescriber.replace({'Credentials': r'DO,'}, {'Credentials': 'DO'}, regex=True)
em_prescriber = em_prescriber.replace({'Credentials': r'MD,'}, {'Credentials': 'MD'}, regex=True)

# seperate these records with commas (no space) and put in the correct order (terminal degree first)
em_prescriber = em_prescriber.replace({'Credentials': r'MDPHD'}, {'Credentials': 'MD,PHD'}, regex=True)
em_prescriber = em_prescriber.replace({'Credentials': r'MDFACEP'}, {'Credentials': 'MD,FACEP'}, regex=True)
em_prescriber = em_prescriber.replace({'Credentials': r'MDMS'}, {'Credentials': 'MD,MS'}, regex=True)
em_prescriber = em_prescriber.replace({'Credentials': r'MDMBA'}, {'Credentials': 'MD,MBA'}, regex=True)
em_prescriber = em_prescriber.replace({'Credentials': r'MDMPH'}, {'Credentials': 'MD,MPH'}, regex=True)

em_prescriber = em_prescriber.replace({'Credentials': r'FACEPMD'}, {'Credentials': 'MD,FACEP'}, regex=True)
em_prescriber = em_prescriber.replace({'Credentials': r'MSBS'}, {'Credentials': 'MS,BS'}, regex=True)

# someone with a Bachelor degree cannot prescribe opioids, so I will move this one to "unknown" terminal degree
em_prescriber = em_prescriber.replace({'Credentials': r'MBBS'}, {'Credentials': 'Unknown'}, regex=True)

# fill the empty records with "Unknown" terminal degree
em_prescriber = em_prescriber.fillna("Unknown")

In [17]:
em_prescriber["Credentials"].value_counts().sort_values(ascending = False)

MD                 869
DO                 168
Unknown             34
MD,PHD               4
MD,MBA               3
MD,MS                3
MD,FACEP             2
MD,MS,BS             1
MD,MPH               1
BA,,DC,DOJD,LLM      1
MD,FACEP,FAAEM       1
Name: Credentials, dtype: int64

In [18]:
# I'll now deal with the last problem child ...
em_prescriber = em_prescriber.replace({'Credentials': r'BA,,DC,DOJD,LLM'}, {'Credentials': 'DO,JD,DC,LLM,BA'}, regex=True) # add extra comman after DO due to first command above

In [19]:
em_prescriber["Credentials"].value_counts().sort_values(ascending = False)

MD                 869
DO                 168
Unknown             34
MD,PHD               4
MD,MBA               3
MD,MS                3
MD,FACEP             2
MD,MS,BS             1
DO,JD,DC,LLM,BA      1
MD,MPH               1
MD,FACEP,FAAEM       1
Name: Credentials, dtype: int64

**Now, let's try to split the data into a "Title" and "OtherDegrees" features**

In [20]:
# new data frame with split value columns 
new = em_prescriber["Credentials"].str.split(",", n = 5, expand = True) 

# making seperate title column from new data frame 
em_prescriber['Title']= new[0] 
  
# making seperate OtherDegree1 from new data frame 
em_prescriber["OtherDegree1"]= new[1] 

# making seperate OtherDegree(n)columns from new data frame 
# NOTE!! IT IS ASSUMED THAT TO HAVE AN MD OR DO DEGREE THAT YOU WOULD HAVE A BA OR BS DEGREE ALREADY. SO, I'M NOT INTERESTED IN THOSE.
#       THERE ARE ALSO ONLY 2 RECORDS WITH DEGREES/CERTIFICATES BEYOND THE 1ST 2 INDICES OF THE STRINGS ABOVE.
# THEREFORE, I AM ONLY INTERESTED IN THE FIRST 2 LEVELS BEYOND THE TERMINAL DEGREE...
em_prescriber["OtherDegree2"]= new[2] 
em_prescriber["OtherDegree3"]= new[3] 

# Dropping old Credentials column 
em_prescriber.drop(columns =["Credentials"], inplace = True) 

In [21]:
em_prescriber.head()

Unnamed: 0,NPI,Gender,State,Specialty,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,ALENDRONATE.SODIUM,...,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber,Title,OtherDegree1,OtherDegree2,OtherDegree3
68,1114977758,M,NE,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,1,MD,,,
91,1265477343,M,TX,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,1,DO,,,
108,1588612477,F,CA,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,1,Unknown,,,
109,1881825461,M,FL,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,1,DO,,,
118,1467627729,M,FL,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,1,Unknown,,,


In [22]:
em_prescriber["Title"].value_counts().sort_values(ascending = False)

MD         884
DO         169
Unknown     34
Name: Title, dtype: int64

In [23]:
em_prescriber["OtherDegree1"].value_counts().sort_values(ascending = False)

MS       4
PHD      4
FACEP    3
MBA      3
MPH      1
JD       1
Name: OtherDegree1, dtype: int64

In [24]:
em_prescriber["OtherDegree2"].value_counts().sort_values(ascending = False)

BS       1
DC       1
FAAEM    1
Name: OtherDegree2, dtype: int64

In [25]:
em_prescriber["OtherDegree3"].value_counts().sort_values(ascending = False)

LLM    1
Name: OtherDegree3, dtype: int64

Since those with degrees in the "OtherDegree" lists 2 and 3 also have degrees in the 1st lis, I will drop the "OtherDeree2 and 3" columns 

I will also now turn the strings in the "Title" and "OtherDegree" columns to integers

In [26]:
# drop columns
em_prescriber.drop(columns =["OtherDegree2", "OtherDegree3"], inplace = True) 

In [27]:
em_prescriber.head()

Unnamed: 0,NPI,Gender,State,Specialty,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,ALENDRONATE.SODIUM,...,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber,Title,OtherDegree1
68,1114977758,M,NE,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,MD,
91,1265477343,M,TX,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,DO,
108,1588612477,F,CA,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,Unknown,
109,1881825461,M,FL,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,DO,
118,1467627729,M,FL,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,Unknown,


In [28]:
em_prescriber["OtherDegree1"].value_counts().sort_values(ascending = False)

MS       4
PHD      4
FACEP    3
MBA      3
MPH      1
JD       1
Name: OtherDegree1, dtype: int64

In [29]:
em_prescriber['Title'] = em_prescriber['Title'].replace({'MD': 0, 'DO': 1, 'Unknown': 2 })

In [30]:
em_prescriber["OtherDegree1"].value_counts().sort_values(ascending = False)

MS       4
PHD      4
FACEP    3
MBA      3
MPH      1
JD       1
Name: OtherDegree1, dtype: int64

In [31]:
em_prescriber['OtherDegree1'] = em_prescriber['OtherDegree1'].replace({'PHD':1, 'MS':1, 'FACEP':1, 'MBA':1, 'MPH':1, 'JD': 1})

In [32]:
em_prescriber['OtherDegree1'] = em_prescriber['OtherDegree1'].fillna(0).astype(int)

In [33]:
em_prescriber.head()

Unnamed: 0,NPI,Gender,State,Specialty,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,ALENDRONATE.SODIUM,...,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber,Title,OtherDegree1
68,1114977758,M,NE,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
91,1265477343,M,TX,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
108,1588612477,F,CA,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0
109,1881825461,M,FL,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
118,1467627729,M,FL,Emergency Medicine,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0


In [34]:
em_prescriber["OtherDegree1"].value_counts().sort_values(ascending = False)

0    1071
1      16
Name: OtherDegree1, dtype: int64

**Now, let's look at one of the drug columns and see what the data looks like**

I'll look at the Acyclovir inititally

In [35]:
em_prescriber["AGGRENOX"].value_counts().sort_values(ascending = False)

0     1081
11       1
12       1
13       1
17       1
27       1
69       1
Name: AGGRENOX, dtype: int64

We can see that they are integers and thus would be fine the way they are for downstream machine learning algorithms

These are number of prescriptions by that specific prescriber for the time in which the data was collected (2014), so this makes sense.

**We can see that the NPI number is unique for each prescriber. Therefore, we will drop it from the dataset**

In [36]:
em_prescriber = em_prescriber.drop(['NPI'], axis = 1)

In [37]:
em_prescriber.head()

Unnamed: 0,Gender,State,Specialty,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,ALENDRONATE.SODIUM,ALLOPURINOL,...,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber,Title,OtherDegree1
68,M,NE,Emergency Medicine,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
91,M,TX,Emergency Medicine,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
108,F,CA,Emergency Medicine,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0
109,M,FL,Emergency Medicine,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
118,M,FL,Emergency Medicine,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0


**Let's change the Gender values from categorical to ordinal now**

In [38]:
em_prescriber['Gender'] = em_prescriber['Gender'].replace({'M':0, 'F':1 })

**Finally, let's drop the Specialty feature now.**

In [39]:
em_prescriber = em_prescriber.drop(['Specialty'], axis = 1)

In [40]:
em_prescriber.head()

Unnamed: 0,Gender,State,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,ALENDRONATE.SODIUM,ALLOPURINOL,ALPRAZOLAM,...,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber,Title,OtherDegree1
68,0,NE,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
91,0,TX,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
108,1,CA,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0
109,0,FL,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
118,0,FL,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0


In [41]:
# We know look at the column info and find the column "OtherDegree1" is int32, whereas everything else is int64

# em_prescriber.info(all)

In [42]:
em_prescriber['OtherDegree1'] = em_prescriber.OtherDegree1.astype(np.int64)

In [43]:
# em_prescriber.info(all)

**So, now all the data is int64 except the State feature. I'd rather not change it yet, to make initial data visualization easier, but let's look at it...**

In [44]:
em_prescriber['State'].value_counts().sort_values(ascending=False)

CA    113
TX     73
FL     67
NY     62
PA     52
MI     47
IL     47
OH     44
IN     36
GA     34
VA     31
NC     29
MA     27
MO     25
NJ     25
WA     24
CO     23
AZ     21
MD     19
KY     19
SC     18
AL     17
MN     17
TN     16
WI     16
NV     15
CT     15
WV     15
LA     14
OR     13
OK     13
MS     12
UT     11
IA     11
NM      9
AR      6
ND      6
RI      5
DC      5
PR      5
DE      5
HI      4
MT      4
KS      4
ME      3
NE      3
ID      2
VT      2
SD      1
NH      1
VI      1
Name: State, dtype: int64

## C. Data Visualizations and Analysis

Let's start by looking for any correlations...

In [45]:
em_prescriber.head()

Unnamed: 0,Gender,State,ABILIFY,ACETAMINOPHEN.CODEINE,ACYCLOVIR,ADVAIR.DISKUS,AGGRENOX,ALENDRONATE.SODIUM,ALLOPURINOL,ALPRAZOLAM,...,VOLTAREN,VYTORIN,WARFARIN.SODIUM,XARELTO,ZETIA,ZIPRASIDONE.HCL,ZOLPIDEM.TARTRATE,Opioid.Prescriber,Title,OtherDegree1
68,0,NE,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
91,0,TX,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
108,1,CA,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0
109,0,FL,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
118,0,FL,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,2,0


In [46]:
# Correlation between Title and Opioid.Prescriber
em_prescriber[["Title", "Opioid.Prescriber"]].groupby(["Title"], as_index=False).mean().sort_values(by="Opioid.Prescriber", ascending=False)

Unnamed: 0,Title,Opioid.Prescriber
0,0,0.964932
1,1,0.952663
2,2,0.911765


In [61]:
# Correlation between State and Opioid.Prescriber
state_corr = em_prescriber[["State", "Opioid.Prescriber"]].groupby(["State"], as_index=False).mean().sort_values(by="Opioid.Prescriber", ascending=False)
state_corr

Unnamed: 0,State,Opioid.Prescriber
0,AL,1.0
20,ME,1.0
1,AR,1.0
26,NC,1.0
27,ND,1.0
28,NE,1.0
29,NH,1.0
31,NM,1.0
32,NV,1.0
39,RI,1.0


In [68]:
# Correlation between Acetaminophen/codeine and Opioid.Prescriber
acetcod_corr = em_prescriber[["ACTIQ", "Opioid.Prescriber"]].groupby(["ACTIQ"], as_index=False).mean().sort_values(by="Opioid.Prescriber", ascending=False)
acetcod_corr

KeyError: "['ACTIQ'] not in index"

## D. Split the data into test and training sets before we start data visualizations and analysis

Before we start using data visualization to look at the data closer, let's split it into a train and test set and then only look at the train set.

Our target is "Opioid.Prescriber" since we are trying to predict which prescribers might be considered opioid prescribers in the emergency medicne department.

In [None]:
# Create target object and call it y
y=em_prescriber['Opioid.Prescriber']

# Create X
features = em_prescriber.columns.drop('Opioid.Prescriber')
X = em_prescriber[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.2, random_state=1)

**Now let's see the information about the dataset...**

In [None]:
prescriber.info()

#### 2. Opioid Dataset

In [67]:
opioids.head()

Unnamed: 0,Drug Name,Generic Name
0,ABSTRAL,FENTANYL CITRATE
1,ACETAMINOPHEN-CODEINE,ACETAMINOPHEN WITH CODEINE
2,ACTIQ,FENTANYL CITRATE
3,ASCOMP WITH CODEINE,CODEINE/BUTALBITAL/ASA/CAFFEIN
4,ASPIRIN-CAFFEINE-DIHYDROCODEIN,DIHYDROCODEINE/ASPIRIN/CAFFEIN


In [None]:
opioids.tail()

In [69]:
opioids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 2 columns):
Drug Name       113 non-null object
Generic Name    113 non-null object
dtypes: object(2)
memory usage: 1.8+ KB


#### 3. Overdose Dataset

In [None]:
overdose.head()

In [None]:
overdose.tail()

In [None]:
overdose.info()