# Predicting Opioid Prescribers

Created on: 29th January 18

In [None]:
###################################
# Author: Abhijay
# Created: 29th Jan 18
# Last modified date: 31st Jan 18
###################################

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 1. Introduction

Natural derivatives of Opium like heroin are called Opiates which are illegal. Similar synthetically synthesized drugs have been put under the class of Opioids which are legally available. Opioids are prescribed primarily as pain relievers despite a high risk of addiction and overdose. The increase in deaths caused by the risks involved with the consumption of opioids was alarming and declared an epidemic.

Current status of the opioid epidemic is that it is still a crisis (31st January 2018) and tweets have been pouring in [(recent tweets related to Opioid Crisis)](https://twitter.com/search?q=opioid+crisis&ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Esearch), talking about the issue.

An artcle explaining the crisis: [The opioid epidemic may be even deadlier than we think (Vox)](https://www.vox.com/science-and-health/2017/4/26/15425972/opioid-epidemic-overdose-deadlier-study)

Recent news on the epidemic: [Fast facts on Opioid Crisis](https://edition.cnn.com/2017/09/18/health/opioid-crisis-fast-facts/index.html)

# 2. Objective

The objective of this notebook is as follows:
1. Find features which will be useful for classfying whether a prescriber would prescribe opioids
2. Make a prediction model which will learn from prescribers who prescribed opioids, to later on predict the likelihood of a prescriber to prescribe opioids.

https://twitter.com/StefanMolyneux/status/958538041345310721

# 3. Data

[U.S. Opiate Prescriptions/Overdoses on Kaggle](https://www.kaggle.com/apryor6/us-opiate-prescriptions) has a subset of the data from [cms.gov: Medicare Provider Charge Data](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html). More data can also be downloaded using the [script here](https://raw.githubusercontent.com/apryor6/apryor6.github.io/master/Identifying-Opioid-Prescribers/create-dataset.R).

The data decription is as follows:
1. NPI – unique National Provider Identifier number
2. Gender - (M/F)
3. State - U.S. State by abbreviation
4. Credentials - set of initials indicative of medical degree
5. Specialty - description of type of medicinal practice
6. A long list of drugs with numeric values indicating the total number of prescriptions written for the year by that individual
7. Opioid.Prescriber - a boolean label indicating whether or not that individual prescribed opiate drugs more than 10 times in the year

In [None]:
prescribers = pd.read_csv('../input/prescriber-info.csv')

In [None]:
# prescribers.shape

In [None]:
prescribers.head()

In [None]:
prescribers.describe()

In [None]:
# prescribers.columns

## 3.1 Data Cleaning

In [None]:
# len(prescribers['Specialty'].unique())

In [None]:
specialty = pd.DataFrame(prescribers.groupby(['Specialty']).count()['NPI']).sort_values('NPI')

In [None]:
# specialty.loc[specialty['NPI']<40].shape

**Out of the 109 unique specialities 63 have a count of less than 40. These need to be adjusted in other or associated with a generic speciality.**

In [None]:
rareSpecialty = list(specialty.loc[specialty['NPI']<40].index)

**Some of these rare specialities will be of the category surgery and the rest will be combined into the 'Other' category.**

In [None]:
prescribers.loc[prescribers['Specialty'].isin(rareSpecialty),'Specialty'] = prescribers.loc[prescribers['Specialty'].isin(rareSpecialty),'Specialty'].apply(lambda x: 'Surgery' if 'Surgery' in list(x.split( )) else 'Other')

**Checking number of unique Credentials:**

In [None]:
# Credentials
Credentials = pd.DataFrame(prescribers.groupby(['Credentials']).count()['NPI']).reset_index(False)

In [None]:
Credentials[Credentials['NPI']<20]

* Too many unique credentials. This will not be useful for model building hence this column should be discarded.


**NPI column is like an index column to identify the prescriber. It should also be removed.**

In [None]:
prescribersData = prescribers.drop( ['NPI','Credentials'], axis=1)

**Cleaned data sample:**

In [None]:
prescribersData.head()

* Creating dummies from categorical data. (n-1) dummies are taken to avoid multi-collinearity.

In [None]:
prescribersData = pd.get_dummies(prescribersData, columns=['Gender','Specialty','State'], drop_first=True)

In [None]:
# len(prescribersData.columns)

# 4. Experiments

## 4.1 Creating and Evaluating Model using Cross Validation

In [None]:
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

In [None]:
# load dataset
X = prescribersData.drop(['Opioid.Prescriber'],axis=1).values.astype(float)
Y = prescribersData['Opioid.Prescriber'].values

In [None]:
# # encode class values as integers
# encoder = LabelEncoder()
# encoder.fit(Y)
# encoded_Y = encoder.transform(Y)
# ## See and remove IMP

In [None]:
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(60, input_dim=354, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
# # evaluate model with standardized dataset
# estimator = KerasClassifier(build_fn=create_baseline, nb_epoch=100, batch_size=5, verbose=1)
# kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
# results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
# print("Results: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
# evaluate baseline model with standardized dataset
np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=20, batch_size=5, verbose=2)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
# print("Standardized: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
print("On cross validation it can be assessed that the model gives a good accuracy of: %.2f%% with a std of (%.2f%%)" % (results.mean()*100, results.std()*100))

## 4.2 Final model and Summary

### 4.2.1 Without standardizing the features

In [None]:
# create model
model = Sequential()
model.add(Dense(60, input_dim=354, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
earlystop = EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              verbose=0, mode='auto')
callbacks_list = [earlystop]
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
history = model.fit( X, Y, validation_split=0.1, epochs=20, batch_size=5, verbose=2, callbacks=callbacks_list)

In [None]:
loss, accuracy = model.evaluate(X, Y)

In [None]:
accuracy

In [None]:
loss

In [None]:
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

### 4.2.2 Standardizing the features

In [None]:
# min_max_scaler = preprocessing.MinMaxScaler()
# X_minmax = min_max_scaler.fit_transform(X)
# # X_test_minmax = min_max_scaler.transform(X_test)
# # http://scikit-learn.org/stable/modules/preprocessing.html

# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

In [None]:
history = model.fit( X_scaled, Y, validation_split=0.25, epochs=20, batch_size=5, verbose=2, callbacks=callbacks_list)

In [None]:
loss, accuracy = model.evaluate(X_scaled, Y)
print (loss)
print (accuracy)

In [None]:
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.ylim(0.7,1)
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.ylim(0,0.7)
plt.legend(['train', 'test'], loc='upper left')
plt.show()

## 4.3 Summary

* This model was trained on [U.S. Opiate Prescriptions (from Kaggle)](https://www.kaggle.com/apryor6/us-opiate-prescriptions) which is a subset of data given at [cms.gov: Medicare Provider Charge Data](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html)

* Similar data can be downloaded from the aforementioned source to make predictions using the model for further testing

* Since it's a Neural Network being used for Binary Classification the size of data required for training is around ~40,000 data points (or data of ~40,000 doctors/prescribers) with similar features to create a robust generalized model

# 5. References

1. https://medium.com/maheshkkumar/implementing-a-binary-classifier-in-python-b69d08d8da21
2. https://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/
3. https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
4. https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
5. https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/
6. http://parneetk.github.io/blog/neural-networks-in-keras/
7. https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
8. https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956