## Introduction

TEAMMATES: Akshat and Annie

The overall goal is to predict whether a payment by a company to a medical doctor or facility
was made as part of a research project or not.

### Imports

In [1]:
# data loading and manipulation
import pandas as pd
import numpy as np
import random
from dirty_cat import TargetEncoder

# scikit learn
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, confusion_matrix, roc_auc_score, auc, average_precision_score

# unbalanced sets
from imblearn.under_sampling import RandomUnderSampler

# plotting
import matplotlib.pyplot as plt

%matplotlib inline

### Load data

The positive class corresponds to the payments that were made by a company to a doctor or facility that is part of the **research project**. The negative class on the other hand are the **general payments**. 

In the original data sets, the ratio of the positive class to the negative class is 1/20, making the positive class the minority class. 

Because the data sets are so large, we will subsample from the classes in order to maintain the same ratio. Thus we take 120K data points from Class 0, and 20K data points from Class 1. 

120K from the positive class turns out to be ~20% of the data, and 2M from the negative class is ~20% from the negative class. 

In [2]:
# Import 20% data randomly
# p = 0.2
# df0 = pd.read_csv('../payments2017/d0.csv', skiprows=lambda i: i>0 and random.random() > p)
# df1 = pd.read_csv('../payments2017/d1.csv', skiprows=lambda i: i>0 and random.random() > p)

# Write sampled data for future use
# df0.to_csv('../payments2017/gen_payments_sampled.csv')
# df1.to_csv('../payments2017/res_payments_sampled.csv')

In [3]:
# Import from sampled files
df0 = pd.read_csv('../payments2017/gen_payments_sampled.csv')
df1 = pd.read_csv('../payments2017/res_payments_sampled.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df0.shape

(2132686, 76)

In [5]:
df1.shape

(120511, 177)

## Feature Intersection

What features should be excluded because they leak the target information?

There are 75 features present in the negative class, and 176 in the positive class. Our approach to combining the data sets for both the positive and the negative classs it to take an intersection of the features. 

In [10]:
notPrs = list(set(list(df1.columns)).difference(list(df0.columns)))
featureIntersection = list(set(list(df1.columns)).difference(notPrs))
print("There are {} features present in the intersection of the two dataframes.".format(len(featureIntersection) - 1))

df1 = df1[featureIntersection]
df0 = df0[featureIntersection]

There are 65 features present in the intersection of the two dataframes.


Before we concatenate the two data sets, we add an indicator variable to each one specifying which class the data belongs to. We call this feature **target**, which is equal to 1 for the positive class and 0 for the negative class.

In [11]:
df1['Target'] = 1
df0['Target'] = 0

df = pd.concat([df1, df0], axis=0)
df.shape

(2253197, 66)

In [12]:
NAs = df.isna().mean().sort_values(ascending=False)

In [13]:
NAs

Recipient_Province                                                  0.999945
Recipient_Postal_Code                                               0.999931
Physician_License_State_code5                                       0.999826
Physician_License_State_code4                                       0.999209
Associated_Drug_or_Biological_NDC_5                                 0.997671
Physician_License_State_code3                                       0.995469
Product_Category_or_Therapeutic_Area_5                              0.993623
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_5           0.993483
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5            0.993420
Covered_or_Noncovered_Indicator_5                                   0.993296
Teaching_Hospital_CCN                                               0.987752
Teaching_Hospital_ID                                                0.987752
Teaching_Hospital_Name                                              0.987752

**Baseline**

Identifying possible irrelevant columns

In [114]:
columns_to_drop = ['Recipient_Province', 
'Recipient_Postal_Code', 
'Recipient_Primary_Business_Street_Address_Line2',
'Teaching_Hospital_Name', 
'Teaching_Hospital_CCN',
'Teaching_Hospital_ID',
'Physician_Name_Suffix',       
'Program_Year', 
'Physician_Profile_ID', 
'Physician_Last_Name', 
'Physician_First_Name',
'Record_ID',
'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID',
'Physician_Profile_ID',
'Recipient_Zip_Code',
'Date_of_Payment',
'Physician_Middle_Name',
# 'Covered_Recipient_Type', # leaking target info
'Payment_Publication_Date', 
# 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name', # leaking target info
# 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name', # leaking target info
'Unnamed: 0' # Index col from one of the DFs
]

Dropping Columns with any missing value at all

In [115]:
nan_columns = NAs[NAs > 0] 
nan_columns = np.array(nan_columns.index)
to_drop_baseline = list(set(nan_columns) | set(columns_to_drop))

In [116]:
dfBaseline = df.drop(columns=to_drop_baseline, axis ='columns')

Checking single variable performances to identify leakage issues

In [120]:
objVars = ['Covered_Recipient_Type', # leaking target info
            'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name', # leaking target info
            'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name', # leaking target info
           'Form_of_Payment_or_Transfer_of_Value',
           'Dispute_Status_for_Publication', 
           'Delay_in_Publication_Indicator',
           'Related_Product_Indicator',
           'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
           'Change_Type',
           'Total_Amount_of_Payment_USDollars']

target = dfBaseline['Target']

single_var = dict()

features = dfBaseline.drop(columns='Target')

X_train, X_test, y_train, y_test = train_test_split(features, target)

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)    

X_train_rus = pd.DataFrame(X_train_rus, columns=X_train.columns)
X_test = pd.DataFrame(X_test, columns=X_train.columns)

for var in objVars:
    
    if var != 'Total_Amount_of_Payment_USDollars':
        baseline_pipe = Pipeline([
                                ("dummies", OneHotEncoder(handle_unknown='ignore')),
                                ("logreg", LogisticRegression(solver='lbfgs', max_iter=1000))])
    else:
        baseline_pipe = Pipeline([
                                ('scalar', StandardScaler()),
                                ("logreg", LogisticRegression(solver='lbfgs', max_iter=1000))])

    # Baseline Training and testing
    logreg = baseline_pipe.fit(X_train_rus[[var]], y_train_rus)
    y_score = logreg.predict_proba(X_train_rus[[var]])
    
    # Store in dict
    single_var[var] = roc_auc_score(y_train_rus, y_score[:, 1])
    
single_var    

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


{'Covered_Recipient_Type': 0.9792059412507333,
 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name': 0.8615011802682625,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name': 0.94239000863121,
 'Form_of_Payment_or_Transfer_of_Value': 0.8164754233924769,
 'Dispute_Status_for_Publication': 0.500204847692972,
 'Delay_in_Publication_Indicator': 0.5,
 'Related_Product_Indicator': 0.5419384128179292,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country': 0.5523350569889764,
 'Change_Type': 0.5313082741537394,
 'Total_Amount_of_Payment_USDollars': 0.8982626413550628}

Modifying column to drop list

In [133]:
columns_to_drop = ['Recipient_Province', 
'Recipient_Postal_Code', 
'Recipient_Primary_Business_Street_Address_Line2',
'Teaching_Hospital_Name', 
'Teaching_Hospital_CCN',
'Teaching_Hospital_ID',
'Physician_Name_Suffix',       
'Program_Year', 
'Physician_Profile_ID', 
'Physician_Last_Name', 
'Physician_First_Name',
'Record_ID',
'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID',
'Physician_Profile_ID',
'Recipient_Zip_Code',
'Date_of_Payment',
'Physician_Middle_Name',
'Covered_Recipient_Type', # leaking target info
'Payment_Publication_Date', 
# 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name', # leaking target info
'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name', # leaking target info
'Unnamed: 0' # Index col from one of the DFs
]

In [134]:
nan_columns = NAs[NAs > 0] 
nan_columns = np.array(nan_columns.index)
to_drop_baseline = list(set(nan_columns) | set(columns_to_drop))

In [135]:
dfBaseline = df.drop(columns=to_drop_baseline, axis ='columns')

In [136]:
dfBaseline.shape

(2253197, 9)

In [137]:
dfBaseline.head()

Unnamed: 0,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country,Form_of_Payment_or_Transfer_of_Value,Change_Type,Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name,Delay_in_Publication_Indicator,Dispute_Status_for_Publication,Related_Product_Indicator,Total_Amount_of_Payment_USDollars,Target
0,United States,Cash or cash equivalent,UNCHANGED,"Nielsen BioSciences, Inc.",No,No,Yes,8531.0,1
1,United States,Cash or cash equivalent,UNCHANGED,"Nielsen BioSciences, Inc.",No,No,Yes,76433.5,1
2,United States,Cash or cash equivalent,UNCHANGED,"Nielsen BioSciences, Inc.",No,No,Yes,49312.5,1
3,United States,Cash or cash equivalent,UNCHANGED,Mission Pharmacal Company,No,No,No,546.15,1
4,United States,Cash or cash equivalent,UNCHANGED,Mission Pharmacal Company,No,No,No,225.0,1


In [138]:
pd.DataFrame(dfBaseline.columns, columns=['Columns'])

Unnamed: 0,Columns
0,Applicable_Manufacturer_or_Applicable_GPO_Maki...
1,Form_of_Payment_or_Transfer_of_Value
2,Change_Type
3,Submitting_Applicable_Manufacturer_or_Applicab...
4,Delay_in_Publication_Indicator
5,Dispute_Status_for_Publication
6,Related_Product_Indicator
7,Total_Amount_of_Payment_USDollars
8,Target


In [154]:
# Train-Test split
target = dfBaseline['Target']
features = dfBaseline.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(features, target)

# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

## First Baselining

Without target encoding

In [159]:
# Defining continuous and categorical variables
objVars = ['Form_of_Payment_or_Transfer_of_Value',
           'Dispute_Status_for_Publication', 
           'Delay_in_Publication_Indicator',
           'Related_Product_Indicator',
           'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
           'Change_Type',
           'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name']

contVars = ['Total_Amount_of_Payment_USDollars']

# contVars_ct = ColumnTransformer([("scalar", StandardScaler(), contVars)])

# catVars_ct = ColumnTransformer([("dummies", OneHotEncoder(handle_unknown='ignore'), objVars)])

# ("target_encoder", TargetEncoder(clf_type="binary_clf"), target_based_encoding)

# baseline_pipe = Pipeline([
#                         ("contvars", contVars_ct),
#                         ("catvars", catVars_ct),
#                         ("logreg", LogisticRegression(solver='lbfgs', max_iter=1000))])

In [None]:
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("scalar", StandardScaler(), contVars),
                                              ("dummies", OneHotEncoder(handle_unknown='ignore'), objVars)
                                             ])

# Create pipeplines
baseline_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                          ("logreg", LogisticRegression(solver='lbfgs', max_iter=1000))
                         ])

In [147]:
# Baseline Training and testing
# logreg = baseline_pipe.fit(pd.DataFrame(X_train_rus, columns = X_train.columns), y_train_rus)
# y_score = logreg.predict_proba(pd.DataFrame(X_train_rus, columns = X_train.columns))
# baseline= roc_auc_score(y_train_rus, y_score[:, 1])

baseline = cross_val_score(logreg, pd.DataFrame(X_train_rus, columns = X_train.columns), y_train_rus, scoring='roc_auc', cv=5)
baseline_cv_score = np.mean(baseline)
baseline_cv_score


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).tr

0.937455810525603

Trying Target Encoding

In [160]:
# Identify which variables to target encode
target_based_encoding = []
for col in objVars:
    print(col, len(X_train[col].unique()))
    
    if len(X_train[col].unique()) > 100:
        target_based_encoding.append(col)

target_based_encoding

Form_of_Payment_or_Transfer_of_Value 6
Dispute_Status_for_Publication 2
Delay_in_Publication_Indicator 1
Related_Product_Indicator 2
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country 32
Change_Type 3
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name 1180


['Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name']

In [151]:
# Defining continuous and categorical variables
objVars = ['Form_of_Payment_or_Transfer_of_Value',
           'Dispute_Status_for_Publication', 
           'Delay_in_Publication_Indicator',
           'Related_Product_Indicator',
           'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
           'Change_Type']

contVars = ['Total_Amount_of_Payment_USDollars']

# contVars_ct = ColumnTransformer([("scalar", StandardScaler(), contVars)])

# catVars_ct = ColumnTransformer([("dummies", OneHotEncoder(handle_unknown='ignore'), objVars)])

# baseline_pipe = Pipeline([
#                         ("contvars", contVars_ct),
#                         ("catvars", catVars_ct),
#                         ("logreg", LogisticRegression(solver='lbfgs', max_iter=1000))])


In [152]:
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("scalar", StandardScaler(), contVars),
                                              ("dummies", OneHotEncoder(handle_unknown='ignore'), objVars),
                                              ("target_encoder", TargetEncoder(clf_type="binary_clf"), target_based_encoding)
                                             ])

# Create pipeplines
baseline_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                                ("logreg", LogisticRegression(solver='lbfgs', max_iter=1000))
                               ])



In [153]:
# Baseline Training and testing
baseline = cross_val_score(logreg, pd.DataFrame(X_train_rus, columns = X_train.columns), y_train_rus, scoring='roc_auc', cv=5)
baseline_cv_score = np.mean(baseline)
baseline_cv_score


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)


0.939610166041333

## Feature engineering ~ Task 3

Imputing NA with 'Missing' values

In [169]:
df_engineered = df.drop(columns=columns_to_drop)

In [170]:
# Train-Test split
target = df_engineered['Target']
features = df_engineered.drop(columns='Target')
X_train, X_test, y_train, y_test = train_test_split(features, target)

# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

In [172]:
obj_vars = X_train.drop(columns=['Total_Amount_of_Payment_USDollars']).columns.values
cont_vars = ['Total_Amount_of_Payment_USDollars']

In [174]:
# Identify which variables to target encode
target_based_encoding = []
for col in obj_vars:
    print(col, len(X_train[col].unique()))
    
    if len(X_train[col].unique()) > 100:
        target_based_encoding.append(col)

target_based_encoding

Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_4 5
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_2 5
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country 34
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_1 5
Form_of_Payment_or_Transfer_of_Value 6
Physician_License_State_code5 34
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_3 5
Covered_or_Noncovered_Indicator_5 3
Associated_Drug_or_Biological_NDC_1 1224
Change_Type 3
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State 47
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_3 1769
Product_Category_or_Therapeutic_Area_5 193
Product_Category_or_Therapeutic_Area_4 298
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name 1187
Physician_License_State_code2 55
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_1 8236
Delay_in_Publication_Indicator 1
Dispute_Status_for_Publication 2
Physician_License_State_code3 49
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5 5

['Associated_Drug_or_Biological_NDC_1',
 'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_3',
 'Product_Category_or_Therapeutic_Area_5',
 'Product_Category_or_Therapeutic_Area_4',
 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name',
 'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_1',
 'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5',
 'Associated_Drug_or_Biological_NDC_2',
 'Associated_Drug_or_Biological_NDC_3',
 'Product_Category_or_Therapeutic_Area_1',
 'Associated_Drug_or_Biological_NDC_4',
 'Physician_Specialty',
 'Recipient_Primary_Business_Street_Address_Line1',
 'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_4',
 'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_2',
 'Product_Category_or_Therapeutic_Area_2',
 'Recipient_City',
 'Product_Category_or_Therapeutic_Area_3']

In [178]:
# Final categorical variables
categorical = [cols for cols in obj_vars if cols not in target_based_encoding]
len(categorical) + len(target_based_encoding)

44

In [182]:
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("scalar", StandardScaler(), cont_vars),
                                              ("dummies", make_pipeline(SimpleImputer(strategy='constant', fill_value="Missing"),
                                                                        OneHotEncoder(handle_unknown='ignore')), categorical),
                                              ("target_encoder", make_pipeline(SimpleImputer(strategy='constant', fill_value="Missing"),
                                                                               TargetEncoder(clf_type="binary_clf")), target_based_encoding)
                                             ])

# Create pipeplines
take2_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                                ("logreg", LogisticRegression(solver='lbfgs', max_iter=500))
                               ])



In [183]:
# Baseline Training and testing
baseline = cross_val_score(logreg, pd.DataFrame(X_train_rus, columns = X_train.columns), y_train_rus, scoring='roc_auc', cv=5)
baseline_cv_score = np.mean(baseline)
baseline_cv_score


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  res = transformer.transform(X)


0.9381459246035337

In [None]:
y_score = logreg.fit(pd.DataFrame(X_train_rus, columns = X_train.columns), 
                     pd.DataFrame(y_train_rus)).predict_proba(X_test)

In [None]:
preds = logreg.predict(X_test)
tn, fp, fn, tp  = confusion_matrix(y_test, preds).ravel()
print([tn, fp])
print([fn, tp])

In [None]:
roc_auc_score(y_test, y_score[:, 1])

In [None]:
plot_roc(y_test, list(y_score[:, 1]))

In [None]:
average_precision_score(y_test, y_score)

In [None]:
def plot_roc(y_test, y_score):
    
    fpr, tpr, thresholds = roc_curve(y_test, y_score)
    
    roc_auc = auc(fpr, tpr)
    
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
y_prob = logreg.predict_proba(X_test)