
# Physician Conversion Model

This is a modeling pipeline for predicting whether a physician will convert to a new platform. The pipeline consists of the following steps:

1. Split the data into train, validation, and inference sets.
2. Train a variety of models without hyperparameter tuning (vanilla models).
3. Select one of the vanilla models and tune its hyperparameters.
4. Evaluate the model on the inference set.

## Step 1: Split Data into Train, Validation, and Inference Sets

The data was split into 70% train, 20% validation, and 10% inference sets. This ensures that we have enough data to train the model, validate the model, and evaluate the model on unseen data.

## Step 2: Train Vanilla Models

A variety of vanilla models were trained, including logistic regression, decision trees, and random forests. These models were trained without hyperparameter tuning.

## Step 3: Select a Model

One of the vanilla models (----) was selected for hyperparameter tuning. The hyperparameters that were tuned include the learning rate and the regularization strength.

## Step 4: Evaluate the Model

The tuned model was evaluated on the validation set. The model achieved a high F1-score, indicating that it is able to predict whether a physician will convert to a new platform with a high degree of accuracy.

## Conclusion

The modeling pipeline described in this document was able to achieve a high F1-score on the validation set. This suggests that the model is able to predict whether a physician will convert to a new platform with a high degree of accuracy.

## Next Step

The final/selected model will be used in Inference Pipeline to do predition on Inference set

### Import Libraries and Model Input Dataset

In [3]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

#Visual Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Importing necessary libraries for encoding
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# Importing necessary library for scaling
from sklearn.preprocessing import StandardScaler

# Importing necessary library for train-test split
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# Importing necessary libraries for model development and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc
import xgboost as xgb
import lightgbm as lgb

# Hyperparameter Tuning
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
import os

### Loading Data

In [39]:
from sagemaker.feature_store.feature_group import FeatureGroup
feature_group_name = "physician-conversion-feature-group-30-14-22-46"

# Create a SageMaker session
sagemaker_session = sagemaker.Session()

# Load the feature group
feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sagemaker_session)

In [43]:
query = feature_group.athena_query()
database = query.database
table_name = query.table_name
print(database)
print('')
print(table_name)

sagemaker_featurestore

physician_conversion_feature_group_30_14_22_46_1714486966


In [49]:
bucket = "sagemaker-experiment-hs/"
prefix = "Feature-store-trial"

In [51]:
query_string = 'SELECT * FROM "{}"."{}"'.format(database, table_name)
query.run(
    query_string=query_string,
    output_location='s3://sagemaker-experiment-hs/Feature-store-trial/'
)
query.wait()
dataset = query.as_dataframe()


In [52]:
dataset.head()

Unnamed: 0,npi_id,hcp_id,target,age,year_of_experience,number_of_rx,rx_last_1_month,rx_last_3_month,rx_last_6_month,rx_last_12_month,...,specialty_oncology,specialty_pediatric,specialty_uro-oncology,hco_affiliation_type_contract,hco_affiliation_type_employment,hco_affiliation_type_referral,eventtime,write_time,api_invocation_time,is_deleted
0,9875108,HCP_14,1,66,38,1100,1740,2884,3396,5605,...,False,False,False,False,False,True,1714487000.0,2024-04-30 14:28:40.579,2024-04-30 14:23:28.000,False
1,4936853,HCP_45,0,68,13,833,1569,2861,3119,3437,...,False,True,False,False,True,False,1714487000.0,2024-04-30 14:28:40.579,2024-04-30 14:23:29.000,False
2,6801445,HCP_62,0,41,60,490,932,1118,1700,2689,...,False,False,False,False,False,True,1714487000.0,2024-04-30 14:28:40.579,2024-04-30 14:23:29.000,False
3,8159501,HCP_116,0,59,60,1100,2082,3317,5516,9072,...,False,True,False,True,False,False,1714487000.0,2024-04-30 14:28:40.579,2024-04-30 14:23:30.000,False
4,3696556,HCP_198,0,72,10,243,246,256,342,345,...,False,False,True,True,False,False,1714487000.0,2024-04-30 14:28:40.579,2024-04-30 14:23:31.000,False
