Problem 1 - The development of drugs is critical in providing therapeutic options
for patients suffering from chronic and terminal illnesses. “Target Drug”, in particular,
is designed to enhance the patient's health and well-being without causing
dependence on other medications that could potentially lead to severe and
life-threatening side effects. These drugs are specifically tailored to treat a particular
disease or condition, offering a more focused and effective approach to treatment,
while minimising the risk of harmful reactions.
The objective in this assignment is to develop a predictive model which will predict
whether a patient will be eligible*** for “Target Drug” or not in next 30 days. Knowing
if the patient is eligible or not will help physician treating the patient make informed
decision on the which treatments to give.

#Import required package

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta

#Read The File

In [2]:
df_train = pd.read_parquet("/content/train.parquet", engine="pyarrow")
df_train.head(5)

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1


#Preprocess and clean the data

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220868 entries, 0 to 29080911
Data columns (total 3 columns):
 #   Column       Dtype         
---  ------       -----         
 0   Patient-Uid  object        
 1   Date         datetime64[ns]
 2   Incident     object        
dtypes: datetime64[ns](1), object(2)
memory usage: 98.3+ MB


In [4]:
df_train.drop_duplicates

<bound method DataFrame.drop_duplicates of                                    Patient-Uid       Date           Incident
0         a0db1e73-1c7c-11ec-ae39-16262ee38c7f 2019-03-09  PRIMARY_DIAGNOSIS
1         a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f 2015-05-16  PRIMARY_DIAGNOSIS
3         a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f 2018-01-30     SYMPTOM_TYPE_0
4         a0dc950b-1c7c-11ec-b6ec-16262ee38c7f 2015-04-22        DRUG_TYPE_0
8         a0dc9543-1c7c-11ec-bb63-16262ee38c7f 2016-06-18        DRUG_TYPE_1
...                                        ...        ...                ...
29080886  a0ee9f75-1c7c-11ec-94c7-16262ee38c7f 2018-07-06        DRUG_TYPE_6
29080897  a0ee1284-1c7c-11ec-a3d5-16262ee38c7f 2017-12-29        DRUG_TYPE_6
29080900  a0ee9b26-1c7c-11ec-8a40-16262ee38c7f 2018-10-18       DRUG_TYPE_10
29080903  a0ee1a92-1c7c-11ec-8341-16262ee38c7f 2015-09-18        DRUG_TYPE_6
29080911  a0ee146e-1c7c-11ec-baee-16262ee38c7f 2018-10-05        DRUG_TYPE_1

[3220868 rows x 3 columns]>

In [5]:
# Data cleaning and formatting
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train.dropna(inplace=True)

#separate postive and negative based on Target Drug


In [6]:
# Positive Set: Patients who have taken the "Target Drug"
positive_set = df_train[df_train['Incident'] == 'TARGET DRUG']

# Negative Set: Patients who have not taken the "Target Drug"
negative_set = df_train[df_train['Incident'] != 'TARGET DRUG']

# Identify patients who became eligible for the drug
eligible_patients = positive_set.groupby('Patient-Uid')['Date'].min()

# Apply a time-based approach to ensure the positive set includes eligible patient
positive_set = positive_set[positive_set.apply(lambda row: row['Date'] >= eligible_patients[row['Patient-Uid']], axis=1)]

# Combine positive and negative sets
combined_set = pd.concat([positive_set, negative_set])

#Featrue Engineering

In [7]:
import numpy as np

# Create frequency-based features
patient_frequency = df_train.groupby('Patient-Uid')['Date'].count().reset_index()
patient_frequency.columns = ['Patient-Uid', 'Frequency']

# Calculate time since the last event for each patient
df_train['TimeSinceLastEvent'] = df_train.groupby('Patient-Uid')['Date'].diff().dt.days

# Calculate time since the first event for each patient
df_train['TimeSinceFirstEvent'] = (df_train['Date'] - df_train.groupby('Patient-Uid')['Date'].transform('min')).dt.days

# Create a feature for the number of days between consecutive events
df_train['TimeBetweenEvents'] = df_train['TimeSinceLastEvent'].fillna(0)

# Create a feature for the average time between events for each patient
avg_time_between_events = df_train.groupby('Patient-Uid')['TimeBetweenEvents'].mean().reset_index()
avg_time_between_events.columns = ['Patient-Uid', 'AvgTimeBetweenEvents']

# Combine features
final_features = pd.merge(df_train, patient_frequency, on='Patient-Uid', how='left')
final_features = pd.merge(final_features, avg_time_between_events, on='Patient-Uid', how='left')

# Create binary features for different event types (e.g., DRUG_TYPE_7, SYMPTOM_TYPE_2, etc.)
event_types = df_train['Incident'].unique()

for event in event_types:
    final_features[event] = (final_features['Incident'] == event).astype(int)

#Splitting the dataset into training and validation sets.

In [9]:
from sklearn.model_selection import train_test_split
# Split the dataset into features and labels
X = final_features.drop(['Patient-Uid', 'Date', 'Incident', 'TARGET DRUG'], axis=1)
y = final_features['TARGET DRUG']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)


#Fitting the model on the training data.

#Scaling

In [10]:
from sklearn.preprocessing import StandardScaler
# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)



# Create an imputer

In [11]:
from sklearn.impute import SimpleImputer

# Create an imputer to fill missing values with the mean of each column
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the training features and transform both training and validation features
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_val_imputed = imputer.transform(X_val_scaled)


#Model fitting - LogisticRegression

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Create a Logistic Regression model
logreg_model = LogisticRegression()

# Train the model on the imputed training data
logreg_model.fit(X_train_imputed, y_train)

# Predict on the imputed validation data
y_pred_logreg = logreg_model.predict(X_val_imputed)

# F1 score on the validation set
f1_score_logreg = f1_score(y_val, y_pred_logreg)

print("F1 Score (Logistic Regression):", f1_score_logreg)

F1 Score (Logistic Regression): 1.0


#F1 Score (Logistic Regression): 1.0

#Load Test Data

In [3]:
df_test = pd.read_parquet("/content/test.parquet", engine="pyarrow")
df_test.head(5)

Unnamed: 0,Patient-Uid,Date,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2016-12-08,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-10-17,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-12-01,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-12-05,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-11-04,SYMPTOM_TYPE_0


#Preprocessing And Clean the test data

In [14]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1065524 entries, 0 to 1372859
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   Patient-Uid  1065524 non-null  object        
 1   Date         1065524 non-null  datetime64[ns]
 2   Incident     1065524 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 32.5+ MB


In [15]:
df_test.drop_duplicates

<bound method DataFrame.drop_duplicates of                                   Patient-Uid       Date        Incident
0        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f 2016-12-08  SYMPTOM_TYPE_0
1        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f 2018-10-17     DRUG_TYPE_0
2        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f 2017-12-01     DRUG_TYPE_2
3        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f 2018-12-05     DRUG_TYPE_1
4        a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f 2017-11-04  SYMPTOM_TYPE_0
...                                       ...        ...             ...
1372854  a10272c9-1c7c-11ec-b3ce-16262ee38c7f 2017-05-11    DRUG_TYPE_13
1372856  a10272c9-1c7c-11ec-b3ce-16262ee38c7f 2018-08-22     DRUG_TYPE_2
1372857  a10272c9-1c7c-11ec-b3ce-16262ee38c7f 2017-02-04     DRUG_TYPE_2
1372858  a10272c9-1c7c-11ec-b3ce-16262ee38c7f 2017-09-25     DRUG_TYPE_8
1372859  a10272c9-1c7c-11ec-b3ce-16262ee38c7f 2017-05-19     DRUG_TYPE_7

[1065524 rows x 3 columns]>

In [16]:
# Data cleaning and formatting
df_test['Date'] = pd.to_datetime(df_test['Date'])  # Use df_test instead of df_train

# Handle missing values
df_test.dropna(inplace=True)

#separate postive and negative based on Target Drug

In [17]:
# Positive Set: Patients who have taken the "Target Drug"
positive_set = df_test[df_test['Incident'] == 'TARGET DRUG']

# Negative Set: Patients who have not taken the "Target Drug"
negative_set = df_test[df_test['Incident'] != 'TARGET DRUG']

# Identify patients who became eligible for the drug
eligible_patients = positive_set.groupby('Patient-Uid')['Date'].min()

# Apply a time-based approach to ensure the positive set includes eligible patients
positive_set = positive_set[positive_set.apply(lambda row: row['Date'] >= eligible_patients[row['Patient-Uid']], axis=1)]

# Combine positive and negative sets
combined_set = pd.concat([positive_set, negative_set])

#Feature Engineering - Time based and Frequent base

In [None]:
# Create frequency-based features
patient_frequency = df_test.groupby('Patient-Uid')['Date'].count().reset_index()
patient_frequency.columns = ['Patient-Uid', 'Frequency']


In [None]:

#time since the last event for each patient
df_test['TimeSinceLastEvent'] = df_test.groupby('Patient-Uid')['Date'].diff().dt.days

# time since the first event for each patient
df_test['TimeSinceFirstEvent'] = (df_test['Date'] - df_test.groupby('Patient-Uid')['Date'].transform('min')).dt.days

# number of days between consecutive events
df_test['TimeBetweenEvents'] = df_test['TimeSinceLastEvent'].fillna(0)

In [18]:

# Create a new column 'DRUG_TYPE_18' in the test data with default values (e.g., 0)
df_test['DRUG_TYPE_18'] = 0

# Create a feature for the average time between events for each patient
avg_time_between_events = df_test.groupby('Patient-Uid')['TimeBetweenEvents'].mean().reset_index()
avg_time_between_events.columns = ['Patient-Uid', 'AvgTimeBetweenEvents']

# Combine features
final_test_features = pd.merge(df_test, patient_frequency, on='Patient-Uid', how='left')
final_test_features = pd.merge(final_test_features, avg_time_between_events, on='Patient-Uid', how='left')

# Create binary features for different event types
event_types = df_test['Incident'].unique()

for event in event_types:
    final_test_features[event] = (final_test_features['Incident'] == event).astype(int)

#Applying scaling and imputation to the test data.


In [None]:
# Separate features from the test data
X_test = final_test_features.drop(['Patient-Uid', 'Date', 'Incident'], axis=1)

# Assuming you have already fitted the scaler and imputer on your training data
#scaler = StandardScaler()
#imputer = SimpleImputer(strategy='mean')
#scaler.fit(X_train)
#imputer.fit(X_train)

#  transformations to the test data
X_test_scaled = scaler.transform(X_test)
X_test_imputed = imputer.transform(X_test_scaled)

# Load the trained Logistic Regression model
logreg_model = LogisticRegression(random_state=42)

#trained model to make predictions on the preprocessed test data
y_pred_test = logreg_model.predict(X_test_imputed)


# Create a DataFrame for predictions
predictions_df = pd.DataFrame({'Patient-Uid': final_test_features['Patient-Uid'], 'Eligibility_Prediction': y_pred_test})

# Save predictions to a CSV file
predictions_df.to_csv('eligibility_predictions_lr_test.csv', index=False)

# Save predictions to a CSV file

In [None]:
# Save predictions to a CSV file
predictions_df.to_csv('eligibility_predictions_lr_test.csv', index=False)