# Janatha Hack Health care Analytics

### **This notebook describes the solution presented for the Janatha Hack Health care Challenge conducted by ANlytics Vidhya team**
[Janatahack: Healthcare Analytics](https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics/#ProblemStatement)


---



---




# **The Problem Statement**



MedCamp was started because the founders saw their family suffer due to bad work life balance and neglected health.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp). 

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and Number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

 

The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
 

Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
We need to predict the chances (probability) of having a favourable outcome.
 


# Data Description



**Train.zip** contains the following 6 csvs alongside the data dictionary that contains definitions for each variable

**Health_Camp_Detail.csv** – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.

**Train.csv** – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

**Patient_Profile.csv** – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category

**First_Health_Camp_Attended.csv** – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

**Second_Health_Camp_Attended.csv** - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.

**Third_Health_Camp_Attended.csv** - This file contains details about people who attended health camp of third format. This includes Number_of_stall_visited & Last_Stall_Visited_Number.



## Test Set

**Test.csv** – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

 

## Train / Test split:

Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.


## Sample Submission:

**Patient_ID**: Unique Identifier for each patient. This ID is not sequential in nature and can not be used in modeling

**Health_Camp_ID**: Unique Identifier for each camp. This ID is not sequential in nature and can not be used in modeling

**Outcome**: Predicted probability of a favourable outcome


# Introduction to PyCaret

**[PyCaret](https://pycaret.org/)** is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment

![](https://avatars1.githubusercontent.com/u/58118658?s=460&u=93d2659aa7c1a6b8fecc15b87a2bad4d8b0022fb&v=4)

# **Importing Libriaries** 

In [None]:
import numpy as np # For Numerical Computing
import pandas as pd # To Work with csv files and dataframes
import os # To work with operating system
import pickle as pkl # To save python objects
!pip install pycaret # To install pycaret
from sklearn.metrics import accuracy_score, confusion_matrix # For Metrics
from pycaret.classification import * # To build classification models with PyCaret
from fastai.tabular import add_datepart # To add Date related columns into dataframe

In [3]:
#Loading the data from google drive to colab local path
DATASET_DRIVE_PATH = "/content/drive/My Drive/HealthCareDataset"
DATASET_LOCAL_PATH = "/content/HealthCareDataset"
if not os.path.exists(DATASET_LOCAL_PATH) : 
    !cp -r "{DATASET_DRIVE_PATH}" "{DATASET_LOCAL_PATH}"
%cd "{DATASET_LOCAL_PATH}"

/content/HealthCareDataset


# Read Data 

In [4]:
train = pd.read_csv("Train/Train.csv", parse_dates=['Registration_Date'])
train.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5
67116,511320,6534,2006-07-28,0,0,0,0,0
20629,507978,6554,2005-06-18,0,0,0,0,0
49863,512756,6529,2006-03-29,0,0,0,0,0
16034,514277,6539,2004-11-20,1,0,0,0,0
72413,526437,6561,2003-12-05,2,0,0,0,2


In [5]:
first_health_camp_data = pd.read_csv("Train/First_Health_Camp_Attended.csv", 
                                     usecols = ['Patient_ID', 'Health_Camp_ID', 'Donation', 'Health_Score'])
first_health_camp_data.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Donation,Health_Score
4605,514487,6563,30,0.205128
4264,492349,6538,20,0.546326
4541,521163,6539,40,0.771654
5949,507738,6537,80,0.917012
4756,514830,6586,40,0.33


In [6]:
second_health_camp_data = pd.read_csv("Train/Second_Health_Camp_Attended.csv")
second_health_camp_data.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Health Score
6451,510128,6534,0.586493
1125,517419,6523,0.869446
3907,508046,6529,0.221351
2403,490079,6534,0.844787
5142,487838,6536,0.55266


In [7]:
third_health_camp_data = pd.read_csv("Train/Third_Health_Camp_Attended.csv")
third_health_camp_data.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Number_of_stall_visited,Last_Stall_Visited_Number
4163,487181,6527,1,2
2095,495414,6527,2,1
2762,526449,6541,1,1
3670,494790,6527,5,3
6069,497189,6541,5,4


In [8]:
health_camp_detail = pd.read_csv("Train/Health_Camp_Detail.csv", parse_dates = ['Camp_Start_Date', 'Camp_End_Date'])
health_camp_detail.sample(5)

Unnamed: 0,Health_Camp_ID,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3
19,6540,2004-11-01,2004-11-04,First,E,2
41,6541,2005-12-03,2006-01-30,Third,G,2
56,6545,2006-09-22,2006-09-27,First,C,2
26,6532,2005-02-19,2005-08-23,First,F,2
39,6575,2005-10-12,2005-10-14,First,C,2


In [9]:
patient_profile = pd.read_csv("Train/Patient_Profile.csv", parse_dates = ['First_Interaction'])
patient_profile.sample(5)

Unnamed: 0,Patient_ID,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category
25694,518380,0,0,0,0,,,,2003-10-19,,
35958,511445,0,0,0,0,,,,2006-08-06,,
27354,498143,0,0,0,0,1.0,,36.0,2004-08-18,E,
2069,504619,0,0,0,0,,,,2003-11-29,C,
25289,501867,0,0,0,0,,,,2002-11-12,B,


# Feature Engineering

## Transforming Train Data

In [10]:
#Filling the missing values with UNK for categorical cols
patient_profile['City_Type'] = patient_profile['City_Type'].fillna('UNK')
patient_profile['Employer_Category'] = patient_profile['Employer_Category'].fillna('UNK')

In [11]:
# Joining the patient and health camp data with train data
data = train.merge(patient_profile, on = 'Patient_ID', how = 'left', suffixes = ['_train', '_patient'])
data = data.merge(health_camp_detail, on = 'Health_Camp_ID', how = 'left', suffixes = ['_train', '_health_camp'])
data = data.merge(first_health_camp_data, on = ['Patient_ID', 'Health_Camp_ID'], 
                  how = 'left', suffixes = ['_train', '_first_health_camp'])
data = data.merge(second_health_camp_data, on = ['Patient_ID', 'Health_Camp_ID'], 
                  how = 'left', suffixes = ['_train', '_second_health_camp'])
data = data.merge(third_health_camp_data, on = ['Patient_ID', 'Health_Camp_ID'], 
                  how = 'left', suffixes = ['_train', 'third_health_camp'])
data.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Donation,Health_Score,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number
2584,510630,6570,2005-04-29,3,1,0,0,1,0,0,0,0,1.0,91.0,38.0,2005-02-17,D,Education,2005-07-09,2005-07-22,First,E,2,40.0,0.425993,,,
59340,515519,6562,2005-01-07,0,0,0,0,0,0,0,0,0,0.0,,43.0,2004-09-08,A,UNK,2004-11-24,2005-06-02,First,F,2,,,,,
53459,491889,6542,2005-08-16,0,0,0,0,0,0,0,0,0,,,,2005-07-30,E,UNK,2005-02-19,2005-08-23,First,F,2,,,,,
20668,512104,6527,2005-05-24,0,0,0,0,0,0,0,0,0,1.0,69.0,44.0,2004-09-21,B,Technology,2005-06-13,2005-07-22,Third,G,2,,,,2.0,1.0
44997,494580,6540,2004-10-29,0,0,0,0,0,0,0,0,0,,,,2004-10-25,UNK,UNK,2004-11-01,2004-11-04,First,E,2,,,,,


In [12]:
#Adding extra features and target variables
data['Camp_Duration'] = (data['Camp_End_Date'] - data['Camp_Start_Date']).dt.days
data['RegistrationToCampStart'] = (data['Registration_Date'] - data['Camp_Start_Date']).dt.days
data['RegistrationToCampEnd'] = (data['Registration_Date'] - data['Camp_End_Date']).dt.days
data['Registration_Date'] = np.where(data['Registration_Date'].isnull(), 
                                     np.where(data['First_Interaction'] > data['Camp_Start_Date'], 
                                              data['First_Interaction'], data['Camp_Start_Date']), 
                                     data['Registration_Date'] 
                                     )

data['Age'] = data['Age'].replace({'None' : np.nan}) #Replacing None value with NaN value

# Defining Target Variables

data['isAttendedFirstCamp'] = data['Health_Score'].isnull().replace({True : 0, False : 1})
data['isAttendedSecondCamp'] = data['Health Score'].isnull().replace({True : 0, False : 1})
data['isAttendedThirdCamp'] = (data['Number_of_stall_visited']>0).astype('int')

data.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Donation,Health_Score,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Duration,RegistrationToCampStart,RegistrationToCampEnd,isAttendedFirstCamp,isAttendedSecondCamp,isAttendedThirdCamp
49084,500204,6585,2003-11-02,0,0,0,0,0,0,0,0,0,4.0,72.0,45.0,2003-10-23,G,Education,2003-11-22,2003-12-05,First,E,2,,,,,,13,-20.0,-33.0,0,0,0
68258,487445,6586,2004-10-09,0,0,0,0,0,0,0,0,0,,,,2004-08-30,UNK,UNK,2004-10-01,2004-10-18,First,E,2,,,,,,17,8.0,-9.0,0,0,0
46679,500719,6529,2006-03-28,0,0,0,0,0,0,0,0,0,,,,2006-01-10,UNK,UNK,2006-03-30,2006-04-03,Second,A,2,,,,,,4,-2.0,-6.0,0,0,0
21526,528322,6523,2005-03-01,0,0,0,0,0,0,0,0,0,1.0,87.0,40.0,2004-12-04,D,Consulting,2005-02-23,2005-09-16,Second,D,2,,,0.373206,,,205,6.0,-199.0,0,1,0
58804,489643,6526,2005-01-09,3,0,0,0,0,1,0,1,1,4.0,82.0,41.0,2003-01-31,A,Technology,2005-01-03,2005-02-20,First,E,2,,,,,,48,6.0,-42.0,0,0,0


In [13]:
data.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Donation,Health_Score,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Duration,RegistrationToCampStart,RegistrationToCampEnd,isAttendedFirstCamp,isAttendedSecondCamp,isAttendedThirdCamp
53100,522808,6536,2005-01-12,0,0,0,0,0,0,0,0,0,1.0,,,2003-12-19,F,Technology,2005-02-15,2005-02-18,Second,D,2,,,,,,3,-34.0,-37.0,0,0,0
33100,506482,6534,2006-01-06,0,0,0,0,0,0,0,0,0,,,,2005-06-15,UNK,UNK,2005-10-17,2007-11-07,Second,A,2,,,0.64297,,,751,81.0,-670.0,0,1,0
49105,499550,6580,2004-12-15,0,0,0,0,0,0,0,0,0,3.0,,70.0,2004-12-06,UNK,Transport,2004-12-22,2005-01-06,First,E,2,,,,,,15,-7.0,-22.0,0,0,0
49540,519587,6539,2004-09-22,0,0,0,0,0,1,0,1,0,3.0,,42.0,2004-09-16,H,Software Industry,2004-08-07,2005-02-12,First,F,2,,,,,,189,46.0,-143.0,0,0,0
29655,514355,6537,2007-01-16,0,0,0,0,0,0,0,0,0,,,,2006-08-17,C,UNK,2005-09-27,2007-11-07,First,F,2,,,,,,771,476.0,-295.0,0,0,0


## Transforming Test Data

In [14]:
# Test data
test = pd.read_csv("test.csv", parse_dates = ['Registration_Date'])
test.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5
29040,504535,6583,2006-06-24,0,0,0,0,0
28273,497131,6548,2006-06-20,0,0,0,0,0
4224,524822,6584,2006-06-05,0,0,0,0,0
15270,506616,6566,2006-03-10,0,0,0,0,0
4512,511429,6576,2006-09-06,0,0,0,0,0


In [15]:
# Checking the Patient data in test with patients profile
test_patients = set(test.Patient_ID.unique())
patients = set(patient_profile['Patient_ID'].unique())
patients.intersection(test_patients) == test_patients

True

In [16]:
# Checking the health camp data given in test
test_health_camps = set(test.Health_Camp_ID.unique())
health_camps = set(health_camp_detail.Health_Camp_ID.unique())
health_camps.intersection(test_health_camps) == test_health_camps 

True

In [17]:
# Joining Test data with Patient and Health camp metadata
test_new = (test.merge(patient_profile, on = 'Patient_ID', how = 'inner') \
            .merge(health_camp_detail, on = 'Health_Camp_ID', how = 'inner'))

test_new.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3
19122,505835,6583,2006-06-27,0,0,0,0,0,0,0,0,0,,,,2004-01-31,H,UNK,2006-08-02,2006-08-05,Second,A,2
28477,515066,6550,2006-11-09,0,0,0,0,0,0,0,0,0,,,,2006-11-09,H,UNK,2006-10-12,2006-12-18,Third,G,2
15210,512698,6584,2006-07-14,3,0,0,0,1,0,0,0,0,,,,2003-05-04,G,UNK,2006-08-04,2006-08-09,Second,A,2
34115,492759,6568,2006-08-19,0,0,0,0,0,0,0,0,0,,,,2006-08-07,UNK,UNK,2006-08-17,2006-09-13,First,E,2
34135,511863,6568,2006-08-19,1,0,0,0,0,0,0,0,0,5.0,78.0,49.0,2003-01-25,G,Education,2006-08-17,2006-09-13,First,E,2


In [18]:
#Extracting new features for test data
test_new['Camp_Duration'] = (test_new['Camp_End_Date'] - test_new['Camp_Start_Date']).dt.days
test_new['RegistrationToCampStart'] = (test_new['Registration_Date'] - test_new['Camp_Start_Date']).dt.days
test_new['RegistrationToCampEnd'] = (test_new['Registration_Date'] - test_new['Camp_End_Date']).dt.days
test_new['Age'] = test_new['Age'].replace({'None' : np.nan})

In [19]:
# Comparing the Train and Test Columns
set(data.columns).difference(test_new.columns)

{'Donation',
 'Health Score',
 'Health_Score',
 'Last_Stall_Visited_Number',
 'Number_of_stall_visited',
 'isAttendedFirstCamp',
 'isAttendedSecondCamp',
 'isAttendedThirdCamp'}

Here we are ignoring following columns as they won't be available at test time for new patients


* Health Score
* Health_Score
* Last_Stall_Visited_Number
* Number_of_stall_visited
* Donation


In [22]:
# Defining the Train and Target Columns
train_columns = ['Registration_Date', 'Var1', 'Var2', 'Category3', 'Camp_Duration', 
                 'Var3', 'Var4', 'Var5', 'Online_Follower', 'LinkedIn_Shared', 
                 'Twitter_Shared', 'Facebook_Shared', 'Income', 'Education_Score',  
                 'First_Interaction', 'City_Type', 'Employer_Category', 'Age',
                 'Camp_Start_Date', 'Camp_End_Date', 'Category1', 'Category2', 
                 'RegistrationToCampStart', 'RegistrationToCampEnd'
                 ]
date_columns = ['Registration_Date', 'First_Interaction', 'Camp_Start_Date', 'Camp_End_Date']
target_columns = ['isAttendedFirstCamp', 
                  'isAttendedSecondCamp', 
                  'isAttendedThirdCamp'
                  ]

In [23]:
data_train = data[train_columns + target_columns].copy()
data_train['index'] = data_train['Registration_Date']
data_train.set_index('index', inplace = True)
data_train = data_train.sort_index()
data_train.sample(5)

Unnamed: 0_level_0,Registration_Date,Var1,Var2,Category3,Camp_Duration,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,First_Interaction,City_Type,Employer_Category,Age,Camp_Start_Date,Camp_End_Date,Category1,Category2,RegistrationToCampStart,RegistrationToCampEnd,isAttendedFirstCamp,isAttendedSecondCamp,isAttendedThirdCamp
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2006-12-07,2006-12-07,0,0,2,771,0,0,0,0,0,0,0,,,2006-12-05,A,UNK,,2005-09-27,2007-11-07,First,F,436.0,-335.0,0,0,0
2005-11-10,2005-11-10,0,0,2,751,0,0,0,0,0,0,0,,,2005-10-11,UNK,UNK,,2005-10-17,2007-11-07,Second,A,24.0,-727.0,0,1,0
2005-07-04,2005-07-04,0,0,2,185,0,0,0,0,0,0,0,,,2005-07-02,UNK,UNK,,2005-02-19,2005-08-23,First,F,135.0,-50.0,0,0,0
2005-07-12,2005-07-12,0,0,2,185,0,0,0,0,0,0,0,,,2005-06-17,D,UNK,,2005-02-19,2005-08-23,First,F,143.0,-42.0,0,0,0
2005-03-06,2005-03-06,0,0,2,185,0,0,0,0,0,0,0,,,2003-09-23,I,UNK,,2005-02-19,2005-08-23,First,F,15.0,-170.0,0,0,0


In [26]:
# Adding date realted columns to Train and Test Data
for column in date_columns : 
    print(column)
    add_datepart(data_train, field_name = column)
    add_datepart(test_new, field_name = column)
    

Registration_Date
First_Interaction
Camp_Start_Date
Camp_End_Date


In [27]:
#Defining Continuous and Categorical Features
continuous_features = ['Age', 'RegistrationToCampStart', 'RegistrationToCampEnd', 'Camp_Duration', 
                       'Registration_Elapsed', 'First_InteractionElapsed', 'Camp_Start_Elapsed', 'Camp_End_Elapsed']
train_features = test_new.columns
categorical_features = [column_name for column_name in train_features if column_name not in continuous_features and 
                       column_name not in ['Patient_ID', 'Health_Camp_ID']]
len(categorical_features), len(continuous_features)

(64, 8)

## Saving the Transformed Data

This helps to skip the above steps when we are restarting the notebook 

In [None]:
# Saving the Data as Pickle Files
with open('train_data.pkl', 'wb') as f : 
    pkl.dump(data_train, f)

with open('test_data.pkl', 'wb') as f : 
    pkl.dump(test_new, f)

with open('categ_feat.pkl', 'wb') as f : 
    pkl.dump(categorical_features, f)

with open('cont_feat.pkl', 'wb') as f : 
    pkl.dump(continuous_features, f)

In [None]:
# Moving the Data to Google drive
!cp "/content/HealthCareDataset/test_data.pkl" "/content/drive/My Drive/HealthCareDataset"
!cp "/content/HealthCareDataset/train_data.pkl" "/content/drive/My Drive/HealthCareDataset"
!cp "/content/HealthCareDataset/categ_feat.pkl" "/content/drive/My Drive/HealthCareDataset"
!cp "/content/HealthCareDataset/cont_feat.pkl" "/content/drive/My Drive/HealthCareDataset"

## ** Use the below script to load data from google drive **




```
if not os.path.exists(DATASET_LOCAL_PATH) : 
    !cp -r "{DATASET_DRIVE_PATH}" "{DATASET_LOCAL_PATH}"

%cd "{DATASET_LOCAL_PATH}"

with open('train_data.pkl', 'rb') as f : 
    data_train = pkl.load(f)

with open('test_data.pkl', 'rb') as f : 
    test_data = pkl.load(f)

with open('categ_feat.pkl', 'rb') as f : 
    categorical_features = pkl.load(f)

with open('cont_feat.pkl', 'rb') as f : 
    continuous_features = pkl.load(f)
    
```



# Splitting the Data into train and Test Splits

In [31]:
train_data = data_train.loc[ : '2006-03-31'].copy()
val_data = data_train.loc['2006-04-01' : ].copy()
train_data.shape, val_data.shape, data_train.shape[0]

((66382, 75), (8896, 75), 75278)

# ML Modeling with PyCaret

To build the ML models with PyCaret , we need to do the follwoing tasks



1.   Define setup with details like target variable, continuous an categorical features 
2.   create the model with create_model function or tune the model with tune_model function

we can save the pycaret model with ** save_model ** function and load the saved model with ** load_model ** function

For more tutorials or documentation visit [PyCaret Tutorials](https://pycaret.org/tutorial/)



## Model for checking the First Health Camp Score

In [None]:
pycaret_clf_first = setup(data = train_data, 
                    target = 'isAttendedFirstCamp',
                    numeric_imputation = 'mean',
                    categorical_features = categorical_features, 
                    ignore_features = ['isAttendedSecondCamp', 'isAttendedThirdCamp'], 
                    numeric_features = continuous_features, 
                    bin_numeric_features = continuous_features, 
                    feature_selection = True, 
                    )
tuned_catboost_first = tune_model('catboost', optimize = 'AUC', n_iter = 20)
save_model(tuned_catboost_first, 'tuned_catboost_first')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.9231,0.8761,0.2161,0.5949,0.317,0.2856
1,0.9214,0.8685,0.2018,0.5714,0.2983,0.2666
2,0.9224,0.8633,0.1766,0.6063,0.2735,0.2454
3,0.9199,0.8696,0.1606,0.5556,0.2491,0.2202
4,0.919,0.8563,0.1678,0.529,0.2548,0.2239
5,0.9241,0.8769,0.2046,0.6224,0.308,0.2785
6,0.921,0.8608,0.1885,0.5655,0.2828,0.2519
7,0.9191,0.8659,0.1586,0.5349,0.2447,0.215
8,0.9212,0.8529,0.1793,0.5735,0.2732,0.2435
9,0.9203,0.8589,0.1747,0.5547,0.2657,0.2355


Transformation Pipeline and Model Succesfully Saved


## Model for checking the Second Health Camp Score

In [None]:
%%time 
pycaret_clf_second = setup(data = train_data, 
                    target = 'isAttendedSecondCamp',
                    numeric_imputation = 'mean',
                    categorical_features = categorical_features, 
                    ignore_features = ['target', 'isAttendedFirstCamp', 'isAttendedThirdCamp'], 
                    numeric_features = continuous_features, 
                    bin_numeric_features = continuous_features, 
                    feature_selection = True, 
                    )
tuned_catboost_second = tune_model('catboost', optimize = 'AUC', n_iter = 20)
save_model(tuned_catboost_second, 'tuned_catboost_second')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.9452,0.9766,0.7185,0.7443,0.7312,0.7006
1,0.9402,0.9746,0.7044,0.7161,0.7102,0.6769
2,0.9391,0.9718,0.7099,0.706,0.7079,0.6739
3,0.9364,0.972,0.6916,0.6954,0.6935,0.658
4,0.94,0.9717,0.7221,0.7066,0.7143,0.6808
5,0.9364,0.9733,0.6764,0.7008,0.6884,0.653
6,0.9398,0.9727,0.6947,0.717,0.7057,0.6722
7,0.936,0.9725,0.6947,0.6909,0.6928,0.6571
8,0.945,0.976,0.7294,0.7375,0.7335,0.7028
9,0.9433,0.9768,0.713,0.7331,0.7229,0.6913


CPU times: user 16min 34s, sys: 1min, total: 17min 34s
Wall time: 38min 5s


## Model for checking the Third Health Camp Score

In [None]:
%%time 
pycaret_clf_third = setup(data = train_data, 
                    target = 'isAttendedThirdCamp',
                    numeric_imputation = 'mean',
                    categorical_features = categorical_features, 
                    ignore_features = ['target', 'isAttendedFirstCamp', 'isAttendedSecondCamp'], 
                    numeric_features = continuous_features, 
                    bin_numeric_features = continuous_features, 
                    feature_selection = True, 
                    )
tuned_catboost_third = tune_model('catboost', optimize = 'AUC', n_iter = 20)
save_model(tuned_catboost_third, 'tuned_catboost_third')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.9639,0.9875,0.8593,0.7563,0.8045,0.7848
1,0.9643,0.9882,0.8659,0.7562,0.8074,0.7878
2,0.9602,0.9841,0.8527,0.7307,0.787,0.7652
3,0.959,0.9855,0.8637,0.7185,0.7844,0.762
4,0.9619,0.9873,0.8524,0.7428,0.7938,0.7729
5,0.9649,0.9865,0.8767,0.7552,0.8114,0.7922
6,0.96,0.9855,0.8462,0.7319,0.7849,0.763
7,0.9628,0.9871,0.8527,0.7505,0.7984,0.778
8,0.9638,0.9858,0.8505,0.7588,0.8021,0.7822
9,0.9594,0.9838,0.8637,0.7211,0.786,0.7638


CPU times: user 16min 2s, sys: 55.7 s, total: 16min 57s
Wall time: 27min 26s


## Use the following code to loa the saved models



```
tuned_catboost_third = load_model('tuned_catboost_third')
tuned_catboost_first = load_model('tuned_catboost_first')
tuned_catboost_second = load_model('tuned_catboost_second')

```



# Predicting with the Model

In [None]:
test_predictions = test_data.copy()

predictions = predict_model(tuned_catboost_first, data=test_data)
test_predictions['FirstCampLabel'] = predictions['Label']
test_predictions['FirstCampScore'] = predictions['Score']

predictions = predict_model(tuned_catboost_third, data=test_data)
test_predictions['ThirdCampLabel'] = predictions['Label']
test_predictions['ThirdCampScore'] = predictions['Score']

predictions = predict_model(tuned_catboost_second, data=test_data)
test_predictions['SecondCampLabel'] = predictions['Label']
test_predictions['SecondCampScore'] = predictions['Score']

test_predictions.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Var1,Var2,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,City_Type,Employer_Category,Category1,Category2,Category3,Camp_Duration,RegistrationToCampStart,RegistrationToCampEnd,Registration_Year,Registration_Month,Registration_Week,Registration_Day,Registration_Dayofweek,Registration_Dayofyear,Registration_Is_month_end,Registration_Is_month_start,Registration_Is_quarter_end,Registration_Is_quarter_start,Registration_Is_year_end,Registration_Is_year_start,Registration_Elapsed,First_InteractionYear,First_InteractionMonth,First_InteractionWeek,First_InteractionDay,First_InteractionDayofweek,First_InteractionDayofyear,First_InteractionIs_month_end,First_InteractionIs_month_start,First_InteractionIs_quarter_end,First_InteractionIs_quarter_start,First_InteractionIs_year_end,First_InteractionIs_year_start,First_InteractionElapsed,Camp_Start_Year,Camp_Start_Month,Camp_Start_Week,Camp_Start_Day,Camp_Start_Dayofweek,Camp_Start_Dayofyear,Camp_Start_Is_month_end,Camp_Start_Is_month_start,Camp_Start_Is_quarter_end,Camp_Start_Is_quarter_start,Camp_Start_Is_year_end,Camp_Start_Is_year_start,Camp_Start_Elapsed,Camp_End_Year,Camp_End_Month,Camp_End_Week,Camp_End_Day,Camp_End_Dayofweek,Camp_End_Dayofyear,Camp_End_Is_month_end,Camp_End_Is_month_start,Camp_End_Is_quarter_end,Camp_End_Is_quarter_start,Camp_End_Is_year_end,Camp_End_Is_year_start,Camp_End_Elapsed,FirstCampLabel,FirstCampScore,ThirdCampLabel,ThirdCampScore,SecondCampLabel,SecondCampScore
0,505701,6548,1,0,0,0,2,0,0,0,0,0.0,,44.0,E,,Third,G,2,66,-23,-89,2006,5,20,21,6,141,False,False,False,False,False,False,1148169600,2003,2,6,5,2,36,False,False,False,False,False,False,1044403200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0001,1,0.5458,0,0.0003
1,500633,6548,0,0,0,0,0,0,1,0,0,1.0,67.0,41.0,D,Consulting,Third,G,2,66,-7,-73,2006,6,23,6,1,157,False,False,False,False,False,False,1149552000,2004,12,50,11,5,346,False,False,False,False,False,False,1102723200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0002,1,0.6048,0,0.0003
2,494067,6548,0,0,0,0,0,0,0,0,0,,,,B,,Third,G,2,66,36,-30,2006,7,29,19,2,200,False,False,False,False,False,False,1153267200,2006,7,29,19,2,200,False,False,False,False,False,False,1153267200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0001,1,0.7298,0,0.0002
3,498974,6548,0,0,0,0,0,0,0,0,0,2.0,66.74,46.0,B,Software Industry,Third,G,2,66,16,-50,2006,6,26,29,3,180,False,False,False,False,False,False,1151539200,2004,11,46,8,0,313,False,False,False,False,False,False,1099872000,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0,1,0.6834,0,0.0003
4,517714,6548,0,0,0,0,0,0,0,0,0,,,,B,,Third,G,2,66,4,-62,2006,6,24,17,5,168,False,False,False,False,False,False,1150502400,2005,2,6,9,2,40,False,False,False,False,False,False,1107907200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0001,1,0.6064,0,0.0002


In [None]:
# Combining all the three predictions to get the final prediction for all health camps
test_predictions['Outcome'] = (test_predictions['FirstCampScore'] + 
                               test_predictions['SecondCampScore'] + 
                               test_predictions['ThirdCampScore'])
test_predictions.sample(5)

Unnamed: 0,Patient_ID,Health_Camp_ID,Var1,Var2,Var3,Var4,Var5,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,City_Type,Employer_Category,Category1,Category2,Category3,Camp_Duration,RegistrationToCampStart,RegistrationToCampEnd,Registration_Year,Registration_Month,Registration_Week,Registration_Day,Registration_Dayofweek,Registration_Dayofyear,Registration_Is_month_end,Registration_Is_month_start,Registration_Is_quarter_end,Registration_Is_quarter_start,Registration_Is_year_end,Registration_Is_year_start,Registration_Elapsed,First_InteractionYear,First_InteractionMonth,First_InteractionWeek,First_InteractionDay,First_InteractionDayofweek,...,First_InteractionIs_month_start,First_InteractionIs_quarter_end,First_InteractionIs_quarter_start,First_InteractionIs_year_end,First_InteractionIs_year_start,First_InteractionElapsed,Camp_Start_Year,Camp_Start_Month,Camp_Start_Week,Camp_Start_Day,Camp_Start_Dayofweek,Camp_Start_Dayofyear,Camp_Start_Is_month_end,Camp_Start_Is_month_start,Camp_Start_Is_quarter_end,Camp_Start_Is_quarter_start,Camp_Start_Is_year_end,Camp_Start_Is_year_start,Camp_Start_Elapsed,Camp_End_Year,Camp_End_Month,Camp_End_Week,Camp_End_Day,Camp_End_Dayofweek,Camp_End_Dayofyear,Camp_End_Is_month_end,Camp_End_Is_month_start,Camp_End_Is_quarter_end,Camp_End_Is_quarter_start,Camp_End_Is_year_end,Camp_End_Is_year_start,Camp_End_Elapsed,FirstCampLabel,FirstCampScore,ThirdCampLabel,ThirdCampScore,SecondCampLabel,SecondCampScore,Outcome,target
0,505701,6548,1,0,0,0,2,0,0,0,0,0,,44,E,,Third,G,2,66,-23,-89,2006,5,20,21,6,141,False,False,False,False,False,False,1148169600,2003,2,6,5,2,...,False,False,False,False,False,1044403200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0001,1,0.5458,0,0.0003,0.5462,1
1,500633,6548,0,0,0,0,0,0,1,0,0,1,67,41,D,Consulting,Third,G,2,66,-7,-73,2006,6,23,6,1,157,False,False,False,False,False,False,1149552000,2004,12,50,11,5,...,False,False,False,False,False,1102723200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0002,1,0.6048,0,0.0003,0.6053,1
2,494067,6548,0,0,0,0,0,0,0,0,0,,,,B,,Third,G,2,66,36,-30,2006,7,29,19,2,200,False,False,False,False,False,False,1153267200,2006,7,29,19,2,...,False,False,False,False,False,1153267200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0001,1,0.7298,0,0.0002,0.7301,1
3,498974,6548,0,0,0,0,0,0,0,0,0,2,66.74,46,B,Software Industry,Third,G,2,66,16,-50,2006,6,26,29,3,180,False,False,False,False,False,False,1151539200,2004,11,46,8,0,...,False,False,False,False,False,1099872000,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0000,1,0.6834,0,0.0003,0.6837,1
4,517714,6548,0,0,0,0,0,0,0,0,0,,,,B,,Third,G,2,66,4,-62,2006,6,24,17,5,168,False,False,False,False,False,False,1150502400,2005,2,6,9,2,...,False,False,False,False,False,1107907200,2006,6,24,13,1,164,False,False,False,False,False,False,1150156800,2006,8,33,18,4,230,False,False,False,False,False,False,1155859200,0,0.0001,1,0.6064,0,0.0002,0.6067,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35244,511358,6545,0,0,0,0,0,0,0,0,0,,,,H,,First,C,2,5,4,-1,2006,9,39,26,1,269,False,False,False,False,False,False,1159228800,2006,9,37,16,5,...,False,False,False,False,False,1158364800,2006,9,38,22,4,265,False,False,False,False,False,False,1158883200,2006,9,39,27,2,270,False,False,False,False,False,False,1159315200,1,0.5206,0,0.0000,0,0.0007,0.5213,1
35245,500117,6545,0,0,0,0,0,0,0,0,0,,,,,,First,C,2,5,2,-3,2006,9,38,24,6,267,False,False,False,False,False,False,1159056000,2006,9,37,15,4,...,False,False,False,False,False,1158278400,2006,9,38,22,4,265,False,False,False,False,False,False,1158883200,2006,9,39,27,2,270,False,False,False,False,False,False,1159315200,0,0.4294,0,0.0000,0,0.0006,0.4300,0
35246,496733,6545,0,0,0,0,0,0,0,0,0,,,,,,First,C,2,5,3,-2,2006,9,39,25,0,268,False,False,False,False,False,False,1159142400,2006,9,37,17,6,...,False,False,False,False,False,1158451200,2006,9,38,22,4,265,False,False,False,False,False,False,1158883200,2006,9,39,27,2,270,False,False,False,False,False,False,1159315200,0,0.3416,0,0.0000,0,0.0006,0.3422,0
35247,514977,6545,0,0,0,0,0,0,0,0,0,,,,,,First,C,2,5,3,-2,2006,9,39,25,0,268,False,False,False,False,False,False,1159142400,2005,12,50,13,1,...,False,False,False,False,False,1134432000,2006,9,38,22,4,265,False,False,False,False,False,False,1158883200,2006,9,39,27,2,270,False,False,False,False,False,False,1159315200,0,0.4660,0,0.0000,0,0.0006,0.4666,0


In [None]:
# Saving the Predictions to CSV file
submission = test_predictions[['Patient_ID', 'Health_Camp_ID', 'Outcome']]
submission.to_csv("submission.csv", index = False)