Congratulations – you have been hired as Chief Data Scientist of MedCamp – a not for profit organization dedicated in making health conditions for working professionals better. MedCamp was started because the founders saw their family suffer due to bad work life balance and neglected health.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp). 

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and Number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

 

The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
 

Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.

In [34]:
# Importing the libraries

In [33]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None
plt.rcParams['figure.figsize']=[15,15]

In [35]:
# Importing the files

In [2]:
train = pd.read_csv('~/Downloads/healthcare/Train.csv')
test = pd.read_csv('~/Downloads/healthcare/test_l0Auv8Q.csv')
fhc = pd.read_csv('~/Downloads/healthcare/First_Health_Camp_Attended.csv')
hcd = pd.read_csv('~/Downloads/healthcare/Health_Camp_Detail.csv')
pp = pd.read_csv('~/Downloads/healthcare/Patient_Profile.csv')
shc = pd.read_csv('~/Downloads/healthcare/Second_Health_Camp_Attended.csv')
thc = pd.read_csv('~/Downloads/healthcare/Third_Health_Camp_Attended.csv')

In [36]:
# checking the initial shape of the data

In [4]:
train.shape, test.shape

((75278, 8), (35249, 8))

In [37]:
# combining the train and test data

In [7]:
combined = pd.concat([train,test], ignore_index= True)

In [8]:
combined.shape

(110527, 8)

In [11]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5
0,489652,6578,10-Sep-05,4,0,0,0,2
1,507246,6578,18-Aug-05,45,5,0,0,7
2,523729,6534,29-Apr-06,0,0,0,0,0
3,524931,6535,07-Feb-04,0,0,0,0,0
4,521364,6529,28-Feb-06,15,1,0,0,7


In [9]:
train.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5
0,489652,6578,10-Sep-05,4,0,0,0,2
1,507246,6578,18-Aug-05,45,5,0,0,7
2,523729,6534,29-Apr-06,0,0,0,0,0
3,524931,6535,07-Feb-04,0,0,0,0,0
4,521364,6529,28-Feb-06,15,1,0,0,7


In [10]:
fhc.head(3)

Unnamed: 0,Patient_ID,Health_Camp_ID,Donation,Health_Score,Unnamed: 4
0,506181,6560,40,0.439024,
1,494977,6560,20,0.097561,
2,518680,6560,10,0.04878,


In [12]:
# merge fhc with combined 

In [13]:
combined = pd.merge(left = combined , right = fhc, on=['Patient_ID','Health_Camp_ID'], how= 'left')

In [14]:
combined.shape

(110527, 11)

In [15]:
shc.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Health Score
0,526631,6536,0.875136
1,509122,6536,0.7557
2,498864,6536,0.673181
3,515398,6536,0.722041
4,504624,6536,0.464712


In [16]:
# merge shc with combined 

In [17]:
combined = pd.merge(left = combined , right = shc, on=['Patient_ID','Health_Camp_ID'], how= 'left')

In [18]:
# merge thc with combined 

In [19]:
thc.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Number_of_stall_visited,Last_Stall_Visited_Number
0,517875,6527,3,1
1,504692,6578,1,1
2,504692,6527,3,1
3,493167,6527,4,4
4,510954,6528,2,2


In [20]:
combined = pd.merge(left = combined , right = thc, on=['Patient_ID','Health_Camp_ID'], how= 'left')

In [22]:
combined.shape

(110527, 14)

In [21]:
hcd.head()

Unnamed: 0,Health_Camp_ID,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3
0,6560,16-Aug-03,20-Aug-03,First,B,2
1,6530,16-Aug-03,28-Oct-03,First,C,2
2,6544,03-Nov-03,15-Nov-03,First,F,1
3,6585,22-Nov-03,05-Dec-03,First,E,2
4,6561,30-Nov-03,18-Dec-03,First,E,1


In [23]:
# # merge hcd with combined 

In [24]:
combined = pd.merge(left = combined , right = hcd, on='Health_Camp_ID', how= 'left')

In [25]:
# merge pp with combined 

In [26]:
pp.head()

Unnamed: 0,Patient_ID,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category
0,516956,0,0,0,0,1,90.0,39,18-Jun-03,,Software Industry
1,507733,0,0,0,0,1,,40,20-Jul-03,H,Software Industry
2,508307,0,0,0,0,3,87.0,46,02-Nov-02,D,BFSI
3,512612,0,0,0,0,1,75.0,47,02-Nov-02,D,Education
4,521075,0,0,0,0,3,,80,24-Nov-02,H,Others


In [27]:
combined = pd.merge(left = combined , right = pp, on='Patient_ID', how= 'left')

In [28]:
combined.shape

(110527, 29)

In [32]:
combined.head(10)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category
0,489652,6578,10-Sep-05,4,0,0,0,2,,,,,2.0,1.0,16-Aug-05,14-Oct-05,Third,G,2,0,0,0,0,,,,06-Dec-04,,
1,507246,6578,18-Aug-05,45,5,0,0,7,,,,,,,16-Aug-05,14-Oct-05,Third,G,2,0,0,0,0,1.0,75.0,40.0,08-Sep-04,C,Others
2,523729,6534,29-Apr-06,0,0,0,0,0,,,,0.402054,,,17-Oct-05,07-Nov-07,Second,A,2,0,0,0,0,,,,22-Jun-04,,
3,524931,6535,07-Feb-04,0,0,0,0,0,,,,,,,01-Feb-04,18-Feb-04,First,E,2,0,0,0,0,,,,07-Feb-04,I,
4,521364,6529,28-Feb-06,15,1,0,0,7,,,,0.845597,,,30-Mar-06,03-Apr-06,Second,A,2,0,0,0,1,1.0,70.0,40.0,04-Jul-03,I,Technology
5,494493,6570,20-May-05,0,0,0,0,0,,,,,,,09-Jul-05,22-Jul-05,First,E,2,0,0,0,0,,,,01-Feb-04,,
6,523001,6562,22-May-05,0,0,0,0,0,,,,,,,24-Nov-04,02-Jun-05,First,F,2,0,0,0,0,,,,07-Apr-05,,
7,500733,6535,31-Jan-04,0,0,0,0,0,,,,,,,01-Feb-04,18-Feb-04,First,E,2,0,0,0,0,,,,23-Mar-03,D,
8,501155,6538,31-Jan-04,0,0,0,0,0,,,,,,,09-Jan-04,04-Feb-05,First,F,2,0,0,0,0,,,,31-Jan-04,B,
9,501457,6538,12-Aug-04,0,0,0,0,0,,,,,,,09-Jan-04,04-Feb-05,First,F,2,0,0,0,0,0.0,,,28-Jan-03,B,


### FEATURE ENGINEERING

In [45]:
# code to convert string to date
combined['Registration_Date'] = pd.to_datetime(combined.Registration_Date, dayfirst = True)
combined['Registration_Month'] = combined.Registration_Date.dt.month
combined['Registration_Year'] = combined.Registration_Date.dt.year
combined['Registration_Day'] = combined.Registration_Date.dt.day

In [46]:
combined['Camp_Start_Month'] = pd.DatetimeIndex(combined.Camp_Start_Date).month
combined['Camp_Start_Year'] = pd.DatetimeIndex(combined.Camp_Start_Date).year
combined['Camp_Start_Day'] = pd.DatetimeIndex(combined.Camp_Start_Date).day

combined['Camp_End_Month'] = pd.DatetimeIndex(combined.Camp_End_Date).month
combined['Camp_End_Year'] = pd.DatetimeIndex(combined.Camp_End_Date).year
combined['Camp_End_Day'] = pd.DatetimeIndex(combined.Camp_End_Date).day

In [72]:
combined.head(10)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_presence,Camp_Duration,Interaction_Diff,HC_per_patient,Patient_per_HC,Patient_per_year,Patient_per_month
0,489652,6578,2005-09-10,4,0,0,0,2,,,,,2.0,1.0,2005-08-16,2005-10-14,Third,G,2,,,,2004-12-06,,,9.0,2005.0,10.0,8,2005,16,10,2005,14,0,59,278.0,11,2837,22359,10445
1,507246,6578,2005-08-18,45,5,0,0,7,,,,,,,2005-08-16,2005-10-14,Third,G,2,1.0,75.0,40.0,2004-09-08,C,Others,8.0,2005.0,18.0,8,2005,16,10,2005,14,0,59,344.0,26,2837,22359,10445
2,523729,6534,2006-04-29,0,0,0,0,0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2004-06-22,,,4.0,2006.0,29.0,10,2005,17,11,2007,7,0,751,676.0,7,3597,22359,8242
3,524931,6535,2004-02-07,0,0,0,0,0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-07,I,,2.0,2004.0,7.0,2,2004,1,2,2004,18,0,17,0.0,6,1882,10902,8717
4,521364,6529,2006-02-28,15,1,0,0,7,,,,0.845597,,,2006-03-30,2006-04-03,Second,A,2,1.0,70.0,40.0,2003-07-04,I,Technology,2.0,2006.0,28.0,3,2006,30,4,2006,3,1,4,970.0,23,3823,16175,3823
5,494493,6570,2005-05-20,0,0,0,0,0,,,,,,,2005-07-09,2005-07-22,First,E,2,,,,2004-02-01,,,5.0,2005.0,20.0,7,2005,9,7,2005,22,0,13,474.0,20,3564,22359,3726
6,523001,6562,2005-05-22,0,0,0,0,0,,,,,,,2004-11-24,2005-06-02,First,F,2,,,,2005-04-07,,,5.0,2005.0,22.0,11,2004,24,6,2005,2,0,190,45.0,3,2338,10902,8757
7,500733,6535,2004-01-31,0,0,0,0,0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2003-03-23,D,,1.0,2004.0,31.0,2,2004,1,2,2004,18,0,17,314.0,13,1882,10902,8717
8,501155,6538,2004-01-31,0,0,0,0,0,,,,,,,2004-01-09,2005-02-04,First,F,2,,,,2004-01-31,B,,1.0,2004.0,31.0,1,2004,9,2,2005,4,0,392,0.0,2,3954,10902,9076
9,501457,6538,2004-08-12,0,0,0,0,0,,,,,,,2004-01-09,2005-02-04,First,F,2,0.0,,,2003-01-28,B,,8.0,2004.0,12.0,1,2004,9,2,2005,4,0,392,562.0,8,3954,10902,9076


In [48]:
# creating online presence using patient profiling data

combined['Online_presence'] = combined['Online_Follower'] + combined['LinkedIn_Shared'] + combined['Twitter_Shared'] + combined['Facebook_Shared']

In [49]:
# dropping the columns

combined.drop(['Online_Follower','LinkedIn_Shared','Twitter_Shared','Facebook_Shared'], axis = 1, inplace = True)

In [52]:
del pp  # deleting the unnecessary data

In [53]:
# camp duration from CampEndDate - CampStartDate

combined['Camp_Start_Date'] = pd.to_datetime(combined.Camp_Start_Date, dayfirst = True)
combined['Camp_End_Date'] = pd.to_datetime(combined.Camp_End_Date, dayfirst = True)
combined['Camp_Duration'] = (combined['Camp_End_Date']-combined['Camp_Start_Date']).dt.days

In [57]:
# Days_Interaction_diff = Registration_date - First_Interaction
#converting first interaction to datetime

combined['First_Interaction']= pd.to_datetime(combined.First_Interaction, dayfirst = True)

combined['Interaction_Diff'] = (combined['Registration_Date'] - combined['First_Interaction']).dt.days

In [60]:
# number of Health camp each patient visited

combined['HC_per_patient']= combined.groupby('Patient_ID')['Health_Camp_ID'].transform('nunique')

# number of patients in each health camp
combined['Patient_per_HC']= combined.groupby('Health_Camp_ID')['Patient_ID'].transform('nunique')

In [65]:
combined['HC_per_patient'].describe()

count    110527.000000
mean          7.501036
std           6.888289
min           1.000000
25%           2.000000
50%           5.000000
75%          11.000000
max          40.000000
Name: HC_per_patient, dtype: float64

In [66]:
combined['Patient_per_HC'].describe()

count    110527.000000
mean       2904.425480
std        1330.060994
min          44.000000
25%        1993.000000
50%        2763.000000
75%        3809.000000
max        6543.000000
Name: Patient_per_HC, dtype: float64

In [70]:
# Patients per year in the starting of a year

combined['Patient_per_year'] = combined.groupby('Camp_Start_Year')['Patient_ID'].transform('nunique')
combined.groupby('Camp_Start_Year')['Patient_ID'].nunique()

Camp_Start_Year
2003     3081
2004    10902
2005    22359
2006    16175
2007     2579
Name: Patient_ID, dtype: int64

In [71]:
# patients per month in the start of a year

combined['Patient_per_month'] = combined.groupby('Camp_Start_Month')['Patient_ID'].transform('nunique')
combined.groupby('Camp_Start_Month')['Patient_ID'].nunique()

Camp_Start_Month
1      9076
2      8717
3      3823
4      6164
5      1903
6      8208
7      3726
8     10445
9     11210
10     8242
11     8757
12     6774
Name: Patient_ID, dtype: int64

In [73]:
# patients per year in the beginning of each year

combined['Patient_per_End_Year'] = combined.groupby('Camp_End_Year')['Patient_ID'].transform('nunique')

# patients per month in the end of each year
combined['Patient_per_End_Month'] = combined.groupby('Camp_End_Month')['Patient_ID'].transform('nunique')

In [79]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_presence,Camp_Duration,Interaction_Diff,HC_per_patient,Patient_per_HC,Patient_per_year,Patient_per_month,Patient_per_End_Year,Patient_per_End_Month,Target
0,489652,6578,2005-09-10,4,0,0,0,2,,,,,2.0,1.0,2005-08-16,2005-10-14,Third,G,2,,,,2004-12-06,,,9.0,2005.0,10.0,8,2005,16,10,2005,14,0,59,278.0,11,2837,22359,10445,17330,6109,1
1,507246,6578,2005-08-18,45,5,0,0,7,,,,,,,2005-08-16,2005-10-14,Third,G,2,1.0,75.0,40.0,2004-09-08,C,Others,8.0,2005.0,18.0,8,2005,16,10,2005,14,0,59,344.0,26,2837,22359,10445,17330,6109,0
2,523729,6534,2006-04-29,0,0,0,0,0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2004-06-22,,,4.0,2006.0,29.0,10,2005,17,11,2007,7,0,751,676.0,7,3597,22359,8242,14379,14643,1
3,524931,6535,2004-02-07,0,0,0,0,0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-07,I,,2.0,2004.0,7.0,2,2004,1,2,2004,18,0,17,0.0,6,1882,10902,8717,6319,11010,0
4,521364,6529,2006-02-28,15,1,0,0,7,,,,0.845597,,,2006-03-30,2006-04-03,Second,A,2,1.0,70.0,40.0,2003-07-04,I,Technology,2.0,2006.0,28.0,3,2006,30,4,2006,3,1,4,970.0,23,3823,16175,3823,14842,6519,1


In [75]:
combined.shape

(110527, 43)

## Target Variable

In [None]:
# if Health score == yes or visited stall == yes then 1 else 0

In [76]:
def outcome(a,b,c,d):
    if((a>0)|(b>0)|(c>0)|(d>0)):
        return(1)
    else:
        return(0)

In [78]:
combined['Target']= combined.apply(lambda x:outcome(x['Health_Score'],x['Health Score'],x['Number_of_stall_visited'],
                                x['Last_Stall_Visited_Number']), axis =1)

In [80]:
combined['Target'].value_counts()

0    89993
1    20534
Name: Target, dtype: int64

In [81]:
# let's drop the unnecessary variables

combined.columns

Index(['Patient_ID', 'Health_Camp_ID', 'Registration_Date', 'Var1', 'Var2',
       'Var3', 'Var4', 'Var5', 'Donation', 'Health_Score', 'Unnamed: 4',
       'Health Score', 'Number_of_stall_visited', 'Last_Stall_Visited_Number',
       'Camp_Start_Date', 'Camp_End_Date', 'Category1', 'Category2',
       'Category3', 'Income', 'Education_Score', 'Age', 'First_Interaction',
       'City_Type', 'Employer_Category', 'Registration_Month',
       'Registration_Year', 'Registration_Day', 'Camp_Start_Month',
       'Camp_Start_Year', 'Camp_Start_Day', 'Camp_End_Month', 'Camp_End_Year',
       'Camp_End_Day', 'Online_presence', 'Camp_Duration', 'Interaction_Diff',
       'HC_per_patient', 'Patient_per_HC', 'Patient_per_year',
       'Patient_per_month', 'Patient_per_End_Year', 'Patient_per_End_Month',
       'Target'],
      dtype='object')

In [83]:
new_data = combined.drop(['Patient_ID', 'Health_Camp_ID', 'Registration_Date','Donation','Unnamed: 4','Health_Score'
             ,'Health Score','Income','Education_Score','Age','First_Interaction','Number_of_stall_visited', 'Last_Stall_Visited_Number',
             'First_Interaction','City_Type','Employer_Category'], axis = 1)

In [85]:
new_data.drop(['Camp_Start_Date','Camp_End_Date'], axis = 1, inplace = True)

In [92]:
new_data.head()

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Category1,Category2,Category3,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_presence,Camp_Duration,Interaction_Diff,HC_per_patient,Patient_per_HC,Patient_per_year,Patient_per_month,Patient_per_End_Year,Patient_per_End_Month,Target
0,4,0,0,0,2,3,6,2,9.0,2005.0,10.0,8,2005,16,10,2005,14,0,59,278.0,11,2837,22359,10445,17330,6109,1
1,45,5,0,0,7,3,6,2,8.0,2005.0,18.0,8,2005,16,10,2005,14,0,59,344.0,26,2837,22359,10445,17330,6109,0
2,0,0,0,0,0,2,0,2,4.0,2006.0,29.0,10,2005,17,11,2007,7,0,751,676.0,7,3597,22359,8242,14379,14643,1
3,0,0,0,0,0,1,4,2,2.0,2004.0,7.0,2,2004,1,2,2004,18,0,17,0.0,6,1882,10902,8717,6319,11010,0
4,15,1,0,0,7,2,0,2,2.0,2006.0,28.0,3,2006,30,4,2006,3,1,4,970.0,23,3823,16175,3823,14842,6519,1


In [87]:
del combined, fhc,shc,thc, hcd  # deleting unnecessary data

In [88]:
new_data.Category1.value_counts()

First     61417
Second    27783
Third     21327
Name: Category1, dtype: int64

In [89]:
# encoding the category1 variable

mapped = {'First':1,'Second':2,'Third':3}

new_data['Category1'] = new_data.Category1.map(mapped)

In [91]:
# encoding the category2 column
# pd.factorize works as same as labelencoder
new_data['Category2'] = pd.factorize(new_data.Category2, sort = True)[0]

In [99]:
new_data.isnull().sum()

Var1                       0
Var2                       0
Var3                       0
Var4                       0
Var5                       0
Category1                  0
Category2                  0
Category3                  0
Registration_Month       334
Registration_Year        334
Registration_Day         334
Camp_Start_Month           0
Camp_Start_Year            0
Camp_Start_Day             0
Camp_End_Month             0
Camp_End_Year              0
Camp_End_Day               0
Online_presence            0
Camp_Duration              0
Interaction_Diff         334
HC_per_patient             0
Patient_per_HC             0
Patient_per_year           0
Patient_per_month          0
Patient_per_End_Year       0
Patient_per_End_Month      0
Target                     0
dtype: int64

In [101]:
# treating null values

new_data['Registration_Month']= new_data['Registration_Month'].fillna(new_data['Registration_Month'].mode()[0])
new_data['Registration_Year']= new_data['Registration_Year'].fillna(new_data['Registration_Year'].mode()[0])
new_data['Registration_Day']= new_data['Registration_Day'].fillna(new_data['Registration_Day'].mode()[0])
new_data['Interaction_Diff']= new_data['Interaction_Diff'].fillna(new_data['Interaction_Diff'].median())

In [102]:
new_data.isnull().sum()

Var1                     0
Var2                     0
Var3                     0
Var4                     0
Var5                     0
Category1                0
Category2                0
Category3                0
Registration_Month       0
Registration_Year        0
Registration_Day         0
Camp_Start_Month         0
Camp_Start_Year          0
Camp_Start_Day           0
Camp_End_Month           0
Camp_End_Year            0
Camp_End_Day             0
Online_presence          0
Camp_Duration            0
Interaction_Diff         0
HC_per_patient           0
Patient_per_HC           0
Patient_per_year         0
Patient_per_month        0
Patient_per_End_Year     0
Patient_per_End_Month    0
Target                   0
dtype: int64

## Train Test Split

In [106]:
train.shape, test.shape

((75278, 8), (35249, 8))

In [197]:
new_data.loc[test.shape[0]:]

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Category1,Category2,Category3,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_presence,Camp_Duration,Interaction_Diff,HC_per_patient,Patient_per_HC,Patient_per_year,Patient_per_month,Patient_per_End_Year,Patient_per_End_Month,Target
35249,1,0,0,0,1,1,4,2,11.0,2003.0,21.0,11,2003,22,12,2003,5,0,13,148.0,11,1398,3081,8757,1753,4733,0
35250,0,0,0,0,0,3,6,2,9.0,2005.0,17.0,8,2005,16,10,2005,14,0,59,591.0,10,2837,22359,10445,17330,6109,0
35251,0,0,0,0,0,1,4,2,12.0,2004.0,2.0,12,2004,22,1,2005,6,0,15,2.0,2,3517,10902,6774,17330,5293,0
35252,0,0,0,0,0,1,5,2,6.0,2006.0,10.0,9,2005,27,11,2007,7,0,771,0.0,1,6543,22359,11210,14379,14643,0
35253,1,0,0,0,2,2,0,2,9.0,2006.0,27.0,10,2005,17,11,2007,7,0,751,130.0,9,3597,22359,8242,14379,14643,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,12,2,0,0,6,1,2,2,4.0,2006.0,13.0,4,2006,8,4,2006,17,2,9,786.0,26,166,16175,6164,14842,6519,0
110523,0,0,0,0,0,2,3,2,11.0,2006.0,3.0,11,2006,13,11,2006,18,0,5,415.0,3,2180,16175,8757,14842,14643,0
110524,0,0,0,0,0,2,0,2,6.0,2006.0,17.0,8,2006,4,8,2006,9,0,5,736.0,16,3041,16175,10445,14842,8663,0
110525,0,0,0,0,0,2,3,2,1.0,2007.0,13.0,1,2007,30,2,2007,4,0,5,619.0,5,2441,2579,9076,14379,11010,0


In [108]:
newtrain = new_data.loc[0:train.shape[0]-1,:]
newtest = new_data.loc[train.shape[0]:,:]

In [109]:
newtrain.shape, newtest.shape

((75278, 27), (35249, 27))

In [110]:
newtest.drop('Target', axis = 1, inplace = True)

In [111]:
newtrain.shape, newtest.shape

((75278, 27), (35249, 26))

In [133]:
X = newtrain.drop('Target', axis =1)
y = newtrain.Target

## Model Building

In [112]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [127]:
# models

lr = LogisticRegression()
rf = RandomForestClassifier()
gbc = GradientBoostingClassifier()
cb = CatBoostClassifier(eval_metric = 'AUC')
xgb = XGBClassifier(eval_metric = 'auc')

In [131]:
# let's build voting classifier model

from sklearn.ensemble import VotingClassifier
vc = VotingClassifier(estimators=[('lr',lr),('rf',rf),('gbc',gbc),('cb',cb),('xgb',xgb)], voting = 'soft')

In [134]:
pred = vc.fit(X,y).predict_proba(newtest)

Learning rate set to 0.065204
0:	total: 21.3ms	remaining: 21.2s
1:	total: 40ms	remaining: 20s
2:	total: 59ms	remaining: 19.6s
3:	total: 79.3ms	remaining: 19.8s
4:	total: 103ms	remaining: 20.5s
5:	total: 128ms	remaining: 21.2s
6:	total: 152ms	remaining: 21.6s
7:	total: 176ms	remaining: 21.9s
8:	total: 202ms	remaining: 22.2s
9:	total: 228ms	remaining: 22.5s
10:	total: 252ms	remaining: 22.6s
11:	total: 276ms	remaining: 22.7s
12:	total: 299ms	remaining: 22.7s
13:	total: 321ms	remaining: 22.6s
14:	total: 345ms	remaining: 22.6s
15:	total: 368ms	remaining: 22.7s
16:	total: 391ms	remaining: 22.6s
17:	total: 414ms	remaining: 22.6s
18:	total: 434ms	remaining: 22.4s
19:	total: 452ms	remaining: 22.1s
20:	total: 469ms	remaining: 21.9s
21:	total: 487ms	remaining: 21.7s
22:	total: 505ms	remaining: 21.4s
23:	total: 521ms	remaining: 21.2s
24:	total: 536ms	remaining: 20.9s
25:	total: 552ms	remaining: 20.7s
26:	total: 568ms	remaining: 20.5s
27:	total: 585ms	remaining: 20.3s
28:	total: 602ms	remaining: 20

245:	total: 4.29s	remaining: 13.1s
246:	total: 4.32s	remaining: 13.2s
247:	total: 4.33s	remaining: 13.1s
248:	total: 4.35s	remaining: 13.1s
249:	total: 4.37s	remaining: 13.1s
250:	total: 4.39s	remaining: 13.1s
251:	total: 4.41s	remaining: 13.1s
252:	total: 4.42s	remaining: 13.1s
253:	total: 4.44s	remaining: 13s
254:	total: 4.46s	remaining: 13s
255:	total: 4.48s	remaining: 13s
256:	total: 4.5s	remaining: 13s
257:	total: 4.52s	remaining: 13s
258:	total: 4.53s	remaining: 13s
259:	total: 4.55s	remaining: 13s
260:	total: 4.57s	remaining: 12.9s
261:	total: 4.58s	remaining: 12.9s
262:	total: 4.6s	remaining: 12.9s
263:	total: 4.62s	remaining: 12.9s
264:	total: 4.64s	remaining: 12.9s
265:	total: 4.65s	remaining: 12.8s
266:	total: 4.67s	remaining: 12.8s
267:	total: 4.69s	remaining: 12.8s
268:	total: 4.71s	remaining: 12.8s
269:	total: 4.73s	remaining: 12.8s
270:	total: 4.74s	remaining: 12.8s
271:	total: 4.76s	remaining: 12.7s
272:	total: 4.78s	remaining: 12.7s
273:	total: 4.8s	remaining: 12.7s
27

488:	total: 8.49s	remaining: 8.88s
489:	total: 8.51s	remaining: 8.86s
490:	total: 8.53s	remaining: 8.84s
491:	total: 8.55s	remaining: 8.82s
492:	total: 8.56s	remaining: 8.81s
493:	total: 8.58s	remaining: 8.79s
494:	total: 8.6s	remaining: 8.77s
495:	total: 8.62s	remaining: 8.75s
496:	total: 8.63s	remaining: 8.74s
497:	total: 8.65s	remaining: 8.72s
498:	total: 8.67s	remaining: 8.7s
499:	total: 8.69s	remaining: 8.69s
500:	total: 8.71s	remaining: 8.67s
501:	total: 8.72s	remaining: 8.65s
502:	total: 8.74s	remaining: 8.64s
503:	total: 8.76s	remaining: 8.62s
504:	total: 8.78s	remaining: 8.61s
505:	total: 8.8s	remaining: 8.59s
506:	total: 8.81s	remaining: 8.57s
507:	total: 8.83s	remaining: 8.55s
508:	total: 8.85s	remaining: 8.54s
509:	total: 8.87s	remaining: 8.52s
510:	total: 8.89s	remaining: 8.5s
511:	total: 8.9s	remaining: 8.49s
512:	total: 8.92s	remaining: 8.47s
513:	total: 8.94s	remaining: 8.46s
514:	total: 8.96s	remaining: 8.44s
515:	total: 8.98s	remaining: 8.42s
516:	total: 9s	remaining:

733:	total: 12.9s	remaining: 4.68s
734:	total: 12.9s	remaining: 4.66s
735:	total: 13s	remaining: 4.64s
736:	total: 13s	remaining: 4.63s
737:	total: 13s	remaining: 4.61s
738:	total: 13s	remaining: 4.59s
739:	total: 13s	remaining: 4.57s
740:	total: 13s	remaining: 4.55s
741:	total: 13.1s	remaining: 4.54s
742:	total: 13.1s	remaining: 4.52s
743:	total: 13.1s	remaining: 4.5s
744:	total: 13.1s	remaining: 4.49s
745:	total: 13.1s	remaining: 4.47s
746:	total: 13.1s	remaining: 4.45s
747:	total: 13.2s	remaining: 4.43s
748:	total: 13.2s	remaining: 4.42s
749:	total: 13.2s	remaining: 4.4s
750:	total: 13.2s	remaining: 4.38s
751:	total: 13.2s	remaining: 4.36s
752:	total: 13.2s	remaining: 4.34s
753:	total: 13.3s	remaining: 4.33s
754:	total: 13.3s	remaining: 4.31s
755:	total: 13.3s	remaining: 4.3s
756:	total: 13.3s	remaining: 4.28s
757:	total: 13.3s	remaining: 4.26s
758:	total: 13.4s	remaining: 4.24s
759:	total: 13.4s	remaining: 4.23s
760:	total: 13.4s	remaining: 4.21s
761:	total: 13.4s	remaining: 4.19s


970:	total: 17.2s	remaining: 513ms
971:	total: 17.2s	remaining: 495ms
972:	total: 17.2s	remaining: 477ms
973:	total: 17.2s	remaining: 460ms
974:	total: 17.2s	remaining: 442ms
975:	total: 17.2s	remaining: 424ms
976:	total: 17.3s	remaining: 406ms
977:	total: 17.3s	remaining: 389ms
978:	total: 17.3s	remaining: 371ms
979:	total: 17.3s	remaining: 353ms
980:	total: 17.3s	remaining: 336ms
981:	total: 17.3s	remaining: 318ms
982:	total: 17.4s	remaining: 300ms
983:	total: 17.4s	remaining: 283ms
984:	total: 17.4s	remaining: 265ms
985:	total: 17.4s	remaining: 247ms
986:	total: 17.4s	remaining: 230ms
987:	total: 17.5s	remaining: 212ms
988:	total: 17.5s	remaining: 195ms
989:	total: 17.5s	remaining: 177ms
990:	total: 17.5s	remaining: 159ms
991:	total: 17.6s	remaining: 142ms
992:	total: 17.6s	remaining: 124ms
993:	total: 17.6s	remaining: 106ms
994:	total: 17.6s	remaining: 88.6ms
995:	total: 17.7s	remaining: 70.9ms
996:	total: 17.7s	remaining: 53.2ms
997:	total: 17.7s	remaining: 35.4ms
998:	total: 17.7

In [140]:
pred[:,1]

array([0.484439  , 0.39082388, 0.17530693, ..., 0.37012055, 0.24169588,
       0.42842214])

In [142]:
# submission

submission = pd.DataFrame({'Patient_ID': test.Patient_ID,
                         'Health_Camp_ID': test.Health_Camp_ID,
                         'Outcome':pred[:,1]})

In [144]:
submission.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Outcome
0,505701,6548,0.484439
1,500633,6584,0.390824
2,506945,6582,0.175307
3,497447,6551,0.445131
4,496446,6533,0.162


In [146]:
submission.to_csv('VotingModel.csv', index = False)  #0.6303

In [158]:
# using LGBM 

In [147]:
from lightgbm import LGBMClassifier

In [151]:
lgbm = LGBMClassifier(n_estimators = 500, max_depth = 10, random_state = 10, learning_rate= 0.01)

pred_lgbm = lgbm.fit(X,y).predict_proba(newtest)

In [152]:
pred_lgbm[:1]

array([[0.30458491, 0.69541509]])

In [155]:
# submission

submission2 = pd.DataFrame({'Patient_ID': test.Patient_ID,
                         'Health_Camp_ID': test.Health_Camp_ID,
                         'Outcome':pred_lgbm[:,1]})

In [156]:
submission2.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Outcome
0,505701,6548,0.695415
1,500633,6584,0.409714
2,506945,6582,0.179618
3,497447,6551,0.453857
4,496446,6533,0.066385


In [157]:
submission2.to_csv('lgbmmodel.csv', index = False)  #0.6277

In [159]:
# rfecv and cross validation

In [160]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFECV

In [161]:
dtree = DecisionTreeClassifier()

In [165]:
rfe = RFECV(estimator = dtree, step= 1, min_features_to_select== 5, cv = 5, verbose= 5)

In [166]:
rfe.fit(X,y)
features = list(rfe.get_feature_names_out())
print(features)

Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 featur

In [167]:
# rfe features

rfe_input = X.loc[:,features]
rfe_test = newtest.loc[:,features]

In [170]:
rfe_input.shape, rfe_test.shape

((75278, 14), (35249, 14))

In [172]:
lgbm = LGBMClassifier(n_estimators = 500, max_depth = 10, random_state = 10, learning_rate= 0.01)

pred_rfe_lgbm = lgbm.fit(rfe_input,y).predict_proba(rfe_test)

In [173]:
#submission

submission3 = pd.DataFrame({'Patient_ID': test.Patient_ID,
                         'Health_Camp_ID': test.Health_Camp_ID,
                         'Outcome':pred_rfe_lgbm[:,1]})

In [174]:
submission3.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Outcome
0,505701,6548,0.65658
1,500633,6584,0.600173
2,506945,6582,0.102796
3,497447,6551,0.668019
4,496446,6533,0.049789


In [176]:
submission3.to_csv('rfemodel.csv', index = False)  #0.6306

In [177]:
# cross validation

In [178]:
from sklearn.model_selection import KFold

In [184]:
kfold = KFold(n_splits = 5, shuffle = True)
lgbm = LGBMClassifier(n_estimators = 500,
                     max_depth = 10, random_state = 42,
                     learning_rate = 0.01,
                     scale_pos_weight = 3)

pred_df = pd.DataFrame()

n=5
for i in range(n):
    folds = next(kfold.split(X))  #next helps you iterate over your folds
    xtrain = X.iloc[folds[0]]  #builds xtrain data
    ytrain = y.iloc[folds[0]]  #creates y value
    lgbm.fit(xtrain, ytrain)
    pred_df[i] = lgbm.predict_proba(newtest)[:, 1]

In [185]:
median_prob = pred_df.median(axis = 1)

In [187]:
submission4 = pd.DataFrame({'Patient_ID': test.Patient_ID,
                         'Health_Camp_ID': test.Health_Camp_ID,
                         'Outcome':median_prob})

submission4.to_csv('Median_probmodel.csv', index = False)  #0.6302

## Summary

* Model performance drastically gets impacted by the feature engineering
* We saw that LGBM performed the best among all fro this competition leading at #2 in public leaderboard
* We saw RFECV features did not performed well
* Parameter tuning of LGBM can take the model Performance to whole new level
* Cross validation model for lgbm did a fantastic job earning us few brownie points in the leaderboard