**Congratulations – you have been hired as Chief Data Scientist of MedCamp – a not for profit organization dedicated in making health conditions for working professionals better. MedCamp was started because the founders saw their family suffer due to bad work life balance and neglected health.**

**MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).**

**MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and Number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.**

**One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.**

 

#### The Process:
* MedCamp employees / volunteers reach out to people and drive registrations.
  During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
 

### Other things to note:
* Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information   about these people.
* For a few camps, there was hardware failure, so some information about date and time of registration is lost.
* MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

### Favorable outcome:
* For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
* You need to predict the chances (probability) of having a favourable outcome.
 

### Data Description
#### **Train.zip contains the following 6 csvs alongside the data dictionary that contains definitions for each variable**

* Health_Camp_Detail.csv – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.

* Train.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

* Patient_Profile.csv – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category

* First_Health_Camp_Attended.csv – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

* Second_Health_Camp_Attended.csv - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.

* Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third format. This includes Number_of_stall_visited & Last_Stall_Visited_Number.


#### **Test Set**

* Test.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

 

### Train / Test split:

*Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.*


**Sample Submission:**

* *Patient_ID: Unique Identifier for each patient. This ID is not sequential in nature and can not be used in modeling.*

* *Health_Camp_ID: Unique Identifier for each camp. This ID is not sequential in nature and can not be used in modeling*

* *Outcome: Predicted probability of a favourable outcome*



# Import required Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = [12,6]
pd.set_option('display.max_columns',50)  # to display all data

In [2]:
train=pd.read_csv(r"C:\Users\jakha\Downloads\Train (2).csv")
test=pd.read_csv(r"C:\Users\jakha\Downloads\Test (2).csv")
submissiondata=pd.read_csv(r"C:\Users\jakha\Downloads\Sample_Submission (1).csv")
hcdtl=pd.read_csv(r"C:\Users\jakha\Downloads\Health_Camp_Detail.csv")
pp=pd.read_csv(r"C:\Users\jakha\Downloads\Patient_Details.csv")
fhc=pd.read_csv(r"C:\Users\jakha\Downloads\First_Health_Camp.csv")
shc=pd.read_csv(r"C:\Users\jakha\Downloads\Second_Health_Camp.csv")
thc=pd.read_csv(r"C:\Users\jakha\Downloads\Third_Health_Camp.csv")

In [3]:
train.shape

(52694, 9)

In [4]:
test.shape

(22584, 8)

In [5]:
# combine the train and test data
combined = pd.concat([train, test], ignore_index= True)

In [6]:
combined.shape

(75278, 9)

In [7]:
combined.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome
0,526927,6570,14/05/05,0,0,0,0,0,0.0
1,510379,6534,26/05/06,0,0,0,0,0,1.0


In [8]:
fhc.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Donation,Health_Score,Unnamed: 4
0,506181,6560,40,0.439024,
1,494977,6560,20,0.097561,


In [9]:
# merge fhc with combined data
combined = pd.merge(left = combined , right = fhc, on = ['Patient_ID', 'Health_Camp_ID'], how = 'left')

In [10]:
combined.shape

(75278, 12)

In [11]:
combined.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4
0,526927,6570,14/05/05,0,0,0,0,0,0.0,,,
1,510379,6534,26/05/06,0,0,0,0,0,1.0,,,


In [12]:
shc.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Health Score
0,526631,6536,0.875136
1,509122,6536,0.7557


In [13]:
# merge shc with combined data
combined = pd.merge(left = combined , right = shc, on = ['Patient_ID', 'Health_Camp_ID'], how = 'left')

In [14]:
combined.shape

(75278, 13)

In [15]:
combined.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score
0,526927,6570,14/05/05,0,0,0,0,0,0.0,,,,
1,510379,6534,26/05/06,0,0,0,0,0,1.0,,,,0.402054


In [16]:
thc.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Number_of_stall_visited,Last_Stall_Visited_Number
0,517875,6527,3,1
1,504692,6578,1,1


In [17]:
# merge thc with combined data
combined = pd.merge(left = combined , right = thc, on = ['Patient_ID', 'Health_Camp_ID'], how = 'left')

In [18]:
combined.shape

(75278, 15)

In [19]:
combined.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number
0,526927,6570,14/05/05,0,0,0,0,0,0.0,,,,,,
1,510379,6534,26/05/06,0,0,0,0,0,1.0,,,,0.402054,,


In [20]:
hcdtl.head(2)

Unnamed: 0,Health_Camp_ID,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3
0,6560,16-Aug-03,20-Aug-03,First,B,2
1,6530,16-Aug-03,28-Oct-03,First,C,2


In [21]:
# merge hcdtl with combined data
combined = pd.merge(left = combined , right = hcdtl, 
                    on = 'Health_Camp_ID', how = 'left')

In [22]:
combined.shape

(75278, 20)

In [23]:
combined.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3
0,526927,6570,14/05/05,0,0,0,0,0,0.0,,,,,,,09-Jul-05,22-Jul-05,First,E,2
1,510379,6534,26/05/06,0,0,0,0,0,1.0,,,,0.402054,,,17-Oct-05,07-Nov-07,Second,A,2


In [24]:
pp.head(2)

Unnamed: 0,Patient_ID,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category
0,516956,0,0,0,0,1,90.0,39,18-Jun-03,,Software Industry
1,507733,0,0,0,0,1,,40,20-Jul-03,H,Software Industry


In [25]:
# merge pp with combined data
combined = pd.merge(left = combined , right = pp, on = 'Patient_ID', how = 'left')

In [26]:
combined.shape

(75278, 30)

In [27]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category
0,526927,6570,14/05/05,0,0,0,0,0,0.0,,,,,,,09-Jul-05,22-Jul-05,First,E,2,0,0,0,0,,,,14-Nov-04,,
1,510379,6534,26/05/06,0,0,0,0,0,1.0,,,,0.402054,,,17-Oct-05,07-Nov-07,Second,A,2,0,0,0,0,,,,26-May-06,H,
2,520968,6557,07/01/04,0,0,0,0,0,1.0,20.0,0.611111,,,,,04-Jan-04,09-Jan-04,First,C,2,0,0,0,0,,,,07-Jan-04,H,
3,507625,6535,12/02/04,0,0,0,0,0,0.0,,,,,,,01-Feb-04,18-Feb-04,First,E,2,0,0,0,0,,,,12-Feb-04,B,
4,502611,6581,14/03/04,0,0,0,0,0,0.0,,,,,,,07-Dec-03,13-Jun-04,First,F,2,0,0,0,0,,,,14-Mar-04,B,


## Feature Engineering

In [28]:
combined.dtypes

Patient_ID                     int64
Health_Camp_ID                 int64
Registration_Date             object
Var1                           int64
Var2                           int64
Var3                           int64
Var4                           int64
Var5                           int64
outcome                      float64
Donation                     float64
Health_Score                 float64
Unnamed: 4                   float64
Health Score                 float64
Number_of_stall_visited      float64
Last_Stall_Visited_Number    float64
Camp_Start_Date               object
Camp_End_Date                 object
Category1                     object
Category2                     object
Category3                      int64
Online_Follower                int64
LinkedIn_Shared                int64
Twitter_Shared                 int64
Facebook_Shared                int64
Income                        object
Education_Score               object
Age                           object
F

In [30]:
# Date: Registration_Date -- object
# first cobert string to datetime using 'to_datetime'

combined['Registration_Date'] = pd.to_datetime(combined.Registration_Date, 
                                               dayfirst = True)#dayfirst: Boolean value, places day first if True

# now split date into day, month and year
combined['Registration_Day'] = combined.Registration_Date.dt.day
combined['Registration_Month'] = combined.Registration_Date.dt.month 
combined['Registration_Year'] = combined.Registration_Date.dt.year

In [31]:
# camp start andend date
#Camp_Start_Date               object
#Camp_End_Date                 object

combined["Camp_Start_Month"] = pd.DatetimeIndex(combined.Camp_Start_Date).month
combined["Camp_Start_Day"] = pd.DatetimeIndex(combined.Camp_Start_Date).day
combined["Camp_Start_Year"] = pd.DatetimeIndex(combined.Camp_Start_Date).year

combined["Camp_End_Month"] = pd.DatetimeIndex(combined.Camp_End_Date).month
combined["Camp_End_Day"] = pd.DatetimeIndex(combined.Camp_End_Date).day
combined["Camp_End_Year"] = pd.DatetimeIndex(combined.Camp_End_Date).year

In [32]:
# creating Online_presence  using Patient Profile Data

combined['Online_Presence'] = combined['Online_Follower'] + combined['LinkedIn_Shared']+combined['Twitter_Shared']+combined['Facebook_Shared']
            

In [33]:
# drop all four columns added in online presence in above cell.
combined.drop(['Online_Follower','LinkedIn_Shared', 
               'Twitter_Shared','Facebook_Shared'], axis = 1, inplace = True)

In [34]:
# camp duration from camp start and end date

#convert object into date time format
combined['Camp_Start_Date'] = pd.to_datetime(combined.Camp_Start_Date, dayfirst = True) # will take first element as date
combined['Camp_End_Date'] = pd.to_datetime(combined.Camp_End_Date, dayfirst = True)


# difference will be in days
# to extract only number of days we will use attribute'dt.days
combined['Camp_Duration'] = (combined['Camp_End_Date'] - combined['Camp_Start_Date']).dt.days

In [35]:
# Days_ interaction_diff : reg_date - first_interaction

# first converting object into datetime format

combined['First_Interaction'] = pd.to_datetime(combined.First_Interaction, 
                                               dayfirst = True)


# then finding the diff b/w reg_date and first_int_date
combined['Int_days_diff'] = (combined['Registration_Date'] - combined['First_Interaction']).dt.days


In [37]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Day,Registration_Month,Registration_Year,Camp_Start_Month,Camp_Start_Day,Camp_Start_Year,Camp_End_Month,Camp_End_Day,Camp_End_Year,Online_Presence,Camp_Duration,Int_days_diff
0,526927,6570,2005-05-14,0,0,0,0,0,0.0,,,,,,,2005-07-09,2005-07-22,First,E,2,,,,2004-11-14,,,14.0,5.0,2005.0,7,9,2005,7,22,2005,0,13,181.0
1,510379,6534,2006-05-26,0,0,0,0,0,1.0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2006-05-26,H,,26.0,5.0,2006.0,10,17,2005,11,7,2007,0,751,0.0
2,520968,6557,2004-01-07,0,0,0,0,0,1.0,20.0,0.611111,,,,,2004-01-04,2004-01-09,First,C,2,,,,2004-01-07,H,,7.0,1.0,2004.0,1,4,2004,1,9,2004,0,5,0.0
3,507625,6535,2004-02-12,0,0,0,0,0,0.0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-12,B,,12.0,2.0,2004.0,2,1,2004,2,18,2004,0,17,0.0
4,502611,6581,2004-03-14,0,0,0,0,0,0.0,,,,,,,2003-12-07,2004-06-13,First,F,2,,,,2004-03-14,B,,14.0,3.0,2004.0,12,7,2003,6,13,2004,0,189,0.0


In [39]:
combined.shape

(75278, 38)

In [40]:
combined.Patient_ID.nunique()

29828

In [41]:
combined.Health_Camp_ID.nunique()

44

In [42]:
# Donation v/s health_camp_ID
combined.groupby('Health_Camp_ID').Donation.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Health_Camp_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6523,0.0,,,,,,,
6524,54.0,25.555556,10.580628,10.0,20.0,20.0,30.0,60.0
6526,140.0,31.214286,24.034384,10.0,10.0,20.0,40.0,150.0
6527,0.0,,,,,,,
6528,0.0,,,,,,,
6529,0.0,,,,,,,
6530,42.0,19.52381,10.109735,10.0,10.0,20.0,27.5,50.0
6531,79.0,18.35443,8.232266,10.0,10.0,20.0,20.0,40.0
6532,262.0,32.824427,29.446355,10.0,10.0,30.0,40.0,330.0
6534,0.0,,,,,,,


In [43]:
# unique health_camp per patient ID
combined.groupby('Patient_ID')['Health_Camp_ID'].count()

Patient_ID
485679    2
485680    1
485681    2
485682    1
485684    1
         ..
528651    1
528653    1
528655    2
528656    2
528657    5
Name: Health_Camp_ID, Length: 29828, dtype: int64

In [44]:
# unique Patient_ID per health camp
combined.groupby('Health_Camp_ID')['Patient_ID'].nunique()

Health_Camp_ID
6523    2084
6524     149
6526    3809
6527    4144
6528    1744
6529    3823
6530     259
6531     120
6532    1993
6534    3597
6535    1882
6536    2037
6537    3859
6538    3954
6539    1992
6540    1426
6541    1547
6542    2368
6543    6543
6544     128
6546     403
6549    1835
6552      82
6553      94
6554    2303
6555    1738
6557      52
6558      44
6560     123
6561     200
6562    2338
6563     171
6564     514
6565      66
6569     177
6570    3564
6571    2086
6575      90
6578    2837
6580    3517
6581    1485
6585    1398
6586    2624
6587      79
Name: Patient_ID, dtype: int64

In [45]:
combined.groupby('Health_Camp_ID')['Patient_ID'].transform('nunique')

0        3564
1        3597
2          52
3        1882
4        1485
         ... 
75273    1992
75274    1993
75275    3954
75276    2338
75277    1426
Name: Patient_ID, Length: 75278, dtype: int64

In [46]:
# unique HealthCamp per patient

combined["HC_per_patient"] = combined.groupby('Patient_ID')['Health_Camp_ID'].transform('nunique')

# unique patient per HealthCamp
combined["Patient_per_HC"] = combined.groupby('Health_Camp_ID')['Patient_ID'].transform('nunique')

In [47]:
combined.shape

(75278, 40)

In [48]:
combined.HC_per_patient.describe()
# a unique patient visiting health camps data

count    75278.000000
mean         5.660618
std          5.077292
min          1.000000
25%          2.000000
50%          4.000000
75%          8.000000
max         32.000000
Name: HC_per_patient, dtype: float64

In [49]:
combined.Patient_per_HC.describe()
# a unique HC having over patients data.

count    75278.000000
mean      3081.875873
std       1457.243273
min         44.000000
25%       1993.000000
50%       2837.000000
75%       3823.000000
max       6543.000000
Name: Patient_per_HC, dtype: float64

In [50]:
# unique patients visiting at the beginning of the year camp started.

combined['Patients_Per_Month'] = combined.groupby('Camp_Start_Month')['Patient_ID'].transform('nunique')
combined['Patients_Per_Year'] = combined.groupby('Camp_Start_Year')['Patient_ID'].transform('nunique')

In [51]:
combined.groupby('Camp_Start_Year')['Patient_ID'].transform('nunique')
# gives an info about patient ID and number of patients visiting in a particular year

0        22359
1        22359
2        10902
3        10902
4         3081
         ...  
75273    10902
75274    22359
75275    10902
75276    10902
75277    10902
Name: Patient_ID, Length: 75278, dtype: int64

In [52]:
#del combined['Patients_Per_Month']
combined.groupby('Camp_Start_Year')['Patient_ID'].nunique()
# 2003     3081

Camp_Start_Year
2003     3081
2004    10902
2005    22359
2006     4681
Name: Patient_ID, dtype: int64

In [53]:
combined.groupby('Camp_End_Year')['Patient_ID'].nunique()
#2003     1753

Camp_End_Year
2003     1753
2004     6319
2005    17330
2006     5663
2007    10197
Name: Patient_ID, dtype: int64

In [54]:
# unique patients visiting at the beginning of the year camp ended(yearwise).

combined['Patients_Per_End_Year'] = combined.groupby('Camp_End_Year')['Patient_ID'].transform('nunique')

#
# unique patients visiting at the beginning of the year camp started(monthwise).
combined['Patients_Per_End_Month'] = combined.groupby('Camp_End_Month')['Patient_ID'].transform('nunique')

In [55]:
combined.shape

(75278, 44)

In [56]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Day,Registration_Month,Registration_Year,Camp_Start_Month,Camp_Start_Day,Camp_Start_Year,Camp_End_Month,Camp_End_Day,Camp_End_Year,Online_Presence,Camp_Duration,Int_days_diff,HC_per_patient,Patient_per_HC,Patients_Per_Month,Patients_Per_Year,Patients_Per_End_Year,Patients_Per_End_Month
0,526927,6570,2005-05-14,0,0,0,0,0,0.0,,,,,,,2005-07-09,2005-07-22,First,E,2,,,,2004-11-14,,,14.0,5.0,2005.0,7,9,2005,7,22,2005,0,13,181.0,2,3564,3564,22359,17330,7008
1,510379,6534,2006-05-26,0,0,0,0,0,1.0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2006-05-26,H,,26.0,5.0,2006.0,10,17,2005,11,7,2007,0,751,0.0,1,3597,7346,22359,10197,11374
2,520968,6557,2004-01-07,0,0,0,0,0,1.0,20.0,0.611111,,,,,2004-01-04,2004-01-09,First,C,2,,,,2004-01-07,H,,7.0,1.0,2004.0,1,4,2004,1,9,2004,0,5,0.0,1,52,6902,10902,6319,5293
3,507625,6535,2004-02-12,0,0,0,0,0,0.0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-12,B,,12.0,2.0,2004.0,2,1,2004,2,18,2004,0,17,0.0,1,1882,8717,10902,6319,9102
4,502611,6581,2004-03-14,0,0,0,0,0,0.0,,,,,,,2003-12-07,2004-06-13,First,F,2,,,,2004-03-14,B,,14.0,3.0,2004.0,12,7,2003,6,13,2004,0,189,0.0,2,1485,6272,3081,6319,5140


# Target Variable

In [64]:
def outcome(a,b,c,d):
    if((a>0) |(b>0) | (c>0) |(d>0)):
        return 1
    else:
        return 0

In [65]:
# generating the target variable
combined['Target'] = combined.apply(lambda x: outcome(x['Health_Score'], x['Health Score'], 
                                 x['Number_of_stall_visited'], x['Last_Stall_Visited_Number']), axis = 1)    

In [66]:
combined.outcome.value_counts()

0.0    38354
1.0    14340
Name: outcome, dtype: int64

In [67]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,outcome,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Day,Registration_Month,Registration_Year,Camp_Start_Month,Camp_Start_Day,Camp_Start_Year,Camp_End_Month,Camp_End_Day,Camp_End_Year,Online_Presence,Camp_Duration,Int_days_diff,HC_per_patient,Patient_per_HC,Patients_Per_Month,Patients_Per_Year,Patients_Per_End_Year,Patients_Per_End_Month,Target
0,526927,6570,2005-05-14,0,0,0,0,0,0.0,,,,,,,2005-07-09,2005-07-22,First,E,2,,,,2004-11-14,,,14.0,5.0,2005.0,7,9,2005,7,22,2005,0,13,181.0,2,3564,3564,22359,17330,7008,0
1,510379,6534,2006-05-26,0,0,0,0,0,1.0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2006-05-26,H,,26.0,5.0,2006.0,10,17,2005,11,7,2007,0,751,0.0,1,3597,7346,22359,10197,11374,1
2,520968,6557,2004-01-07,0,0,0,0,0,1.0,20.0,0.611111,,,,,2004-01-04,2004-01-09,First,C,2,,,,2004-01-07,H,,7.0,1.0,2004.0,1,4,2004,1,9,2004,0,5,0.0,1,52,6902,10902,6319,5293,1
3,507625,6535,2004-02-12,0,0,0,0,0,0.0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-12,B,,12.0,2.0,2004.0,2,1,2004,2,18,2004,0,17,0.0,1,1882,8717,10902,6319,9102,0
4,502611,6581,2004-03-14,0,0,0,0,0,0.0,,,,,,,2003-12-07,2004-06-13,First,F,2,,,,2004-03-14,B,,14.0,3.0,2004.0,12,7,2003,6,13,2004,0,189,0.0,2,1485,6272,3081,6319,5140,0


In [68]:
combined.columns

Index(['Patient_ID', 'Health_Camp_ID', 'Registration_Date', 'Var1', 'Var2',
       'Var3', 'Var4', 'Var5', 'outcome', 'Donation', 'Health_Score',
       'Unnamed: 4', 'Health Score', 'Number_of_stall_visited',
       'Last_Stall_Visited_Number', 'Camp_Start_Date', 'Camp_End_Date',
       'Category1', 'Category2', 'Category3', 'Income', 'Education_Score',
       'Age', 'First_Interaction', 'City_Type', 'Employer_Category',
       'Registration_Day', 'Registration_Month', 'Registration_Year',
       'Camp_Start_Month', 'Camp_Start_Day', 'Camp_Start_Year',
       'Camp_End_Month', 'Camp_End_Day', 'Camp_End_Year', 'Online_Presence',
       'Camp_Duration', 'Int_days_diff', 'HC_per_patient', 'Patient_per_HC',
       'Patients_Per_Month', 'Patients_Per_Year', 'Patients_Per_End_Year',
       'Patients_Per_End_Month', 'Target'],
      dtype='object')

In [69]:
combined.Age.describe()

count     75278
unique       50
top        None
freq      51612
Name: Age, dtype: object

In [70]:
# dropping unnecessary columns
new = combined.drop(['Patient_ID', 'Health_Camp_ID', 'Registration_Date',
                      'Donation', 'Health_Score', 'Unnamed: 4',
                      'Health Score', 'Number_of_stall_visited', 'Last_Stall_Visited_Number', 
                      'Camp_Start_Date', 'Camp_End_Date', 'Income', 
                      'Education_Score', 'First_Interaction','City_Type', 'Employer_Category', 'Age'], axis = 1)

In [92]:
del combined, fhc, shc, thc, hcdtl, pp

In [71]:
new.head()

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,outcome,Category1,Category2,Category3,Registration_Day,Registration_Month,Registration_Year,Camp_Start_Month,Camp_Start_Day,Camp_Start_Year,Camp_End_Month,Camp_End_Day,Camp_End_Year,Online_Presence,Camp_Duration,Int_days_diff,HC_per_patient,Patient_per_HC,Patients_Per_Month,Patients_Per_Year,Patients_Per_End_Year,Patients_Per_End_Month,Target
0,0,0,0,0,0,0.0,First,E,2,14.0,5.0,2005.0,7,9,2005,7,22,2005,0,13,181.0,2,3564,3564,22359,17330,7008,0
1,0,0,0,0,0,1.0,Second,A,2,26.0,5.0,2006.0,10,17,2005,11,7,2007,0,751,0.0,1,3597,7346,22359,10197,11374,1
2,0,0,0,0,0,1.0,First,C,2,7.0,1.0,2004.0,1,4,2004,1,9,2004,0,5,0.0,1,52,6902,10902,6319,5293,1
3,0,0,0,0,0,0.0,First,E,2,12.0,2.0,2004.0,2,1,2004,2,18,2004,0,17,0.0,1,1882,8717,10902,6319,9102,0
4,0,0,0,0,0,0.0,First,F,2,14.0,3.0,2004.0,12,7,2003,6,13,2004,0,189,0.0,2,1485,6272,3081,6319,5140,0


In [72]:
new.Category1.value_counts()

First     49892
Second    15114
Third     10272
Name: Category1, dtype: int64

In [73]:
# "Category1" mapping
mapped = {'First':1, 'Second':2, 'Third':3}
new["Category1"] = new.Category1.map(mapped)

In [74]:
new["Category1"].value_counts()

1    49892
2    15114
3    10272
Name: Category1, dtype: int64

In [75]:
new.Category2.value_counts()

F    24660
E    20988
A    10993
G    10272
D     4121
B     2426
C     1818
Name: Category2, dtype: int64

In [76]:
# category 2
new['Category2'] = pd.factorize(new.Category2, sort = True)[0]
#This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values

In [77]:
new.Category2.value_counts()

5    24660
4    20988
0    10993
6    10272
3     4121
1     2426
2     1818
Name: Category2, dtype: int64

In [78]:
new.shape

(75278, 28)

In [79]:
new.head()

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,outcome,Category1,Category2,Category3,Registration_Day,Registration_Month,Registration_Year,Camp_Start_Month,Camp_Start_Day,Camp_Start_Year,Camp_End_Month,Camp_End_Day,Camp_End_Year,Online_Presence,Camp_Duration,Int_days_diff,HC_per_patient,Patient_per_HC,Patients_Per_Month,Patients_Per_Year,Patients_Per_End_Year,Patients_Per_End_Month,Target
0,0,0,0,0,0,0.0,1,4,2,14.0,5.0,2005.0,7,9,2005,7,22,2005,0,13,181.0,2,3564,3564,22359,17330,7008,0
1,0,0,0,0,0,1.0,2,0,2,26.0,5.0,2006.0,10,17,2005,11,7,2007,0,751,0.0,1,3597,7346,22359,10197,11374,1
2,0,0,0,0,0,1.0,1,2,2,7.0,1.0,2004.0,1,4,2004,1,9,2004,0,5,0.0,1,52,6902,10902,6319,5293,1
3,0,0,0,0,0,0.0,1,4,2,12.0,2.0,2004.0,2,1,2004,2,18,2004,0,17,0.0,1,1882,8717,10902,6319,9102,0
4,0,0,0,0,0,0.0,1,5,2,14.0,3.0,2004.0,12,7,2003,6,13,2004,0,189,0.0,2,1485,6272,3081,6319,5140,0


In [81]:
new.isnull().sum()

Var1                          0
Var2                          0
Var3                          0
Var4                          0
Var5                          0
outcome                   22584
Category1                     0
Category2                     0
Category3                     0
Registration_Day            334
Registration_Month          334
Registration_Year           334
Camp_Start_Month              0
Camp_Start_Day                0
Camp_Start_Year               0
Camp_End_Month                0
Camp_End_Day                  0
Camp_End_Year                 0
Online_Presence               0
Camp_Duration                 0
Int_days_diff               334
HC_per_patient                0
Patient_per_HC                0
Patients_Per_Month            0
Patients_Per_Year             0
Patients_Per_End_Year         0
Patients_Per_End_Month        0
Target                        0
dtype: int64

In [83]:
new['Registration_Day'].fillna(new.Registration_Day.mode()[0], inplace = True)

In [84]:
new['Registration_Month'].fillna(new.Registration_Month.mode()[0], inplace = True)
new['Registration_Year'].fillna(new.Registration_Year.mode()[0], inplace = True)
new['Int_days_diff'].fillna(new.Int_days_diff.median(), inplace = True)

In [85]:
new.isnull().sum()

Var1                          0
Var2                          0
Var3                          0
Var4                          0
Var5                          0
outcome                   22584
Category1                     0
Category2                     0
Category3                     0
Registration_Day              0
Registration_Month            0
Registration_Year             0
Camp_Start_Month              0
Camp_Start_Day                0
Camp_Start_Year               0
Camp_End_Month                0
Camp_End_Day                  0
Camp_End_Year                 0
Online_Presence               0
Camp_Duration                 0
Int_days_diff                 0
HC_per_patient                0
Patient_per_HC                0
Patients_Per_Month            0
Patients_Per_Year             0
Patients_Per_End_Year         0
Patients_Per_End_Month        0
Target                        0
dtype: int64

# train test data split

In [86]:
train.shape, test.shape

((52694, 9), (22584, 8))

In [87]:
new.shape

(75278, 28)

In [88]:
newtrain = new.loc[:train.shape[0]-1, :]
newtest = new.loc[train.shape[0]:, :]

In [89]:
newtrain.shape,newtest.shape

((52694, 28), (22584, 28))

In [90]:
# drop Target from test data
newtest.drop('outcome', axis = 1, inplace = True)

In [91]:
newtrain.shape, newtest.shape

((52694, 28), (22584, 27))

# Building a Base Model

* Logistics Regression
* Random Forest Model
* Gradient Boosting Model
* Xtreme Gradient Boosting Model
* CatBoost Model

In [92]:
X = newtrain.drop('outcome', axis =1)
y = newtrain.outcome

In [93]:
from sklearn.model_selection import train_test_split

In [94]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [95]:
# model Instances
lr = LogisticRegression()
rf = RandomForestClassifier()
gb = GradientBoostingClassifier()
xgb = XGBClassifier(eval_metric = 'auc')
cboost = CatBoostClassifier(eval_metric= 'AUC')

In [96]:
# Building a voting Classifier Model.
from sklearn.ensemble import VotingClassifier
vc = VotingClassifier(estimators=[('lr', lr), ('rf', rf), 
                                  ('gbm', gb), ('xgb', xgb), ('cb', cboost)], voting = 'soft') # soft indicates probability

In [98]:
pred = vc.fit(X, y).predict(newtest)

Learning rate set to 0.055992
0:	total: 24.4ms	remaining: 24.4s
1:	total: 38.5ms	remaining: 19.2s
2:	total: 50.8ms	remaining: 16.9s
3:	total: 60.3ms	remaining: 15s
4:	total: 69.2ms	remaining: 13.8s
5:	total: 77.7ms	remaining: 12.9s
6:	total: 87ms	remaining: 12.3s
7:	total: 95.9ms	remaining: 11.9s
8:	total: 105ms	remaining: 11.5s
9:	total: 114ms	remaining: 11.3s
10:	total: 123ms	remaining: 11.1s
11:	total: 131ms	remaining: 10.8s
12:	total: 140ms	remaining: 10.6s
13:	total: 148ms	remaining: 10.4s
14:	total: 157ms	remaining: 10.3s
15:	total: 167ms	remaining: 10.3s
16:	total: 177ms	remaining: 10.2s
17:	total: 187ms	remaining: 10.2s
18:	total: 193ms	remaining: 9.97s
19:	total: 202ms	remaining: 9.89s
20:	total: 210ms	remaining: 9.77s
21:	total: 218ms	remaining: 9.69s
22:	total: 226ms	remaining: 9.62s
23:	total: 235ms	remaining: 9.56s
24:	total: 243ms	remaining: 9.5s
25:	total: 251ms	remaining: 9.4s
26:	total: 259ms	remaining: 9.34s
27:	total: 267ms	remaining: 9.26s
28:	total: 275ms	remaining

248:	total: 1.8s	remaining: 5.44s
249:	total: 1.81s	remaining: 5.43s
250:	total: 1.82s	remaining: 5.45s
251:	total: 1.84s	remaining: 5.45s
252:	total: 1.85s	remaining: 5.45s
253:	total: 1.85s	remaining: 5.44s
254:	total: 1.86s	remaining: 5.43s
255:	total: 1.87s	remaining: 5.43s
256:	total: 1.87s	remaining: 5.42s
257:	total: 1.88s	remaining: 5.41s
258:	total: 1.89s	remaining: 5.4s
259:	total: 1.89s	remaining: 5.39s
260:	total: 1.9s	remaining: 5.38s
261:	total: 1.91s	remaining: 5.37s
262:	total: 1.91s	remaining: 5.36s
263:	total: 1.92s	remaining: 5.35s
264:	total: 1.93s	remaining: 5.34s
265:	total: 1.93s	remaining: 5.33s
266:	total: 1.94s	remaining: 5.32s
267:	total: 1.95s	remaining: 5.31s
268:	total: 1.95s	remaining: 5.3s
269:	total: 1.96s	remaining: 5.3s
270:	total: 1.97s	remaining: 5.29s
271:	total: 1.97s	remaining: 5.28s
272:	total: 1.98s	remaining: 5.27s
273:	total: 1.99s	remaining: 5.26s
274:	total: 1.99s	remaining: 5.26s
275:	total: 2s	remaining: 5.25s
276:	total: 2.01s	remaining:

490:	total: 3.44s	remaining: 3.56s
491:	total: 3.44s	remaining: 3.55s
492:	total: 3.45s	remaining: 3.55s
493:	total: 3.46s	remaining: 3.54s
494:	total: 3.46s	remaining: 3.53s
495:	total: 3.47s	remaining: 3.52s
496:	total: 3.48s	remaining: 3.52s
497:	total: 3.48s	remaining: 3.51s
498:	total: 3.49s	remaining: 3.5s
499:	total: 3.5s	remaining: 3.5s
500:	total: 3.5s	remaining: 3.49s
501:	total: 3.51s	remaining: 3.48s
502:	total: 3.52s	remaining: 3.47s
503:	total: 3.52s	remaining: 3.46s
504:	total: 3.53s	remaining: 3.46s
505:	total: 3.53s	remaining: 3.45s
506:	total: 3.54s	remaining: 3.44s
507:	total: 3.55s	remaining: 3.44s
508:	total: 3.55s	remaining: 3.43s
509:	total: 3.56s	remaining: 3.42s
510:	total: 3.57s	remaining: 3.41s
511:	total: 3.57s	remaining: 3.4s
512:	total: 3.58s	remaining: 3.4s
513:	total: 3.59s	remaining: 3.39s
514:	total: 3.59s	remaining: 3.38s
515:	total: 3.6s	remaining: 3.38s
516:	total: 3.6s	remaining: 3.37s
517:	total: 3.61s	remaining: 3.36s
518:	total: 3.62s	remaining:

742:	total: 5.25s	remaining: 1.81s
743:	total: 5.26s	remaining: 1.81s
744:	total: 5.26s	remaining: 1.8s
745:	total: 5.27s	remaining: 1.79s
746:	total: 5.28s	remaining: 1.79s
747:	total: 5.29s	remaining: 1.78s
748:	total: 5.3s	remaining: 1.77s
749:	total: 5.3s	remaining: 1.77s
750:	total: 5.31s	remaining: 1.76s
751:	total: 5.32s	remaining: 1.75s
752:	total: 5.33s	remaining: 1.75s
753:	total: 5.34s	remaining: 1.74s
754:	total: 5.35s	remaining: 1.74s
755:	total: 5.36s	remaining: 1.73s
756:	total: 5.37s	remaining: 1.72s
757:	total: 5.37s	remaining: 1.72s
758:	total: 5.38s	remaining: 1.71s
759:	total: 5.39s	remaining: 1.7s
760:	total: 5.4s	remaining: 1.7s
761:	total: 5.41s	remaining: 1.69s
762:	total: 5.41s	remaining: 1.68s
763:	total: 5.42s	remaining: 1.67s
764:	total: 5.43s	remaining: 1.67s
765:	total: 5.44s	remaining: 1.66s
766:	total: 5.45s	remaining: 1.65s
767:	total: 5.45s	remaining: 1.65s
768:	total: 5.46s	remaining: 1.64s
769:	total: 5.47s	remaining: 1.63s
770:	total: 5.48s	remainin

In [99]:
pred

array([1., 0., 0., ..., 0., 0., 0.])

In [100]:
submissiondata.outcome=pred

0.0    0.725735
1.0    0.274265
Name: outcome, dtype: float64

In [127]:
# submission Dataframe

submission = pd.DataFrame({'Patient_ID':test.Patient_ID,
                           'Health_Camp_ID': test.Health_Camp_ID,
                          'Outcome': pred[:, 1]})

In [103]:
submissiondata.to_csv('VotingModel.csv', index = False)

In [128]:
from lightgbm import LGBMClassifier

In [134]:
lgbm = LGBMClassifier(n_estimators=500, max_depth=10,
                     random_state=42,
                     learning_rate=0.01, 
                     scale_pos_weight = 3)

In [135]:
pred_lgbm = lgbm.fit(X,y).predict_proba(newtest)

In [136]:
pred_lgbm[:,1]

array([0.85390021, 0.60135099, 0.37209634, ..., 0.6567897 , 0.35696185,
       0.75523136])

In [137]:
submission = pd.DataFrame({'Patient_ID':test.Patient_ID,
                           'Health_Camp_ID': test.Health_Camp_ID,
                          'Outcome': pred_lgbm[:, 1]})

submission.to_csv('LGB_Model.csv', index = False)

# Cross validation and RFECV

In [138]:
from sklearn.tree import DecisionTreeClassifier

In [139]:
from sklearn.feature_selection import RFECV

In [140]:
dtree = DecisionTreeClassifier()
rfe = RFECV(estimator = dtree, step = 1,
           min_features_to_select = 5, cv = 5, verbose = 5)

In [141]:
rfe.fit(X,y)
feat = list(rfe.get_feature_names_out())
print(feat)

Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 featur

In [142]:
feat

['Var1',
 'Var2',
 'Var4',
 'Var5',
 'Category1',
 'Category2',
 'Category3',
 'Registration_Day',
 'Registration_Month',
 'Registration_Year',
 'Camp_Start_Month',
 'Camp_Start_Day',
 'Camp_Start_Year',
 'Camp_End_Month',
 'Camp_End_Day',
 'Camp_End_Year',
 'Online_Presence',
 'Camp_Duration',
 'Int_days_diff',
 'HC_per_patient',
 'Patient_per_HC',
 'Patients_Per_Month',
 'Patients_Per_Year',
 'Patients_Per_End_Year',
 'Patients_Per_End_Month']

In [143]:
# RFE Features
rfe_input = X.loc[:, feat]
rfe_test = newtest.loc[:, feat]
rfe_input.shape, rfe_test.shape

((75278, 25), (35249, 25))

In [146]:
lgbm = LGBMClassifier(n_estimators=500, max_depth=10,
                     random_state=42,
                     learning_rate=0.01, 
                     scale_pos_weight = 3)

pred_lgbm = lgbm.fit(rfe_input,y).predict_proba(rfe_test)
submission = pd.DataFrame({'Patient_ID':test.Patient_ID,
                           'Health_Camp_ID': test.Health_Camp_ID,
                          'Outcome': pred_lgbm[:, 1]})

submission.to_csv('RFE_Model.csv', index = False)

# Cross - Validation

In [145]:
from sklearn.model_selection import KFold

In [147]:
kfold = KFold(n_splits = 5, shuffle= True )
lgbm = LGBMClassifier(n_estimators=500, max_depth=10,
                     random_state=42,
                     learning_rate=0.01, 
                     scale_pos_weight = 3)

pred_df = pd.DataFrame()
n = 5
for i in range(n):
    folds = next(kfold.split(X))   # splitting into 5 folds
    xtrain = X.iloc[folds[0]]       # create xtrain data
    ytrain = y.iloc[folds[0]]
    lgbm.fit(xtrain, ytrain)
    pred_df[i] = lgbm.predict_proba(newtest)[:,1]

In [148]:
median_prob = pred_df.median(axis = 1)

In [149]:
submission = pd.DataFrame({'Patient_ID':test.Patient_ID,
                          'Health_Camp_ID': test.Health_Camp_ID,
                          'Outcome': median_prob})

submission.to_csv('MedianProb_Model.csv', index = False)

# Summary

* Model performance increases drastically by the Feature engineering
* we saw that LGBM appeared to be the best model across for this competition leading at #2 in Public Leaderboard
* The features selected during RFECV did not wrok wellso we had to drop the model.
* Parameter tuning of LGBM can take model performance at a whole new level.
* Cross validatiom model did a good job earning us a couple of brownie points at the leaderboard
* We studied that the Target variable can be masked and we could find from the problem statement.