# Healthcare Analytics - Exploring the Data

# Jupyter Preliminaries
* Each block is called a cell. Cells can be used for code or text (markdown).
* Each code cell is marked by a number in square brackets on the left. If you are working on a particular cell at the moment, you'll see a play button or a square (stop) button instead, depending on whether or not the cell is running.
* If a cell which you're not working on is running, the number is replaced by a star *.
* To run a cell, you could press the aforementioned play button, use one of the run options on the toolbar, or use Shift+Enter on your keyboard.
* Depending on the code, a few cells may take a few seconds to run. You could stop the execution halfway using the stop button, but bear in mind that none of the output will be saved, including the output from operations already run.
* The input files are available at/uploaded to the input folder under 'Data' on the right. Output files will be saved to and can be downloaded from the output folder.

# Python Preliminaries
* List - Represented by [], items separated by commas. List items can have any datatype. Ordered, changeable, duplicates allowed
* Dictionary - Represented by {key:value}, may have multiple key-value pairs. An item can be accessed using its key. No duplicate key-value pairs allowed, indexed, unordered
* Set - Represented by {}. No duplicates allowed, unordered, unindexed 
* Tuple - Represented by (). Ordered, unchangeable, duplicates allowed

# **PROBLEM**
Link to the original source: https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics/#ProblemStatement

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps.

There are three types of health camps. In the first two, the patient is given a health score after a few check -ups. Getting a health score is considered a favourable outcome.

In the third type of health camps, stalls are set up in order to raise awareness. A favourable outcome here would be when a patient visits atleast one stall.

The goal is to predict whether or not the outcome will be favorable.



# Imports

In [None]:
import numpy as np
import pandas as pd

* numpy - A library for complex mathematical calculations
* pandas - THE Python go-to library for Data Analysis

# Pandas Preliminaries
* Series - A one-dimensional array that has axis labels, i.e. indices
* Dataframe - A two dimensional tabular representation of data with index and column labels. Each dataframe column can be used as a single series. A dataframe column can be accessed using df[col_name] or df.col_name. A smaller dataframe with only a few of the original columns can also be accessed using df[col_list]
* CSV file - Comma Separated Values. This contains tabular data  with the elements of each row separated by commas. A csv file can be opened on an Excel spreadsheet.

# Reading the CSV Files
1. Patient details - Age, Education etc.; imported as patient_profile
2. Details about the health camps
3. Information from each individual health camp
4. Train data set - Patient ID, Health Camp ID, anonymized variables; imported as df
5. Test data set - Same as Train; imported as test 

In [None]:
patient_profile = pd.read_csv('../input/janatahack-healthcare-analytics/Train/Patient_Profile.csv')
hc_detail = pd.read_csv('../input/janatahack-healthcare-analytics/Train/Health_Camp_Detail.csv')
hc1 = pd.read_csv('../input/janatahack-healthcare-analytics/Train/First_Health_Camp_Attended.csv')
hc2 = pd.read_csv('../input/janatahack-healthcare-analytics/Train/Second_Health_Camp_Attended.csv')
hc3 = pd.read_csv('../input/janatahack-healthcare-analytics/Train/Third_Health_Camp_Attended.csv')
df_temp = pd.read_csv('../input/janatahack-healthcare-analytics/Train/Train.csv')
test = pd.read_csv('../input/janatahack-healthcare-analytics/Test.csv')

# Exploratory Data Analysis
* The data needs to be looked into in order to gauge the operations that need to be performed to bring it into a state that the model can work with.
* df.info() is a function that gives out the datatypes contained in each column, as well as the number of elements available.
* df.nunique() gives the number of unique values in each column
* df[col].unique() gives a list of the unique values in a particular column
* df.isnull() gives a True/False output for whether or not a particular element is a missing value. df.isnull().sum() sums up the True outputs for every column, thereby giving the number of missing values in each column. 

In [None]:
print(patient_profile.info())
patient_profile

In [None]:
print(hc_detail.info())
hc_detail

In [None]:
print(hc1.info())
hc1

In [None]:
hc1 = hc1.rename({'Health_Score':'Health_Score_1'}, axis = 1)
hc1

In [None]:
print(hc2.info())
hc2

In [None]:
hc2 = hc2.rename({'Health Score':'Health_Score_2'}, axis = 1)
hc2

In [None]:
print(hc3.info())
hc3

In [None]:
print(df_temp.info())
df_temp

In [None]:
test

# Merging all the Dataframes

In [None]:
df = df_temp.merge(hc_detail, on = ['Health_Camp_ID'],how = 'left')
df = df.merge(hc1, on = ['Patient_ID','Health_Camp_ID'],how = 'left')
df = df.merge(hc2, on = ['Patient_ID','Health_Camp_ID'],how = 'left')
df = df.merge(hc3, on = ['Patient_ID','Health_Camp_ID'],how = 'left')
df = df.merge(patient_profile, on = ['Patient_ID'],how = 'left')
df

In [None]:
df.columns

# Missing Values
* Missing values are pieces of information unavailable due to various circumstances. In a dataframe, these values show up as NaN or None, and generally disrupt other operations done on the dataframe, as well as the working of the model.
* They can be filled in using .fillna(). .fillna(0) replaces all of them with 0s. In certain cases, -1 would be more appropriate, while in others, the mean/mode of the other data in the column is used.
* If there aren't many missing values in comparison to the size of the data, they can be dropped using .dropna()

In [None]:
df.isnull().sum()

In [None]:
df = df.drop(['Unnamed: 4'],axis = 1)
df

In [None]:
df['Last_Stall_Visited_Number'].unique()

In [None]:
df[['Donation','Health_Score_1','Health_Score_2','Number_of_stall_visited']] = df[['Donation','Health_Score_1','Health_Score_2','Number_of_stall_visited']].fillna(0)
df['Last_Stall_Visited_Number'] = df['Last_Stall_Visited_Number'].fillna(-1)
df

In [None]:
for col in df.select_dtypes(include = 'object').columns:
    print(col + '\n',df[col].unique(),'\n')

In [None]:
df['Income'].unique()

In [None]:
df = df.replace('None','-1')
df[['Income','Education_Score','Age']] = df[['Income','Education_Score','Age']].astype('float64')
df['Age'] = df['Age'].replace(-1,df['Age'][df['Age']!=-1].mean())
df['City_Type'] = df['City_Type'].fillna('Unknown')
df['Employer_Category'] = df['Employer_Category'].fillna('Others')
df

In [None]:
df.isnull().sum()

In [None]:
df = df.dropna()
df

# Defining the Outcome
As mentioned in the problem, the goal is to predict whether the outcome is positive or negative. This is best done using a binary column. 

The conditions are:
1. In case of the third type of health camps, the outcome is favorable (1), the patient needs to have more than one health camp. This data is available in the column 'Number_of_stall_visited'.
2. In the other two health camps, a favorable outcome is when a health score is obtained. The scores are accessible in the columns 'Health_Score_1' and 'Health_Score_2'.

In [None]:
df['Outcome'] = np.where(df['Number_of_stall_visited']>0,1,
                         np.where(df['Health_Score_1']>0,1,
                                  np.where(df['Health_Score_2']>0,1,0)))
df

# Further Data Preprocessing
A model needs all training data to be in numerical formats, and the non-numeric columns need to be appropriately converted.

In [None]:
df.select_dtypes('object').columns

The .to_datetime() function converts the input variable into the YYYY-MM-DD format. After a date column is converted into this format, the years, months, days and multiple other features can be directly drawn from this column.

In [None]:
dates = ['Registration_Date','First_Interaction','Camp_Start_Date', 'Camp_End_Date']
for date in dates:
    df[date] = df[date].apply(lambda x: pd.to_datetime(x))
    df[date + '_year'] = df[date].dt.year
    df[date + '_month'] = df[date].dt.month
df

In [None]:
df = df.drop(dates, axis = 1)

A categorical column has just a few unique values (categories) and each entry is assigned one of these values. Label encoding gives each of these entries a number. These numbers are assigned serially and have no positional weightage.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[['City_Type','Employer_Category','Category2']] = df[['City_Type','Employer_Category','Category2']].apply(le.fit_transform)
df['Category1'] = df['Category1'].replace({'First':1,'Second':2,'Third':3})
df

# Downloading the Final Output

In [None]:
final_df = df.to_csv('final_df.csv', index = False)
final_df

# Plotting and Chart Representations
Often, it is important to visualize the data you have to get a clearer picture. The basic library used for data visualization is Matplotlib, but Seaborn and Plotly give amazing results, too.

In [None]:
import seaborn as sns
sns.countplot(df['Outcome'])

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure()
plt.hist(df['Age'],color = 'r')

In [None]:
df.columns

In [None]:
import plotly.express as px

fig = px.sunburst(df,path = ['Outcome','Employer_Category'],values = 'Patient_ID',color = 'City_Type', color_continuous_scale=px.colors.sequential.GnBu)
fig.show()