## Exploring data: initial assessment of the data

Using the medical appointment no show dataset from kaggle - [here](https://www.kaggle.com/joniarroba/noshowappointments)

This dataset contains information about the 110.527 medical appointments. There are 14 associated characteristic features. Out of the appointments, some of the patients did not show up for the appointments. It would be useful to know before hand whether a patient is likely to show up or not, given the characteristics of the appointment.

- **No-show** (str) - No (patient showed up) or Yes (patient did not show up)

Following features are provided to us in the data:

- **PatientId** (str) - Identification of a patient
- **AppointmentID** (str) - Identification of each appointment
- **Gender** (str) - Male or Female
- **ScheduledDay** (str) - The day of the actuall appointment, when they have to visit the doctor.
- **AppointmentDay** (str) - The day someone called or registered the appointment, this is before appointment of course.
- **Age** (str) - How old is the patient.
- **Neighbourhood** (str) - Where the appointment takes place.
- **Scholarship** (str) - True or False
- **Hipertension** (str) - True or False
- **Diabetes** (str) - True or False
- **Alcoholism** (str) - True or False
- **Handcap** (str) - True or False
- **SMS_received** (str) - True or False

** This section focus on loading the data and understanding the features **


In [32]:
# import necessary packages
import pandas as pd

In [3]:
# load the excel file
patient_data = pd.read_csv('../data/KaggleV2-May-2016.csv')
patient_data.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [5]:
# Shape of the data
patient_data.shape

(110527, 14)

In [30]:
# Check the distribution of target variable
patient_data['No-show'].value_counts()

No     88208
Yes    22319
Name: No-show, dtype: int64

###### There are 14 features and 110527 number of observations. 

In [5]:
# Check the type of features
patient_data.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

###### The features can be divided into following categorical and continuous variables

###### 1. Categorical - PatientId, Gender,  Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_received
###### 2. Continuous - ScheduledDay, AppointmentDay, Age, AppointmentID 