<a href="https://colab.research.google.com/github/GuerrillaGambit/Medical-appointment-No-Show/blob/main/DataScience.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Review: Medical Appointment No Shows

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>

<a id='intro'></a>
## Introduction

> * The data set is Medical Appointment No shows. <br>
> * The data set variables are self explanatory. The variables in the data set are 'Patient ID', 'AppointmentID', 'Gender', 'ScheduleDay', 'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-Show'. <br>
> * No-Show as Yes corresponds to people who did not turn up for the doctor visit and No-Show as No turned up for the doctor visit.

## Important information:

1. Explain your findings and the steps you've followed using the Markdown cells. Create Markdown cells wherever necessary
2. Double-click the markdown cells to edit them and add your inferences
3. Add necessary code cells for your task. You are not restricted to the cells created before hand.


### Steps to be followed:
1. Load the data
2. Variable Identification.
3. Check for cleanliness.
4. Trim and clean the data.
5. Feature Selection.
6. Modification of features if needed.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### Loading the data

In [None]:
#mounting Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


Mounted at /content/gdrive


In [None]:
#providing configuration path
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"

In [None]:
#Changing working directory
%cd /content/gdrive/My Drive/Kaggle

/content/gdrive/My Drive/Kaggle


In [None]:
#Downloading Dataset
!kaggle datasets download -d joniarroba/noshowappointments

Downloading noshowappointments.zip to /content/gdrive/My Drive/Kaggle
  0% 0.00/2.40M [00:00<?, ?B/s]
100% 2.40M/2.40M [00:00<00:00, 39.1MB/s]


In [None]:
!ls

kaggle.json  KaggleV2-May-2016.csv  noshowappointments.zip


In [None]:
#unzipping the zip files and deleting the zip files
!unzip \*.zip  && rm *.zip

Archive:  noshowappointments.zip
replace KaggleV2-May-2016.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: KaggleV2-May-2016.csv   


In [None]:
dataset = pd.read_csv('KaggleV2-May-2016.csv')

In [None]:
# Create a summary of the data.
dataset.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


Minimum age in the dataset is negative(something to be taken care of) and max age is above 100.

In [None]:
dataset.drop(dataset[dataset['Age']<0].index, inplace=True)

In [None]:
dataset.loc[dataset.Handcap > 1, "Handcap"] = 1

### Variable Identification
* Identifying the target variable(dependent variable) and the predictor variables(independent variables) which affect the status of the target variable.

In [None]:
dataset.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


#### Mention your target variable(s) in this markdown cell and specify the predictor variable(s) you'll use for the analysis

### Type of Variable Classified:
#### Predictor Variables: Gender, Scheduled Day, Appointment Day, Age, Neighbourhood, Scholarship, Hipertension, Alcoholism, Handcap, SMS received.

<br>

#### Target Variable (Independent Variable): No Show

<br>

### Classify the features based on dataype

In [None]:
# Check the data types of each column (features)
dataset.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

### Write the features categorized by data types here:
### Data Types

<br>

### Checking for missing data

In [None]:
dataset.isnull().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

Are there any missing data? _(Answer here)_
####**No null data**


#### Check the counts of some categorical variables and give your inference

In [None]:
dataset.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0
mean,147493400000000.0,5675304.0,37.089219,0.098266,0.197248,0.071865,0.0304,0.020276,0.321029
std,256094300000000.0,71295.44,23.110026,0.297676,0.397923,0.258266,0.171686,0.140943,0.466874
min,39217.84,5030230.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172536000000.0,5640285.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680572.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94389630000000.0,5725523.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,1.0,1.0


The counts of categorical variables are equal.

The value of handcap which is a categorical variable should be 1 or 0 but values of 2,3,4 existed. 

This was taken care of in the linked cell, and the values of 2, 3 and 4 were reduced to 1.

https://colab.research.google.com/drive/1PAvSNbeH9qhJVXxkbAvTfQji2cpWCKho#scrollTo=Klw1MaOBsLPv&line=1&uniqifier=1

<a id='cleaning'></a>
## Data Cleaning

#### The changes made to date format.
1. The Scheduled and Appointment days are in date time format.
2. Convert the date time format to Date, Week day and Month Scheduled.
3. Days, Month should normal count in integer.
4. The Week day should be coded as Monday : 0 to Sunday : 6
5. As the dataset is of year 2016, the year can be ignored.

Hint : This can be done using NumPy's datetime64 (https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html)

* Some of the spellings of the columns are incorrect.
* Correcting them will make it easier for the users to follow.

### Check the unique values for each column.
Certain values might be of wrong format or deosn't make sense with regard to the feature. Identify and correct them if any. 

Add any number of cells you think are necessary in this section

Write your findings **here**

### Feature Selection

Choose the features that are best suited for the analysis and drop those which are unnecessary.

Display your final dataset at the end
