## Phase 1: Data Understanding and Scope Definition

### Task 2

In [13]:
#importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [33]:
# uploading the dataset 
df = pd.read_csv("diabetic_data.csv")
df.shape

(101766, 50)

In [34]:
df.sample(5)

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
7103,34158750,40764573,Caucasian,Female,[40-50),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,<30
78514,239940882,55680444,Caucasian,Female,[80-90),?,2,3,1,5,...,No,Up,No,No,No,No,No,Ch,Yes,>30
37599,116662986,23508954,AfricanAmerican,Male,[50-60),?,1,1,7,1,...,No,Steady,No,No,No,No,No,No,Yes,NO
18291,66360738,28445220,Caucasian,Female,[60-70),?,1,1,17,3,...,No,Down,No,No,No,No,No,Ch,Yes,NO
5283,27811320,16652376,Caucasian,Female,[50-60),?,1,1,7,3,...,No,Steady,No,No,No,No,No,No,Yes,NO


In [None]:
# Overview of the dataset
df.info()

In [4]:
#list of columns
cols = []
for i in range(0,50):
    cols.append(df.columns[i])

print(cols)

['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']


In [5]:
# no of unique value in each column
nUnique = []
for col in cols: 
    nUnique.append(df[col].nunique())

print(nUnique)

[101766, 71518, 6, 3, 10, 10, 8, 26, 17, 14, 18, 73, 118, 7, 75, 39, 33, 21, 717, 749, 790, 16, 3, 3, 4, 4, 4, 4, 4, 2, 4, 4, 2, 4, 4, 4, 4, 2, 3, 1, 1, 4, 4, 2, 2, 2, 2, 2, 2, 3]


In [None]:
# df.groupby('patient_nbr')['A1Cresult'].value_counts()
df.groupby('A1Cresult')['patient_nbr'].value_counts()

In [None]:
print(df['A1Cresult'].value_counts().sort_index())
sns.countplot(x = df['A1Cresult'])

### Task 3

In [35]:
df['readmitted'].value_counts(normalize=True)

readmitted
NO     0.539119
>30    0.349282
<30    0.111599
Name: proportion, dtype: float64

##### Why is this a class imbalance problem?
##### Since our dataset contains three classes and out of them one of the class apprears in more than 50% of the entire dataset and the remaining 2 covers only around 46% of the dataset. Out of those two, our main class which is our main target of the model covers only around 12% of the dataset, making our target variable a class imbalance in the ratio of 10:90(positive class: negative class).So our positive class will be ("<30").

### Task4

##### What are encounter_id and patient_nbr and Why must they never be used as model inputs? 
###### They are just the identifiers used for patients and their encounters, we should not consider them as features for our dataset. They are useful for us only to tell that each row is unique according to the patient_nbr and encounter_id, but would be of no help in mathematical point of view for our model. This is the reason why they must never be used as model inputs.

##### Looking at 'discharge_disposition_id & medication change columns', would knowing this in advance give unfair future information?
###### According to my understandings i don't think this gives me unfair future information, they provided essential info for our model to get trained on. Discharge_disposition_id tell us about the condition of the patient at the time of discharge by knowing where he is being sent to after discharging from the hospital directly related to the probability of knowing whether he'll get readmitted within 30 days or not. Currently I'm not sure about the columns regarding the medication, I'm confused whether they represent the changes in the doses while being in the hospital or after getting discharge form the hospital.

##### 

### Task 5

In [37]:
(df == '?').sum().sort_values(ascending=False)

weight                      98569
medical_specialty           49949
payer_code                  40256
race                         2273
diag_3                       1423
diag_2                        358
diag_1                         21
admission_type_id               0
patient_nbr                     0
encounter_id                    0
time_in_hospital                0
admission_source_id             0
num_lab_procedures              0
num_procedures                  0
num_medications                 0
discharge_disposition_id        0
gender                          0
age                             0
number_inpatient                0
number_emergency                0
number_outpatient               0
number_diagnoses                0
max_glu_serum                   0
A1Cresult                       0
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide 

##### Which columns are heavily missing?
###### Columns that are heavily missing are [weight, medical_speciality, payer_code, race and diag_3].
##### Why might these be missing in real hospitals?
###### Since it is a dataset of around 2008, there wasn't any necessary code of conduct for the weight of person to be mentioned in the records or not, that's why most of the values are missing. But nowadays i don't think weight, medical_speciality and diag_3 would be missing anymore coz to know under which doctor you were treated is important and also to diagnose small diseases is also important.
##### Should missingness itself carry information?
###### Since you're asking me whether it should or not, I does carry information about the dataset like if it's value is pretty high for any dataset's column we can simply avoid that column coz there's no way for us to predict what would be the most probable value that row should have.