**Business Understanding**

- Dataset source:  dataset from the MIMIC-III database
- Contains information on in-hospital mortality from the monitoring of patients in the Intensive Care Unit (ICU) for 48 hours

In [1]:
#loading required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Eye balling the data
df = pd.read_csv('ihm_48_hours.csv')
df.head()

Unnamed: 0,Capillary refill rate,Diastolic blood pressure,Fraction inspired oxygen,Glascow coma scale eye opening,Glascow coma scale motor response,Glascow coma scale total,Glascow coma scale verbal response,Glucose,Heart Rate,Height,Mean blood pressure,Oxygen saturation,Respiratory rate,Systolic blood pressure,Temperature,Weight,pH,Patient_id,target
0,,73.0,,Spontaneously,Obeys Commands,,Oriented,-11.396037,-19.976803,,76.0,94.0,17.0,116.0,36.388889,83.5,,30552,0
1,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,95.0,18.0,116.0,36.388889,83.5,,30552,0
2,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,-6.497052,18.0,116.0,36.388889,83.5,,30552,0
3,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,95.0,18.0,116.0,36.388889,83.5,,30552,0
4,,73.0,,Spontaneously,Obeys Commands,,Oriented,115.0,96.0,,76.0,95.0,18.0,116.0,36.388889,83.5,,30552,0


**Data Understanding**

| Column Name                         | Description                                                                 |
|-------------------------------------|-----------------------------------------------------------------------------|
| Capillary refill rate               | The time it takes for color to return to an external capillary bed (such as a fingertip) after pressure is applied. It is an indicator of peripheral perfusion. |
| Diastolic Blood Pressure            | The pressure in the arteries when the heart rests between beats. It's the lower of the two blood pressure measurements. |
| Fraction inspired oxygen            | The concentration of oxygen in the air mixture that is inhaled by the patient, often measured in ventilated patients. |
| Glascow coma scale eye opening      | A component of the Glasgow Coma Scale (GCS) that measures a patient's ability to open their eyes in response to stimuli. |
| Glascow coma scale motor response    | A component of the GCS that assesses a patient's motor response to stimuli, indicating brain function. |
| Glascow coma scale total            | The total score of the Glasgow Coma Scale, which assesses the level of consciousness in a person following a traumatic brain injury. |
| Glascow coma scale verbal response   | A component of the GCS that evaluates a patient’s verbal response, indicating their level of consciousness. |
| Glucose                             | The level of sugar (glucose) in the blood, an important indicator of metabolic status. |
| Heart Rate                          | The number of heartbeats per minute, indicating the functioning of the cardiovascular system. |
| Height                              | The measurement of a patient's stature, typically recorded in centimeters or meters. |
| Mean blood pressure                 | The average pressure in a patient's arteries during one cardiac cycle, indicating overall blood pressure. |
| Oxygen saturation                   | The percentage of hemoglobin binding sites in the bloodstream occupied by oxygen, indicating the oxygenation status of the patient. |
| Respiratory rate                    | The number of breaths taken per minute, indicating the patient’s respiratory status. |
| Systolic blood pressure             | The pressure in the arteries when the heart beats. It is the higher of the two blood pressure measurements. |
| Temperature                         | The body temperature of the patient, indicating metabolic and homeostatic status. |
| Weight                              | The body mass of the patient, typically recorded in kilograms. |
| pH                                  | The measure of acidity or alkalinity of the blood, indicating the balance of acids and bases in the body. |
| Patient_id                          | A unique identifier assigned to each patient. |
| target                              | The outcome or variable of interest- for this project -> mortality|

In [3]:
# Shape of the dataset
shape = df.shape
print('The dataset contains', shape[0], 'rows and', shape[1], 'columns.')

The dataset contains 300912 rows and 19 columns.


In [4]:
# checking the data types
df.dtypes

Capillary refill rate                 float64
Diastolic blood pressure              float64
Fraction inspired oxygen              float64
Glascow coma scale eye opening         object
Glascow coma scale motor response      object
Glascow coma scale total              float64
Glascow coma scale verbal response     object
Glucose                               float64
Heart Rate                            float64
Height                                float64
Mean blood pressure                   float64
Oxygen saturation                     float64
Respiratory rate                      float64
Systolic blood pressure               float64
Temperature                           float64
Weight                                float64
pH                                    float64
Patient_id                             object
target                                  int64
dtype: object

**Insights:**

- **Numerical variables** : Capillary refill rate ,Diastolic blood pressure, Fraction inspired oxygen, Glascow coma scale total, Glucose, Heart Rate, Height, Mean blood pressure, Oxygen saturation,Respiratory rate, Systolic blood pressure, Temperature, Weight, pH, target

- **Categorical variables** : Glascow coma scale eye opening, Glascow coma scale motor response, Glascow coma scale verbal response, Patient_id.




In [5]:
# Descriptive Statistics -> Categorical
df.describe(include='object')


Unnamed: 0,Glascow coma scale eye opening,Glascow coma scale motor response,Glascow coma scale verbal response,Patient_id
count,274190,296978,296884,300912
unique,7,12,12,6269
top,4 Spontaneously,6 Obeys Commands,1.0 ET/Trach,30552
freq,94516,115595,87646,48


**Insights:**

- We do not have binary categorical variables present in the dataset, since none of the categorical variables have a unique value of 2.
- However we have nominal categorical variables(Have no order). They include:

- **1.Glascow coma scale eye opening**
    - This variable has 7 unique categories.
    - The category that has the highest number of observations is 4 Spontaneously which has 94516 observations.

- **2.Glascow coma scale motor response**
    - This variable has 12 unique categories.
    - The category that has the highest number of observations is 6 Obeys Commands which has 115595 observations.

- **3.Glascow coma scale verbal response**
    -This variable has 12 unique categoriries.
    - The category that has the highest number of observation is 1.0 ET/Trach which has 87646 observations.

- **4.Patient_id**
    - This variable has 6269 unique categories. This might imply 6269 patients participated in the study.
    - The count of this variable is 300912. Thus there is a possibility that data was collected from the patients more than once.


In [6]:
# Descriptive statistics -> Nuerical variables
df.describe()

Unnamed: 0,Capillary refill rate,Diastolic blood pressure,Fraction inspired oxygen,Glascow coma scale total,Glucose,Heart Rate,Height,Mean blood pressure,Oxygen saturation,Respiratory rate,Systolic blood pressure,Temperature,Weight,pH,target
count,6336.0,296944.0,88464.0,184416.0,300698.0,300912.0,55824.0,296984.0,300912.0,300864.0,300912.0,298848.0,221040.0,230614.0,300912.0
mean,0.219223,62.541099,0.599884,10.818123,130.628329,79.447793,168.543422,78.79108,95.34346,18.731265,119.694213,36.832834,82.969018,5.573617,0.142128
std,0.413753,341.559624,0.253919,4.334923,84.171126,32.14592,15.137414,29.52986,2529.203751,6.884248,23.396042,1.000075,26.765857,5.963634,0.349182
min,0.0,0.0,0.0,3.0,-19.999974,-19.999623,0.0,-34.0,-19.999687,0.0,0.0,0.0,0.0,-19.999706,0.0
25%,0.0,51.0,0.4,8.0,101.0,70.0,160.0,68.0,95.0,15.0,103.0,36.277802,66.6,7.31,0.0
50%,0.0,59.0,0.5,11.0,126.0,84.0,170.0,77.0,98.0,18.0,117.0,36.833333,79.099998,7.37,0.0
75%,0.0,69.0,0.7,15.0,158.0,97.0,178.0,88.0,100.0,22.0,134.0,37.388889,94.699997,7.42,0.0
max,1.0,100105.01,7.1,15.0,9999.0,941.0,203.0,9381.0,981023.0,1211.0,295.0,73.760002,931.224376,99.0,1.0


**Insights:**

- **Capillary refill rate:**   
    - Has a 6336 data points
    - Has a range from 0 to 1, hence implying it might be a binary variable.
- **Diastolic blood pressure:**
    - Has 296944 data points recorded.
    - High standard deviation(Mean = 62.541, standard deviation = 341.559).
    - Ranges from 0 to 100105. Compared to the quartiles, this variable might have outliers.
    
- **Fraction inspired oxygen:**
    - The mean is 0.599,with a standard deviation of 0.25
    - Ranges from 0 to 7.1. 
- **Glascow coma scale total:**
    - The mean is 10.818,with a standard deviation of 4.33
    - Ranges from 3 to 15, with most values clustering around the higher end of this range. 

- **Glucose:**
    - The mean glucose level is 130.63 ,with a standard deviation of 84.17, which is quite large
    - Ranges from 0 to -19.99 to 9999.   

- **Heart Rate:**
    - Has 300912 records
    - Most rates average at around a mean of 79 beats per minute
    - Has a standard deviation of of 32.15 (moderate variability)
    - The least value is -19.999623 and maximum value is 941
    - based on the quatiles, most values fall between 70 and 97 beats per minutes(bpm) with a median value of 84 bpm
    - Note that there might exist errors in the data collected especially during data entry 

- **Height:**
    - Has 55824 records
    - Most heights average at around a mean of 168.54
    - Has a standard deviation of 15.14 which is not a very big deviation from the central points 
    - Most of the heights fall between 160 to 178 with a median of 170
    - Tallest individual had a height of 203 and shortest individual had a height of 0 (probable error here)

- **Mean blood pressure:**
    - Has 296984 records
    - Most levels average at around a mean of 78.79 mg/dL.
    - Has a standard deviation of 29.529 which is exudes a deviation from the central points 
    - Most of the levels fall between 68 and 88 with
    - Data set has a minimum value of -34 and max value of 9381, this is a possible data error

- **Oxygen saturation:**
    - Has 300912 records
    - Data averages with a mean of 95.343460
    - Has a standard deviation of 2529.203752, which is very high. This shows that there may exist outliers in the data
    - The Oxygen saturation levels fall between 95 and 100 with a central point of 98
    - The least saturation level recorded is -19.999687 and highest value of 981023(this is a possible error)

- **Respiratory rate:**
    - Has 300864 records
    - Data averages with a mean of 18.731265
    - Has as standard deviation of 6.884248(moderate variability), which is significantly ok hence data quality might be ok
    - The lowest value is 0 and largest being 1211(this may be erronoeus since the value is quite large)
    - The rates range between 15 and 22 with a median of 18

- **Systolic blood pressure:**
    - Has  300912 records
    - Data averages with a mean of  119.694213
    - Has as standard deviation of 23.396042(moderate variablity)
    - The lowest value is 0 and largest being 295
    - The rates range between 103 and 134 with a median of 117

- **Temperature:**
    - Has 298848 records
    - Data averages around 36.832834°C
    - Has as standard deviation of 1.000, which is small hence data quality might be ok
    - The lowest level is 0°C and largest being 73.76°C
    - The rates range between 36.277802°C and 37.388889°C with a median of 36.833333°C

- **Weight:**
    - Has 221040 records
    - Data averages with a mean of 82.969018
    - Has a standard deviation of 26.765857, which expresses that the data quality might not be ok
    - The lowest weight is at 0 and largest at 931.224376
    - The weights range between 66.600000 and 94.699997 with a median value of 79.099998

- **pH:**
    - Has 230614 records
    - Data averages with a mean of 5.573617
    - Has a standard deviation of 5.963634, which is not as high but expresses that the data quality might not be ok but at an insignificant level
    - The lowest weight is at -19.999706 and largest at 99
    - The weights range between 7.310000 and 7.420000 with a median value of 7.370000, the data seems to be compromised in term sof data quality because of the min and max values there exists outliers 



*Generally:*
- *Based on the above the data quality is challenged, as some features contain outliers and other contain missing values.*
              


In [9]:
# checking the percentage of missing values
missing_percentage = df.isnull().mean() * 100
print(missing_percentage)


Capillary refill rate                 97.894401
Diastolic blood pressure               1.318658
Fraction inspired oxygen              70.601372
Glascow coma scale eye opening         8.880337
Glascow coma scale motor response      1.307359
Glascow coma scale total              38.714309
Glascow coma scale verbal response     1.338597
Glucose                                0.071117
Heart Rate                             0.000000
Height                                81.448397
Mean blood pressure                    1.305365
Oxygen saturation                      0.000000
Respiratory rate                       0.015952
Systolic blood pressure                0.000000
Temperature                            0.685915
Weight                                26.543308
pH                                    23.361647
Patient_id                             0.000000
target                                 0.000000
dtype: float64


**Insights:**
- The following features have more than 50% of missing values: Capillary refill rate(97.89%), fraction inspired oxygen(70.60%), Height(81.44%).
-  The following features have less than 50% of missing values: Diastolic blood pressure(1.318),Glascow coma scale eye opening,Glascow coma scale motor response, Glascow coma scale total  .
- 