# <u><center>**Business Understanding**</center><u/>

## Project Title : **Predictive Analytics for Sepsis : A Machine Learning Approach with API Deployment**

### Problem Statement: 
Sepsis is a serious condition in which the body responds improperly to an infection. The infection-fighting processes turn on the body, causing the organs to work poorly.
Sepsis may progress to septic shock. This is a dramatic drop in blood pressure that can damage the lungs, kidneys, liver and other organs. When the damage is severe, it can lead to death.
Early treatment of sepsis improves chances for survival. 

### Project Objective:
The primary objective of this project is to develop a machine learning model that can predict the likelihood of a patient in the Intensive Care Unit (ICU) developing sepsis. By accurately predicting sepsis onset, healthcare providers can take timely and appropriate actions to prevent the condition, thereby improving patient outcomes, reducing mortality rates, and decreasing the length of ICU stays.

### Project Success Creterion:
For the project to be considered successful, it is necessary that the pipelines and models used make accurate predictions. The API should work correctly, taking multiple inputs and returning all related predictions. This is essential for predicting patients who are likely to suffer from sepsis, allowing for necessary precautions to be taken.

### Technologies and Tools:
Choose appropriate machine learning frameworks (e.g., TensorFlow, scikit-learn) and data processing tools (e.g., pandas, SQL) for model development. Decide on visualization libraries (e.g., Matplotlib, Seaborn) for result interpretation.

### Risks and Contingencies:
Identify potential risks such as data quality issues, model overfitting, or regulatory compliance. Develop contingency plans to address these risks and mitigate their impact on project timelines and outcomes.




### **Hypothesis :** 
##### ***Null Hypothesis(H0):***
Sepsis disease is most common in elderly patients i.e 40 years and above 
##### ***Alternative Hypothesis(H1):***
Sepsis disease is least common in elderly patients i.e 40 years and above

### **Analytical Questions**
1. Between insured and non-insured patients, who were the most affected by sepsis?
2. What is the distribution of body mass index (M11) across different age groups?
3. How does the average blood pressure (PR) compare between patients with positive and negative sepsis statuses?
4. Is there a significant correlation between plasma glucose levels (PRG) and the development of sepsis (Sepsis)?

# <u><center>**Data Understanding**</u></center>

In [3]:
                  ## importing necessary libraries
### libraries for data manipulation
import pandas as pd
import numpy as np

## packages for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

print("All modules imported")

All modules imported


In [5]:
## loading the dataset
df = pd.read_csv('./Datasets/Paitients_Files_Train.csv')

In [8]:
## getting a preview of the dataset
df

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance,Sepssis
0,ICU200010,6,148,72,35,0,33.6,0.627,50,0,Positive
1,ICU200011,1,85,66,29,0,26.6,0.351,31,0,Negative
2,ICU200012,8,183,64,0,0,23.3,0.672,32,1,Positive
3,ICU200013,1,89,66,23,94,28.1,0.167,21,1,Negative
4,ICU200014,0,137,40,35,168,43.1,2.288,33,1,Positive
...,...,...,...,...,...,...,...,...,...,...,...
594,ICU200604,6,123,72,45,230,33.6,0.733,34,0,Negative
595,ICU200605,0,188,82,14,185,32.0,0.682,22,1,Positive
596,ICU200606,0,67,76,0,0,45.3,0.194,46,1,Negative
597,ICU200607,1,89,24,19,25,27.8,0.559,21,0,Negative


## Feature Understanding

| Column   Name                | Attribute/Target | Description                                                                                                                                                                                                  |
|------------------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ID                           | N/A              | Unique number to represent patient ID                                                                                                                                                                        |                                                                                                                                                               
| PRG           | Attribute1       |  Plasma glucose|
| PL               | Attribute 2     |   Blood Work Result-1 (mu U/ml)                                                                                                                                                |
| PR              | Attribute 3      | Blood Pressure (mm Hg)|
| SK              | Attribute 4      | Blood Work Result-2 (mm)|
| TS             | Attribute 5      |     Blood Work Result-3 (mu U/ml)|                                                                                  
| M11     | Attribute 6    |  Body mass index (weight in kg/(height in m)^2|
| BD2             | Attribute 7     |   Blood Work Result-4 (mu U/ml)|
| Age              | Attribute 8      |    patients age  (years)|
| Insurance | N/A     | If a patient holds a valid insurance card|
| Sepssis                 | Target           | Positive: if a patient in ICU will develop a sepsis , and Negative: otherwise |

### **Exploratory Data Analysis**(**E.D.A**)

In [15]:
### getting information about columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         599 non-null    object 
 1   PRG        599 non-null    int64  
 2   PL         599 non-null    int64  
 3   PR         599 non-null    int64  
 4   SK         599 non-null    int64  
 5   TS         599 non-null    int64  
 6   M11        599 non-null    float64
 7   BD2        599 non-null    float64
 8   Age        599 non-null    int64  
 9   Insurance  599 non-null    int64  
 10  Sepssis    599 non-null    object 
dtypes: float64(2), int64(7), object(2)
memory usage: 51.6+ KB


### ***Column Information Report***
- Dataset has 11 columns and 599 rows
- There are 2 non-numeric columns and 9 numeric columns
- No null values in the columns

In [11]:
### descriptive analysis
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,599.0,599.0,ICU200608,1.0,,,,,,,
PRG,599.0,,,,3.824708,3.362839,0.0,1.0,3.0,6.0,17.0
PL,599.0,,,,120.153589,32.682364,0.0,99.0,116.0,140.0,198.0
PR,599.0,,,,68.732888,19.335675,0.0,64.0,70.0,80.0,122.0
SK,599.0,,,,20.562604,16.017622,0.0,0.0,23.0,32.0,99.0
TS,599.0,,,,79.460768,116.576176,0.0,0.0,36.0,123.5,846.0
M11,599.0,,,,31.920033,8.008227,0.0,27.1,32.0,36.55,67.1
BD2,599.0,,,,0.481187,0.337552,0.078,0.248,0.383,0.647,2.42
Age,599.0,,,,33.290484,11.828446,21.0,24.0,29.0,40.0,81.0
Insurance,599.0,,,,0.686144,0.464447,0.0,0.0,1.0,1.0,1.0


### ***Descriptive Analysis Report***
- The highest plasma glucose recorded was 17.0
- The average blood pressure was 68.7 ranging from 0 to 122.0 
- Body mass index (M11) ranged from 0 to 67.1 with a mean of 31.9
- The oldest patient was of the age 81 while the youngest was 21 years old with the average age being 33 years
- Most patients tested negative of the target variable(sepssis)
