# About the project and the data set 

About the data set
 
The dataset comprises patient information collected in an intensive care unit (ICU), with the primary goal of predicting the development of sepsis. It includes various health-related attributes such as plasma glucose, blood work results, blood pressure, body mass index, and patient age. The binary target variable, 'Sepssis,' indicates whether a patient is likely to develop sepsis ('Positive') or not ('Negative'). Features such as insurance status are also present, allowing for an exploration of potential correlations between patient characteristics and the occurrence of sepsis in the ICU.

About the Project

As the final project of the program, this project focuses on the process of creating an API that can be seamlessly integrated with a machine-learning model. This approach is particularly valuable when safeguarding the confidentiality of a model's architecture or when making it accessible to users with existing API integration. Through the creation and deployment of an API, the model gains the capability to receive requests over the internet.

## Analytical Questions 

1. Demographic Analysis:

a. What is the distribution of patients based on age?

b. How is the distribution of patients across different insurance statuses?

c. Is there any correlation between age and the likelihood of developing sepsis?

2. Feature Relationships:

a. Are there any noticeable patterns or correlations between different numerical features (e.g., PRG, PL, PR) and the target variable (Sepssis)?

b. How does the distribution of Plasma Glucose (PRG) differ between patients who develop sepsis and those who don't?

3. Class Distribution:

a. What is the distribution of classes in the target variable (Positive/Negative)?


## Hypothesis:

This study seeks to investigate the impact of key health-related features, including plasma glucose levels, blood pressure, and body mass index (BMI), on the probability of a patient developing sepsis within the intensive care unit (ICU). We hypothesize that variations in these specific health indicators will exhibit statistically significant correlations with the likelihood of sepsis occurrence. By analyzing these relationships, we aim to identify crucial factors contributing to the development of sepsis, thereby informing more targeted and effective medical interventions in the ICU setting

## Installing Packages 

In [None]:
%pip install pandas matplotlib seaborn

## Importing Packages

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Loading data 

In [5]:
# Loading the train data set and the test data set
train_path = "G:\AZUBI-AFRICA\CAREER_ACCELERATOR_ALL_OUT\LP6_API_BUILDING\lp6_building_api\data\Paitients_Files_Train.csv"
test_path =  "G:\AZUBI-AFRICA\CAREER_ACCELERATOR_ALL_OUT\LP6_API_BUILDING\lp6_building_api\data\Paitients_Files_Test.csv"

df_train =  pd.read_csv(train_path)

df_test = pd.read_csv(test_path)

# let's look at some information on our data sets

## train set

In [6]:
# looking at the head of the train set 
df_train.head()

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance,Sepssis
0,ICU200010,6,148,72,35,0,33.6,0.627,50,0,Positive
1,ICU200011,1,85,66,29,0,26.6,0.351,31,0,Negative
2,ICU200012,8,183,64,0,0,23.3,0.672,32,1,Positive
3,ICU200013,1,89,66,23,94,28.1,0.167,21,1,Negative
4,ICU200014,0,137,40,35,168,43.1,2.288,33,1,Positive


In [7]:
df_train.describe(include='object') # checking for statistical describtion 

Unnamed: 0,ID,Sepssis
count,599,599
unique,599,2
top,ICU200010,Negative
freq,1,391


In [8]:
df_train.isnull().sum()  # checking for null values 

ID           0
PRG          0
PL           0
PR           0
SK           0
TS           0
M11          0
BD2          0
Age          0
Insurance    0
Sepssis      0
dtype: int64

In [10]:
df_train.duplicated().sum() # checking duplicated values 

0

# Exploratory Data Analysis 