# Notebook Overview

In this notebook, we perform an Exploratory Data Analysis on a dataset and train a machine learningng model that classifies patients in the ICU that will develop Sepsis and those that will not develop Sepsis


# Set Up



**Installation**

Here is the section where we installed all the packages/libraries that will be needed to tackle the challlenge.

In [1]:
# Installatin of packages
#!pip install numpy
#!pip install pandas
#!pip install patool
#!pip install forex_python
#!pip install pandas_profiling
#! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 
#!pip install -U imbalanced-learn
pip install scikit-learn==1.0.2


SyntaxError: invalid syntax (3424016597.py, line 9)

# Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [None]:
# Data handling
import pandas as pd
import numpy as np
from statistics import mean
from forex_python.converter import CurrencyRates
from babel.numbers import format_currency
import datetime as dt

# Statistics
from scipy import stats
from scipy.stats import shapiro, trim_mean, mstats, mode
from scipy.stats import ttest_ind


# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sn


# balance data
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Machine learning libraries and metrics
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# Feature Processing (Scikit-learn processing, etc. )
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder, Binarizer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score,roc_curve, auc,roc_auc_score
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
import joblib
# Other packages
import os
import warnings
warnings.filterwarnings('ignore')
import patoolib
import pickle
from sklearn.pipeline import Pipeline

# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [None]:
testurl="https://raw.githubusercontent.com/Gilbert-B/Machine-Learning-API-using-FastAPI/main/assets/datasets/Paitients_Files_Test.csv"
trainurl="https://raw.githubusercontent.com/Gilbert-B/Machine-Learning-API-using-FastAPI/main/assets/datasets/Paitients_Files_Train.csv"

In [None]:
test_df = pd.read_csv(testurl,error_bad_lines=False)
train_df= pd.read_csv(trainurl,error_bad_lines=False)

# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

## Dataset overview

Have a look at the loaded datsets using the following methods: `.head(), .info()`

In [None]:
# A quick look at the shape of our dataset

train_df.shape

In [None]:
#Looking at the head of our dataset

train_df.head()

In [None]:
#Taking a look at the tail
train_df.tail()

##### Description of Columns 

ID	-	Unique number to represent patient ID

PRG - 	Plasma glucose

PL	-	Blood Work Result-1 (mu U/ml)

PR	-	Blood Pressure (mm Hg) 

SK	-	Blood Work Result-2 (mm)

TS	-   Blood Work Result-3 (mu U/ml)

M11	-	Body mass index (weight in kg/(height in m)^2

BD2	-	Blood Work Result-4 (mu U/ml)

Age	-	patients age (years)

Insurance	- If a patient holds a valid insurance card

Sepssis	Target	Positive: if a patient in ICU will develop a sepsis , and Negative: otherwise

In [2]:
#Look at the columns in the dataset and their data types

train_df.info()

NameError: name 'train_df' is not defined

In [None]:
#Get more details about the features of our data
train_df.describe()

In [None]:
#Check for missing values
train_df.isna().sum()



In [None]:
#Check for outliers
train_df.boxplot()

## Issues With the Data


Too many zeros in each column

The column names are not very descriptive.

The target variable 'Sepssis' may have imbalanced classes.

There are many outliers in some of the numerical columns.

There could be correlations between some of the predictor variables, leading to multicollinearity.


## How I Intend to Solve Them

Replace zeros in each column with the median value

Rename the column names to be more descriptive and easier to understand.

Handle the imbalanced classes in the target variable using techniques such as undersampling or oversampling.

Use visualization techniques such as box plots and scatter plots to identify and handle any outliers.

Use correlation analysis to identify highly correlated variables and consider dropping or transforming them.

## Hypothesis

***Null Hypothesis:*** Age does not determine whether a patient will develop Sepssis

***Alternate Hypothesis:*** Age determines whether a pateint will develop Sepssis

##  Questions

1. Is the train dataset complete?
2. What are the ages of the youngest and oldest patients?
3. What are the youngest and oldest patients with Sepssis?
4. What is the average age ?
5. What is the ratio of patients who are positive for sepssis to the negative patients ?
6. What is the highest and lowest BMI?
7. What is the average BMI ?
8. Is there a corelation between the Sepssis status and the other attributes? 

## Data Cleaning 

In [None]:
# First Rename the columns
train_df = train_df.rename(columns={
    "PRG": "Plasma_glucose",
    "PL": "Blood_Work_R1",
    "PR": "Blood_Pressure",
    "SK": "Blood_Work_R2",
    "TS": "Blood_Work_R3",
    "M11": "BMI",
    "BD2": "Blood_Work_R4",
    "Age": "Patient_age",
    "Sepssis": "Target"
})

In [None]:
numerical_features = ['Plasma_glucose', 'Blood_Work_R1', 'Blood_Pressure', 'Blood_Work_R2', 'Blood_Work_R3', 'BMI', 'Blood_Work_R4', 'Patient_age']

##### Removing the rows where BMI is 0 

In [None]:
# Lets inspect our dataset again
train_df

A glance at our dataset shows the value 0 in some of the columns. This can not be possible and indicates the presence of wrong vaules in our dataset. Lets first remove 0 BMIs and replace the other 0 values in the columns with the median.

In [None]:
#Extracting rows with 0 BMI
zero_bmi = train_df[train_df['BMI']==0.0]
zero_bmi

In [None]:
# Removing rows with 0 BMI
train_df.drop(train_df[train_df['BMI'] == 0.0].index, inplace=True)

In [None]:
#confirming that all 0 BMIs have been removed from our dataset
zero_bmi2 = train_df[train_df['BMI']==0.0]
zero_bmi2

##### Replace zeros in other  columns  with the median value

In [None]:
# Another look at our dataset shows that most of our columns have 0 for values.
train_df

In [None]:
columns_with_too_many_zeros = ['Plasma_glucose', 'Blood_Work_R2', 'Blood_Work_R3']
for col in columns_with_too_many_zeros:
    train_df[col].replace(to_replace=0, value=train_df[col].median(), inplace=True)

In [None]:
train_df

#### Checking for Outliers

In [None]:
plt.figure(figsize=(10, 6))

# Plot the boxplot
train_df.boxplot()

# Rotate x-axis labels by 45 degrees
plt.xticks(rotation=45)

# Display the plot
plt.show()

The box plots of the various columns as visualized above, shows the presence of outliers in our data.
Outliers can skew the results of machine learning models and make them less accurate and reliable. 

In [None]:
Q1 = train_df.quantile(0.25)
Q3 = train_df.quantile(0.75)
IQR = Q3-Q1
IQR
((train_df< (Q1-1.5 * IQR)) | (train_df > (Q3 + 1.5 * IQR))).any()

All the columns except ID, Insurance and the Target Column have outliers.

#### Calculating the Interquartile range, setting the outlier boundary and removing the outliers from the dataframe

In [None]:
# Specify the columns of interest
columns_of_interest =  ['BMI', 'Blood_Pressure', 'Blood_Work_R1','Blood_Work_R2','Blood_Work_R3','Blood_Work_R4','Patient_age','Plasma_glucose']

# Check if outliers still exist in the columns
outliers_exist = False

for column in columns_of_interest:
    # Calculate the first and third quartiles (Q1 and Q3)
    Q1 = train_df[column].quantile(0.25)
    Q3 = train_df[column].quantile(0.75)

    # Calculate the interquartile range (IQR)
    IQR = Q3 - Q1

    # Define the lower and upper bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Modify the values in the column to be within the range
    train_df[column] = train_df[column].clip(lower_bound, upper_bound)

    # Check if outliers exist in the column
    if (train_df[column] < lower_bound).any() or (train_df[column] > upper_bound).any():
        outliers_exist = True
        print(f"Outliers still exist in '{column}'.")

if not outliers_exist:
    print("No outliers exist in the specified columns.")


In [None]:
plt.figure(figsize=(10, 6))

# Plot the boxplot
train_df.boxplot()

# Rotate x-axis labels by 45 degrees
plt.xticks(rotation=45)

# Display the plot
plt.show()

## Univariate Analysis

#### Positive Sepssis Cases 

In [None]:
positive_cases = train_df[train_df['Target'] == 'Positive']
positive_cases

Age

In [None]:
positive_age_stats = positive_cases['Patient_age'].describe()
positive_age_stats 

In [None]:
no_positives= positive_age_stats['count']
print(f'The no of patients diagnosed with Sepssis is {no_positives}')

In [None]:
positive_mean_age = positive_age_stats['mean']
print(f'The mean age of patients with Sepssis is: {positive_mean_age:.2f} years')

In [None]:
highest_positive_age = positive_age_stats['max']
print(f'The oldest patient with Sepssis is {highest_positive_age} years old')

In [None]:
lowest_positive_age = positive_age_stats['min']
print(f'The youngest patient with Sepssis is {lowest_positive_age} years old')

In [None]:
# Extract the 'age' column from the DataFrame
ages = positive_cases['Patient_age']

# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(ages, bins=20, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Ages of Patients with Sepssis')
plt.show()

# Create a kernel density plot
plt.figure(figsize=(10, 6))
sn.kdeplot(ages, shade=True, color='blue', linewidth=2)
plt.axvline(positive_mean_age, color='red', linestyle='--', label='Mean Age')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Distribution of Ages of Patients with Sepssis (Kernel Density Plot)')
plt.legend()
plt.show()

BMI

In [None]:
positive_bmi_stats = positive_cases['BMI'].describe()
positive_bmi_stats

In [None]:
positive_mean_bmi = positive_bmi_stats['mean']
print(f'The average BMI for patients with Sepssis is {positive_mean_bmi:.2f}')

In [None]:
highest_bmi = positive_bmi_stats['max']
print(f'The highest BMI for a patient with Sepssis is {highest_bmi}')

In [None]:
lowest_positive_bmi = positive_bmi_stats['min']
print(f'The lowest BMI for a patient with Sepssis is {lowest_positive_bmi}')

In [None]:
# Extract the BMI 'M11' column from the DataFrame
BMI = positive_cases['BMI']

# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(ages, bins=20, edgecolor='black')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.title('Distribution of BMI of Patients with Sepssis')
plt.show()

# Create a kernel density plot
plt.figure(figsize=(10, 6))
sn.kdeplot(BMI, shade=True, color='blue', linewidth=2)
plt.axvline(positive_mean_age, color='red', linestyle='--', label='Mean BMI')
plt.xlabel('BMI')
plt.ylabel('Density')
plt.title('Distribution of BMI of Patients With Sepssis (Kernel Density Plot)')
plt.legend()
plt.show()

#### Negative Sepssis Cases

AGE

In [None]:
negative_cases = train_df[train_df['Target'] == 'Negative']
negative_cases

In [None]:
negative_age_stats = negative_cases['Patient_age'].describe()
negative_age_stats 

In [None]:
No_Negative = negative_age_stats['count']
print (f'No of patients without Sepssis is {No_Negative}')

In [None]:
mean_age = negative_age_stats['mean']
print(f'The mean age for patients without Sepssis is: {mean_age:.2f} years')


In [None]:
highest_negative_age = negative_age_stats['max']
print(f'The oldest patient without Sepssis is {highest_negative_age} years old')

In [None]:
lowest_negative_age = negative_age_stats['min']
print(f'The youngest patient withot Sepssis is {lowest_negative_age} years old')

In [None]:
# Extract the 'age' column from the DataFrame
ages = negative_cases['Patient_age']

# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(ages, bins=20, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Ages of Patients without Sepssis')
plt.show()

# Create a kernel density plot
plt.figure(figsize=(10, 6))
sn.kdeplot(ages, shade=True, color='blue', linewidth=2)
plt.axvline(positive_mean_age, color='red', linestyle='--', label='Mean Age')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Distribution of Ages of Patients without Sepssis (Kernel Density Plot)')
plt.legend()
plt.show()

BMI

In [None]:
negative_bmi_stats = negative_cases['BMI'].describe()
negative_bmi_stats

In [None]:
negative_mean_bmi = negative_bmi_stats['mean']
print(f'The mean BMI for patients without Sepssis is: {negative_mean_bmi:.2f}')

In [None]:
highest_negative_bmi = negative_bmi_stats['max']
print(f'The highest BMI for a  patient without Sepssis is {highest_negative_bmi}')

In [None]:
lowest_negative_bmi = negative_bmi_stats['min']
print(f'The lowest BMI for a patient withot Sepssis is {lowest_negative_bmi}')

In [None]:
# Extract the BMI 'M11' column from the DataFrame
BMI = negative_cases['BMI']

# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(ages, bins=20, edgecolor='black')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.title('Distribution of BMI of Patients Without Sepssis')
plt.show()

# Create a kernel density plot
plt.figure(figsize=(10, 6))
sn.kdeplot(BMI, shade=True, color='blue', linewidth=2)
plt.axvline(negative_mean_bmi, color='red', linestyle='--', label='Mean BMI')
plt.xlabel('BMI')
plt.ylabel('Density')
plt.title('Distribution of BMI of Patients Without Sepssis (Kernel Density Plot)')
plt.legend()
plt.show()

## Univariate Analysis

#### Graphically Displaying all other numerical columns using Histogram 

In [None]:
# Set the style for the plot
sn.set(style="ticks", color_codes=True)

# Create a grid of 3 by 3 subplots
fig, axes = plt.subplots(3, 3, figsize=(12, 12))

# Flatten the axes array
axes = axes.flatten()

# Plot histograms for each numerical column
for i, col in enumerate(numerical_features):
    sn.histplot(data=train_df, x=col, kde=True, bins=10, ax=axes[i])
    axes[i].set_title(f'Histogram of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Count')

# Adjust the spacing between subplots
plt.tight_layout()

# Display the plot
plt.show()

In [None]:
# Distribution of the target variable (Sepssis)
sn.countplot(x='Target', data=train_df)
plt.title('Distribution of Sepssis')
plt.show()

# Pairplot to visualize relationships between variables
sn.pairplot(train_df, hue='Target', diag_kind='kde')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sn.heatmap(train_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Boxplots of numerical variables grouped by Sepssis Target
numeric_columns = ['Plasma_glucose', 'Blood_Work_R1', 'Blood_Pressure', 'Blood_Work_R2', 'Blood_Work_R3', 'BMI', 'Blood_Work_R4', 'Patient_age']
for column in numeric_columns:
    plt.figure(figsize=(8, 6))
    sn.boxplot(x='Target', y=column, data=train_df)
    plt.title('Boxplot of ' + column + ' grouped by Sepssis Status')
    plt.show()

# Histograms of numerical variables grouped by Sepssis Target
for column in numeric_columns:
    plt.figure(figsize=(8, 6))
    sn.histplot(data=train_df, x=column, hue='Target', kde=True)
    plt.title('Histogram of ' + column + ' grouped by Sepssis  Status')
    plt.show()

# Bar plots of categorical variable (Insurance) grouped by Sepssis Target
plt.figure(figsize=(8, 6))
sn.countplot(x='Insurance', hue='Target', data=train_df)
plt.title('Count of Insurance grouped by Sepssis Status')
plt.show()

In [None]:
sn.histplot(data=train_df, x='Patient_age', hue='Target', alpha=0.5, kde=True)
plt.title(f'Histogram of Patient_age')
plt.xlabel('Patient_age')
plt.ylabel('Count')
plt.show()

## Hypothesis Validation 

In [None]:
# Split the data into two groups based on the Sepssis variable
target_positive = train_df[train_df['Target'] == 'Positive']
target_negative= train_df[train_df['Target'] == 'Negative']

# Extract the Age(Patient_age) values for each group
age_target_positive = target_positive['Patient_age']
age_target_negative = target_negative['Patient_age']

# Perform independent samples t-test
t_statistic, p_value = ttest_ind(age_target_positive, age_target_negative)

# Print the results
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

In [None]:
print("Based on the t-test results, the t-statistic value is {} and the p-value is {}.".format(t_statistic, p_value))
print("\nInterpreting the results:")
print("T-Statistic: The t-statistic measures the difference between the means of the two groups (positive and negative Sepssis Target) relative to the variability within each group. In this case,the t-statistic value of {} indicates a substantial difference in the mean age between the two groups.".format(t_statistic))
print("P-Value: The p-value is a measure of the statistical significance of the t-test results.In this case,the p-value is very small, which is less than commonly used significance levels like 0.05 or 0.01.This indicates strong evidence against the null hypothesis.")
print("\nInterpretation: With a t-statistic of {} and a very small p-value of {}, we can conclude that there is a significant difference in the mean age between patients with a positive Sepssis status and those with a negative Sepssis status. The results suggest that age may play a role in determining the likelihood of developing sepsis.".format(t_statistic, p_value))


### Answers

#### 1. Is the train dataset complete

In [None]:
train_df.isnull().sum()

There are no missing values in the dataset 

#### 2. What are the ages of the youngest and oldest patients

In [None]:
oldest_age = train_df['Patient_age'].max()
youngest_age= train_df['Patient_age'].min()

In [None]:
print(f'The youngest and oldest patients are {youngest_age} and {oldest_age} years respectively')

#### 3. What are the youngest and oldest patients with Sepssis?

In [None]:
highest_positive_age = positive_age_stats['max']
lowest_positive_age = positive_age_stats['min']

In [None]:
print(f'The youngest and oldest patient with Sepssis is {lowest_positive_age} and {highest_positive_age} years respectively')

#### 4. What is the average age ?


In [None]:
average_age = train_df['Patient_age'].mean()
print(f'The Average age is {average_age:.2f} years old')

#### 5. What is the ratio of patients who are positive for sepssis to the negative patients ?


In [None]:
# Calculate the count of positive and negative patients
positive_count = train_df[train_df['Target'] == 'Positive'].shape[0]
negative_count = train_df[train_df['Target'] == 'Negative'].shape[0]

# Calculate the ratio
ratio = positive_count / negative_count

print(f'The ratio of patientrs positive for sepssis to negative patients is {ratio:.2f}')

#### 6.What is the highest and lowest BMI?


In [None]:
highest_bmi = train_df['BMI'].max()
lowest_bmi= train_df['BMI'].min()

print(f'The highest and lowest BMI is {highest_bmi:.2f} and {lowest_bmi:.2f} respectively')

#### 7.What is the average BMI ?


In [None]:
average_bmi = train_df['BMI'].mean()

print(f'The average BMI is {average_bmi:.2f}')

#### 8.Is there a corelation between the Sepssis status and the other attributes?

In [None]:
# Replace "Positive" with 1 and "Negative" with 0
train_df['Target'] = train_df['Target'].replace({'Positive': 1, 'Negative': 0})

# Print the updated DataFrame
train_df.head(5)

In [None]:
# Calculate the correlation matrix
correlation_matrix = train_df.corr()

# Plot the correlation heatmap
plt.figure(figsize=(10, 8))
sn.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Calculate the correlation matrix
correlation_matrix = train_df.corr()

# Create a DataFrame from the correlation matrix
correlation_table = pd.DataFrame(correlation_matrix)

# Print the correlation table
correlation_table

In [None]:

# Set the threshold for high correlation
threshold = 0.5

# Find the highly correlated variables
high_correlation = (correlation_matrix.abs() > threshold) & (correlation_matrix != 1)

# Get the variable pairs with high correlation
high_correlation_pairs = [(i, j) for i in high_correlation.columns for j in high_correlation.columns if high_correlation.loc[i, j]]

# Print the highly correlated variables
for pair in high_correlation_pairs:
    var1, var2 = pair
    correlation_value = correlation_matrix.loc[var1, var2]
    print(f"{var1} and {var2} are highly correlated (correlation value: {correlation_value})")


## Feature Processing and Engineering

In [None]:
train_df.info()

#### Check and Drop Duplicates 

In [None]:
#Check for duplicate rows in data
duplicate_rows = train_df.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())

#### Impute Missing Values 

In [None]:
missing_values = train_df.isna().sum()
print(missing_values)

## Feature Encoding

##### Removing Columns 

In [None]:
# Since Plasma_Glucose and Patient_age are highly correlated, we will remove Plasma_Glucose 
# We will remove Blood_work_R2 since it is also highly correlated to BMI
# We will remove the ID column
# we drop Insurance as well since it isnt a relevant field 

In [None]:
train_df_new = train_df.drop(['Blood_Work_R2', 'Plasma_glucose', 'ID', 'Insurance'], axis=1)


In [None]:
train_df_new

## Data Spliting 

In [None]:
# Use train_test_split with a random_state, and add stratify for Classification
# Split the  data into train and validation sets
X_train, X_eval, y_train, y_eval = train_test_split(train_df_new.iloc[:, :-1], train_df_new.iloc[:, -1:],
                                                    test_size=0.2, random_state=42, stratify=train_df_new.iloc[:, -1:])


In [None]:
X_train.shape,X_eval.shape,y_train.shape,y_eval.shape

In [None]:
X_train

In [None]:
y_train

## Feature Scaling 

#### Checking to see if Our Data is Balanced 

In [None]:
# Count the occurrences of each class label
class_counts = train_df['Target'].value_counts()

# Print the class counts
print(class_counts)

# Calculate the class frequencies
class_frequencies = class_counts / len(train_df)

# Print the class frequencies
print(class_frequencies)

In [None]:
# Our dataset is imbalanced and we would have to Scale it 
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_train_df = pd.DataFrame(X_train_scaled, columns = ['Blood_Work_R1', 'Blood_Pressure', 'Blood_Work_R3', 'BMI', 'Blood_Work_R4', 'Patient_age'])

X_eval_scaled = scaler.transform(X_eval)
X_eval_df = pd.DataFrame(X_eval_scaled, columns = ['Blood_Work_R1', 'Blood_Pressure','Blood_Work_R3', 'BMI', 'Blood_Work_R4', 'Patient_age'])

# Machine Learing Model 

#### Here is the section to build, train, evaluate and compare the models to each others.

## Logistics Regression Model

In [None]:
# Instanciate the model

lr = LogisticRegression(random_state=42)

In [None]:
# Train the model on the training set

lr.fit(X_train_scaled, y_train)

In [None]:
y_pred = lr.predict(X_eval)

In [None]:
y_pred

In [None]:
#classification report for the model's performance on the eval set.
LRM=(classification_report(y_eval, y_pred))


print(LRM)

In [None]:
# Function to parse the classification report string
def parse_classification_report(LRM):
    lines = LRM.strip().split('\n')
    metrics = {}
    for line in lines[2:4]:  # Only parse the lines for class 0 and class 1
        label, precision, recall, f1_score, _ = line.split()
        metrics[f'precision_{label}'] = float(precision)
        metrics[f'recall_{label}'] = float(recall)
        metrics[f'f1-score_{label}'] = float(f1_score)
    return metrics

# Call the function to parse the classification report string
lr_metrics = parse_classification_report(LRM)

print(lr_metrics)


## Random Forest Classifier

In [None]:
# creating the model
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.set_params(**{'n_estimators': 100, 'random_state': 42})

In [None]:
# Train the model on the training set
rfc.fit(X_train_scaled, y_train)

In [None]:
# Evaluate the Model on the Evaluation dataset 
y_pred = rfc.predict(X_eval)

In [None]:
y_pred

In [None]:
# Compute the valid metrics for the use case # Optional: show the classification report 
accuracy = accuracy_score(y_eval, y_pred)
precision = precision_score(y_eval, y_pred)
recall = recall_score(y_eval, y_pred)
f1 = f1_score(y_eval, y_pred)


#classification report for the model's performance on the eval set.
RFCM=(classification_report(y_eval, y_pred))

print(RFCM)

For each model and class, we have the following metrics:

Precision: It measures how many of the predicted positive instances are actually positive. A higher precision means fewer false positives.

Recall: It measures how many of the actual positive instances are correctly identified. A higher recall means fewer false negatives.

F1-score: It is a balanced measure that combines both precision and recall. It provides a single value that represents the model's overall performance for a specific class.

In [None]:
# Function to parse the classification report string
def parse_classification_report(RFCM):
    lines = RFCM.strip().split('\n')
    metrics = {}
    for line in lines[2:4]:  # Only parse the lines for class 0 and class 1
        label, precision, recall, f1_score, _ = line.split()
        metrics[f'precision_{label}'] = float(precision)
        metrics[f'recall_{label}'] = float(recall)
        metrics[f'f1-score_{label}'] = float(f1_score)
    return metrics

# Call the function to parse the classification report string
rfc_metrics = parse_classification_report(RFCM)

print(rfc_metrics)


## Gradient Boosting Classifier Model

In [None]:
# Initialize the model with default hyperparameters
gbc = GradientBoostingClassifier(random_state=42)

In [None]:
# Fit the model on the training data
gbc.fit(X_train_scaled, y_train)

In [None]:
# Make predictions on the evaluation set
y_pred = gbc.predict(X_eval_scaled)

In [None]:
y_pred

In [None]:
#classification report for the model's performance on the evalset.
GBCM= (classification_report(y_eval, y_pred))


print(GBCM)

In [None]:
# Function to parse the classification report string
def parse_classification_report(GBCM):
    lines = GBCM.strip().split('\n')
    metrics = {}
    for line in lines[2:4]:  # Only parse the lines for class 0 and class 1
        label, precision, recall, f1_score, _ = line.split()
        metrics[f'precision_{label}'] = float(precision)
        metrics[f'recall_{label}'] = float(recall)
        metrics[f'f1-score_{label}'] = float(f1_score)
    return metrics

# Call the function to parse the classification report string
gbc_metrics = parse_classification_report(GBCM)

print(gbc_metrics)

# Models comparison

Creating a pandas dataframe that will allow us to compare our models.

In [None]:
models = []

In [None]:
models.append('RFC')
models.append('GBC')
models.append('LR')

In [None]:
metrics_list = [rfc_metrics, gbc_metrics, lr_metrics]

In [None]:
combined_metrics = []
for i, m in enumerate(metrics_list):
    m['model'] = models[i]
    combined_metrics.append(m)

In [None]:
metrics_df = pd.DataFrame(combined_metrics)
metrics_df.set_index('model', inplace=True)

In [None]:
print(metrics_df)

If we consider the F1-score as the evaluation criterion, we can compare the F1-scores for each model and class. Generally, a higher F1-score indicates better performance in terms of both precision and recall.

Looking at the F1-scores in the table:

For class 0, the GBC model has the highest F1-score of 0.84, followed by the RFC model with an F1-score of 0.84, and the LR model with an F1-score of 0.83.
For class 1, the GBC model has the highest F1-score of 0.73, followed by the RFC model with an F1-score of 0.68, and the LR model with an F1-score of 0.63.
Based on the F1-scores, we can conclude that the GBC model performed the best for both class 0 and class 1, followed by the RFC model.

# Hyper Parameter tuning.

Fine-tuning the Top-k models (3 < k < 5) using a GridSearchCV (that is in sklearn.model_selection ) to find the best hyperparameters and achieve the maximum performance of each of the Top-k models, then comparing them again to select the best one.

In [None]:
# Parameter grid for GBC
gbc_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.05, 0.01],
    'max_depth': [3, 5, 7]
}

# Parameter grid for LR
lr_param_grid = {
    'C': [0.1, 1.0, 10.0],
    'penalty': ['l1', 'l2']
}

# Parameter grid for RF
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}


In [None]:
# GridSearchCV for GBC
gbc_grid_search = GridSearchCV(
    estimator=GradientBoostingClassifier(),
    param_grid=gbc_param_grid,
    scoring='accuracy',
    cv=5
)

# GridSearchCV for LR
lr_grid_search = GridSearchCV(
    estimator=LogisticRegression(),
    param_grid=lr_param_grid,
    scoring='accuracy',
    cv=5
)

# GridSearchCV for RF
rf_grid_search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=rf_param_grid,
    scoring='accuracy',
    cv=5
)


In [None]:
gbc_grid_search.fit(X_train_scaled, y_train)
lr_grid_search.fit(X_train_scaled, y_train)
rf_grid_search.fit(X_train_scaled, y_train)


#### Assessing the best parameters

In [None]:
# GBC - Best hyperparameters
best_gbc_params = gbc_grid_search.best_params_

# LR - Best hyperparameters
best_lr_params = lr_grid_search.best_params_

# RF - Best hyperparameters
best_rf_params = rf_grid_search.best_params_


#### Assessing the best score 

In [None]:
# GBC - Best score
best_gbc_score = gbc_grid_search.best_score_

# LR - Best score
best_lr_score = lr_grid_search.best_score_

# RF - Best score
best_rf_score = rf_grid_search.best_score_


Use the best hyperparameters obtained from Grid Search to train your final models and evaluate them on the test data.

In [None]:
# Initialize GBC with best hyperparameters
gbc_best = GradientBoostingClassifier(**best_gbc_params)
gbc_best.fit(X_train_scaled, y_train)

# Initialize LR with best hyperparameters
lr_best = LogisticRegression(**best_lr_params)
lr_best.fit(X_train_scaled, y_train)

# Initialize RF with best hyperparameters
rf_best = RandomForestClassifier(**best_rf_params)
rf_best.fit(X_train_scaled, y_train)


In [None]:
#creating a file path to save all the componets in.
if not os.path.exists("ml"):
    os.makedirs("ml")

In [None]:
# set the destination path to the "export" directory
destination = os.path.join(".", "ml", "gbc.pkl")
joblib.dump(gbc_best, destination)

In [None]:
destination2 = os.path.join(".", "ml", "scaler.pkl")
joblib.dump(scaler, destination2)

In [None]:
import pickle

with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)
