<h1> 1. Predicting Diabetes Onset Using Logistic Regression </h1>
<h3><b>Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values if any.</li>
    <li>Standardize features.</li>
    <li>Encode categorical variables if any.</li>
</ul>
<h3><b>Task:</b> Implement logistic regression to predict diabetes onset and evaluate the model using accuracy, precision, and recall.</h3>



In [1]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler  # For standardization
from sklearn.model_selection import train_test_split  # For splitting the data
from sklearn.linear_model import LogisticRegression   
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [2]:
# Loading the dataset
diabetes_dataset = pd.read_csv('Datasets\\Diabetes.csv')
print(diabetes_dataset.shape, '\n')
diabetes_dataset.head()

(768, 9) 



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Printing the basic statistics of the data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [4]:
# Printing information of dataset
diabetes_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [5]:
# Separating the features and the target variable
X = diabetes_dataset.drop('Outcome', axis=1)
Y = diabetes_dataset['Outcome']

print("Shape of X:", X.shape)
print("Shape of Y:", Y.shape)

Shape of X: (768, 8)
Shape of Y: (768,)


<h2>Data Preprocessing</h2>

<h3><ol><li>Handling Missing Values</li></ol></h3>

In [6]:
# Checking the missing values in the dataset
diabetes_dataset.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

-> Since there are no missing values in the dataset, we can proceed to the next preprocessing step i.e, <b>standardizing features</b>.

<h3>2. Standardizing Features</h3>

In [7]:
# Applying the standardization (z scores method)
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Converting the standardized features to dataframe
X_standardized = pd.DataFrame(X_standardized, columns=X.columns)
X_standardized

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.639947,0.848324,0.149641,0.907270,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.233880,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.907270,0.765836,1.409746,5.484909,-0.020496
...,...,...,...,...,...,...,...,...
763,1.827813,-0.622642,0.356432,1.722735,0.870031,0.115169,-0.908682,2.532136
764,-0.547919,0.034598,0.046245,0.405445,-0.692891,0.610154,-0.398282,-0.531023
765,0.342981,0.003301,0.149641,0.154533,0.279594,-0.735190,-0.685193,-0.275760
766,-0.844885,0.159787,-0.470732,-1.288212,-0.692891,-0.240205,-0.371101,1.170732


In [8]:
# Printing the basic statistics of the standardized data
X_standardized.describe().round(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.14,-3.78,-3.57,-1.29,-0.69,-4.06,-1.19,-1.04
25%,-0.84,-0.69,-0.37,-1.29,-0.69,-0.6,-0.69,-0.79
50%,-0.25,-0.12,0.15,0.15,-0.43,0.0,-0.3,-0.36
75%,0.64,0.61,0.56,0.72,0.41,0.58,0.47,0.66
max,3.91,2.44,2.73,4.92,6.65,4.46,5.88,4.06


-> So, the features have been standardized by standard scaler. All features have mean of 0 and std of 1.

<h3>3. Encoding Categorical Variables</h3>

In [9]:
# Printing the features information
X_standardized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
dtypes: float64(8)
memory usage: 48.1 KB


-> Since no feature has data type of 'object', that means no feature here is to be encoded.<br>
-> The Data Preprocessing steps have now been done and we will proceed to the model's training and evaluation.

<h2>Model Training</h2>

In [10]:
# Splitting the data into training and test data in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X_standardized, Y, test_size=0.2, random_state=42)

In [11]:
# Initializing and fitting the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, Y_train)

In [12]:
# Predicting the target variable
Y_pred = lr_model.predict(X_test)

<h2>Model Evaluation</h2>

<h3><ol><li>Accuracy Score</li></ol></h3>

In [13]:
# Predicting the accuracy of the model
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy of the Model:", accuracy)

Accuracy of the Model: 0.7532467532467533


<h3>2. Precision Score</h3>

In [14]:
# Predicting the precision of the model
precision = precision_score(Y_test, Y_pred)
print("Precision of the Model:", precision)

Precision of the Model: 0.6491228070175439


<h3>3. Recall Score</h3>

In [15]:
# Predicting the precision of the model
recall = recall_score(Y_test, Y_pred)
print("Recall score of the Model:", recall)

Recall score of the Model: 0.6727272727272727


<li><b>Accuracy (0.753):</b></li> The model correctly predicts diabetes onset 75.3% of the time. <br>
<br>
<li><b>Precision (0.649):</b></li> Of the predicted diabetes cases, 64.9% are correct. <br>
<br>
<li><b>Recall (0.673):</b></li> The model correctly identifies 67.3% of actual diabetes cases.

-> The model performs moderately well but has room for improvement in accurately predicting diabetes cases, particularly in reducing false positives and false negatives.

<hr>