# **Diabetes Prediction using Logistic Regression**

# **Get Understanding about Dataset**

## **There are total 9 columns with 8 independent parameters and 1 outcome parameter**

1. **Pregnancies** - Number of times pregnant

2. **Glucose** - Oral Glucose Tolerance Test : The glucose tolerance test, also known as the oral glucose tolerance test, measures your body's response to sugar (glucose).

3. **Blood Pressure** - Blood pressure is the pressure of blood pushing against the walls of your arteries. Arteries carry blood from your heart to other parts of your body.

-- **Blood pressure is measured using two numbers:**

- The first number, called **systolic** blood pressure, measures the pressure in your arteries when your heart beats.

- The second number, called **diastolic** blood pressure, measures the pressure in your arteries when your heart rests between beats.

- **Normal : Lower than 80**
- **Stage 1 hypertension : 80-89**
- **Stage 2 hypertension : 90 or above**
- **Hypertensive crisis : 120 or above (Most people with diabetes will eventually have high blood pressure)** 

4. **Skin Thickness** - Skin thickness is primarily determined by collagen content and is increased in insulin-dependent diabetes mellitus (IDDM).

5. **Insulin** - People with type 2 diabetes or gestational diabetes need insulin therapy if other treatments haven't been able to keep blood glucose levels within the desired range. Insulin therapy helps prevent diabetes complications by keeping your blood sugar within your target range.

6. **BMI** - Body mass index (BMI) is a measure of body fat based on height and weight that applies to adult men and women. Any increase in BMI above normal weight levels is associated with an increased risk of being diagnosed as having complications of diabetes mellitus.

7. **Diabetes Pedigree Function** - indicates the function which scores likelihood of diabetes based on family history. 

8. **Age** - indicates the age of the person (in years)

9. **Outcome** - indicates if the patient had a diabetes or not (1 = yes, 0 = no)

# **Import Library**

In [2]:
import pandas as pd

In [3]:
import numpy as np

# **Import CSV as DataFrame**

In [4]:
df = pd.read_csv('https://github.com/YBI-Foundation/Dataset/raw/main/Diabetes.csv')

# **Analyzing the data**

# **Displaying the first 5 rows of DataFrame**

In [5]:
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# **Detailed Information of DataFrame**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


# **Getting the Summary Statistics**

In [7]:
df.describe()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


# **Column Names**

In [8]:
df.columns

Index(['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age', 'diabetes'],
      dtype='object')

# **Shape of DataFrame - displays total no. of rows and cols**

In [9]:
df.shape

(768, 9)

# **Get unique values (Class or Label) in y variable**

In [10]:
df['diabetes'].value_counts()

0    500
1    268
Name: diabetes, dtype: int64

In [11]:
df.groupby('diabetes').mean()

Unnamed: 0_level_0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


# **Define X and Y**

## **X - (Features or Independent or Attribute Variable)**
## **Y - (Label or Dependent or Target Variable)**

In [12]:
X = df[['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age',]]

In [13]:
X.shape

(768, 8)

In [14]:
X

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [15]:
y = df['diabetes']

In [16]:
y.shape

(768,)

In [17]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: diabetes, Length: 768, dtype: int64

# **Get X variables standardized**

In [18]:
from sklearn.preprocessing import MinMaxScaler

In [19]:
mm = MinMaxScaler()

In [20]:
X = mm.fit_transform(X)

In [21]:
X

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.16666667],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.18333333],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.07130658,
        0.15      ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
        0.43333333],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
        0.03333333]])

# **Train Test Split Data**

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, stratify = y, random_state = 202529)

In [24]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((537, 8), (231, 8), (537,), (231,))

# **Train the Model**

In [25]:
from sklearn.linear_model import LogisticRegression

In [26]:
lr = LogisticRegression()

In [27]:
lr.fit(X_train, y_train)

LogisticRegression()

# **Model Prediction**

In [28]:
y_pred = lr.predict(X_test)

In [29]:
y_pred.shape

(231,)

In [30]:
y_pred

array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0])

# **Get Probability of each predicted class**

In [31]:
lr.predict_proba(X_test)

array([[0.37133669, 0.62866331],
       [0.47994452, 0.52005548],
       [0.5582806 , 0.4417194 ],
       [0.63306346, 0.36693654],
       [0.51196175, 0.48803825],
       [0.71216516, 0.28783484],
       [0.77609618, 0.22390382],
       [0.37949613, 0.62050387],
       [0.74019308, 0.25980692],
       [0.71243995, 0.28756005],
       [0.22077358, 0.77922642],
       [0.84304697, 0.15695303],
       [0.87667311, 0.12332689],
       [0.89919611, 0.10080389],
       [0.71852137, 0.28147863],
       [0.2444943 , 0.7555057 ],
       [0.65785309, 0.34214691],
       [0.67390071, 0.32609929],
       [0.82016107, 0.17983893],
       [0.40597775, 0.59402225],
       [0.70736983, 0.29263017],
       [0.81196178, 0.18803822],
       [0.05596467, 0.94403533],
       [0.40581684, 0.59418316],
       [0.72869687, 0.27130313],
       [0.80177623, 0.19822377],
       [0.1551265 , 0.8448735 ],
       [0.88975685, 0.11024315],
       [0.73722465, 0.26277535],
       [0.32004603, 0.67995397],
       [0.

# **Model Evaluation**

In [32]:
from sklearn.metrics import confusion_matrix, classification_report

In [33]:
print(confusion_matrix(y_test, y_pred))

[[134  16]
 [ 35  46]]


In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.89      0.84       150
           1       0.74      0.57      0.64        81

    accuracy                           0.78       231
   macro avg       0.77      0.73      0.74       231
weighted avg       0.78      0.78      0.77       231



# **Future Predictions**

## *Lets select a random sample from the existing dataset as a new value*

### Steps to Follow

1. Extract a random row using **sample function**
2. Separate X and Y
3. Standardize X
4. Predict

In [35]:
X_new = df.sample(1)

In [36]:
X_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
598,1,173,74,0,0,36.8,0.088,38,1


In [37]:
X_new.shape

(1, 9)

In [38]:
X_new = X_new.drop('diabetes', axis = 1)

In [39]:
X_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
598,1,173,74,0,0,36.8,0.088,38


In [40]:
X_new.shape

(1, 8)

In [41]:
X_new = mm.fit_transform(X_new)

In [42]:
y_pred_new = lr.predict(X_new)

In [43]:
y_pred_new

array([0])

In [44]:
lr.predict_proba(X_new)

array([[0.99418575, 0.00581425]])