# <span style="color:purple;">**Diabetes Identification with Logistic Regression**

![image.png](attachment:dd130a27-cc5d-4a75-a5a8-29b70196d54c.png)

### **Use Logistic Regression to Predict Binary Class Problem**

## <span style="color:purple;">**Dataset Summary**

#### Nine (9) columns with eight (8) independent parameter and one (1) outcome parameter.

#### Seven hundred sixty eight (768) observations having two hundred sixty eight (268) positive for diabetes (1) and five hundred (500) negative for diabetes (0)

#### **1. Pregnancies:** Number of times pregnant
#### **2. Glucose:** Oral Glucose Tolerance Test result The glucose tolerance test is a lab test to check how your body moves sugar from the blood into tissues like muscle and fat. The test is often used to diagnose diabetes. The most common glucose tolerance test is the oral glucose tolerance test (OGTT). Before the test begins, a sample of blood will be taken. You will then be asked to drink a liquid containing a certain amount of glucose (usually 75 grams). Your blood will be taken again every 30 to 60 minutes after you drink the solution.
#### **3. BloodPressure:** Diastolic Blood Pressure values in (mm Hg) The diastolic reading, or the bottom number, is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen. This is what your diastolic blood pressure number means:
* #### Normal: Lower than 80
* #### Stage 1 hypertension: 80-89
* #### Stage 2 hypertension: 90 or more
* #### Hypertensive crisis: 120 or more Most people with diabetes will eventually have high     blood pressure.

#### **4. Skin Thickness:** Triceps skin fold thickness in (mm) Skinfold thickness, so that a prediction of the total amount of body fat can be made.The triceps skinfold is necessary for calculating the upper arm muscle circumference. Its thickness gives information about the fat reserves of the body, whereas the calculated muscle mass gives information about the protein reserves. For adults, the standard normal values for triceps skinfolds are 2.5mm (men) or about 20% fat; 18.0mm (women) or about 30% fat. Measurement half, or less, of these values represent about the 15th percentile and can be considered as either borderline, or fat depleted. Values over 20mm (men) and 30mm (women) represent about the 85th percentile, and can be considered.

#### **5. Insulin:** 2-Hour serum insulin (mu U/ml) Insulin is a hormone that helps move blood sugar, known as glucose, from your bloodstream into your cells. 2-hour Serum Insulin: Greater than 150 mu U/ml relates to insulin therapy Insulin therapy is a critical part of treatment for people with type 1 diabetes and also for many with type 2 diabetes. The goal of insulin therapy is to keep your blood sugar levels within a target range.

#### **6. BMI:** Body mass index The Body Mass Index (BMI) provides a simple, yet accurate method of assessing whether a patient is at risk from either over-or-underweight. However, a proportionally greater lean body mass and/or skeletal frame size can contribute to apparent excess body weight. Many athletes, for example would be considered 'overweight', yet skin-fold tests show a sub-normal amount of adipose tissue. It can easily be calculated by dividing the patient's weight (kg) by the square of their height (meters). BMI= weight(kg)/(height(m)]2

#### **7. Diabetes Pedigree Function:** Diabetes pedigree function Diabetes Pedigree Function, it provided some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gave us an idea of the hereditary risk one might have with the onset of diabetes mellitus.

#### **8. Age:** Age in years

#### **9. Outcome:** Class 1 indicates person having diabetes and 0 indicates other.****

## <span style="color:purple;">**Import Library**

In [1]:
import pandas as pd
import numpy as np

## <span style="color:purple;">**Import CSV as DataFrame**

In [2]:
df= pd.read_csv('https://github.com/YBI-Foundation/Dataset/raw/main/Diabetes.csv')

## <span style="color:purple;">**Get the First Five Rows of DataFrame**

In [3]:
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## <span style="color:purple;">**Get Information of DataFrame**

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


## <span style="color:purple;">**Get the Summary Statistics**

In [5]:
df.describe()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## <span style="color:purple;">**Get Column Names**

In [6]:
df.columns

Index(['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age', 'diabetes'],
      dtype='object')

## <span style="color:purple;">**Get Shape of DataFrame**

In [7]:
df.shape

(768, 9)

## <span style="color:purple;">**Get Unique Values (Class or Label) in Y Variable**

In [8]:
df['diabetes'].value_counts()

diabetes
0    500
1    268
Name: count, dtype: int64

In [9]:
df.groupby('diabetes').mean()

Unnamed: 0_level_0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


## <span style="color:purple;">**Define Y (dependent or label or target variable) and X (independent or features or attribute variable)**

In [10]:
y= df['diabetes']

In [11]:
y.shape

(768,)

In [12]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: diabetes, Length: 768, dtype: int64

In [13]:
x= df[['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age']]

### **Or Use .Drop function to Define X**

In [14]:
x= df.drop('diabetes', axis= 1)

In [15]:
x.shape

(768, 8)

In [16]:
x

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


###  <span style="color:purple;">**Get X Variables Standardized**
    
#### Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
#### Next approach is go for MinMax Scaler

In [17]:
from sklearn.preprocessing import MinMaxScaler

In [18]:
mm= MinMaxScaler()

In [19]:
x= mm.fit_transform(x)
x

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.16666667],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.18333333],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.07130658,
        0.15      ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
        0.43333333],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
        0.03333333]])

### <span style="color:purple;">**Get Train Test Split**

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size= 0.3, stratify= y,random_state= 2529)

In [22]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((537, 8), (231, 8), (537,), (231,))

### <span style="color:purple;">**Get Model Train**

In [23]:
from sklearn.linear_model import LogisticRegression

In [24]:
lr = LogisticRegression()

In [25]:
lr.fit(x_train, y_train)

### <span style="color:purple;">**Get Model Prediction**

In [26]:
y_pred= lr.predict(x_test)

In [27]:
y_pred.shape

(231,)

In [28]:
y_pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

### <span style="color:purple;">**Get Probability of Each Predicted Class**

In [29]:
lr.predict_proba(x_test)

array([[0.71101198, 0.28898802],
       [0.80246044, 0.19753956],
       [0.50085081, 0.49914919],
       [0.8745601 , 0.1254399 ],
       [0.84313967, 0.15686033],
       [0.72965238, 0.27034762],
       [0.32611128, 0.67388872],
       [0.82905388, 0.17094612],
       [0.57764733, 0.42235267],
       [0.5794767 , 0.4205233 ],
       [0.90475455, 0.09524545],
       [0.42428281, 0.57571719],
       [0.81659611, 0.18340389],
       [0.86057018, 0.13942982],
       [0.55629153, 0.44370847],
       [0.83208198, 0.16791802],
       [0.40636481, 0.59363519],
       [0.8430081 , 0.1569919 ],
       [0.6035823 , 0.3964177 ],
       [0.51982645, 0.48017355],
       [0.65174255, 0.34825745],
       [0.89662971, 0.10337029],
       [0.88362346, 0.11637654],
       [0.50529753, 0.49470247],
       [0.74048922, 0.25951078],
       [0.38010129, 0.61989871],
       [0.86743064, 0.13256936],
       [0.63051126, 0.36948874],
       [0.54593476, 0.45406524],
       [0.16753486, 0.83246514],
       [0.

### <span style="color:purple;">**Get Model Evaluation**

In [30]:
from sklearn.metrics import confusion_matrix, classification_report

In [31]:
print(confusion_matrix(y_test,y_pred))

[[136  14]
 [ 37  44]]


In [32]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.79      0.91      0.84       150
           1       0.76      0.54      0.63        81

    accuracy                           0.78       231
   macro avg       0.77      0.72      0.74       231
weighted avg       0.78      0.78      0.77       231



### <span style="color:purple;">**Get Future Predictions**
    
#### Lets select a random sample from existing dataset as new value or patient
    
##### Steps to follow
##### 1. Extract a random row using sample function
##### 2. Seperate X and Y
##### 3. Standardize X
##### 4. Predict    

In [33]:
x_new = df.sample(1)
x_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
748,3,187,70,22,200,36.4,0.408,36,1


In [34]:
x_new.shape

(1, 9)

In [35]:
x_new= x_new.drop('diabetes', axis =1)
x_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
748,3,187,70,22,200,36.4,0.408,36


In [36]:
x_new.shape

(1, 8)

In [37]:
x_new= mm.fit_transform(x_new)

In [38]:
y_pred_new = lr.predict(x_new)

In [39]:
y_pred_new

array([0])

In [40]:
lr.predict_proba(x_new)

array([[0.99508059, 0.00491941]])

### **Predicted and Actual Class is Zero (0) that is Non-Diabetic**