# 🩺 Diabetes Prediction using Machine Learning with Python

In this project, we develop a **machine learning model** to predict the likelihood of diabetes in patients based on medical data. By utilizing various **classification algorithms**, we aim to build an accurate predictive model that can assist in early detection and interventio.


In [1]:
# Import necessary libraries
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn import svm  # For support vector machine models
from sklearn.metrics import accuracy_score  # For evaluating model performance
import warnings  # To manage warnings
# Suppress warnings to keep the output clean
warnings.filterwarnings('ignore')


# Data Collection and Analysis

In [2]:
# Load the dataset into a Pandas DataFrame

df = pd.read_csv("diabetes.csv")


In [3]:
# Display the first 5 rows of the dataset

df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Display the last 5 rows of the dataset

df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [5]:
# Check the number of rows and columns in the dataset

df.shape


(768, 9)

In [6]:
# Get statistical summary of the dataset

df.describe()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [7]:
# Count the number of occurrences of each unique value in the 'Outcome' column

df['Outcome'].value_counts()


Outcome
0    500
1    268
Name: count, dtype: int64

In [8]:
# Calculate the mean of each numerical column, grouped by 'Outcome'

df.groupby('Outcome').mean()


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


# 🔍 Observing Glucose Levels in Diabetics

The **glucose levels** in individuals diagnosed with diabetes tend to be elevated, often approaching or exceeding 140 mg/dL, especially among older individuals.

# 🏷️ Separating Features and Target Labels

- **Features**: These are the input data points used to predict diabetes (e.g., glucose levels, age, BMI, etc.).
- **Target Labels**: The actual outcome or result (whether the patient has diabetes or not).



In [10]:
# Separate the dataset into features (X) and target labels (Y)

X = df.drop(columns='Outcome', axis=1)  # Features: all columns except 'Outcome'
Y = df['Outcome']  # Target labels: 'Outcome' column


In [11]:
print(X.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  


In [12]:
print(Y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


#Data Standardization

# ⚖️ Feature Scaling with StandardScaler

To ensure that all features contribute equally to the machine learning model, we apply **feature scaling**. This process standardizes the data, so that each feature has a mean of 0 and a standard deviation of 1.

# 🔧 Creating a StandardScaler Instance

```python
scaler = StandardScaler()  # Initializes the StandardScaler to normalize the features


In [13]:
# Create an instance of StandardScaler for feature scaling

scaler = StandardScaler()


In [14]:
# Fit the StandardScaler to the feature data (X) to compute the scaling parameters

scaler.fit(X)


In [15]:
# Apply the scaling transformation to the feature data (X) using the fitted scaler

standardized_data = scaler.transform(X)


In [16]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [17]:
X = standardized_data
Y = df['Outcome']

In [18]:
print(X)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [19]:
print(Y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


# ✂️ Train-Test Split

Before building the model, we divide the dataset into two parts:
- **Training set**: Used to train the machine learning model.
- **Test set**: Used to evaluate the model's performance on unseen data.

This ensures the model is not overfitting and can generalize well to new data.

# 🚦 Performing Train-Test Split

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test


In [21]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(
    X,             # Feature data to be split
    Y,             # Target labels to be split
    test_size=0.2, # Proportion of the dataset to use for testing (20%)
    stratify=Y,    # Ensure that the distribution of the target labels is preserved in both train and test sets
    random_state=2 # Seed for random number generation to ensure reproducibility
)


In [22]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


# 🏋️‍♂️ Training the Model

After splitting the data into training and testing sets, we proceed to train the machine learning model using the training data. This process involves feeding the model with the feature set (input data) and the target labels (output data) so that it can learn the underlying patterns.

In [23]:
# Create a Support Vector Classifier with a linear kernel

classifier = svm.SVC(kernel='linear')


In [24]:
# Training the support vector Machine Classifier (SVC)

In [25]:
# Train the Support Vector Classifier using the training data

classifier.fit(X_train, Y_train)


# Model Evaluation

In [26]:
# Calculate the accuracy score on the training dataset

In [27]:
# Predict the labels for the training data
X_train_prediction = classifier.predict(X_train)

# Calculate the accuracy of the classifier on the training data
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [28]:
print('Accuracy Score of the training Data: ',training_data_accuracy )

Accuracy Score of the training Data:  0.7866449511400652


In [29]:
# Predict the labels for the test data
X_test_prediction = classifier.predict(X_test)

# Calculate the accuracy of the classifier on the test data
testing_data_accuracy = accuracy_score(Y_test, X_test_prediction)

# Print the accuracy score of the test data
print('Accuracy Score of the Test Data:', testing_data_accuracy)


Accuracy Score of the Test Data: 0.7727272727272727


#Making Predictive System

In [30]:
import numpy as np

# Sample input data
input_data = (145.78, 32.14, 89.56, 167.32, 49.91, 132.48, 178.67, 61.22)

# Changing the input_data to a NumPy array
input_data_np_array = np.asarray(input_data)

# Reshape the array as we are predicting for one instance
input_data_reshape = input_data_np_array.reshape(1, -1)

# Standardize the input data (assuming scaler has been defined and fitted earlier)
std_data = scaler.transform(input_data_reshape)

# Print standardized data
print("Standardized Data:", std_data)

# Make a prediction (assuming classifier has been trained and defined earlier)
prediction = classifier.predict(std_data)

# Print the prediction result
print("Prediction:", prediction)

# Check if the prediction indicates diabetes
if prediction[0] == 0:
    print("The person is not diabetic.")
else:
    print("The person is diabetic.")


Standardized Data: [[ 4.21499194e+01 -2.77776346e+00  1.05745324e+00  9.20744697e+00
  -2.59527237e-01  1.27537878e+01  5.38179482e+02  2.38067999e+00]]
Prediction: [1]
The person is diabetic.


In [31]:
import numpy as np

# Sample input data
input_data = (4,110,92,0,0,37.6,0.191,30)

# Changing the input_data to a NumPy array
input_data_np_array = np.asarray(input_data)

# Reshape the array as we are predicting for one instance
input_data_reshape = input_data_np_array.reshape(1, -1)

# Standardize the input data (assuming scaler has been defined and fitted earlier)
std_data = scaler.transform(input_data_reshape)

# Print standardized data
print("Standardized Data:", std_data)

# Make a prediction (assuming classifier has been trained and defined earlier)
prediction = classifier.predict(std_data)

# Print the prediction result
print("Prediction:", prediction)

# Check if the prediction indicates diabetes
if prediction[0] == 0:
    print("The person is not diabetic.")
else:
    print("The person is diabetic.")


Standardized Data: [[ 0.04601433 -0.34096773  1.18359575 -1.28821221 -0.69289057  0.71168975
  -0.84827977 -0.27575966]]
Prediction: [0]
The person is not diabetic.


In [32]:
# Saving the Trained Model

In [33]:
# Importing the pickle module for model serialization and deserialization

import pickle


In [34]:
# Defining the filename to save the trained model

filename = "saved_model.sav"

# Saving the trained classifier model using pickle

pickle.dump(classifier, open(filename, 'wb'))
    

In [35]:
# Loading the saved classifier model from the file

loaded_model = pickle.load(open('saved_model.sav', 'rb'))


In [36]:
# Sample input data
input_data = (5,166,72,19,175,25.8,0.587,51)

# Changing the input_data to a NumPy array
input_data_np_array = np.asarray(input_data)

# Reshape the array as we are predicting for one instance
input_data_reshape = input_data_np_array.reshape(1, -1)

# Standardize the input data (assuming scaler has been defined and fitted earlier)
std_data = scaler.transform(input_data_reshape)

# Print standardized data
print("Standardized Data:", std_data)

# Make a prediction (assuming classifier has been trained and defined earlier)
prediction = loaded_model.predict(std_data)

# Print the prediction result
print("Prediction:", prediction)

# Check if the prediction indicates diabetes
if prediction[0] == 0:
    print("The person is not diabetic.")
else:
    print("The person is diabetic.")

Standardized Data: [[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
Prediction: [1]
The person is diabetic.


# 📝 Conclusion

The standardized data for the input features indicates that the individual's measurements fall within specific ranges that suggest a likelihood of diabetes. 

Based on the model's prediction:

- **Standardized Data**: `[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734 0.34768723  1.51108316]]`
- **Prediction**: `[1]` (indicating the presence of diabetes)

This prediction suggests that the person is classified as diabetic, highlighting the importance of early detection and intervention for diabetes management. By employing machine learning techniques, we can effectively assess and predict health risks, ultimately aiding in timely medical responses and lifestyle adjustments.
