# Diabetes Prediction Using Machine Learning

## Overview
Machine learning model to predict diabetes using medical indicators from the PIMA Indians Diabetes Dataset.

## Dataset Features
| Feature | Description | Unit |
|---------|------------|------|
| Pregnancies | Number of pregnancies | Count |
| Glucose | Plasma glucose concentration | mg/dL |
| Blood Pressure | Diastolic blood pressure | mm Hg |
| Skin Thickness | Triceps skin fold thickness | mm |
| Insulin | 2-Hour serum insulin | mu U/ml |
| BMI | Body mass index | kg/m² |
| DiabetesPedigree | Diabetes pedigree function | Score |
| Age | Age of patient | Years |

## Methodology
1. Data Preprocessing
   - Feature scaling
   - Train-test split (70-30)
2. Model: Support Vector Machine
   - Linear kernel
   - Standardized features
3. Evaluation
   - Accuracy metrics
   - Prediction system

## Note
For screening purposes only. Consult healthcare providers for diagnosis.

## Why SVM for Diabetes Prediction?

SVM (Support Vector Machine) is chosen for:
- Optimal for binary classification problems
- Effective with medical datasets
- Strong performance with numerical features
- Good generalization capabilities
- Clear decision boundaries
- Robust against outliers

In [1]:
# 1. Importing the Dependencies

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

## Dependencies

Key libraries used:
- **NumPy**: Numerical operations and arrays
- **Pandas**: Data manipulation and analysis
- **Scikit-learn**:
  - StandardScaler: Feature normalization
  - train_test_split: Dataset splitting
  - SVM: Machine learning model
  - accuracy_score: Performance metric

In [2]:
# 2. Loading the Dataset

df = pd.read_csv('Diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Dataset Loading

Loading PIMA Indians Diabetes Dataset:
- CSV format containing patient records
- Features: Medical indicators
- Target: Diabetes diagnosis (0/1)
- Initial view shows data structure

In [3]:
# 3. Checking the Shape of the Dataset

df.shape

(768, 9)

## Dataset Dimensions

Shape analysis reveals:
- Number of patient records
- Number of features
- Dataset size adequacy for ML

In [4]:
# 4. Checking the Statistical Summary of the Dataset

df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## Statistical Summary

Key statistics for each feature:
- Mean, std: Central tendency and spread
- min, max: Value ranges
- Quartiles: Data distribution
- Validates data quality

In [5]:
# 5. Checking the Distribution of the Target Variable

df['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

## 5. Target Variable Distribution Analysis

Understanding the distribution of the target variable (Outcome) is critical:

1. **Class Balance**:
   - Shows distribution between diabetic and non-diabetic cases
   - Helps identify class imbalance
   - Influences choice of evaluation metrics

2. **Model Strategy**:
   - Helps decide if we need:
     - Class weights
     - Over/under sampling
     - SMOTE or other balancing techniques

3. **Performance Expectations**:
   - Sets baseline accuracy expectations
   - Guides choice of evaluation metrics
   - Helps interpret model results

4. **Medical Context**:
   - Reflects real-world diabetes prevalence
   - Helps in understanding prediction bias
   - Guides clinical application strategy

In [6]:
# 6. Checking for Missing Values

df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## 6. Missing Values Analysis

Checking for missing values is a crucial data quality step:

1. **Data Quality Assessment**:
   - Identifies incomplete records
   - Shows which features have missing data
   - Quantifies the extent of missing data

2. **Why It's Important**:
   - Missing data can bias the model
   - Some algorithms can't handle missing values
   - Helps decide on data imputation strategy

3. **Handling Strategies**:
   - Drop rows with missing values
   - Impute with mean/median/mode
   - Use advanced imputation techniques

4. **Impact on Model**:
   - Ensures reliable predictions
   - Maintains data integrity
   - Prevents model errors

In [8]:
# Analyzing the Data
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [9]:
# 7. Splitting the Data into Features and Target Variable
X = df.drop(columns='Outcome', axis=1)
Y = df['Outcome']

## 7. Feature and Target Separation

Separating features and target variable is a fundamental step:

1. **Data Organization**:
   - X (Features): Medical indicators
   - Y (Target): Diabetes outcome
   - Prepares data for model training

2. **Why We Do This**:
   - Follows supervised learning framework
   - Allows proper model training
   - Enables prediction testing

3. **Features (X)**:
   - Contains all medical measurements
   - Excludes the outcome column
   - Used to make predictions

4. **Target (Y)**:
   - Binary outcome (0 or 1)
   - What we want to predict
   - Ground truth for training

This separation is crucial because:
- Models need clear input/output structure
- Enables proper evaluation
- Prevents data leakage

In [11]:
# 8. Feature Scaling

scaler = StandardScaler()
scaler.fit(X)
standardized_data = scaler.transform(X)

X = standardized_data
Y = df['Outcome']

## Feature Scaling

StandardScaler implementation:
- Standardizes features to zero mean and unit variance
- Essential for SVM performance
- Ensures equal feature contribution
- Preserves scaling for predictions

In [53]:
# 9. Splitting the Data into Training and Testing Sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, stratify=Y, random_state=35)
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (537, 8) (231, 8)


## 9. Train-Test Data Split

Splitting data into training and testing sets is essential:

1. **Purpose**:
   - Evaluates model's generalization
   - Prevents overfitting
   - Simulates real-world usage

2. **Split Parameters**:
   - test_size=0.3: 70% training, 30% testing
   - stratify=Y: Maintains class distribution
   - random_state=35: Ensures reproducibility

3. **Why These Ratios?**:
   - Training (70%): Sufficient for learning patterns
   - Testing (30%): Adequate for validation
   - Balanced compromise between learning and validation

4. **Stratification**:
   - Maintains class proportions
   - Prevents sampling bias
   - Ensures representative splits

5. **Importance**:
   - Validates model performance
   - Tests on unseen data
   - Provides reliable metrics

In [54]:
# 10. Classifier
classifier = svm.SVC(kernel='linear')

## Model Training

SVM Configuration:
- Linear kernel for interpretability
- Default regularization (C=1.0)
- Trained on standardized features
- Optimized for binary classification

In [55]:
# 11. Training the Support Vector Machine Classifier 
classifier.fit(X_train, Y_train)

0,1,2
,C,1.0
,kernel,'linear'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


In [56]:
# 12. Model Evaluation
x_train_pred = classifier.predict(X_train)
training_data_acc = accuracy_score(x_train_pred, Y_train)

print("accuracy_score of training data : ", training_data_acc)

x_test_pred = classifier.predict(X_test)
test_data_acc = accuracy_score(x_test_pred, Y_test)

print('accuracy_score of testing data : ',test_data_acc)


accuracy_score of training data :  0.7783985102420856
accuracy_score of testing data :  0.7922077922077922


In [57]:
# 12. Making a Predictive System
input_data = (5, 116, 74, 0, 0, 25.6, 0.201, 30)

input_data_as_np_arr = np.asarray(input_data)
input_data_reshape = input_data_as_np_arr.reshape(1, -1)
std_input_data = scaler.transform(input_data_reshape)

prediction = classifier.predict(std_input_data)

if prediction[0] == 0:
    print("The person is NON-DIABATIC")
else:
    print("The person is DIABATIC")

The person is NON-DIABATIC




## Prediction System

Process flow:
1. Input: 8 medical parameters
2. Preprocessing: Array conversion and scaling
3. Prediction: Binary classification (0: Non-diabetic, 1: Diabetic)

Note: Use as screening tool only.