# **Diabetes Prediction System (Diabetic vs Non-Diabetic)**

Binary Classification Problem

>This project demonstrates a real-world machine learning application that predicts whether a person is diabetic or non-diabetic using medical and health-related data. The goal is to show how raw healthcare data can be processed, modeled, and transformed into actionable predictions using supervised learning techniques.

---

**Step 1: Data Collection**

The dataset used in this project contains medical records of patients, where each record includes multiple health indicators such as:
- Glucose level
- Blood pressure
- Body Mass Index (BMI)
- Insulin level
- Age
- Other relevant diagnostic features

Each data sample is labeled as:
- 1 – Diabetic
- 0 – Non-Diabetic
This labeled structure makes the problem suitable for **binary classification**.

---

**Step 2: Data Preprocessing**

Raw medical data cannot be directly used for modeling. In this step, the dataset is prepared for machine learning:
- The dataset is loaded into a Pandas DataFrame.
- Feature variables and the target label are separated.
- Missing or invalid values are handled appropriately.
- Numerical transformations and scaling are applied to normalize the data.
- The final dataset is converted into a format suitable for model training.
This step ensures the model learns from **clean, consistent, and meaningful data**.

---

**Step 3: Train-Test Data Split**

To evaluate the model fairly, the dataset is divided into two parts:
- **Training data** - Used to train the machine learning model.
- **Testing data** – Used to test the model on unseen data.
Stratified splitting is applied to maintain class balance between diabetic and non-diabetic cases, reducing bias and improving reliability.

---

**Step 4: Model Training (Support Vector Machine – SVM)**

The processed training data is fed into a Support Vector Machine (SVM) classifier:
- SVM is a supervised machine learning algorithm.
- It is highly effective for **binary classification problems**.
- SVM is a margin-based classifier that finds the optimal decision boundary between classes.
- The model learns patterns that separate diabetic and non-diabetic individuals based on medical features.
This choice of model ensures robustness and strong generalization performance.

---

**Step 5: Model Evaluation & Accuracy**

After training, the model is evaluated using the test dataset:
- Predictions are generated for unseen patient data.
- Model performance is measured using accuracy score.
- The evaluation confirms how well the model generalizes beyond training data.
The achieved accuracy demonstrates that the model can reliably classify patients based on health indicators.

---

**Step 5: Prediction on New Data**

Once validated, the trained model is used for real-world prediction:
- New patient data is provided as input.
- The input is converted into a NumPy array and reshaped.
- The model predicts whether the person is diabetic or not.
This step simulates how the model would behave in a real healthcare scenario.

---

## Key Skills Demonstrated

- Data preprocessing & feature handling
- Binary classification using supervised learning
- Support Vector Machine (SVM) modeling
- Model evaluation and accuracy analysis
- Real-world healthcare data application

---

## Final Outcome

*This project successfully showcases an end-to-end machine learning pipeline — from raw data collection and preprocessing to model training, evaluation, and real-world prediction. It highlights how data-driven approaches can support early diagnosis and decision-making in healthcare.*

In [3]:
# importing dependencies
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

In [6]:
# loading or collecting data and Analysis
# PIMA Diabetes Dataset
diabetes_dataset = pd.read_csv('diabetes.csv')

# printing first 5 row of dataset
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Outcome has 2 values Known as labels
- i.e;
  - 0 : represent not diabitic
  - 1 : represent is diabitic




In [None]:
# number of rows and columns in this dataset
diabetes_dataset.shape
# (rows, columns)

(768, 9)

In [None]:
# getting the statistical mesures of data
diabetes_dataset.describe()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
diabetes_dataset['Outcome'].value_counts()
# returns no of each lables in a column
# no of 0 is 500
# no of 1 is 268

Unnamed: 0_level_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


In [None]:
diabetes_dataset.groupby('Outcome').mean()
# return mean value of each column for each group besed on labels

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [None]:
# separating the data and labels
x = diabetes_dataset.drop(columns = 'Outcome', axis=1)   # holding all colums except 'output' column
y = diabetes_dataset['Outcome']   # new data table holding only labels

In [None]:
# standerdization the data
scaler = StandardScaler()
scaler.fit(x)

In [None]:
standerdized_data = scaler.transform(x)  # tranforming fit data
standerdized_data

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [None]:
x = standerdized_data

In [None]:
# spliting traing data and testing data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, stratify=y, random_state=2)
# test_size = 0.2 represent 20% of test data
# stratify = y represent based of y slpit traing data and testing data in similer proportion
# random_state = 2 represent the randomness of data to be splited

In [None]:
# implementing and traing the ml model
classifier = svm.SVC(kernel='linear')
# using Support Vector Classifier (CVC)'s linear Model of Support vector machine(SVM)


In [None]:
# traing the Support Vector Machine Clasifier
classifier.fit(x_train, y_train)


# **Model Evaluation**

In [44]:
# finding accuracy score on the train data
train_prediction = classifier.predict(x_train)
train_accuracy = accuracy_score(train_prediction, y_train)  # checking the comparision predict data and actual y_train data

print('Accuracy score of the training data',int(train_accuracy * 100),"%")


Accuracy score of the training data 78 %


In [45]:
# finding accuracy score on the test data
test_prediction = classifier.predict(x_test)
test_accuracy = accuracy_score(test_prediction, y_test)  # checking the comparision predict data and actual y_train data

print('Accuracy score of the training data', int(test_accuracy * 100),"%")

Accuracy score of the training data 77 %


# Making a predictive system for external data

In [52]:
# taking data as input
input_data = input("Enter the data (comma-separated): ")

# Convert input string to list of floats
input_data = [float(x) for x in input_data.split(',')]

# changing the "input_data" to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the np array as we are predicting for one instance
input_data_reshape  = input_data_as_numpy_array.reshape(1, -1)

# standerdize the input data
skd_data = scaler.transform(input_data_reshape)
print(skd_data)

# prediction
prediction = classifier.predict(skd_data)
print(prediction)
# prediction = classifier.predict(skd_data)

# Appling Condition
if prediction[0] == 0:
  print('Yor are not diabetic patient')
else:
  print('You are diabetic patient')

Enter the data (comma-separated): 1,85,66,29,0,26.6,0.351,31
[[-0.84488505 -1.12339636 -0.16054575  0.53090156 -0.69289057 -0.68442195
  -0.36506078 -0.19067191]]
[0]
Yor are not diabetic patient


