## Project Overview
A machine learning-based system for predicting diabetes risk using clinical health indicators. This project leverages on Supervised Machine Learning algorithms to provide accurate predictions and help in early detection of diabetes, supporting healthcare professionals in making informed decisions.

Industry: Healthcare / Medical Technology

## IMPORT DEPENDENCIES

In [2]:
# import the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

### DATA COLLECTION AND DATA PROCESSING

In [4]:
# Load the dataset to a pandas dataframe
diabetes_prediction = pd.read_csv('diabetes.csv')

In [5]:
# Check the first 5 rows of the dataset
diabetes_prediction.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# Check the shape of the dataset
diabetes_prediction.shape

(768, 9)

In [7]:
# Check the Statiscal measures of the dataset
diabetes_prediction.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [8]:
# Check the Value counts of the label column
diabetes_prediction['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

1 ---> Diabetes

0 ---> Non-Diabetes

In [10]:
# Check for missing/null values in the dataset
diabetes_prediction.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [11]:
# Check the mean value of the 'label' column
diabetes_prediction.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


### Seperate The Data (Features) and The Label (Outcome)

In [13]:
# Seperate the Features and the Label
x = diabetes_prediction.drop(columns= 'Outcome', axis = 1)
y = diabetes_prediction['Outcome']

### DATA STANDARDIZATION
* I am standardizing the data because the distribution of values in the dataset are not in a common range
* Standardizing the dataset puts the values in a common range and helps to get better prediction

In [15]:
# Standardizing the dataset
Scaler = StandardScaler()

In [16]:
Scaler.fit(x)

In [17]:
# Transform the standardized dataset
Standardized_data = Scaler.transform(x)

In [18]:
print(Standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [19]:
# Fit the standardized data and label to (x) and (y)
x = Standardized_data
y = diabetes_prediction['Outcome']

### Spliting The Data Into 'Training' And 'Test' Data

In [21]:
# create four variables and split the dataset into training and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2)

In [22]:
print(x.shape, x_train.shape, x_test.shape)

(768, 8) (614, 8) (154, 8)


### Training the model ---> I am using the Support Vector Machine model
* SVM is a Classifier

In [24]:
# Put the Support Vector Machine Classifier into a new variable
classifier = svm.SVC(kernel = 'linear')

In [25]:
# Training the data with the Support Vector Machine model
classifier.fit(x_train, y_train)

### MODEL EVALUATION USING THE ACCURACY SCORE
* Accuracy score is used to evaluate our model in other to find how well our model is perfoming and how many good predictions it is making.

In [27]:
# Check the accuracy score of the trained data
x_train_prediction = classifier.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)

In [28]:
print('Accuracy Score on Training Data : ', training_data_accuracy)

Accuracy Score on Training Data :  0.7866449511400652


#### The Accuracy Score for the Train data is about 83%.

#### This is a very good score.

In [30]:
# Check the accuracy score of the test data
x_test_prediction = classifier.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)

In [31]:
print('Accuracy Score on Test Data : ', test_data_accuracy)

Accuracy Score on Test Data :  0.7727272727272727


#### The Accuracy Score for the Train data is about 83%.

#### This is a very good score.

### I build a Predictive System that predict whether a patient is 'Diabetes' or 'Non-Diabetes'

In [34]:
input_data = (7,181,84,21,192,35.9,0.586,51)

# Changing the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the numpy array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

# Standardize the Input Data
std_data = Scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifier.predict(std_data)
print(prediction)

if (prediction[0]==0):
    print('The Patient Is Not Diabetic')
else:
    print('The Patient Is Diabetic')

[[0.93691372 1.88112959 0.77001375 0.02907707 0.97422544 0.49592704
  0.34466711 1.51108316]]
[1]
The Patient Is Diabetic


