# Diabetes Prediction with Machine Learning Algorithm :Support Vector Machines 

####  Why SVM ? It is chosen for its effectiveness in handling high-dimensional data and its ability to find optimal decision boundaries, making it suitable for classification tasks like predicting diabetes. SVM works well even with a small dataset and is less prone to overfitting compared to other algorithms.

## Import required packages 

In [39]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV 
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

## Data Collection 

####  Gather the famous diabetes dataset, which contains various features such as glucose levels, blood pressure, and body mass index (BMI), along with the target variable indicating the presence or absence of diabetes.This is because in medicine,several factors need to be weighed or tested before a diagnosis is made. eg headache can be a common symptom for several diseases, hence the need to test other factors too

In [33]:
diabetes=pd.read_csv('../diabetes.csv')



### Data Processing 

####  Perform preprocessing steps such as handling missing values, scaling features to a similar range, and splitting the dataset into training and testing sets.

In [15]:
print(diabetes.head())
print("\n")
data_shape=diabetes.shape

print(f"Number of rows and columns in data are :{data_shape[0]} rows and {data_shape[1]} columns")

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


Number of rows and columns in data are :768 rows and 9 columns


### Getting the Statistical measure of the data 

In [16]:
diabetes.describe()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [35]:
print(diabetes.columns)
all=diabetes['Outcome'].value_counts()
print("0 -----> Non Diabetic")
print("1 ------> Diabetic")

print(f"{all[0]} individuals do not have diabetes while {all[1]} individuals are presented with diabetes")

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
0 -----> Non Diabetic
1 ------> Diabetic
500 individuals do not have diabetes while 268 individuals are presented with diabetes


In [19]:
diabetes.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


### Separating the data and labels 

In [20]:
X=diabetes.drop(columns='Outcome',axis=1)
Y=diabetes['Outcome']



### Data standardization 

In [21]:
scaler=StandardScaler()
scaler.fit(X)
standardized_data=scaler.transform(X)

In [22]:
X=standardized_data
Y=diabetes['Outcome']

### Train test Split 

In [23]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)
print(X.shape,X_train.shape,X_test.shape)

(768, 8) (614, 8) (154, 8)


### Training the model 

#### Train an SVM model using the training data. Choose appropriate kernel functions (e.g., linear, polynomial, or radial basis function) and tune hyperparameters such as regularization parameter (C) and kernel coefficient (gamma) using techniques like grid search or cross-validation.

In [24]:
classifier=svm.SVC(kernel='linear')


### Training the support vector machine classifier

In [25]:
classifier.fit(X_train,Y_train)


## Model Evaluation 

#### Evaluate the trained SVM model on the testing data using performance metrics such as accuracy, precision, recall, and F1-score to assess its effectiveness in predicting diabetes.

### Accuracy Score 

In [26]:
X_train_prediction=classifier.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
print(f"The accuracy score of the training dataset is : {training_data_accuracy *100}")

The accuracy score of the training dataset is : 78.66449511400651


### Accuracy score on the test prediction 

In [27]:
X_test_prediction=classifier.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print(f"Accuracy score for test data is : {test_data_accuracy *100}")


Accuracy score for test data is : 77.27272727272727


### Making a Predictive system 

In [28]:
input_data=(4,110,92,0,0,37.6,0.191,30)

#Lets change the input data to a numpy array 
input_data_as_numpy_array=np.asarray(input_data)
#reshape the array as we are predicting for one instance

input_data_reshape=input_data_as_numpy_array.reshape(1,-1)

#Standardize the input data 
std_input_data=scaler.transform(input_data_reshape)
print(std_input_data)




[[ 0.04601433 -0.34096773  1.18359575 -1.28821221 -0.69289057  0.71168975
  -0.84827977 -0.27575966]]


### Prediction on Input data 

In [31]:
input_data_prediction=classifier.predict(std_input_data)
print(input_data_prediction)
if(input_data_prediction[0]==0):
    print("Individual is not diabetic")
else:
    print("Individual is diabetic patient")
    

[0]
Individual is not diabetic


### Improving on our models, to have higher accuracy 

In [38]:
# Load the diabetes dataset

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define parameter grid for grid search
param_grid = [
    {'C': [0.1, 1, 10, 100], 'kernel': ['linear']},
    {'C': [0.1, 1, 10, 100], 'kernel': ['poly'], 'degree': [2, 3, 4]},
    {'C': [0.1, 1, 10, 100], 'kernel': ['rbf'], 'gamma': [0.1, 0.01, 0.001]}
]

# Perform grid search with cross-validation to find the best hyperparameters
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train_scaled, Y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
best_kernel = best_params['kernel']
best_C = best_params['C']
if best_kernel == 'poly':
    best_degree = best_params['degree']
elif best_kernel == 'rbf':
    best_gamma = best_params['gamma']

# Train the SVM model with the best hyperparameters
if best_kernel == 'linear':
    svm_model = SVC(kernel='linear', C=best_C)
elif best_kernel == 'poly':
    svm_model = SVC(kernel='poly', C=best_C, degree=best_degree)
elif best_kernel == 'rbf':
    svm_model = SVC(kernel='rbf', C=best_C, gamma=best_gamma)

svm_model.fit(X_train_scaled, Y_train)

# Make predictions on the test set 
# Calculate accuracy
accuracy = accuracy_score(Y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7727272727272727


### Using Random Forest Classifier Algorithm

In [40]:

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = rf_classifier.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7272727272727273
