# Problem Statement:

The goal of this project is to develop a predictive model using Support Vector Machines (SVMs) to accurately classify individuals as diabetic or non-diabetic based on their medical data. This model aims to aid in the early diagnosis of diabetes, potentially leading to better management strategies and improved health outcomes.

## Dataset:

The dataset used is "diabetes.csv".

Features in the dataset include:
- pregnancies: Number of times pregnant
- glucose: Plasma glucose concentration
- diastolic: Diastolic blood pressure (mm Hg)
- triceps: Triceps skin fold thickness (mm)
- insulin: 2-hour serum insulin (mu U/ml)
- bmi: Body mass index (weight in kg/(height in m)^2)
- dpf: Diabetes pedigree function
- age: Age (years)
- diabetes: Class variable (0 = non-diabetic, 1 = diabetic)

## Methodology:

### Data Preprocessing:

- Handle missing values (e.g., imputation or removal)
- Normalize or standardize features, if necessary.
- Split the data into training and testing sets.

### SVM Model Development:

- Select an appropriate SVM kernel (linear, polynomial, RBF, etc.).
- Tune hyperparameters (e.g., C, gamma) using techniques like grid search or cross-validation.
- Train the SVM model on the training data.

### Evaluation:

- Evaluate the model's performance on the testing set using metrics such as:
  - Accuracy
  - Precision
  - Recall
  - F1-score
  - ROC-AUC curve

## Considerations:

- **Feature Importance**: Analyze feature importance to understand which factors contribute most to diabetes prediction.
- **Imbalanced Classes**: If the dataset is imbalanced (uneven distribution of diabetic/non-diabetic cases), address this using techniques like oversampling, undersampling, or cost-sensitive learning.
- **Comparative Analysis**: Compare the performance of the SVM model to other machine learning algorithms (e.g., decision trees, logistic regression) to determine the most effective approach.


:


**1. Import Necessary Libraries**

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

**2. Load the Dataset**

* **Upload your "diabetes.csv" file:** Click the "Files" icon in the left sidebar, then "Upload" and select your dataset.
* **Load the dataset into a Pandas DataFrame:**

In [12]:
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()  # Display the first few rows to check the data

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [13]:
# Get a summary
diabetes_df.head()
diabetes_df.info()
diabetes_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


**3. Preprocessing**

* **Handle missing values:**

In [14]:
diabetes_df.isnull().sum()  # Check for missing values
# Replace missing values with appropriate strategies (e.g., mean, median)

pregnancies    0
glucose        0
diastolic      0
triceps        0
insulin        0
bmi            0
dpf            0
age            0
diabetes       0
dtype: int64

* **Feature Scaling:**

In [15]:
scaler = StandardScaler()
features_to_scale = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']
diabetes_df[features_to_scale] = scaler.fit_transform(diabetes_df[features_to_scale])

* **Split data:**

In [16]:
X = diabetes_df.drop('diabetes', axis=1)
y = diabetes_df['diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**4. SVM Model Creation and Training**

###### Experiment with different kernels (e.g., 'rbf', 'poly')

In [17]:
model = SVC(kernel='rbf')  # RBF kernel is often a good choice
model.fit(X_train, y_train)

**5. Prediction and Evaluation**

In [18]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7272727272727273
Confusion Matrix:
 [[81 18]
 [24 31]]
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.82      0.79        99
           1       0.63      0.56      0.60        55

    accuracy                           0.73       154
   macro avg       0.70      0.69      0.70       154
weighted avg       0.72      0.73      0.72       154



**6. Hyperparameter Tuning**

In [19]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001]}
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
grid.fit(X_train, y_train)
print(grid.best_params_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END .....................................C=0.1, gamma=1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ...................................C=0.1, gamma=0.1; total time=   0.0s
[CV] END ..................................C=0.1, gamma=0.01; total time=   0.0s
[CV] END ..................................C=0.1

In [20]:
model = SVC(kernel='rbf',C=100, gamma =0.001)  # RBF kernel is often a good choice
model.fit(X_train, y_train)

## Results and classification report

In [21]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7727272727272727
Confusion Matrix:
 [[82 17]
 [18 37]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.83      0.82        99
           1       0.69      0.67      0.68        55

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154

