## INTRODUCTION
* This project implements a machine learning solution for predicting diabetesusing the PIMA Indian Diabetes Dataset.
* The goal is to classify patients as 
diabetic or non-diabetic based on medical diagnostic measurements using Support Vector Machine (SVM) algorithm.

### DATASET OVERVIEW:
- 768 samples with 8 medical features
- Features include glucose level, BMI, age, insulin, etc.
- Binary classification: 0 (Non-Diabetic) vs 1 (Diabetic)
- Real-world application: Early diabetes detection systems

### METHODOLOGY:
1. Data loading and exploratory analysis
2. Data standardization for SVM optimization
3. Train-test split (80%-20%)
4. SVM model training with linear kernel
5. Performance evaluation and testing
6. Predictive system implementation

## Import required libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

## Load and explore the dataset

In [2]:
diabetes_dataset = pd.read_csv(r"C:\Users\Mohamed Makki\Desktop\Projects\Machine_Learning-Projects\Project_6\Dataset\diabetes.csv")

In [3]:
print(f"Dataset shape: {diabetes_dataset.shape}")
print(f"Features: {list(diabetes_dataset.columns[:-1])}")
print(f"Target: {diabetes_dataset.columns[-1]}")

Dataset shape: (768, 9)
Features: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
Target: Outcome


## Display class distribution

In [4]:
outcome_counts = diabetes_dataset['Outcome'].value_counts()
print(f"\nClass distribution:")
print(f"Non-Diabetic (0): {outcome_counts[0]} samples ({outcome_counts[0]/len(diabetes_dataset)*100:.1f}%)")
print(f"Diabetic (1): {outcome_counts[1]} samples ({outcome_counts[1]/len(diabetes_dataset)*100:.1f}%)")


Class distribution:
Non-Diabetic (0): 500 samples (65.1%)
Diabetic (1): 268 samples (34.9%)


## Show statistical summary by class

In [5]:
print(f"\nAverage values by class:")
class_means = diabetes_dataset.groupby('Outcome').mean()
print(f"Non-Diabetic average glucose: {class_means.loc[0, 'Glucose']:.1f}")
print(f"Diabetic average glucose: {class_means.loc[1, 'Glucose']:.1f}")


Average values by class:
Non-Diabetic average glucose: 110.0
Diabetic average glucose: 141.3


## Data preprocessing

In [6]:
X = diabetes_dataset.drop(columns='Outcome')  # Features (8 medical measurements)
y = diabetes_dataset['Outcome']               # Target (0: Non-Diabetic, 1: Diabetic)

## Data standardization - crucial for SVM performance

In [7]:
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

In [8]:
print(f"\nData preprocessed and standardized for SVM")
print(f"Feature matrix shape: {X_standardized.shape}")
print(f"Target vector shape: {y.shape}")


Data preprocessed and standardized for SVM
Feature matrix shape: (768, 8)
Target vector shape: (768,)


## Split data into training and testing sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X_standardized, y, test_size=0.2, stratify=y, random_state=42
)

In [10]:
print(f"\nData split completed:")
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")


Data split completed:
Training samples: 614
Testing samples: 154


## Train Support Vector Machine model

In [11]:
classifier = svm.SVC(kernel='linear', random_state=42)
classifier.fit(X_train, y_train)

print(f"\nSVM model trained with linear kernel")


SVM model trained with linear kernel


## Evaluate model performance

In [12]:
train_predictions = classifier.predict(X_train)
test_predictions = classifier.predict(X_test)

In [13]:
train_accuracy = accuracy_score(y_train, train_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)

In [14]:
print(f"\nModel Performance:")
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")


Model Performance:
Training Accuracy: 0.7915 (79.15%)
Testing Accuracy: 0.7208 (72.08%)


In [15]:
# Check model generalization
generalization = "Good" if abs(train_accuracy - test_accuracy) < 0.05 else "Fair"
print(f"Model Generalization: {generalization}")

Model Generalization: Fair


## Prediction system for new patient data

In [16]:
def predict_diabetes_risk(model, scaler, patient_data):
    """
    Predict diabetes risk for a new patient
    
    Parameters:
    model: Trained SVM classifier
    scaler: Fitted StandardScaler object
    patient_data: Tuple of 8 medical measurements
    
    Returns:
    Prediction result and risk assessment
    """
    # Convert input to numpy array and reshape
    input_array = np.asarray(patient_data).reshape(1, -1)
    
    # Standardize the input data using the same scaler
    standardized_input = scaler.transform(input_array)
    
    # Make prediction
    prediction = model.predict(standardized_input)[0]
    confidence = model.decision_function(standardized_input)[0]
    
    # Interpret result
    risk_status = "Non-Diabetic" if prediction == 0 else "Diabetic"
    confidence_level = abs(confidence)
    
    return prediction, risk_status, confidence_level

## Test prediction system with sample patient data

In [17]:
# Format: (Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigree, Age)
sample_patient = (1, 89, 66, 23, 94, 28.1, 0.167, 21)

prediction, risk_status, confidence = predict_diabetes_risk(classifier, scaler, sample_patient)



In [18]:
print(f"\nSample Patient Prediction:")
print(f"Input data: {sample_patient}")
print(f"Prediction: {prediction} ({risk_status})")
print(f"Decision confidence: {confidence:.4f}")


Sample Patient Prediction:
Input data: (1, 89, 66, 23, 94, 28.1, 0.167, 21)
Prediction: 0 (Non-Diabetic)
Decision confidence: 2.4876


In [19]:
# Interpret the result
if prediction == 0:
    print("Result: The patient is predicted to be non-diabetic")
else:
    print("Result: The patient is predicted to be diabetic")

Result: The patient is predicted to be non-diabetic


## CONCLUSION
* This project successfully demonstrates the application of Support Vector Machine algorithm for medical diagnosis prediction. The SVM model with data standardization achieved reliable performance in predicting diabetes risk based on medical diagnostic measurements.

* The implementation highlights the importance of data preprocessing, particularly feature standardization, which is crucial for SVM algorithm performance. This approach has practical applications in healthcare systems, early disease detection, and preventive medicine.

* The project validates that machine learning can effectively assist medical professionals in making informed decisions about patient health risks when applied to well-structured medical datasets.