## Disease Prediction Using Classification

In this notebook, we build a classification model to **predict the disease (prognosis)** based on a patient's symptoms, weather conditions, and other features such as age and gender.

### 🔍 Why Classification?

This task is a classic **supervised learning problem**: we have input features and a known output label (`prognosis`). Classification allows us to train a model that can learn from historical cases and make accurate predictions on new patient data.

For example:
- If a patient presents with **headache, vomiting, and high temperature**, the model may predict a high probability of **Migraine**.
- If a patient has **chest pain, high blood pressure, and humidity is high**, it might predict **Heart Attack**.

These predictions could support **clinical decision-making**, early detection, or patient triage.

### 🔗 How Pattern Mining Helps

Earlier, we used pattern mining (FP-Growth) to identify frequent symptom combinations linked to specific diseases. Those patterns help:
- Highlight **strong symptom-disease associations** (e.g., `{headache, vomiting} → Migraine`)
- Guide **feature importance awareness** before modeling
- Validate whether the model is learning similar relationships

### ❌ Why Not Clustering or Outlier Detection?

- **Clustering** is unsupervised and used to explore hidden groupings — but we already know the disease labels.
- **Outlier detection** identifies rare or unusual data points — useful for anomaly detection, not disease prediction.

Therefore, **classification** is the most appropriate and effective approach for our goal.


## Step 1: Load Preprocessed Data

We begin by loading the cleaned and scaled dataset from the preprocessing step. This dataset includes both binary symptom indicators and continuous features (e.g., age, weather) that have been normalized.


In [20]:
import pandas as pd

# Load preprocessed data
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Separate features and target
X = df.drop(columns=["prognosis"])
y = df["prognosis"]

print(f"Dataset shape: {X.shape}")
print(f"Target classes: {y.nunique()} → {y.unique()[:11]}")
# Display first few rows of the dataset
print("\nFirst few rows of the dataset:")
print(X.head())


Dataset shape: (4981, 49)
Target classes: 11 → ['Heart Attack' 'Influenza' 'Dengue' 'Sinusitis' 'Eczema' 'Common Cold'
 'Heat Stroke' 'Migraine' 'Malaria' 'Arthritis' 'Stroke']

First few rows of the dataset:
        Age  Gender  Temperature (C)  Humidity  Wind Speed (km/h)  nausea  \
0  0.030303       1         0.729691  0.586755           0.264610       1   
1  0.545455       0         0.654889  0.364238           0.486594       0   
2  0.444444       0         0.515404  0.709272           0.136890       0   
3  0.050505       0         0.933323  0.380132           0.575202       1   
4  0.696970       0         0.593129  0.793377           0.572230       0   

   joint_pain  abdominal_pain  high_fever  chills  ...  sinus_headache  \
0           0               0           0       0  ...               0   
1           0               0           0       1  ...               0   
2           0               0           0       0  ...               0   
3           0               0   

## Step 2: Split Data into Training and Test Sets

To evaluate our classification model, we split the data into two parts:

- **Training set**: Used to train the model (80% of the data).
- **Test set**: Used to evaluate how well the model performs on unseen data (20%).

We use `train_test_split()` from `scikit-learn` with the following parameters:

- `X`: All the input features (symptoms, age, weather data, etc.).
- `y`: The target labels (i.e., the `prognosis` column — the disease to be predicted).
- `test_size=0.2`: Allocates 20% of the data for testing.
- `stratify=y`: Ensures that each class (disease) is proportionally represented in both train and test sets.
- `random_state=42`: Ensures reproducibility by fixing the random seed.

This gives us:

- `X_train`: Feature values for training.
- `X_test`: Feature values for testing.
- `y_train`: Target labels for training.
- `y_test`: Target labels for testing.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


: 

: 

## Step 3: Train a K-Nearest Neighbors (KNN) Classifier

We train a KNN model using `k = 5`, which means the prediction is based on the 5 closest neighbors in the training data. KNN is a distance-based method, so it's important that continuous features are properly scaled — which was done during preprocessing.

### Introduction to K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a **non-parametric** and **instance-based** learning algorithm. It works by comparing a new, unseen data point to the labeled data points in the training set. Specifically, it looks for the `k` closest data points (neighbors) in terms of feature similarity — typically using a distance metric like Euclidean distance — and assigns the most common class label among those neighbors to the new point.

This method is effective when similar cases tend to have similar outcomes, making it highly interpretable and intuitive.

In our context, KNN is useful because it can naturally take into account the **similarity of symptoms and environmental conditions** (like temperature or humidity) across patients. For example, if previous patients with high fever, chest pain, and similar weather exposure were diagnosed with dengue, KNN can use those historical patterns to make accurate predictions for new cases. It doesn't assume a fixed model form, which is helpful given the complex, multi-factor nature of medical data.


### 🔍 Visual Explanation of KNN (Illustration Only)

To help visualize how the **K-Nearest Neighbors algorithm** works, the following plot shows a **synthetic 2D dataset** and a new data point being classified based on its 5 nearest neighbors.

> ⚠️ **Note**: This example is **purely illustrative**. It uses **artificial data** for visualization purposes and is not based on the actual medical dataset used in this project.

The plot demonstrates:
- How KNN uses distance to find the `k` nearest points.
- How the predicted class is chosen by **majority vote** among neighbors.
- The visual connection between the new sample and its neighbors.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
import numpy as np

# Generate simple 2D data for demonstration
X, y = make_classification(
    n_samples=200, n_features=2, n_informative=2, n_redundant=0,
    n_clusters_per_class=1, n_classes=3, random_state=42
)

colors = ['red', 'green', 'blue']
class_labels = ['Class 0', 'Class 1', 'Class 2']

# Train KNN model
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)

# New sample
new_point = np.array([[0, 0]])
predicted_class = knn.predict(new_point)[0]

# Find k nearest neighbors
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X)
distances, indices = nn.kneighbors(new_point)

# Plot the data
plt.figure(figsize=(8, 6))

# Plot existing points
for class_value in np.unique(y):
    plt.scatter(
        X[y == class_value, 0], X[y == class_value, 1],
        color=colors[class_value], label=class_labels[class_value], alpha=0.6
    )

# Highlight nearest neighbors
for idx in indices[0]:
    plt.plot([X[idx, 0], new_point[0, 0]], [X[idx, 1], new_point[0, 1]], 
             'k--', alpha=0.5)

# Plot new point
plt.scatter(new_point[0][0], new_point[0][1], color='black', edgecolor='white', 
            marker='X', s=200, label='New Sample')

# Annotate prediction
plt.text(new_point[0][0] + 0.3, new_point[0][1], 
         f"Predicted: Class {predicted_class}", fontsize=12, color='black')

plt.title(f"KNN Decision (k={k}) with Nearest Neighbors Highlighted")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


: 

: 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize and train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict on test set
y_pred = knn.predict(X_test)


: 

: 

## Step 3.1: Evaluate Model Performance

We evaluate the KNN model using standard classification metrics:
- **Accuracy**: Overall correctness
- **Precision / Recall / F1-Score**: Per-class evaluation
- **Confusion Matrix**: Visual breakdown of true vs predicted labels


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os

def evaluate_model(y_true, y_pred, class_names, save_path):
    """
    Evaluates a classification model by printing metrics and plotting a normalized confusion matrix.

    Parameters:
    - y_true: Ground truth labels
    - y_pred: Predicted labels
    - class_names: List of class names in the order of encoded labels
    - save_path: Path to save the confusion matrix image
    """

    # Accuracy
    accuracy = accuracy_score(y_true, y_pred)
    print(f"Accuracy: {accuracy:.4f}")

    # Classification Report
    print(classification_report(y_true, y_pred, target_names=class_names))

    # Format labels: e.g. 0: Arthritis, 1: Common Cold, ...
    labels = [f"{i}: {name}" for i, name in enumerate(class_names)]

    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    cm_normalized = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]

    # Ensure output folder exists
    os.makedirs(os.path.dirname(save_path), exist_ok=True)

    # Plot
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm_normalized, annot=True, fmt=".2f", cmap="Blues", cbar=True,
                xticklabels=labels, yticklabels=labels)
    plt.title("Normalized Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.xticks(rotation=45, ha="right")
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.savefig(save_path, dpi=300)
    plt.show()

# Get class names from the target variable
# If your dataframe is still called df
class_names = df['prognosis'].unique().tolist()
class_names = sorted(class_names)  # Ensure correct order

# Evaluate the model
save_path = "../report/knn/confusion_matrix.png"
evaluate_model(y_test, y_pred, class_names, save_path)

: 

: 

## Step 3.2: Optimize the Number of Neighbors (k)

K-Nearest Neighbors relies on the `k` parameter — the number of nearest neighbors used to classify a new point.

Choosing the right `k` is crucial:
- **Too small** → very sensitive to noise (overfitting).
- **Too large** → overly smooth predictions (underfitting).

In this step, we evaluate multiple values of `k` and visualize their corresponding accuracy scores. Our goal is to select the value of `k` that maximizes accuracy on the test set.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

k_values = range(1, 21)
accuracies = []

for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    y_pred_k = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred_k)
    accuracies.append(acc)

# Plot accuracy vs k
plt.figure(figsize=(8, 5))
plt.plot(k_values, accuracies, marker='o')
plt.title("Accuracy vs. Number of Neighbors (k)")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy")
plt.xticks(k_values)
plt.grid(True)
plt.tight_layout()

# Save the figure
plt.savefig("../report/knn/accuracy_vs_k.png", dpi=300)
plt.show()


: 

: 

## Step 3.3: Final KNN Model & Interpretation

After tuning the `k` value, we retrain the model using the **best-performing number of neighbors** and finalize our results.

We evaluate this model again to confirm its performance and summarize the key takeaways.


In [None]:
# Final model with best k (replace with your chosen best_k)
best_k = k_values[accuracies.index(max(accuracies))]
final_knn = KNeighborsClassifier(n_neighbors=best_k)
final_knn.fit(X_train, y_train)
final_preds = final_knn.predict(X_test)

# Evaluation
from sklearn.metrics import classification_report, confusion_matrix

print(f"✅ Final KNN model trained with k = {best_k}")
print("\nFinal Classification Report:")
print(classification_report(y_test, final_preds))

# Generate new confusion matrix
evaluate_model(y_test, final_preds, class_names, save_path="../report/knn/final_confusion_matrix.png")



: 

: 

In [None]:
from sklearn.metrics import classification_report
import pandas as pd

# Generate the classification report as a dictionary
report = classification_report(y_test, final_preds, target_names=class_names, output_dict=True)

# Convert it to a DataFrame
df_report = pd.DataFrame(report).transpose()

# Round for better readability
df_report = df_report.round(3)

# Display the table
display(df_report)


: 

: 

## Step 4: Disease Prediction Using Decision Tree Classifier

In this step, we train a **Decision Tree classifier** to predict patient diagnoses based on their symptoms and environmental conditions.

### What is a Decision Tree?

A **Decision Tree** is a supervised learning model that makes predictions by learning a series of **if-else rules** from the data. The model recursively splits the dataset based on feature values that best separate the classes. The resulting structure is a flowchart-like tree, where:

- **Internal nodes** represent decisions based on feature thresholds (e.g., "Is temperature > 38°C?").
- **Branches** represent possible outcomes of a decision.
- **Leaf nodes** represent the final predicted class (diagnosis).

This model is highly interpretable — clinicians can trace back the reasoning behind each prediction by following the tree path.

### Why Use a Decision Tree Here?

In our context, Decision Trees are a good fit because:

- They can naturally handle both **binary symptoms** and **continuous weather features**.
- They can capture **nonlinear interactions** between symptoms and conditions.
- The tree structure provides an **explainable model**, which is desirable in healthcare applications.
- They allow us to identify which features (symptoms, age, weather) are most important for predicting each disease.

### Workflow

We will:
1. Train a Decision Tree model on our dataset.
2. Evaluate its predictive performance.
3. Visualize the learned tree structure to inspect how the model makes decisions.
4. Compare its performance.

### Limitations of Decision Trees

While Decision Trees offer excellent interpretability, they also come with some limitations:

- **Overfitting**: A fully grown tree may fit the training data perfectly but perform poorly on new data. Pruning or limiting the tree depth (e.g., `max_depth`) is often required to avoid this.
- **Instability**: Small changes in the data can lead to very different tree structures, as the tree-building process is greedy.
- **Bias toward features with more levels**: Features with more unique values may dominate splits, even if they are not truly more informative.
- **Lower predictive power** compared to ensemble methods (such as Random Forest or Gradient Boosted Trees), which combine multiple trees to improve accuracy and robustness.

Despite these limitations, Decision Trees remain a valuable tool for **interpretable baseline models** and for understanding the structure of the data.



## Step 4.1: Train the Decision Tree Classifier

We first initialize and train a **Decision Tree classifier** using the training set.

At this stage, we use default hyperparameters, meaning the tree will grow fully to fit the data. We will later explore how controlling the tree depth can help improve generalization.

Training a Decision Tree involves the model learning **splitting rules** that divide the feature space into regions associated with each disease class.

We then generate predictions on the test set to evaluate the model's performance.


In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

# Predict on test set
dt_pred = dt_model.predict(X_test)


: 

: 

## Step 4.2: Evaluate the Decision Tree Model

We evaluate the Decision Tree classifier using the following metrics:

- **Accuracy**: Overall proportion of correct predictions.
- **Precision / Recall / F1-Score**: Detailed per-class evaluation of model performance.
- **Confusion Matrix**: Visual breakdown of true vs. predicted labels.

These metrics help us understand how well the model performs across different disease classes and whether it tends to favor certain predictions.


In [None]:
# Use correct class names from your dataframe
class_names = sorted(df['prognosis'].unique().tolist())

# Evaluate Decision Tree
save_path = "../report/decision_tree/confusion_matrix.png"
evaluate_model(y_test, dt_pred, class_names, save_path)


: 

: 

## Step 4.3: Visualize the Decision Tree

One of the main advantages of Decision Trees is their **interpretability**. The trained model can be visualized as a tree structure where:

- Each **internal node** represents a decision based on a feature.
- Each **branch** corresponds to the outcome of that decision.
- Each **leaf node** represents a predicted disease class.

Visualizing the tree helps us understand **which symptoms and environmental factors** are most important for predicting each disease. It also allows us to inspect the logical structure of the model.


In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Plot the full Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(dt_model, 
          feature_names=df.drop(columns=["prognosis"]).columns.tolist(),
          class_names=class_names, 
          filled=True, 
          fontsize=8)

plt.title("Decision Tree Visualization")
plt.tight_layout()

# Save to report
plt.savefig("../report/decision_tree/decision_tree_plot.png", dpi=300)
plt.show()


: 

: 

## Step 4.4: Tune the Maximum Depth of the Tree

Fully grown Decision Trees tend to **overfit** the training data, resulting in very large and complex trees that are difficult to interpret.

By limiting the tree's maximum depth (`max_depth` parameter), we can:

- Improve generalization to unseen data.
- Produce a simpler and more readable tree.
- Focus on the most important decision paths.

Here, we retrain the model with `max_depth = [4,8,12,16,20]` to produce a more interpretable structure.


In [None]:
# Retrain Decision Tree with limited depth
max_depths = [4,8,12,16,20]

for max_depth in max_depths:
    dt_model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    dt_model.fit(X_train, y_train)
    
    # Predict on test set
    dt_pred = dt_model.predict(X_test)
    
    # Evaluate the model
    save_path = f"../report/decision_tree/confusion_matrix_maxdepth{max_depth}.png"
    evaluate_model(y_test, dt_pred, class_names, save_path)
    
    # Visualize the tree
    plt.figure(figsize=(20, 10))
    plot_tree(dt_model, 
              feature_names=df.drop(columns=["prognosis"]).columns.tolist(),
              class_names=class_names, 
              filled=True, 
              fontsize=8)
    
    plt.title(f"Decision Tree Visualization (max_depth={max_depth})")
    plt.tight_layout()
    plt.savefig(f"../report/decision_tree/decision_tree_plot_maxdepth{max_depth}.png", dpi=300)
    plt.show()


: 

: 

## Step 4.5: Effect of Maximum Tree Depth on Model Performance

To better understand how the depth of the Decision Tree affects its predictive power, we conduct an experiment by training trees with different `max_depth` values.

This allows us to observe the trade-off between:

- **Model Complexity**: Deeper trees can model more complex patterns but are prone to overfitting.
- **Interpretability**: Shallower trees are easier to understand but may underfit the data.
- **Accuracy**: How predictive performance varies as we change the depth.

We evaluate accuracy on the test set for several values of `max_depth` and visualize the results.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Range of max_depth values to test
depth_values = [2, 3, 4, 5, 6, 8, 10, 12, 15, None]  # None = full depth

# Store results
depth_list = []
accuracy_list = []

for depth in depth_values:
    # Train Decision Tree
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    
    # Predict on test set
    y_pred_depth = model.predict(X_test)
    
    # Calculate accuracy
    acc = accuracy_score(y_test, y_pred_depth)
    
    # Store results
    depth_list.append("Full" if depth is None else depth)
    accuracy_list.append(acc)
    
    print(f"max_depth = {depth}: accuracy = {acc:.4f}")

# Plot results
plt.figure(figsize=(8, 5))
plt.plot(depth_list, accuracy_list, marker='o', linestyle='-')
plt.title("Effect of Decision Tree max_depth on Accuracy")
plt.xlabel("max_depth")
plt.ylabel("Test Set Accuracy")
plt.grid(True)
plt.tight_layout()

# Save to report
plt.savefig("../report/decision_tree/max_depth_optimization.png", dpi=300)
plt.show()


: 

: 

## Conclusion of Decision Tree Modeling

In this section, we trained a Decision Tree classifier to predict diseases based on patient symptoms and environmental factors.

- The full tree (no depth limit) achieved very high accuracy (**97.19%**) but produced a very large and complex tree, prone to overfitting.
- We systematically evaluated different values of `max_depth`, observing the trade-off between model complexity and performance:

| max_depth | Accuracy (%) |
|-----------|--------------|
| 2         | 40.12        |
| 3         | 44.93        |
| 4         | 48.85        |
| 5         | 53.16        |
| 6         | 56.47        |
| 8         | 63.89        |
| 10        | 70.51        |
| 12        | 75.53        |
| 15        | 84.15        |
| Full      | 97.19        |

- As expected, shallow trees (e.g., `max_depth=2` to `4`) offered high interpretability but suffered from low accuracy (underfitting).
- Deeper trees progressively improved accuracy, but at the cost of producing much larger and less interpretable models.

From the visualization at `max_depth=4`, we observed that:

- `chest_pain` and `hiv_aids` were highly informative and appeared at the top of the tree.
- Features such as `diarrhea`, `headache`, and `trouble_seeing` also contributed to differentiating certain diseases.

This experiment clearly illustrates the **performance vs. interpretability trade-off** inherent to Decision Trees.

Overall, Decision Trees provide a transparent and useful baseline model for our medical prediction task, especially when explainability is required.

In the next step, we will explore ensemble classifiers (Random Forest, Gradient Boosted Trees) to aim for **higher predictive power** while analyzing feature importance.


## Step 5.1: Increasing Predictive Power Using Ensemble Classifiers

We've now built individual decision trees to predict diseases based on environmental factors. While the decision trees can intuitive and easy to interpret, by varying the depth we often suffer from either poor interpretability or poor accuracy, leading to poor generalization on unseen data. This is where ensemble methods like Random Forest and XGBoost come into play, significantly boosting predictive power by combining multiple decision trees.

Instead of one deep, overfit tree, Random Forest builds many (e.g., hundreds) independent decision trees. Each of these trees is intentionally "overfit" to a subset of the data (bootstrap samples) and only considers a random subset of features at each split. By averaging or majority-voting the predictions of these many diverse, slightly biased trees, the collective decision becomes far more robust and generalized. The individual errors and idiosyncrasies of each tree cancel out, leading to a significant reduction in variance and, consequently, preventing the severe overfitting we observed with the single full tree.

While also using many trees, Gradient Boosting tackles overfitting differently. It builds trees sequentially, with each new tree trying to correct the prediction errors (residuals) of the previous trees. This iterative refinement allows the model to progressively learn more complex patterns and focus on the data points that were difficult for earlier trees. Despite building on potentially "weak learners," the cumulative effect, combined with built-in regularization techniques (which XGBoost is particularly good at), leads to extremely strong and well-generalized models, mitigating the risk of overfitting found in a single deep tree. 

## Step 5.2: Utilizing Random Forest 

Now we can utilize the X and Y training sets that we defined earlier to train a Random Forest Classifier in order to potentially solve our problem with overfitting.


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
clf.fit(X_train, y_train)

: 

: 

For clarity, here is a visual of one possible random forest:

In [None]:
plt.figure(figsize=(20, 10))
plot_tree(clf.estimators_[0], 
          filled=True, 
          feature_names=feature_names, 
          class_names=class_names, 
          rounded=True, 
          proportion=False, 
          precision=2)
plt.title("Visualization of One Tree from the Random Forest")
plt.show()

And then we can create some predictions in order to evaluate the performance of the random forest.

In [None]:
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

: 

: 

## Step 5.3: Evaluating Random Forest

We can use accuracy, a conversion matrix, Precision, recall, F1-score, and ROC-AUC to evaluate the performance of our random forest.

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# 1. Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# 2. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# 3. Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))

# 4. ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_proba)
print("ROC-AUC Score:", roc_auc)

# 5. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

: 

: 

### SVM

In [None]:
from sklearn.svm import SVC

svm_model = SVC(kernel="rbf", C=1.0, gamma="scale", decision_function_shape="ovr", random_state=42)

# Train the model
svm_model.fit(X_train, y_train)

: 

: 

In [None]:

# Predict on test set
y_pred = svm_model.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

: 

: 

In [None]:
y_pred = svm_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

: 

: 