# **Parkinson’s Disease Detection using Voice Biomarkers**
This notebook explores the use of ML classifiers (Logistic Regression, XGBoost, Platt-Calibrated SVC) to predict early-stage Parkinson’s Disease based on voice features.


# **1. Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from tqdm.notebook import tqdm
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

# **2. Data Loading**

In [None]:
from google.colab import files
import os

if not os.path.exists('parkinson_disease.csv'):
    files.upload()

df = pd.read_csv('parkinson_disease.csv')
pd.set_option('display.max_columns', 10)
df.sample(5)


# **3. Exploratory Data Analysis (EDA)**

In [None]:
# Display dataset info: data types, nulls, memory usage
df.info()

In [None]:
# Transposed describe for better view across all columns
df.describe().T

In [None]:
# Check total number of missing values
df.isnull().sum().sum()

# **4. Data Preprocessing**


We start by normalizing the features using Min-Max Scaling and applying chi-squared feature selection to retain the top 30 informative features.


In [None]:
print(df['class'])

In [None]:
# Assuming the target column is 'status' or similar
if df['class'].dtype == 'object':
    le = LabelEncoder()
    df['class'] = le.fit_transform(df['class'])

# Separate features and target
X = df.drop(columns=['class'])
y = df['class']

# Normalize features to [0, 1]
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

X_scaled.head()


# **4.2 Removing Highly Correlated Features**

Highly correlated features (correlation > 0.7) can introduce redundancy and harm model performance.
We group the dataset by `id`, drop it, and then remove one feature from each pair that is strongly correlated.

In [None]:
# Group by 'id' and average duplicates
df = df.groupby('id').mean().reset_index()

# Drop 'id' as it's no longer needed
df.drop('id', axis=1, inplace=True)

# Remove features with high correlation (> 0.7)
target_col = 'class'
columns = list(df.columns)
columns.remove(target_col)

filtered_columns = []

for i, col in enumerate(columns):
    keep = True
    for sel in filtered_columns:
        if abs(df[col].corr(df[sel])) > 0.7:
            keep = False
            break
    if keep:
        filtered_columns.append(col)

# Add back the target column
filtered_columns.append(target_col)
df = df[filtered_columns]

print("Remaining shape after removing correlated features:", df.shape)


In [None]:
# Check class balance
plt.figure(figsize=(6, 4))
df['class'].value_counts().plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Class Distribution')
plt.xticks(ticks=[0, 1], labels=['Healthy', 'Parkinson'], rotation=0)
plt.show()

# Correlation heatmap
plt.figure(figsize=(14, 10))
corr = df.corr()
sb.heatmap(corr, cmap='coolwarm', annot=False)
plt.title('Feature Correlation Heatmap')
plt.show()


## 5. Feature Selection and Class Distribution Analysis 🔍📊
To reduce dimensionality and improve model efficiency, we use the Chi-squared test to select the top 30 features that have the strongest relationship with the target variable (`class`).







> 🧠 **Note:** Chi-squared test works with non-negative values only (it assumes frequency data), so we normalize features to **[0, 1] range**.


In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
import seaborn as sns
import matplotlib.pyplot as plt

# Separating features and target
X = df.drop('class', axis=1)
y = df['class']

# Normalizing features (required for Chi-squared test)
# Chi-squared test assumes non-negative input features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Applying Chi-squared test to select top 30 features
selector = SelectKBest(score_func=chi2, k=30)
X_selected = selector.fit_transform(X_scaled, y)

# Getting names of selected features
selected_mask = selector.get_support()
selected_columns = X.columns[selected_mask]

# Displaying selected features with their Chi-squared scores
feature_scores = selector.scores_[selected_mask]
chi2_scores_df = pd.DataFrame({'Feature': selected_columns, 'Chi2 Score': feature_scores})
chi2_scores_df = chi2_scores_df.sort_values(by='Chi2 Score', ascending=False)
print("Top 30 features with Chi-squared scores:")
display(chi2_scores_df)


# Creating filtered DataFrame with selected features and target
df = pd.DataFrame(X_selected, columns=selected_columns)
df['class'] = y.reset_index(drop=True)  # Ensuring alignment with transformed X

print("Shape after Chi-squared feature selection:", df.shape)

###Class Distribution Visualization

Before splitting the data, we examine the class balance to understand if the dataset is skewed toward any particular label. This informs our sampling strategy.

In [None]:
# Class distribution - Pie Chart
class_counts = df['class'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(class_counts.values,
        labels=class_counts.index,
        autopct='%1.1f%%',
        colors=['#FFA500', '#1f77b4'],
        startangle=90)
plt.title("Class Distribution")
plt.axis('equal')
plt.show()

# Class distribution - Bar Plot (for actual counts)
plt.figure(figsize=(5, 4))
sns.countplot(x='class', data=df, palette=['#FFA500', '#1f77b4'])
plt.title("Class Counts")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

# **6. Model Training and Evaluation**
We train three different classifiers — Logistic Regression, XGBoost, and SVM — on the Parkinson’s dataset.
To address the class imbalance (~75% positive class), we use RandomOverSampler during training.

Each model is evaluated using ROC AUC score on both training and validation sets.


### Step 1: Stratified Cross-Validation & Oversampling Setup

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc
from sklearn.calibration import CalibratedClassifierCV
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import numpy as np

# Preparing data
features = df.drop('class', axis=1)
target = df['class']

# Stratified K-Fold Cross Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Model definitions with class balancing
models = {
    "Logistic Regression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Random Forest": RandomForestClassifier(class_weight='balanced', random_state=42),
    "SVC (Platt Calibrated)": CalibratedClassifierCV(SVC(kernel='rbf', probability=True), method='sigmoid', cv=3)
}


In [None]:
print(features.columns)
print(features.dtypes)


### Step 2: Model Training, AUC Scoring & Confidence Intervals

#### 🧪 Train-Test Split

We split the data into training and test sets using stratified sampling to maintain class distribution.


In [None]:
from sklearn.metrics import precision_recall_curve, auc, roc_auc_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np

# Split
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Store results
results = []

for name, model in models.items():
    print(f"\n🔍 Training model: {name}")
    model.fit(X_train, y_train)

    # Predict probabilities or scores
    if hasattr(model, "predict_proba"):
        val_probs = model.predict_proba(X_val)[:, 1]
    elif hasattr(model, "decision_function"):
        val_probs = model.decision_function(X_val)
    else:
        print(f"⚠️ Skipping {name}: No probability or decision function.")
        continue

    if len(val_probs) != len(y_val):
        print(f"❌ Length mismatch for {name}")
        continue

    # ROC-AUC & PR-AUC
    roc_auc = roc_auc_score(y_val, val_probs)
    precision, recall, _ = precision_recall_curve(y_val, val_probs)
    pr_auc = auc(recall, precision)

    results.append({
        "Model": name,
        "ROC_AUC": roc_auc,
        "PR_AUC": pr_auc,
        "Precision": precision,
        "Recall": recall
    })

# Print result table
print("\n📊 Table Summary of Results")
print(f"{'Model':<20} {'ROC-AUC':<10} {'PR-AUC':<10}")
print("-" * 42)
for r in results:
    print(f"{r['Model']:<20} {r['ROC_AUC']:.2f}      {r['PR_AUC']:.2f}")

# PR Curves
for r in results:
    plt.figure(figsize=(6, 5))
    plt.plot(r["Recall"], r["Precision"], label=f"{r['Model']} (AUC = {r['PR_AUC']:.2f})", linewidth=2)
    plt.xlabel("Recall", fontsize=12)
    plt.ylabel("Precision", fontsize=12)
    plt.title(f"{r['Model']} Precision-Recall Curve", fontsize=13, weight='bold')
    plt.legend(loc='lower left', fontsize=10)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.tight_layout()
    plt.show()


### 📊 Table Summary of Results

| Model                  | ROC-AUC | PR-AUC |
| ---------------------- | ------- | ------ |
| Logistic Regression    | 0.77    | 0.90   |
| Random Forest          | 0.78    | 0.91   |
| SVC (Platt Calibrated) | 0.73    | 0.88   |



### ✅ Observations

- All models are evaluated using **Stratified 5-Fold Cross Validation** to ensure class distribution consistency across folds.
- To mitigate class imbalance (~75% positive), **RandomOverSampler** is applied in each fold before training.
- Performance metrics include both **ROC-AUC** (sensitivity-specificity tradeoff) and **PR-AUC** (more informative under imbalance).
- **Support Vector Machine (SVC)** is calibrated using **Platt Scaling** to enable reliable probability outputs for interpretability (e.g., in SHAP analysis).
- Evaluation focuses on **average ROC-AUC and PR-AUC values** across folds for fair comparison of model performance.


The 95% confidence interval quantifies the uncertainty in ROC-AUC estimation across folds. A narrow CI indicates stable model performance, which is important for clinical applicability.


### 6.1 Classification Report — All Models


Precision, recall, and F1-score are reported for each model, giving insight into how well they handle both classes.



In [None]:
from sklearn.metrics import classification_report

print("📌 Classification Reports\n")

for name, model in models.items():
    print(f"\n{name}")
    y_pred = model.predict(X_val)
    print(classification_report(y_val, y_pred))


### 📊 Classification Report Summary

**Understanding the Averages:**
- **Macro Avg**: Gives equal weight to each class — useful to assess model fairness.
- **Weighted Avg**: Adjusts metrics based on class distribution — useful when data is imbalanced.

---

**🧪 Dataset Note:**
- Class **1 (Parkinson’s)** has **higher support (~74%)**.
- Pay close attention to **Class 0 (Healthy)** metrics, especially **recall**, to avoid false positives.

---

#### 🔍 Model-wise Observations:

- **📌 Logistic Regression**
  - Balanced performance across both classes.
  - **Recall for Class 0 = 0.69** → Detects most healthy individuals.
  - Weighted F1-score: **0.76**

- **📌 Random Forest**
  - Very strong at detecting Parkinson’s (**Recall = 0.92**).
  - **Recall for Class 0 = 0.31** → Misses many healthy individuals.
  - May overfit to the majority class despite good overall accuracy (**76%**).

- **📌 SVC (Platt Calibrated)**
  - **Recall for Class 0 = 0.38**, an improvement over earlier results.
  - Excellent detection of Parkinson’s (**Recall = 0.95**).
  - Most balanced performance in terms of precision-recall tradeoff.

---

### ✅ Takeaway:
> Even when overall accuracy is high, always examine **recall for the minority class (Class 0)** — especially in medical applications where misclassification can carry high risk.


### 6.2 Confusion Matrices – All Models



These matrices provide a breakdown of true/false positives and negatives for each model on the validation set. They help evaluate model behavior more concretely.


The confusion matrices below show the classification results on the validation set for each model. In each matrix, you can observe:

- **True Positives (TP)** – Bottom-right: Parkinson’s correctly identified  
- **True Negatives (TN)** – Top-left: Healthy correctly identified  
- **False Positives (FP)** – Top-right: Healthy misclassified as Parkinson’s  
- **False Negatives (FN)** – Bottom-left: Parkinson’s misclassified as Healthy  

These visualizations help us evaluate how well each model distinguishes between healthy individuals and those with Parkinson’s Disease.


> 🧠 **Note:** In this dataset, **Class 0 = Healthy** and **Class 1 = Parkinson's Disease**.


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

print("📌 Confusion Matrices\n")

for name, model in models.items():
    # Predicting on validation set from the last fold
    y_pred = model.predict(X_val)
    cm = confusion_matrix(y_val, y_pred)

    fig, ax = plt.subplots(figsize=(5, 4))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Healthy', "Parkinson's"])
    disp.plot(cmap=plt.cm.Blues, values_format='d', ax=ax)

    ax.set_title(f"Confusion Matrix – {name}", fontsize=12)
    ax.set_xlabel("Predicted Label")
    ax.set_ylabel("True Label")
    ax.grid(False)

    plt.tight_layout()
    plt.show()


### 🔍 Observation – Logistic Regression

Confusion Matrix:  
[[9,4],
[9, 29]]

✅ Parkinson’s (Class 1) detection remains strong with 29 true positives.

⚠️ 9 Parkinson’s cases are missed and predicted as Healthy (false negatives), which is risky in medical screening.

⚠️ 4 Healthy individuals are incorrectly flagged as having Parkinson’s (false positives).

✅ Class 0 recall has improved slightly (9 out of 13), but false negatives for Parkinson’s remain a key concern.

Overall, Logistic Regression maintains good detection for Parkinson’s but still needs refinement to minimize false negatives — crucial in clinical settings.

### 🔍 Observation – Random Forest

Confusion Matrix:  
[[4,9],
[3, 35]]

✅ Parkinson’s detection is excellent with 35 true positives and only 3 false negatives, showing high sensitivity.

⚠️ 9 Healthy individuals were incorrectly predicted as having Parkinson’s (false positives), which reduces specificity.

✅ Only 4 true negatives were correctly identified as Healthy, indicating some challenge in distinguishing non-Parkinson’s cases.

Overall, the model strongly favors identifying Parkinson’s, which is valuable in clinical screening — but the high number of false positives may require further calibration or tuning to avoid unnecessary concern for healthy individuals.




### 🔍 Observation – SVC (Platt Calibrated)

Confusion Matrix:  
[[5, 8],
[2, 36]]

✅ 36 true positives and only 2 false negatives show high sensitivity in detecting Parkinson’s.

⚠️ 8 Healthy individuals were misclassified as Parkinson’s (false positives), impacting the model’s specificity.

✅ 5 true negatives indicate some ability to correctly identify Healthy individuals — better than before.

Overall, the model maintains a good balance, excelling in Parkinson’s detection while moderately improving its identification of Healthy cases compared to earlier performance. With further tuning, it could become a solid screening tool in clinical contexts.