<a href="https://colab.research.google.com/github/SafiaAli3/Alzheimers-progression-ML/blob/main/Model_1_logistic_regression_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Logistic Regression** is a widely used statistical model for binary classification problems, where the goal is to predict one of two possible outcomes — in this case, whether a subject is demented (1) or nondemented (0) based on clinical and MRI-derived features.

**Preprocessing and Pipeline**


I used a ColumnTransformer to separately handle numeric and categorical data:

*Numeric features* (like age, eTIV, nWBV) were scaled with StandardScaler() to standardize units.

*Categorical feature* (M/F) was one-hot encoded to make it machine-readable.

This was wrapped in a Pipeline to combine preprocessing and model training into one step. It helps prevent data leakage, keeps the code clean, and makes it easy to swap models later.

In [6]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [9]:
scalar_df = pd.read_csv('/content/Df.csv')
df = scalar_df[scalar_df['Visit'] == 1].copy()

X = df.drop(columns=['Label', 'Subject ID', 'MRI ID', 'Group', 'Visit', 'Hand'], errors='ignore')
y = df['Label']

print(X.shape, y.value_counts())

(136, 10) Label
0    72
1    64
Name: count, dtype: int64


In [10]:
numeric_features = ['Age', 'EDUC', 'SES', 'eTIV', 'nWBV', 'ASF']
categorical_features = ['M/F']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear', random_state=42))
])


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# Evaluation

print("Logistic Regression Model Evaluation:")

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.64      0.60      0.62        15
           1       0.57      0.62      0.59        13

    accuracy                           0.61        28
   macro avg       0.61      0.61      0.61        28
weighted avg       0.61      0.61      0.61        28



In [15]:
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Confusion Matrix:
[[9 6]
 [5 8]]


The confusion matrix revealed that while the model correctly classified both demented and nondemented subjects, it made several false predictions. It misclassified 6 nondemented as demented and 5 demented as nondemented, indicating room for improvement. This reinforces the need for a more complex or better-tuned model, especially given the medical importance of correctly identifying demented individuals.

In [13]:
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.6071428571428571


**Interpretation:**

| Metric              | Meaning                                                                                             |
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| Accuracy = 61%       | Model correctly classified 17 out of 28 patients                                                                                               |
| Class 0 F1 = 0.62    | Nondemented patients are predicted fairly well                                                                                                 |
| Class 1 F1 = 0.59    | Demented patients are slightly harder to detect, but still reasonable                                                                          |
| Balanced metrics     | The model isn’t heavily biased toward one class (which is good)                                                                                |
| Room for improvement | 61% accuracy means the model is learning, but better performance may come from more data or a more complex model like Random Forest or XGBoost |


**Summary**

The logistic regression model serves as a simple, interpretable baseline for dementia classification using demographic and MRI-based features. It achieved 61% accuracy and relatively balanced precision and recall between classes. While performance is modest, it provides a meaningful starting point for more advanced models like Random Forest or neural networks.