# Multiclass Logistic Regression on the UCI Digits Dataset

This notebook is part of the AF3 Product Integrator.  
The idea is to build a complete supervised learning workflow using the **Digits** dataset:
- Load a real public dataset
- Explore and clean the data
- Apply preprocessing and normalization
- Train a **multinomial Logistic Regression** model
- Evaluate the results with proper metrics and plots


## 1. Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)

# Just to display all columns when needed
pd.set_option("display.max_columns", None)
sns.set(style="whitegrid")

## 2. Load dataset from sklearn (UCI Digits)

In [None]:
# Loading the digits dataset from sklearn
digits = load_digits()

# digits.data has the 64 numeric features, digits.target has the labels (0-9)
X = pd.DataFrame(digits.data)
y = pd.Series(digits.target, name="target")

# Combine into a single DataFrame for easier analysis
df = pd.concat([X, y], axis=1)

df.head()

## 3. Basic exploratory data analysis (EDA)

In [None]:
# Shape of the dataset
print("Shape:", df.shape)

# Info about data types and non-null values
df.info()

In [None]:
# Basic statistics for the numeric features
df.describe().T.head(10)

### 3.1 Missing values check

In [None]:
# Checking for missing values in each column
df.isna().sum()

### 3.2 Target class distribution

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x="target", data=df)
plt.title("Digits class distribution (0-9)")
plt.xlabel("Digit")
plt.ylabel("Count")
plt.tight_layout()

# Create figures folder if it does not exist
import os
os.makedirs("../figures", exist_ok=True)
plt.savefig("../figures/class_distribution.png", dpi=300)
plt.show()

### 3.3 Visualizing some sample digits

In [None]:
# Plotting some sample images to understand the data
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(digits.images[i], cmap="gray")
    axes[i].set_title(f"Label: {digits.target[i]}")
    axes[i].axis("off")

plt.tight_layout()
plt.savefig("../figures/sample_digits.png", dpi=300)
plt.show()

### 3.4 (Optional) Save dataset as CSV

In [None]:
# This is optional, but useful to have a copy of the dataset in /data
os.makedirs("../data", exist_ok=True)
df.to_csv("../data/digits.csv", index=False)
print("Saved digits.csv to ../data/digits.csv")

## 4. Feature/target split

In [None]:
# Separating features (X) and target (y)
X = df.drop(columns=["target"])
y = df["target"]

print("Features shape:", X.shape)
print("Target shape:", y.shape)

## 5. Train/Test split (70/30)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

print("X_train:", X_train.shape)
print("X_test:", X_test.shape)

## 6. Feature scaling (standardization)

In [None]:
# StandardScaler will normalize each feature to have mean ~0 and std ~1
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:3]

## 7. Multiclass Logistic Regression (multinomial)

In [None]:
# Creating the multinomial Logistic Regression model
log_reg = LogisticRegression(
    multi_class="multinomial",
    solver="lbfgs",
    max_iter=2000
)

# Training the model
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred = log_reg.predict(X_test_scaled)

## 8. Evaluation metrics

In [None]:
# Computing the main metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average="weighted")
rec = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")

print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-score:", f1)

### 8.1 Save metrics and classification report

In [None]:
# Creating the results folder if it does not exist
os.makedirs("../results", exist_ok=True)

# Saving a simple metrics summary
metrics_text = (
    f"Accuracy: {acc}\n"
    f"Precision (weighted): {prec}\n"
    f"Recall (weighted): {rec}\n"
    f"F1-score (weighted): {f1}\n"
)

with open("../results/metrics.txt", "w") as f:
    f.write(metrics_text)

print("Metrics saved to ../results/metrics.txt")

In [None]:
# Full classification report (per class)
report = classification_report(y_test, y_pred)
print(report)

with open("../results/classification_report.txt", "w") as f:
    f.write(report)

print("Classification report saved to ../results/classification_report.txt")

## 9. Confusion matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Save confusion matrix as CSV as well (for the report if needed)
cm_df = pd.DataFrame(cm, index=range(10), columns=range(10))
cm_df.to_csv("../results/confusion_matrix.csv", index=True)

plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt="g", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.savefig("../figures/confusion_matrix.png", dpi=300)
plt.show()

## 10. Simple coefficient analysis

In [None]:
# The model learns one set of coefficients per class
coefs = log_reg.coef_
print("Coefficients shape:", coefs.shape)  # (10 classes, 64 features)

# Just to get an idea, we can look at the first row (class 0)
coef_class_0 = pd.Series(coefs[0], index=X.columns)
print(coef_class_0.sort_values(ascending=False).head(10))

## 11. Short conclusions (for the report)

In this notebook we:
- Used a **real, public dataset** (UCI Digits via sklearn)
- Performed a basic exploratory data analysis (EDA)
- Verified there were no missing values
- Standardized all numeric features
- Trained a **multinomial Logistic Regression** model
- Evaluated it using Accuracy, Precision, Recall and F1-score
- Visualized the confusion matrix and inspected the learned coefficients

These elements can be used directly in the final PDF report: methodology, results, and discussion.
