# 🧙‍♂️ DSLR - Hogwarts Sorting Hat Algorithm

This notebook demonstrates the main components of the DSLR project, which recreates Hogwarts' Sorting Hat using a logistic regression classifier built from scratch.

## Overview

We'll explore the following components:
1. **Dataset Overview & Custom Statistical Analysis**
2. **Data Visualization & Feature Selection**
3. **Logistic Regression Implementation**
4. **Comparing Different Optimization Algorithms**
5. **Model Evaluation**

## Preparation: Setting Up the Environment

Start by importing necessary Python libraries. These tools help load, clean, and visualize the data in a meaningful way.

In [None]:
''' Prepare the environment for the project. '''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

sys.path.append('..')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (9, 6)

## 1. Dataset Overview & Custom Statistical Analysis

- load the student dataset, which contains scores for various subjects and the house each student belongs to (Gryffindor, Ravenclaw, etc.).

In [None]:
train_dataset_path = '../../data/raw/dataset_train.csv'
df = pd.read_csv(train_dataset_path)

print(f"Dataset shape: {df.shape}")
df.head()

- use the custom `describe.py` implementation to analyze it.

In [17]:
from data.describe import ft_describe

custom_stats = ft_describe(df, is_bonus=True)
custom_stats

Unnamed: 0,Index,Arithmancy,Astronomy,Herbology,Defense Against the Dark Arts,Divination,Muggle Studies,Ancient Runes,History of Magic,Transfiguration,Potions,Care of Magical Creatures,Charms,Flying
count,1600.0,1566.0,1568.0,1567.0,1569.0,1561.0,1565.0,1565.0,1557.0,1566.0,1570.0,1560.0,1600.0,1600.0
mean,799.5,49634.570243,39.797131,1.14102,-0.387863,3.15391,-224.589915,495.74797,2.963095,1030.096946,5.950373,-0.053427,-243.374409,21.958012
std,462.02453,16679.806036,520.298268,5.219682,5.212794,4.155301,486.34484,106.285165,4.425775,44.125116,3.147854,0.971457,8.78364,97.631602
min,0.0,-24370.0,-966.740546,-10.295663,-10.162119,-8.727,-1086.496835,283.869609,-8.858993,906.62732,-4.697484,-3.313676,-261.04892,-181.47
25%,399.75,38511.5,-489.551387,-4.308182,-5.259095,3.099,-577.580096,397.511047,2.218653,1026.209993,3.646785,-0.671606,-250.6526,-41.87
50%,799.5,49013.5,260.289446,3.469012,-2.589342,4.624,-419.164294,463.918305,4.378176,1045.506996,5.874837,-0.044811,-244.867765,-2.515
75%,1199.25,60811.25,524.771949,5.419183,4.90468,5.667,254.994857,597.49223,5.825242,1058.43641,8.248173,0.589919,-232.552305,50.56
max,1599.0,104956.0,1016.21194,11.612895,9.667405,10.032,1092.388611,745.39622,11.889713,1098.958201,13.536762,3.056546,-225.42814,279.07
range,1599.0,129326.0,1982.952486,21.908558,19.829525,18.759,2178.885445,461.526611,20.748706,192.330881,18.234246,6.370222,35.62078,460.54
iqr,799.5,22299.75,1014.323336,9.727365,10.163775,2.568,832.574954,199.981183,3.606588,32.226418,4.601387,1.261526,18.100295,92.43


- compare results with pandas built-in `describe()` method to verify accuracy.

In [None]:
pandas_stats = df.describe()

column = 'Astronomy'
comparison = pd.DataFrame({
    'Custom': custom_stats[column].loc[['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
    'Pandas': pandas_stats[column]
})

comparison['Difference'] = comparison['Custom'] - comparison['Pandas']
comparison['% Diff'] = (comparison['Difference'] / comparison['Pandas'] * 100).round(6)

comparison

- distribution of students across the four Hogwarts houses.

In [None]:
house_counts = df['Hogwarts House'].value_counts()
colors = {
    'Gryffindor': 'red',
    'Hufflepuff': 'yellow',
    'Ravenclaw': 'blue',
    'Slytherin': 'green'
}

plt.figure(figsize=(10, 6))
house_counts.plot(kind='bar', color=[colors[house] for house in house_counts.index])
plt.title('Distribution of Students Across Hogwarts Houses', fontsize=15)
plt.xlabel('House', fontsize=12)
plt.ylabel('Number of Students', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"Total students: {house_counts.sum()}")
for house, count in house_counts.items():
    print(f"{house}: {count} students ({count/house_counts.sum()*100:.1f}%)")

## 2. Data Visualization & Feature Selection

These functions allows us to visualize how each house performs in a given course. The distributions help us detect patterns or differences between the houses.

Start with exploring features to understand which ones are useful for classifying students.

In [None]:
from visualization.utils import HOUSE_COLORS

def plot_course_histogram(course):
    plt.figure(figsize=(12, 6))
    sns.histplot(
        df,
        x=course,
        hue='Hogwarts House',
        palette=HOUSE_COLORS,
        element='step',
        multiple='layer'
    )
    plt.title(f"{course} Score Distribution by House", fontsize=15)
    plt.xlabel("Score", fontsize=12)
    plt.ylabel("Number of Students", fontsize=12)
    plt.tight_layout()
    plt.show()

The dataset shows a **homogeneous distribution** for the courses `Care of Magical Creatures` and `Arithmancy`. This indicates that the features may not provide meaningful separation between the Hogwarts houses.

Exclude this features from the training stage to avoid introducing noise or redundant information into the model.

In [None]:
plot_course_histogram('Care of Magical Creatures')
plot_course_histogram('Arithmancy')

In [None]:
def create_scatter_plot(x_feature, y_feature):
    plt.figure(figsize=(10, 8))
    sns.scatterplot(
        data=df,
        x=x_feature,
        y=y_feature,
        hue='Hogwarts House',
        palette=HOUSE_COLORS,
        alpha=0.7
    )
    plt.title(f"{x_feature} vs {y_feature}", fontsize=15)
    plt.xlabel(x_feature, fontsize=12)
    plt.ylabel(y_feature, fontsize=12)
    plt.tight_layout()
    plt.show()

Among all the data, the features **'Astronomy'** and **'Defense Against the Dark Arts'** are highly correlated. Including both in the training stage may introduce redundancy and noise into the model.

Using only one of these features is necessary.

In [None]:
create_scatter_plot('Astronomy', 'Defense Against the Dark Arts')

Let's create a subset of the pair plot focusing on the most relevant features for classification.

## 3. Preprocess the Datasets

Prepare the dataset

In [None]:
TEST_DATASET = '../../data/raw/dataset_test.csv'

FEATURES_TO_DROP = [
    'Care of Magical Creatures',          # Homogeneous
    'Arithmancy',                         # Homogeneous
    'Defense Against the Dark Arts',      # Similar to Astronomy

    'First Name',                         # Not useful for analysis
    'Last Name',                          # Not useful for analysis
    'Birthday',                           # Not useful for analysis
    'Best Hand',                          # Not useful for analysis
    'Index'                               # Not useful for analysis
]

prediction_df = pd.read_csv(TEST_DATASET)
prediction_df.drop(columns=FEATURES_TO_DROP, inplace=True)
prediction_df.fillna(0.0, inplace=True)


raw_df = pd.read_csv(train_dataset_path)
processed_df = raw_df.drop(columns=FEATURES_TO_DROP)
processed_df.fillna(0.0, inplace=True)

processed_df.head()

In [None]:
from sklearn.preprocessing import StandardScaler

labels = processed_df['Hogwarts House']
features_df = processed_df.drop('Hogwarts House', axis=1)
prediction_features_df = prediction_df.drop('Hogwarts House', axis=1)

scaler = StandardScaler()
features = scaler.fit_transform(features_df)
prediction_features = scaler.transform(prediction_features_df)

## 4. Train Logistic Regression Models

Train the model using **Mini-Batch Gradient Descent**.

In [None]:
from models.train import train, TrainingConfig
import time

algo = "mini_batch_gradient_descent"

config = TrainingConfig()

print(f"\n--- Training with {algo} ---")
start = time.time()
weights = train(algo, features, labels, config)
duration = time.time() - start

f'Training completed in {duration:.2f} seconds'

## 5: Evaluate Model Performance

Measure accuracy on the training set.

In [None]:
from models.predict import predict

predictions = predict(features, weights)

matches = predictions['Hogwarts House'] == labels.reset_index(drop=True)
accuracy = matches.sum() / len(matches) * 100
f'{matches.sum()} out of {len(matches)} predictions match the actual labels ({accuracy:.3f}%) on the training set.'

## 6: Predict and Export Results

Now run function **predict** on prepared dataset where students are not attached to **Hogwarts Houses**
Compare the results with scikit-learn Logistic Regression

In [None]:
predictions = predict(prediction_features, weights)

predictions.head()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.multiclass import OneVsRestClassifier

scaler = StandardScaler()
X_scaled = scaler.fit_transform(processed_df.drop("Hogwarts House", axis=1))
y = processed_df["Hogwarts House"]

model = OneVsRestClassifier(LogisticRegression(max_iter=1000))
model.fit(X_scaled, y)

sklearn_acc = accuracy_score(y, model.predict(X_scaled))
f"sklearn Logistic Regression accuracy on training dataset {sklearn_acc * 100:.3f}%"

Compare the results

In [None]:

sklearn_predictions = model.predict(prediction_features)
models_match = accuracy_score(predictions, sklearn_predictions)

f'{models_match * 100:.3f}% of the predictions match between the custom model and sklearn Logistic Regression'