## 📘 Introduction

This notebook supports the capstone project **"Question Difficulty Classification"**, which explores the development of a machine learning model to predict the **grade level** (3rd–12th) of input questions. The model seeks to align questions with **Common Core State Standards (CCSS)** in order to predict grade level.

Inspired by the paper *"Question Difficulty Estimation Based on Attention Model for Question Answering"*, the architecture extends BERT using a **custom Dual Attention Mechanism** (`DualBertModel`) to better capture the semantic complexity of questions.

This notebook walks through:
- Loading and preprocessing a custom educational dataset (`QxGrade`)
- Initializing and training a dual-attention BERT-based classifier
- Evaluating performance on unseen grade-level questions
- Saving the model for deployment in a **Streamlit app**

The project is implemented primarily in **PyTorch**



## 📦 Library Imports

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

from transformers import AutoModel, AutoTokenizer
from transformers.models.bert.modeling_bert import BertIntermediate, BertOutput, BertEncoder, BertSelfAttention, BertSelfOutput, BertModel, BertConfig, BertPooler
from torch.utils.data import TensorDataset, DataLoader

import pandas as pd
import matplotlib.pyplot as plt

## 🛠️ Functions
Here are the training and testing functions used during data preprocessing, training, and testing:

In [8]:
def train(model, input_ids, attention_mask, train_loader, criterion, optimizer, epochs):
    running_acc = []

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        total_correct = 0
        total_samples = 0
        
        for input_ids, attention_mask, labels in train_loader:
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)

            # Forward pass -- Passes the questions through the Neural Network then tests them against the correct labels
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)

            # Accuracy
            _, predicted_labels = torch.max(outputs, dim=1)
            correct_predictions = (predicted_labels == labels).sum().item()
            total_correct += correct_predictions
            total_samples += labels.size(0)

            # Backward pass -- Adjusts model to make better predictions
            optimizer.zero_grad()  #Zero out the gradients from the previous batch
            loss.backward()  # Backpropagate the loss to compute gradients
            optimizer.step()  #Perform a single step to update parameters

            total_loss += loss.item()

        # Epoch metrics
        epoch_loss = total_loss / len(train_loader)
        epoch_acc = total_correct / total_samples
        running_acc.append(epoch_acc)
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}')

    # Plotting after training
    df_acc = pd.DataFrame({'Epochs': range(1, epochs + 1), 'Accuracy': running_acc})
    df_acc.plot(x='Epochs', y='Accuracy', kind='line', title='Training Accuracy Over Epochs', grid=True)
    plt.show()





def test(model, test_loader, criterion, device, epochs):
    running_test_acc = []

    for epoch in range(epochs):
        model.eval()
        total_loss = 0
        total_correct = 0
        total_samples = 0

        with torch.no_grad():               ##We don't want the model training on test data
            for input_ids, attention_mask, labels in test_loader:
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                labels = labels.to(device)

                # Forward pass
                outputs = model(input_ids, attention_mask=attention_mask)
                loss = criterion(outputs, labels)

                # Accuracy
                _, predicted_labels = torch.max(outputs, dim=1)
                correct_predictions = (predicted_labels == labels).sum().item()
                total_correct += correct_predictions
                total_samples += labels.size(0)

                total_loss += loss.item()

        # Epoch metrics
        epoch_loss = total_loss / len(test_loader)
        epoch_acc = total_correct / total_samples
        running_test_acc.append(epoch_acc)
        print(f'[Test] Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}')

    # Plotting after testing
    df_test = pd.DataFrame({'Epochs': range(1, epochs + 1), 'Accuracy': running_test_acc})
    df_test.plot(x='Epochs', y='Accuracy', kind='line', title='Testing Accuracy Over Epochs', grid=True)
    plt.show()
    
    return df_test


## 🧠 Neural Network Modifications
Here are the modifications made to the Bert model to implement Dual Multihead Attention Mechanisms and multiclass classification

In [9]:
## Implements dual self-attention: two separate attention heads run in parallel and their outputs are combined
class DualBertAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention1 = BertSelfAttention(config)
        self.attention2 = BertSelfAttention(config)
        self.output1 = BertSelfOutput(config)
        self.output2 = BertSelfOutput(config)

    def forward(self, hidden_states, attention_mask=None, head_mask=None):
        attn_output1 = self.attention1(hidden_states, attention_mask, head_mask)[0]
        attn_output1 = self.output1(attn_output1, hidden_states)

        attn_output2 = self.attention2(hidden_states, attention_mask, head_mask)[0]
        attn_output2 = self.output2(attn_output2, hidden_states)

        # Combine the two attention outputs using ReLU activation (replacing any negative values with 0)
        dual_attention_output = F.relu(attn_output1 + attn_output2)
        return dual_attention_output


## Wraps DualBertAttention in a full transformer layer with intermediate and output sublayers
class DualBertLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = DualBertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_value=None,
        output_attentions=False,
    ):
        attention_output = self.attention(hidden_states, attention_mask, head_mask)
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        return (layer_output,)


## Stacks multiple DualBertLayer modules to form the full transformer encoder
class DualBertEncoder(BertEncoder):
    def __init__(self, config):
        super().__init__(config)
        self.layer = nn.ModuleList([DualBertLayer(config) for _ in range(config.num_hidden_layers)])


## Replaces the standard BERT encoder with the DualBertEncoder while keeping pooling layer
class DualBertModel(BertModel):
    def __init__(self, config):
        super().__init__(config)
        self.encoder = DualBertEncoder(config)
        self.pooler = BertPooler(config)

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, **kwargs):
        # Run the standard BERT forward pass using the modified encoder
        outputs = super().forward(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  token_type_ids=token_type_ids,
                                  **kwargs)

        sequence_output = outputs[0]
        pooled_output = self.pooler(sequence_output)

        return (sequence_output, pooled_output)


## Adds a classification head on top of the DualBertModel for text classification
class BertClassifier(nn.Module):
    def __init__(self, bert_model, hidden_dim=64, output_dim=6):
        super().__init__()
        self.bert = bert_model 
        self.classifier = nn.Sequential(
            nn.Linear(self.bert.config.hidden_size, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, input_ids, attention_mask=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = outputs[0][:, 0, :]  # Take [CLS] token's embedding
        return self.classifier(cls_embedding)


## 🗃️ Dataset Import
Import the question sets that will be used to train the model.  The QxGrade_dataset is a set of 26k questions scraped from pdf textbooks.  These textbooks were chosen based on alignment with Common Core State Standards to identify a framework that we can use when training the model with additional data. 

In [10]:
df = pd.read_csv('QxGrade_Dataset.csv')

In [11]:
grade_counts = df['Grade'].value_counts().sort_index()
print("Grade Distribution:\n", grade_counts)

Grade Distribution:
 Grade
3      418
4     1191
5     1658
6      549
7     1153
8     1805
9     8668
10    2704
11    7288
12    1125
Name: count, dtype: int64


The two most important columns we will be using and labeling are Grade and Question.  Using the .values and .tolist function here we are adding all of the grade options (3-12) to the grades function.  We are doing the same with all of the question values.

In [12]:
x = df.question.values.tolist()  ##X is questions
y = df.Grade.astype(str).tolist() ##Y is answers

num_classes = len(set(y))  ##This sets up the classification options (3rd-12th)

Use train_test_split to separate the value for training.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)  ##This is the only place where we use SciKit learn instead of pytorch

Tokenize the data

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  ##Set the device to GPU so we can train the model on the GPU

# Tokenize training and test sets
x_train_encodings = tokenizer(x_train, padding=True, truncation=True, return_tensors='pt', max_length=32)
x_test_encodings = tokenizer(x_test, padding=True, truncation=True, return_tensors='pt', max_length=32)

# Extract input IDs and attention masks
input_ids_train = x_train_encodings["input_ids"].to(device)
attention_mask_train = x_train_encodings["attention_mask"].to(device)

input_ids_test = x_test_encodings["input_ids"].to(device)
attention_mask_test = x_test_encodings["attention_mask"].to(device)


In [15]:
y_train_ints = [int(label) -3 for label in y_train] #Get grades below ten and change them to ints
y_test_ints = [int(label) -3 for label in y_test]
y_train_tensor = torch.tensor(y_train_ints)
y_test_tensor = torch.tensor(y_test_ints)

## 🤖 Build Custom BERT
Using the network we created, instantiate our version of bert

In [16]:
# Load base BERT config and model
config = BertConfig.from_pretrained("bert-base-uncased")
dual_bert = DualBertModel(config)

# Load pretrained BERT model
pretrained_bert = BertModel.from_pretrained("bert-base-uncased")

# Copy embeddings and pooler
dual_bert.embeddings = pretrained_bert.embeddings
dual_bert.pooler = pretrained_bert.pooler

# Build classifier
bert_classifier = BertClassifier(dual_bert, hidden_dim=64, output_dim=num_classes).to(device)

## 🔧 Set Hyperparameters
Create the hyperparameters.  You can tinker with training times, sizes, and number of loops here.

In [17]:
train_epochs = 20 ##How many times we go through the loop
test_epochs = 5

criterion = nn.CrossEntropyLoss()  ##This compares the predicted answer with the correct answer
optimizer = torch.optim.Adam(bert_classifier.parameters(), lr=2e-5)  ##This is the function that controls how quickly the model learns by adjusting its parameters

sequence_length = 32   ## Maximum length of tokens to be used at a time
batch_size = 32  ##The number of training examples in one forward/backward pass
input_dim = 512  ##The total number of dimension we will allow the model to use for calculation
d_model = 512  ##Number of expected features, set to default recommended by pytorch


test_dataset = TensorDataset(input_ids_test, attention_mask_test, y_test_tensor)    ##This separates the dataset into test and training sets.
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

train_dataset = TensorDataset(input_ids_train, attention_mask_train, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)


in_features = input_ids_train.shape[1]   ##in_features are what we are passing to the classification model

## ⚠️ CUDA Troubleshooting
If your CUDA is not available, this block will tell you.  The training loop will not work on CPU.

In [18]:


print("CUDA available? ", torch.cuda.is_available())
print("Device name:    ", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU detected")




CUDA available?  True
Device name:     NVIDIA GeForce RTX 4070 Ti


## 🏋️ Model Training
Use Bert to classify the data --  The training loop takes around 1min per epoch at current hyperparameters.

In [None]:
train(
    bert_classifier,
    input_ids_train,
    attention_mask_train,
    train_loader,
    criterion,
    optimizer,
    train_epochs
)

torch.save(bert_classifier.state_dict(), 'Bert_Classifier.pt')


Epoch [1/20], Loss: 1.4273, Accuracy: 0.5319


## 💾 Load Trained Model

In [None]:
# Rebuild full model
bert_base = DualBertModel(config)
bert_classifier = BertClassifier(bert_model=bert_base, hidden_dim=64, output_dim=num_classes)

# Load the model state dict
bert_classifier.load_state_dict(torch.load("bert_classifier.pt", map_location=device))

# Move to device and set to eval mode
bert_classifier.to(device)
bert_classifier.eval()


## 🧪 Model Testing

In [None]:
bert_classifier.eval()
test(bert_classifier, test_loader, criterion, device, test_epochs)



## 📚 Conclusions

- A model can be successfully trained to classify questions by grade level in alignment with the Common Core State Standards (CCSS).

- Expanding the dataset with greater variety and sourcing material from a wider range of educational companies is likely to significantly improve model performance.

- Multiple rounds of testing with the current architecture consistently show an accuracy of approximately 70%.

- While the model is functional in its current form, further improvements in accuracy and confidence would be necessary before deployment in a professional or production environment. 