Skip to content

3bada66/LLM-project

Repository files navigation

University Student Support — Intent Detection

A machine learning project that classifies student questions into predefined intent categories. The system supports two model backends for comparison:

  • Classic ML — TF-IDF + best classifier selected via stratified cross-validation (scikit-learn)
  • BERTbert-base-uncased fine-tuned for sequence classification (Hugging Face + PyTorch)

Intent Schema

The classifier recognizes 11 university support intents:

Intent Example question
ask_schedule What time does my calculus class start?
ask_registration How do I register for courses next semester?
ask_tuition When is the tuition payment deadline?
ask_deadline What is the last day to withdraw from a course?
ask_location Where is the registrar's office located?
ask_contact How can I contact my academic advisor?
ask_gpa What GPA do I need to maintain my scholarship?
ask_courses What courses are required for a CS degree?
ask_accommodation How do I apply for on-campus housing?
greeting Hi, I need some help.
goodbye Thanks, that was very helpful. Goodbye.

Dataset

  • File: data/intent-detection-train.jsonl
  • Format: one JSON object per line, fields text and label
  • Size: 333 examples — 38 for ask_contact (extended with informal/email-style queries), 30 for 9 original intents, 25 for ask_accommodation
  • Language: English
  • Balance: near-uniform; number of classes is inferred dynamically at train time

Splits are generated deterministically via stratified sampling:

Split Default size
Train 70%
Validation 10%
Test 20%

Project Structure

university-intent-detection/
├── analysis/
│   ├── scripts/
│   │   ├── common.py                  # Shared I/O helpers
│   │   ├── eda_report.py              # Dataset summary + intent distribution
│   │   ├── robustness_experiments.py  # Lexical overlap + baseline CV stats
│   │   ├── generate_stress_test.py    # Rule-based perturbed test generation
│   │   └── run_all.py                 # Runs all analysis scripts in sequence
│   └── outputs/                       # Generated analysis artifacts
├── apps/
│   └── app.py                         # Streamlit demo app
├── config/
│   └── settings.yaml                  # Central configuration
├── data/
│   ├── intent-detection-train.jsonl   # Main dataset (333 examples, 11 intents)
│   ├── intent-detection-test-perturbed.jsonl  # Stress-test variants
│   └── splits/                        # Deterministic train/val/test splits
├── logs/                              # Runtime logs
├── models/                            # Saved model artifacts
├── outputs/evaluations/               # Structured evaluation outputs (JSON)
├── src/
│   ├── cli/main.py                    # Unified CLI (train/predict/evaluate/prepare-data/augment-data)
│   ├── config/                        # Config loader and typed settings
│   ├── data/                          # Dataset preparation and augmentation
│   ├── models/                        # base, classic, bert, registry
│   ├── services/                      # Training, evaluation, prediction orchestration
│   └── utils/                         # Dataset I/O, logging, run helpers
├── Dockerfile
├── Makefile
└── requirements.txt

Setup

python -m venv .venv
.venv\Scripts\python -m pip install --upgrade pip
.venv\Scripts\python -m pip install -r requirements.txt

Or with Makefile:

make install

Workflow

1. Prepare dataset splits

make prepare-data

Creates data/splits/train.jsonl, validation.jsonl, test.jsonl.

2. (Optional) Augment training data

make augment-data

Creates data/splits/train_expanded.jsonl with rule-based variants added to balance classes.

3. Train

make train MODEL=classic
make train MODEL=bert

Or directly:

python -m src.cli.main train --model classic --dataset data/intent-detection-train.jsonl
python -m src.cli.main train --model bert   --dataset data/intent-detection-train.jsonl

4. Predict

make predict MODEL=classic TEXT="How do I register for classes next semester?"
make predict MODEL=bert   TEXT="How do I register for classes next semester?"

Or directly:

python -m src.cli.main predict --model classic --text "How do I register for classes next semester?"
python -m src.cli.main predict --model bert   --text "How do I register for classes next semester?"

5. Evaluate

make evaluate MODEL=classic
make evaluate MODEL=bert

Structured evaluation artifacts are saved to outputs/evaluations/<model>/<timestamp>/:

5a. Robustness evaluation (optional)

Evaluates both models on a perturbed stress-test dataset and logs a side-by-side comparison. Requires the perturbed dataset to exist (run python analysis/scripts/generate_stress_test.py first).

python -m src.cli.main evaluate-robustness

Or with a custom dataset path:

python -m src.cli.main evaluate-robustness --dataset data/intent-detection-test-perturbed.jsonl
  • summary.json — run metadata and split info
  • metrics.json — accuracy, macro F1, weighted F1, precision, recall
  • classification_report.json — per-class precision/recall/F1
  • confusion_matrix.json — full confusion matrix
  • predictions.json — per-sample true vs predicted labels
  • error_analysis.json — per-class error breakdown and top confused label pairs

6. Run the Streamlit demo

make app

Or:

.venv\Scripts\python -m streamlit run apps/app.py

Docker

make docker-train   MODEL=classic
make docker-predict MODEL=classic TEXT="What time is my physics class?"
make docker-evaluate MODEL=classic

make docker-train   MODEL=bert
make docker-app

Exploratory Analysis

python analysis/scripts/run_all.py

Or individually:

python analysis/scripts/eda_report.py
python analysis/scripts/robustness_experiments.py
python analysis/scripts/generate_stress_test.py

Outputs are written to analysis/outputs/.


Models

Classic ML (TF-IDF + sklearn)

  • Vectorizer: TfidfVectorizer with custom tokenizer (whitespace/apostrophe split + contraction expansion)
  • Contraction expansion: 36-entry lookup table applied before tokenization (e.g. "don't""do not")
  • Candidates: Random Forest, Logistic Regression, SVM, Naive Bayes, Gradient Boosting, KNN
  • Selection: stratified k-fold cross-validation, best macro F1 wins
  • SVC fitted with probability=True (Platt scaling) to enable per-prediction confidence scores
  • Saved to: models/traditional/

BERT (bert-base-uncased)

  • Tokenizer: BertTokenizer
  • Model: BertForSequenceClassification fine-tuned on training split; num_labels inferred dynamically
  • Optimizer: AdamW with linear warmup scheduler (warmup = 10% of total training steps)
  • Validation loop: macro F1 computed on data/splits/validation.jsonl after each epoch
  • Early stopping: patience = 2 epochs; best checkpoint (highest val macro F1) is restored before saving
  • Confidence scores: softmax over logits, max probability returned alongside predicted intent
  • Saved to: models/bert/

BERT training output — UNEXPECTED and MISSING weights (normal)

When you run train --model bert for the first time, Hugging Face prints a message like this:

Some weights of BertForSequenceClassification were not initialized from the model checkpoint
at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a downstream task ...

Some weights of the model checkpoint at bert-base-uncased were not used when initializing
BertForSequenceClassification: ['cls.predictions.bias', ...]

This is normal. It is not an error.

Here is what each part means:

MISSING / "newly initialized"

classifier.weight and classifier.bias are the linear layer on top of BERT that maps the [CLS] token representation to your intent labels. bert-base-uncased was pre-trained on general English text for masked-token prediction — it has no classification layer. When the code calls BertForSequenceClassification.from_pretrained(..., num_labels=N), Hugging Face creates this layer fresh and initialises it randomly. It is supposed to be missing from the checkpoint — that is the whole point of fine-tuning. These weights are what get trained during your epochs.

UNEXPECTED / "not used"

bert-base-uncased ships with a masked language model head (cls.predictions.*) that was used during pre-training to predict masked tokens. BertForSequenceClassification has no slot for this head — it only needs the encoder. Those weights are discarded at load time. They are "unexpected" from the classification model's perspective.

The exact line that triggers both messages is in src/models/bert.py:

self.model = BertForSequenceClassification.from_pretrained(
    self.model_name, num_labels=self._num_labels  # bert.py:86-88
)

num_labels is derived dynamically two lines earlier:

self._num_labels = int(len(self.label_encoder.classes_))

The classifier head is always the right size for whatever labels are in your training split — no hardcoding.

When would these messages become a real problem?

The UNEXPECTED/MISSING messages are only a problem if:

  • bert.encoder.* or bert.embeddings.* appear in the MISSING list — that would mean the core encoder failed to load entirely.
  • A saved fine-tuned checkpoint is loaded into a fresh architecture with a different num_labels — PyTorch would raise a shape mismatch error, not just a warning.
  • A Python traceback appears in the lines immediately following these messages.

None of these apply when loading bert-base-uncased into BertForSequenceClassification for the first time.

Summary: training is only considered successful if it proceeds past this message and completes its epochs (or early-stops on validation F1) without raising a traceback. The UNEXPECTED/MISSING lines appear before epoch 1 and have no effect on anything that follows.


Notes

  • The number of output classes is inferred dynamically from the label encoder — no hardcoded class count.
  • Both models return a confidence score alongside the predicted intent; the Streamlit app displays High / Medium / Low confidence labels (≥90% / ≥70% / below 70%).
  • The Streamlit app uses @st.cache_resource so models are loaded once per server process, not on every button click.
  • Retraining is required after adding new intents; the pipeline automatically picks up new labels via LabelEncoder.fit_transform.
  • If Hugging Face rate limits apply, set the HF_TOKEN environment variable.
  • On Windows, symlink warnings from huggingface_hub are cosmetic; enable Developer Mode to suppress them.

About

Intent classification system for a university student support chatbot. Fine-tuned BERT (92.5% accuracy) vs. TF-IDF + SVM baseline (82.1%) across 11 intent categories. Includes Streamlit demo, evaluation outputs, and Docker support.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors