A machine learning project that classifies student questions into predefined intent categories. The system supports two model backends for comparison:
- Classic ML — TF-IDF + best classifier selected via stratified cross-validation (scikit-learn)
- BERT —
bert-base-uncasedfine-tuned for sequence classification (Hugging Face + PyTorch)
The classifier recognizes 11 university support intents:
| Intent | Example question |
|---|---|
ask_schedule |
What time does my calculus class start? |
ask_registration |
How do I register for courses next semester? |
ask_tuition |
When is the tuition payment deadline? |
ask_deadline |
What is the last day to withdraw from a course? |
ask_location |
Where is the registrar's office located? |
ask_contact |
How can I contact my academic advisor? |
ask_gpa |
What GPA do I need to maintain my scholarship? |
ask_courses |
What courses are required for a CS degree? |
ask_accommodation |
How do I apply for on-campus housing? |
greeting |
Hi, I need some help. |
goodbye |
Thanks, that was very helpful. Goodbye. |
- File:
data/intent-detection-train.jsonl - Format: one JSON object per line, fields
textandlabel - Size: 333 examples — 38 for
ask_contact(extended with informal/email-style queries), 30 for 9 original intents, 25 forask_accommodation - Language: English
- Balance: near-uniform; number of classes is inferred dynamically at train time
Splits are generated deterministically via stratified sampling:
| Split | Default size |
|---|---|
| Train | 70% |
| Validation | 10% |
| Test | 20% |
university-intent-detection/
├── analysis/
│ ├── scripts/
│ │ ├── common.py # Shared I/O helpers
│ │ ├── eda_report.py # Dataset summary + intent distribution
│ │ ├── robustness_experiments.py # Lexical overlap + baseline CV stats
│ │ ├── generate_stress_test.py # Rule-based perturbed test generation
│ │ └── run_all.py # Runs all analysis scripts in sequence
│ └── outputs/ # Generated analysis artifacts
├── apps/
│ └── app.py # Streamlit demo app
├── config/
│ └── settings.yaml # Central configuration
├── data/
│ ├── intent-detection-train.jsonl # Main dataset (333 examples, 11 intents)
│ ├── intent-detection-test-perturbed.jsonl # Stress-test variants
│ └── splits/ # Deterministic train/val/test splits
├── logs/ # Runtime logs
├── models/ # Saved model artifacts
├── outputs/evaluations/ # Structured evaluation outputs (JSON)
├── src/
│ ├── cli/main.py # Unified CLI (train/predict/evaluate/prepare-data/augment-data)
│ ├── config/ # Config loader and typed settings
│ ├── data/ # Dataset preparation and augmentation
│ ├── models/ # base, classic, bert, registry
│ ├── services/ # Training, evaluation, prediction orchestration
│ └── utils/ # Dataset I/O, logging, run helpers
├── Dockerfile
├── Makefile
└── requirements.txt
python -m venv .venv
.venv\Scripts\python -m pip install --upgrade pip
.venv\Scripts\python -m pip install -r requirements.txtOr with Makefile:
make installmake prepare-dataCreates data/splits/train.jsonl, validation.jsonl, test.jsonl.
make augment-dataCreates data/splits/train_expanded.jsonl with rule-based variants added to balance classes.
make train MODEL=classic
make train MODEL=bertOr directly:
python -m src.cli.main train --model classic --dataset data/intent-detection-train.jsonl
python -m src.cli.main train --model bert --dataset data/intent-detection-train.jsonlmake predict MODEL=classic TEXT="How do I register for classes next semester?"
make predict MODEL=bert TEXT="How do I register for classes next semester?"Or directly:
python -m src.cli.main predict --model classic --text "How do I register for classes next semester?"
python -m src.cli.main predict --model bert --text "How do I register for classes next semester?"make evaluate MODEL=classic
make evaluate MODEL=bertStructured evaluation artifacts are saved to outputs/evaluations/<model>/<timestamp>/:
Evaluates both models on a perturbed stress-test dataset and logs a side-by-side comparison.
Requires the perturbed dataset to exist (run python analysis/scripts/generate_stress_test.py first).
python -m src.cli.main evaluate-robustnessOr with a custom dataset path:
python -m src.cli.main evaluate-robustness --dataset data/intent-detection-test-perturbed.jsonlsummary.json— run metadata and split infometrics.json— accuracy, macro F1, weighted F1, precision, recallclassification_report.json— per-class precision/recall/F1confusion_matrix.json— full confusion matrixpredictions.json— per-sample true vs predicted labelserror_analysis.json— per-class error breakdown and top confused label pairs
make appOr:
.venv\Scripts\python -m streamlit run apps/app.pymake docker-train MODEL=classic
make docker-predict MODEL=classic TEXT="What time is my physics class?"
make docker-evaluate MODEL=classic
make docker-train MODEL=bert
make docker-apppython analysis/scripts/run_all.pyOr individually:
python analysis/scripts/eda_report.py
python analysis/scripts/robustness_experiments.py
python analysis/scripts/generate_stress_test.pyOutputs are written to analysis/outputs/.
- Vectorizer:
TfidfVectorizerwith custom tokenizer (whitespace/apostrophe split + contraction expansion) - Contraction expansion: 36-entry lookup table applied before tokenization (e.g.
"don't"→"do not") - Candidates: Random Forest, Logistic Regression, SVM, Naive Bayes, Gradient Boosting, KNN
- Selection: stratified k-fold cross-validation, best macro F1 wins
- SVC fitted with
probability=True(Platt scaling) to enable per-prediction confidence scores - Saved to:
models/traditional/
- Tokenizer:
BertTokenizer - Model:
BertForSequenceClassificationfine-tuned on training split;num_labelsinferred dynamically - Optimizer: AdamW with linear warmup scheduler (warmup = 10% of total training steps)
- Validation loop: macro F1 computed on
data/splits/validation.jsonlafter each epoch - Early stopping: patience = 2 epochs; best checkpoint (highest val macro F1) is restored before saving
- Confidence scores: softmax over logits, max probability returned alongside predicted intent
- Saved to:
models/bert/
When you run train --model bert for the first time, Hugging Face prints a message like this:
Some weights of BertForSequenceClassification were not initialized from the model checkpoint
at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a downstream task ...
Some weights of the model checkpoint at bert-base-uncased were not used when initializing
BertForSequenceClassification: ['cls.predictions.bias', ...]
This is normal. It is not an error.
Here is what each part means:
MISSING / "newly initialized"
classifier.weight and classifier.bias are the linear layer on top of BERT that maps the [CLS] token representation to your intent labels. bert-base-uncased was pre-trained on general English text for masked-token prediction — it has no classification layer. When the code calls BertForSequenceClassification.from_pretrained(..., num_labels=N), Hugging Face creates this layer fresh and initialises it randomly. It is supposed to be missing from the checkpoint — that is the whole point of fine-tuning. These weights are what get trained during your epochs.
UNEXPECTED / "not used"
bert-base-uncased ships with a masked language model head (cls.predictions.*) that was used during pre-training to predict masked tokens. BertForSequenceClassification has no slot for this head — it only needs the encoder. Those weights are discarded at load time. They are "unexpected" from the classification model's perspective.
The exact line that triggers both messages is in src/models/bert.py:
self.model = BertForSequenceClassification.from_pretrained(
self.model_name, num_labels=self._num_labels # bert.py:86-88
)num_labels is derived dynamically two lines earlier:
self._num_labels = int(len(self.label_encoder.classes_))The classifier head is always the right size for whatever labels are in your training split — no hardcoding.
When would these messages become a real problem?
The UNEXPECTED/MISSING messages are only a problem if:
bert.encoder.*orbert.embeddings.*appear in the MISSING list — that would mean the core encoder failed to load entirely.- A saved fine-tuned checkpoint is loaded into a fresh architecture with a different
num_labels— PyTorch would raise a shape mismatch error, not just a warning. - A Python traceback appears in the lines immediately following these messages.
None of these apply when loading bert-base-uncased into BertForSequenceClassification for the first time.
Summary: training is only considered successful if it proceeds past this message and completes its epochs (or early-stops on validation F1) without raising a traceback. The UNEXPECTED/MISSING lines appear before epoch 1 and have no effect on anything that follows.
- The number of output classes is inferred dynamically from the label encoder — no hardcoded class count.
- Both models return a confidence score alongside the predicted intent; the Streamlit app displays High / Medium / Low confidence labels (≥90% / ≥70% / below 70%).
- The Streamlit app uses
@st.cache_resourceso models are loaded once per server process, not on every button click. - Retraining is required after adding new intents; the pipeline automatically picks up new labels via
LabelEncoder.fit_transform. - If Hugging Face rate limits apply, set the
HF_TOKENenvironment variable. - On Windows, symlink warnings from
huggingface_hubare cosmetic; enable Developer Mode to suppress them.