<a href="https://colab.research.google.com/github/AndreasCaldewei/colab/blob/main/faq_distillation_notebook_fixed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FAQ System with Model Distillation

This notebook demonstrates how to build an FAQ system using model distillation. We'll transfer knowledge from a large language model to a smaller, more efficient model that can quickly answer frequently asked questions.

## What is this project?

- **Goal**: Create an FAQ system that can quickly match user questions to the most relevant answers
- **Approach**: Use model distillation to create efficient semantic matching
- **Benefits**: Fast response times, works offline, and requires minimal resources

Let's get started!

## Step 1: Install Required Packages

In [None]:
!pip install transformers datasets tqdm joblib torch scikit-learn matplotlib pandas numpy

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupt

## Step 2: Import Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

import torch
from transformers import AutoModel, AutoTokenizer
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import time
import joblib
import re
import json

## Step 3: Check for GPU and Configure Environment

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# If using GPU, check which one we have
if device.type == "cuda":
    !nvidia-smi

# Configuration settings
max_length = 128  # Maximum sequence length for the model
batch_size = 8    # Batch size for processing
seed = 42         # Random seed for reproducibility

# Set random seeds for reproducibility
np.random.seed(seed)
torch.manual_seed(seed)

## Step 4: Load the Teacher Model

We'll use BERT or another transformer model as our teacher model.

In [None]:
print("Loading the teacher model...")
model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Optimized for sentence similarity
# Alternative options:
# model_name = "bert-base-uncased"  # Standard BERT
# model_name = "distilbert-base-uncased"  # Smaller, faster model

tokenizer = AutoTokenizer.from_pretrained(model_name)
teacher_model = AutoModel.from_pretrained(model_name)
teacher_model = teacher_model.to(device)
teacher_model.eval()  # Set to evaluation mode
print(f"Loaded {model_name} model")

## Step 5: Create and Prepare FAQ Dataset

Let's define our FAQ dataset. You can replace this with your own custom FAQ data.

In [None]:
# Define a sample FAQ dataset (replace with your own data)
sample_faqs = [
    {
        "question": "What is model distillation?",
        "answer": "Model distillation is a technique for transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student) while maintaining as much performance as possible.",
        "category": "Technical"
    },
    {
        "question": "How does model distillation work?",
        "answer": "Model distillation works by training a smaller model to mimic the behavior or outputs of a larger model, often by using the larger model's predictions or embeddings as training signals.",
        "category": "Technical"
    },
    {
        "question": "What are the benefits of using distilled models?",
        "answer": "Distilled models are smaller, faster, and require less computational resources while retaining most of the performance of larger models. They're ideal for deployment in resource-constrained environments.",
        "category": "Benefits"
    },
    {
        "question": "Can I use distilled models on mobile devices?",
        "answer": "Yes, distilled models are well-suited for mobile applications due to their smaller size and faster inference times.",
        "category": "Applications"
    },
    {
        "question": "What's the difference between model distillation and model pruning?",
        "answer": "Model distillation trains a new, smaller model using a larger model's outputs, while pruning reduces the size of an existing model by removing unnecessary parameters.",
        "category": "Technical"
    },
    {
        "question": "How much smaller can a distilled model be?",
        "answer": "Distilled models can be significantly smaller, sometimes 10-50 times smaller than the original teacher model, depending on the architecture and distillation approach.",
        "category": "Technical"
    },
    {
        "question": "Are distilled models as accurate as the original models?",
        "answer": "Distilled models typically retain 90-95% of the performance of the original teacher model, with the exact percentage depending on the task complexity and model architectures.",
        "category": "Performance"
    },
    {
        "question": "What type of tasks can I use distilled models for?",
        "answer": "Distilled models work well for a wide range of tasks including classification, retrieval, question answering, and text embedding generation. They're particularly effective for well-defined tasks with clear outputs.",
        "category": "Applications"
    },
    {
        "question": "How do I choose the right size for my student model?",
        "answer": "The optimal student model size depends on your resource constraints and performance requirements. Start with a model 3-10x smaller than the teacher and evaluate the performance/size tradeoff.",
        "category": "Technical"
    },
    {
        "question": "Can I distill any type of model?",
        "answer": "Most model types can be distilled, including neural networks, transformer models, and decision trees. The distillation process varies by model type but the core principle of knowledge transfer remains the same.",
        "category": "Technical"
    },
    {
        "question": "What are some popular model distillation techniques?",
        "answer": "Popular techniques include soft target distillation (using probability distributions), feature distillation (matching intermediate representations), and data augmentation during distillation to improve generalization.",
        "category": "Technical"
    },
    {
        "question": "How long does the distillation process take?",
        "answer": "Distillation typically takes less time than training the original teacher model from scratch. Depending on the model size and data volume, it can range from hours to days on standard hardware.",
        "category": "Process"
    },
    {
        "question": "What is temperature in knowledge distillation?",
        "answer": "Temperature is a hyperparameter in distillation that controls the softness of probability distributions from the teacher model. Higher temperatures produce softer distributions that better transfer knowledge about relationships between classes.",
        "category": "Technical"
    },
    {
        "question": "Can I create an ensemble of distilled models?",
        "answer": "Yes, ensembles of multiple distilled models can be effective, offering improved performance while still being more efficient than the original large model.",
        "category": "Applications"
    },
    {
        "question": "What's the history of model distillation?",
        "answer": "Model distillation was popularized by Geoffrey Hinton in 2015 with his paper 'Distilling the Knowledge in a Neural Network,' though similar concepts existed earlier under different names.",
        "category": "Background"
    }
]

# Extract questions, answers, and categories from the sample FAQs
questions = [faq["question"] for faq in sample_faqs]
answers = [faq["answer"] for faq in sample_faqs]
categories = [faq["category"] for faq in sample_faqs]

# Create additional variations of each question for better training
question_variations = []
answer_variations = []
category_variations = []

for i, q in enumerate(questions):
    # Original question
    question_variations.append(q)
    answer_variations.append(answers[i])
    category_variations.append(categories[i])

    # Add variations - remove question marks, rephrase slightly
    variations = [
        q.replace("?", ""),
        "Can you tell me " + q.lower(),
        "I'd like to know " + q.lower(),
        "Tell me about " + q.lower().replace("what is ", "").replace("how does ", "").replace("?", "")
    ]

    for var in variations:
        question_variations.append(var)
        answer_variations.append(answers[i])
        category_variations.append(categories[i])

print(f"Original FAQs: {len(questions)}")
print(f"With variations: {len(question_variations)}")

# Display a few examples of the variations
for i in range(3):
    print(f"\nOriginal: {questions[i]}")
    idx = i * 5  # Each original question has 4 variations + original
    for j in range(1, 5):
        print(f"Variation {j}: {question_variations[idx + j]}")

## Step 6: Define Embedding Extraction Function

This function will extract embeddings from our teacher model.

In [None]:
def get_model_embeddings(texts, tokenizer, model, batch_size=8, max_length=128):
    """Extract embeddings from the teacher model

    Args:
        texts (list): List of text strings to embed
        tokenizer: The tokenizer for the model
        model: The model to use for embedding
        batch_size (int): Number of texts to process at once
        max_length (int): Maximum sequence length for tokenization

    Returns:
        numpy.ndarray: Array of embeddings, shape (len(texts), embedding_dim)
    """
    embeddings = []

    # Process in batches
    for i in tqdm(range(0, len(texts), batch_size), desc="Extracting embeddings"):
        batch_texts = texts[i:i+batch_size]

        with torch.no_grad():  # No need to track gradients
            # Tokenize the text
            inputs = tokenizer(batch_texts, return_tensors="pt", padding="max_length",
                             truncation=True, max_length=max_length)
            inputs = {k: v.to(device) for k, v in inputs.items()}

            # Get the hidden states from the model
            outputs = model(**inputs, output_hidden_states=True)

            # Use the last hidden state as embeddings
            last_hidden_state = outputs.last_hidden_state

            # Use mean pooling to get a fixed-size vector representation
            mean_embeddings = last_hidden_state.mean(dim=1).cpu().numpy()

            for emb in mean_embeddings:
                embeddings.append(emb)

    return np.array(embeddings)

# Function to normalize embeddings (improves performance for cosine similarity)
def normalize_embeddings(embeddings):
    """Normalize embeddings to unit length"""
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

## Step 7: Extract Embeddings from the Teacher Model

In [None]:
print("Extracting embeddings for FAQ questions...")
start_time = time.time()
question_embeddings = get_model_embeddings(question_variations, tokenizer, teacher_model, batch_size=batch_size)
extraction_time = time.time() - start_time

print(f"Embedding extraction took {extraction_time:.2f} seconds")
print(f"Embedding shape: {question_embeddings.shape}")

# Normalize embeddings for better similarity matching
question_embeddings = normalize_embeddings(question_embeddings)

# Visualize a few embeddings
plt.figure(figsize=(10, 5))
plt.imshow(question_embeddings[:10, :50], aspect='auto', cmap='viridis')
plt.colorbar()
plt.title('First 10 question embeddings (first 50 dimensions)')
plt.xlabel('Embedding dimension')
plt.ylabel('Question')
plt.show()

## Step 8: Train Student Models

We'll train multiple student models for our FAQ system.

In [None]:
# Prepare data for training
# For simplicity, we'll convert category strings to numerical labels
unique_categories = list(set(category_variations))
category_to_id = {cat: idx for idx, cat in enumerate(unique_categories)}
category_ids = [category_to_id[cat] for cat in category_variations]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    question_embeddings,
    category_ids,
    test_size=0.2,
    random_state=seed,
    stratify=category_ids  # Ensure same category distribution in train/test
)

# Initialize student models
print("Training student models...")
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=seed),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=seed)
}

results = {}
train_times = {}

# Train each model and measure performance
for name, model in models.items():
    print(f"\nTraining {name}...")
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    results[name] = accuracy
    train_times[name] = train_time

    print(f"  Accuracy: {accuracy:.4f}, Training time: {train_time:.2f} seconds")
    print("  Classification Report:")
    print(classification_report(y_test, y_pred, target_names=unique_categories))

## Step 9: Visualize Model Performance

In [None]:
plt.figure(figsize=(12, 6))

# Accuracy comparison
plt.subplot(1, 2, 1)
plt.bar(results.keys(), results.values(), color=['#3498db', '#2ecc71'])
plt.title('Model Accuracy', fontsize=14)
plt.ylabel('Accuracy', fontsize=12)
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
for i, (key, value) in enumerate(results.items()):
    plt.text(i, value + 0.02, f'{value:.4f}', ha='center', fontsize=11)

# Training time comparison
plt.subplot(1, 2, 2)
plt.bar(train_times.keys(), train_times.values(), color=['#3498db', '#2ecc71'])
plt.title('Training Time', fontsize=14)
plt.ylabel('Time (seconds)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
for i, (key, value) in enumerate(train_times.items()):
    plt.text(i, value + 0.5, f'{value:.2f}s', ha='center', fontsize=11)

plt.tight_layout()
plt.show()

# Find the best model
best_model_name = max(results, key=results.get)
print(f"The best performing model is: {best_model_name} with accuracy: {results[best_model_name]:.4f}")

## Step 10: Create a Nearest Neighbors Model for FAQ Matching

We'll use a nearest neighbors approach to find the most similar questions in our database.

In [None]:
# Extract original question embeddings (without variations)
original_questions = questions  # Keep a copy of original questions
original_answers = answers      # Keep a copy of original answers

print("Creating embeddings for the original questions...")
original_embeddings = get_model_embeddings(original_questions, tokenizer, teacher_model)
original_embeddings = normalize_embeddings(original_embeddings)

# Create a nearest neighbors model
print("Training nearest neighbors model...")
nn_model = NearestNeighbors(n_neighbors=3, metric='cosine')
nn_model.fit(original_embeddings)

# Function to find the most similar question and return its answer
def get_answer(query, nn_model, tokenizer, teacher_model, questions, answers, threshold=0.2):
    """Get the most relevant answer for a given query

    Args:
        query (str): The user's question
        nn_model: Trained nearest neighbors model
        tokenizer: The tokenizer for the embedding model
        teacher_model: The model to use for embedding
        questions (list): List of original questions
        answers (list): List of original answers
        threshold (float): Maximum distance threshold for a match

    Returns:
        tuple: (matched_question, answer, distance, confidence)
    """
    # Get query embedding
    query_embedding = get_model_embeddings([query], tokenizer, teacher_model)
    query_embedding = normalize_embeddings(query_embedding)

    # Find nearest neighbors
    distances, indices = nn_model.kneighbors(query_embedding)

    # Get closest match
    closest_idx = indices[0][0]
    distance = distances[0][0]

    # Calculate confidence (1 - distance)
    confidence = 1 - distance

    # Check if match is good enough
    if distance > threshold:
        return (None, "I don't have information on that specific question.", distance, confidence)

    return (questions[closest_idx], answers[closest_idx], distance, confidence)

## Step 11: Create a Complete FAQ System Class

In [None]:
class FAQSystem:
    """A distilled model-based FAQ system"""

    def __init__(self, teacher_model, tokenizer, nn_model, category_model,
                 questions, answers, categories, category_mapping):
        self.teacher_model = teacher_model
        self.tokenizer = tokenizer
        self.nn_model = nn_model
        self.category_model = category_model
        self.questions = questions
        self.answers = answers
        self.categories = categories
        self.category_mapping = category_mapping
        self.id_to_category = {v: k for k, v in category_mapping.items()}

    def get_answer(self, query, threshold=0.2):
        """Get answer for a user question"""
        return get_answer(query, self.nn_model, self.tokenizer, self.teacher_model,
                          self.questions, self.answers, threshold)

    def predict_category(self, query):
        """Predict the category of a question"""
        # Get query embedding
        query_embedding = get_model_embeddings([query], self.tokenizer, self.teacher_model)

        # Predict category
        category_id = self.category_model.predict(query_embedding)[0]
        category = self.id_to_category[category_id]

        return category

    def get_questions_by_category(self, category):
        """Get all questions in a specific category"""
        result = []
        for q, c in zip(self.questions, self.categories):
                result.append(q)
        return result

    def save(self, filename="faq_system.joblib"):
        """Save the FAQ system to disk"""
        # Create a dictionary with everything we need to restore the system
        system_data = {
            "questions": self.questions,
            "answers": self.answers,
            "categories": self.categories,
            "category_mapping": self.category_mapping,
            "nn_model": self.nn_model,
            "category_model": self.category_model,
            "model_name": self.tokenizer.name_or_path
        }

        joblib.dump(system_data, filename)
        print(f"FAQ system saved to {filename}")

    @classmethod
    def load(cls, filename="faq_system.joblib"):
        """Load a saved FAQ system"""
        system_data = joblib.load(filename)

        # Load the model and tokenizer
        model_name = system_data["model_name"]
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        teacher_model = AutoModel.from_pretrained(model_name)
        teacher_model.eval()

        # Create instance
        return cls(
            teacher_model=teacher_model,
            tokenizer=tokenizer,
            nn_model=system_data["nn_model"],
            category_model=system_data["category_model"],
            questions=system_data["questions"],
            answers=system_data["answers"],
            categories=system_data["categories"],
            category_mapping=system_data["category_mapping"]
        )

## Step 12: Create and Save the FAQ System

In [None]:
# Create the full FAQ system
print("Building the complete FAQ system...")

# Use the best performing model for category prediction
best_model = models[best_model_name]

# Create our FAQ system
faq_system = FAQSystem(
    teacher_model=teacher_model,
    tokenizer=tokenizer,
    nn_model=nn_model,
    category_model=best_model,
    questions=original_questions,
    answers=original_answers,
    categories=categories[:len(original_questions)],  # Only use original categories
    category_mapping=category_to_id
)

# Save the FAQ system
faq_system.save("distilled_faq_system.joblib")

# Download the saved model (if running in Colab)
try:
    from google.colab import files
    files.download('distilled_faq_system.joblib')
    print("Model downloaded. You can use this file to load the FAQ system later.")
except ImportError:
    print("Not running in Colab. Model saved locally.")

## Step 13: Test the FAQ System with Sample Questions

In [None]:
print("Testing the FAQ system with sample questions...\n")

test_questions = [
    "What exactly is model distillation?",
    "I want to know if distilled models can run on my phone",
    "Are smaller models less accurate?",
    "Tell me about the benefits of model distillation",
    "What's the relationship between model distillation and knowledge transfer?",
    "Why would I want to use a distilled model?"
]

for i, question in enumerate(test_questions):
    print(f"\nQuestion {i+1}: {question}")

    # Get answer
    matched_question, answer, distance, confidence = faq_system.get_answer(question)

    # Get category
    category = faq_system.predict_category(question)

    print(f"Category: {category}")
    print(f"Matched question: {matched_question}")
    print(f"Confidence: {confidence:.2f}")
    print(f"Answer: {answer}")

## Step 14: Interactive Demo

Let's create an interactive demo to test our FAQ system.

In [None]:
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets

# Create input widget
question_input = widgets.Text(
    value='',
    placeholder='Type your question here',
    description='Question:',
    layout=widgets.Layout(width='80%')
)

# Create output widget
output = widgets.Output()

# Create button
button = widgets.Button(
    description='Ask',
    button_style='primary',
    tooltip='Ask the FAQ system'
)

# Create confidence threshold slider
threshold_slider = widgets.FloatSlider(
    value=0.2,
    min=0.0,
    max=0.5,
    step=0.05,
    description='Threshold:',
    tooltip='Confidence threshold for answers',
    layout=widgets.Layout(width='50%')
)

# Function to handle button click
def on_button_clicked(b):
    with output:
        clear_output()
        question = question_input.value
        if not question:
            print("Please enter a question.")
            return

        print(f"Question: {question}")

        # Get answer with current threshold
        matched_question, answer, distance, confidence = faq_system.get_answer(
            question, threshold=threshold_slider.value
        )

        # Get category
        category = faq_system.predict_category(question)

        print(f"\nCategory: {category}")

        if matched_question:
            print(f"\nMatched question: {matched_question}")
            print(f"Confidence: {confidence:.2f}")
            print(f"\nAnswer: {answer}")
        else:
            print(f"\nNo matching question found (confidence: {confidence:.2f})")
            print("Please rephrase your question or ask something related to model distillation.")

# Connect the button to the function
button.on_click(on_button_clicked)

# Display the widgets
display(HTML("<h3>FAQ System Demo</h3>"))
display(HTML("<p>Ask any question about model distillation.</p>"))
display(widgets.HBox([question_input, button]))
display(threshold_slider)
display(output)

## Step 15: How to Use the Saved FAQ System

Here's how to load and use your saved FAQ system in a production environment.

In [None]:
# This is for reference - how to load and use the saved FAQ system

'''
# Import required libraries
import joblib
from transformers import AutoTokenizer, AutoModel

# Load the saved FAQ system
loaded_faq_system = FAQSystem.load("distilled_faq_system.joblib")

# Use the system to answer questions
question = "What is model distillation?"
matched_question, answer, distance, confidence = loaded_faq_system.get_answer(question)
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Confidence: {confidence:.2f}")
'''

## How to Extend This FAQ System

To extend this FAQ system with your own data:

1. Create a list of your own FAQs in the format used in this notebook
2. Replace the `sample_faqs` variable with your own data
3. Re-run the notebook to train the models on your data
4. Save the resulting FAQ system for deployment

You can also fine-tune various parameters like:
- The confidence threshold for matching questions
- The teacher model used for embedding
- The student models used for classification

For larger FAQ datasets, consider adding pagination to the results and implementing more sophisticated matching algorithms.

## Conclusion

In this notebook, we've built a complete FAQ system using model distillation. The system leverages a large pre-trained language model as a teacher to generate high-quality embeddings, then uses smaller, more efficient models as students for fast inference.

The result is a powerful FAQ system that can:
- Match user questions to the most relevant answers
- Categorize questions by topic
- Run efficiently with minimal resources
- Be easily extended with your own FAQ data

This approach demonstrates the power of model distillation for creating practical, deployable AI systems by transferring knowledge from large models to smaller, more efficient ones.