# TODO — Applied NLP Mini-Project: Intelligent FAQ Assistant

In this final exercise, you will combine the skills learned in the previous tasks:

- From **Exercise 1**: using pre-trained Transformers for text classification.  
- From **Exercise 2**: building a semantic search engine using embeddings.

Your goal is to create a **simple Intelligent FAQ Assistant**:
1. Use a Transformer model to **classify** a user query into a topic (e.g., “product”, “payment”, “technical”).  
2. Use **semantic search** to find the most relevant FAQ entry within that topic.  
3. Return the best-matched answer to the user.

This brings together everything learned in Weeks 3 & 4 — modern Transformer-based NLP applications.


# 1. Environment Setup
### Install the Hugging Face and evaluation libraries we need.

In [1]:

!pip install -q sentence-transformers datasets scikit-learn

from sentence_transformers import SentenceTransformer, util
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import torch, numpy as np, pandas as pd


### 2) Create a Small FAQ Dataset

For simplicity, we’ll simulate a small FAQ base with three topics:
- Product
- Payment
- Technical

Each entry has a *question* and an *answer*.

You can replace this dataset with your own domain data later.


In [2]:
data = {
    "topic": ["product","product","product",
              "payment","payment","payment",
              "technical","technical","technical"],
    "question": [
        "What features does the new phone have?",
        "Is the laptop waterproof?",
        "Do you sell smart watches?",
        "How can I get a refund?",
        "Is there a discount for students?",
        "Can I change my payment method?",
        "How do I reset my password?",
        "Why is my app crashing?",
        "Does the software work on Linux?"
    ],
    "answer": [
        "The new phone includes a dual camera and 5G support.",
        "The laptop is not waterproof; keep it away from water.",
        "Yes, we offer several smart watch models online.",
        "You can request a refund from your order history page.",
        "Yes, students can apply a 10% discount at checkout.",
        "You can update payment methods from your account settings.",
        "Go to Settings → Security → Reset Password.",
        "Try reinstalling the app and clearing cached data.",
        "Yes, our software is compatible with Linux systems."
    ]
}

faq_df = pd.DataFrame(data)
faq_df


Unnamed: 0,topic,question,answer
0,product,What features does the new phone have?,The new phone includes a dual camera and 5G su...
1,product,Is the laptop waterproof?,The laptop is not waterproof; keep it away fro...
2,product,Do you sell smart watches?,"Yes, we offer several smart watch models online."
3,payment,How can I get a refund?,You can request a refund from your order histo...
4,payment,Is there a discount for students?,"Yes, students can apply a 10% discount at chec..."
5,payment,Can I change my payment method?,You can update payment methods from your accou...
6,technical,How do I reset my password?,Go to Settings → Security → Reset Password.
7,technical,Why is my app crashing?,Try reinstalling the app and clearing cached d...
8,technical,Does the software work on Linux?,"Yes, our software is compatible with Linux sys..."


### 3) Train a Simple Topic Classifier

We will use sentence embeddings from a pre-trained model (`all-MiniLM-L6-v2`)
and train a **lightweight logistic regression classifier** to predict the FAQ topic.

This keeps training fast while still relying on Transformer-based representations.


In [3]:
# Load model for sentence embeddings
# Hint: Use SentenceTransformer with the pre-trained model "all-MiniLM-L6-v2"
model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode questions into dense vectors
# Hint: Convert all FAQ questions into numerical embeddings using model.encode()
# It could be argued that retrieval may be better when embedding the Answers, or even Quaetions + Answers.
# (cf. Wen-Ting Tseng, Tien-Hong Lo, Yung-Chang Hsu, Berlin Chen (2020). Effective FAQ Retrieval and Question Matching With Unsupervised Knowledge Injection. arXiv)
# I need numpy arrays as output because only those work with e.g. train_test_split.
# As I don't have massive amounts of data, I can run this on CPU.
X = model.encode(faq_df["question"].tolist(), convert_to_numpy=True)


# Hint: Convert text topic labels (e.g., 'product', 'payment', 'technical') into numeric codes
# I don't encode the topics, because they are the target labels. Technically I could leave them as strings
# as many classifiers handle them internally. But apparently it is better to keep them numeric
# (cf. Julian, D. & Raschka, S., Hearty, J (2016):
# Python: Deeper Insights into Machine Learning.
# Packt Publishing. Chapter 4, "Building Good Training Sets – Data Preprocessing").
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(faq_df["topic"])


# Split for validation
# Hint: Use train_test_split to divide data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train simple classifier
# Hint: Create a LogisticRegression model and fit it using the training embeddings and labels. try max iteration 500.
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

# Evaluate
# Hint: Predict topics on test data and print a classification report to see accuracy and F1-score
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=faq_df["topic"].unique()))

# The results are very good. It seems the queries are positioned in the embedding vector space in a way that
# hyperplanes can be defined that groups the queries matching the topic labels.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Cancellation requested; stopping current tasks.


KeyboardInterrupt: 

### 4) Build Semantic Search for Each Topic

Now we create a dictionary of embeddings per topic so that once the
classifier predicts a topic, we can perform **semantic similarity search**
within that topic only.


In [None]:
# Hint: Group the FAQ DataFrame by topic to create a separate subset for each category
topic_groups = {topic: subdf for topic, subdf in faq_df.groupby("topic")}
# We now have a dictionary with the topics as keys, and a smaller df with the faqs for the topic.

# Hint: Create an empty dictionary to store embeddings for each topic
topic_embeddings = {}

# Hint: Loop through each topic group and encode its questions into embeddings
for topic, subdf in topic_groups.items():
    # Use the same model to create embeddings for all questions under this topic
    topic_embeddings[topic] = model.encode(subdf["question"].tolist(), convert_to_numpy=True)

# so now topic_embeddings is a dictionary with topics as keys, and numpy arrays as value that contain the embeddings for
# the questions of the FAQs for the respective topic.



### 5) Query the Assistant

When a user enters a question:
1. Classify it to find the most likely topic.  
2. Perform semantic similarity search within that topic’s FAQs.  
3. Return the most relevant answer.


In [None]:
def faq_assistant(query):
    # Step 1: Encode query
    # Hint: Convert the user’s input text into an embedding using the same model
    # As the query is a string, I need to put in a list, as the classifier expects an array
    q_emb = model.encode([query], convert_to_numpy=True)

    # Step 2: Predict topic
    # Hint: Use the trained classifier to predict which topic the query belongs to
    topic_id = clf.predict(q_emb)[0]

    # Hint: Convert the numeric topic ID back to the original topic name
    # As I used LabelEncoder to transform the topic into an id, I need to reverse the operation
    # Again, as this predicting an individual value, I need to wrap topic_id into a list,
    # because inverse_transform expects that.
    topic_name = le.inverse_transform([topic_id])[0]
    print(f"\n\nPredicted Topic: {topic_name.title()}\n")
    print(f"Asked question: {query}")

    # Step 3: Semantic search within topic
    # Hint: Select all FAQs related to the predicted topic
    topic_df = topic_groups[topic_name]

    # Hint: Retrieve precomputed embeddings for that topic’s questions
    topic_vecs = topic_embeddings[topic_name]

    # Hint: Convert the query embedding to a tensor and move it to the same device as topic embeddings
    # As I used NumPy rather than Tensors for my embeddings, I don't need the device method.
    # This will always just run on CPU.
    q_tensor = torch.tensor(q_emb, dtype=torch.float32)

    # Hint: Compute cosine similarity between the query and all topic question embeddings
    cos_scores = util.cos_sim(q_tensor, topic_vecs)[0]

    # Hint: Find the index of the most similar question
    best_idx = torch.argmax(cos_scores).item()

    # Hint: Print the matched FAQ question and its corresponding answer
    print(f"Matched Question: {topic_df.iloc[best_idx]['question']}")
    print(f"Answer: {topic_df.iloc[best_idx]['answer']}")


### 6) Test the System

Try different queries and see if the assistant finds the right FAQ.


In [None]:


faq_assistant("How can I pay using PayPal?")
faq_assistant("My phone is broken, what warranty do I have?")
faq_assistant("App keeps freezing when I open it.")

faq_assistant("When was the latest iPad model released?")
faq_assistant("Will I get a refund?")
faq_assistant("How do I find out about the new iOS features?")




Predicted Topic: Payment

Asked question: How can I pay using PayPal?
Matched Question: Can I change my payment method?
Answer: You can update payment methods from your account settings.


Predicted Topic: Product

Asked question: My phone is broken, what warranty do I have?
Matched Question: What features does the new phone have?
Answer: The new phone includes a dual camera and 5G support.


Predicted Topic: Technical

Asked question: App keeps freezing when I open it.
Matched Question: Why is my app crashing?
Answer: Try reinstalling the app and clearing cached data.


Predicted Topic: Product

Asked question: When was the latest iPad model released?
Matched Question: What features does the new phone have?
Answer: The new phone includes a dual camera and 5G support.


Predicted Topic: Payment

Asked question: Will I get a refund?
Matched Question: How can I get a refund?
Answer: You can request a refund from your order history page.


Predicted Topic: Product

Asked question: How d

### ✅ Expected Outcome
Your system should:
- Correctly identify the FAQ topic.  
- Retrieve and display the most relevant answer.  

This simple pipeline demonstrates how Transformer-based representations enable modern NLP applications that *understand meaning* rather than relying on keywords.


**Report**


We are doing a 2 step retrieval:

First we identify a topic with a simple linear classifier that was trained with the question embeddings. (It would be interesting to understand how including the answers in the training would change the accuracy of the predictions.)

Then we do a search within the topic to find the actual FAQ entry that comes closest to the question.

This is an alternative approach to what we did in the exercise with doing a similarity search directly (with FAISS) and then use cross-encoder reanking.

I couldn't find concrete references why one would prefer the one over the other.

However, predicting the topic first means that less similarity calculations need to be done. This should improve the performance. 

There is a risk though when the accuracy of the linear model is low. In case of a wrong prediction the similarity search wouldn't have the chance to find the correct question Therefore this approach only makes sense when the topics are well separated.

In the exercise we learned that FAISS is also extremely powerful. If the splitting into topics is not done well, the FAISS + Cross-Encoding Reranking may be preferred.

I am quite intrigued though how good linear classifiers seem to be with topic detection. I am wondering if that also works for content categories, like technical vs. medical vs. marketing vs. legal; or for different stylistic nuances (formal, informal.). Another question I have if how to proceed if "a document" is a whole article rather than just a question. I will try that out. 