# **[INSERT YOUR TOPIC] Q&A Chatbot**
**Name**: [INSERT YOUR NAME]<br>
**Organization**: Northwestern AI Club - Fall 2025<br>

**Project Description**: For this project, you will build a chatbot that can answer questions about a specific topic of your choice (a sports team, favorite anime, etc.) by retrieving relevant information from the web and generating natural-language answers using a local LLM. This project is meant to serve as a very light introduction to some useful resources like Hugging Face and key concepts like retrieval-augmented generation (RAG), embeddings, cosine similarity, and LLM parameters like temperature. Unlike using a pre-packaged API, this project lets you work directly with models, giving you hands-on experience with prompt design, stopping conditions, and model behavior. By the end of this project, you should have a solid starting point to build more complex, topic-specific applications.

> **Note**: This project gives a hands-on, practical taste of production LLM workflows (but MUCH less complex). The next project will be more focused on the architecture behind these models.

# **Setting Up Required Packages**

While Colab already comes with a bunch of useful packages installed, there are still a few packages that we have to manually install.<br>

In [None]:
!pip install -q -U ddgs
!pip install -q -U newspaper3k
!pip install -q -U --upgrade lxml_html_clean

Now we can import all the libraries that we'll need for the project.

In [None]:
import re
import torch
import requests
from ddgs import DDGS
from bs4 import BeautifulSoup
from newspaper import Article
from google.colab import userdata
from huggingface_hub import login
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList

# **Search Engine Setup**

For this project, we'll be using DuckDuckGo as our search engine. When using DuckDuckGo's API, you'll notice it lacks many of the quality-of-life features that services like Google typically have like richer semantic understanding, advanced ranking algorithms, and personalized results. This means that we'll have to do a little more than just directly inputting our query and returning the first results.<br><br>

For this part, you'll be tasked with refining your queries to produce more consistent results and identifying trusted domains that provide reliable information for your specific topic. Below are functions that have been provided for you to use. If you need any additional help understanding how they work, you can use resources like ChatGPT or ask during our next club meeting.

> **Note**: If you want to add features to improve the search phase, please feel free! The code is quite modular, so you shouldn’t have to change much to customize it. You can personalize this project as much as you want :)

In [None]:
def find_sources(question, context_prefix, clarifiers, trusted_domains, num_sources_limit, display=False):
  """
  Purpose:
    Given a question, find the most relevant sources to answer it.

  Inputs:
    * Question: The question that the user inputs.
    * Context_Prefix: A prefix to put at the beginning of the query.
    * Clarifiers: A dictionary of terms and their expansion.
    * Trusted_Domains: A list of trusted domains to prioritize.
    * Num_Sources_Limit: The maximum number of URLs to return.

  Output:
    * Best_Sources: A list of URLS to the most relevant webpages to answer the question.
  """
  # Create the query using our context prefix & clarifiers
  query = prepare_query(question, context_prefix, clarifiers)

  # Get an initial list of results
  source_urls = search_web(query, display)

  # Filter the results to prioritize trusted domains
  best_sources = filter_results(source_urls, trusted_domains, num_sources_limit)

  # Show the results
  if display:
    print(f"Final Results for \"{query}\":")
    for source in best_sources:
      print(f"\t{source}")

  return best_sources

def prepare_query(question, context_prefix, clarifiers):
  """
  Purpose:
    Uses the question, context prefix, and clarifiers to create a query. Since DuckDuckGo's
    API primarily uses pattern matching to retrieve results, this function adds a context
    prefix and expands any clarifiers in the question to be more specific.

  Inputs:
    * Question: The question that the user inputs.
    * Context_Prefix: A prefix to put at the beginning of the query.
    * Clarifiers: A dictionary of terms and their expansion.

  Output:
    * Query: The query that will be used to search the web.
  """
  for term, expansion in clarifiers.items():

    # If the expansion is already present, don't expand the term
    if expansion.lower() in question.lower():
      continue

    # Otherwise, replace the term in the question with the expansion
    pattern = r'\b' + re.escape(term) + r'\b'
    question = re.sub(pattern, expansion, question, flags=re.IGNORECASE)

  # Add the context prefix
  query = f"{context_prefix} {question}"

  return query

def search_web(query, display=False):
  """
  Purpose:
    Takes in the query and uses DuckDuck go to search for results. Since DuckDuckGo's
    API can be inconsistent, this function does 3 rounds of searches and aggregates
    the top 5 results from each round.

  Inputs:
    * Query: The query that will be used to search the web.

  Output:
    * Source_Urls: A list of URLs for the aggregated results (top 7 from each round).
  """
  source_urls = []

  # Use DuckDuckGo as the search client
  with DDGS() as search_client:

    # Search 3 times and aggregate results
    for round in range(3):
      if display: print("Search Results For Round", round+1)
      results = search_client.text(query)

      # Add the top 7 results from each search
      for result in results[:7]:
        source_url = result["href"]
        if display: print(f"\t{source_url}")

        # If it is a duplicate, don't add it
        if source_url not in source_urls:
          source_urls.append(source_url)

      if display: print("")

  return source_urls

def filter_results(source_urls, trusted_domains, num_sources_limit):
  """
  Purpose:
    Filters the results to prioritize trusted domains. This function reduces
    noisy and irrelevant results that DuckDuckGo's API can produce.

  Inputs:
    * Source_Urls: A list of URLs for the aggregated results.
    * Trusted_Domains: A list of trusted domains to prioritize.
    * Num_Sources_Limit: The maximum number of URLs to return.

  Output:
    * Best_Sources: A list of URLS to the most relevant webpages to answer the question.
  """
  best_sources = []

  # Add results that are from a trusted domain
  for source in source_urls:

    for domain in trusted_domains:
      if domain in source:
        best_sources.append(source)
        break

    if len(best_sources) == num_sources_limit:
      break

  # If we end up with less than the limit, add other sources
  if len(best_sources) < num_sources_limit:
    fallback_sources = [source for source in source_urls if source not in best_sources]
    best_sources += fallback_sources[:num_sources_limit - len(best_sources)]

  return best_sources

As stated earlier, we're going to address the limitations of DuckDuckGo's API in 2 different ways. First, we're going to refine the query in order to get more consistent results. Second, we're going to filter the results to prioritize trusted domains. By having these 2 systems in place, we should be able to reduce irrelevant or noisy results that can reduce the quality of retrieved information or confuse the LLM.<br><br>

Here is a breakdown of what you need to add:<br>

> `Context Prefix`: DuckDuckGo's API primarily uses pattern matching to get results. By putting a context prefix at the beginning of the query, we can push the search engine toward more relevant results. For example, if your topic is about Northwestern, you can have something like [Northwestern University] as your context prefix.

> `Clarifiers`: Pattern matching can be problematic if a term is ambiguous. For example, “Apple” could refer to the fruit or the company. To help clarify, you can expand ambiguous terms like "Apple" in your query to "Apple fruit" to be more specific.

> `Trusted Domains`: Even though refining the query can produce more consistent results, DuckDuckGo can still produce irrelevant results. To make sure we get the best information, we can prioritize trusted domains and then look at other sources.


In [None]:
question = ""

context_prefix = ""

clarifiers = {"TERM": "EXPANSION"}

trusted_domains = ["DOMAIN"]

num_sources_limit = _

source_urls = find_sources(question, context_prefix, clarifiers, trusted_domains, num_sources_limit, display=True)

# **Scraping & Formatting Data**

Now that we have a list of links to relevant webpages, we need to extract the content from those webpages to feed into our LLM. Scraping webpages can be quite annoying because websites can be structured differently. To help with this, there is code provided below for you to use. The code below attempts to scrape content using 2 different methods (Newspaper3k and requests).

> **Note**: Scraping isn’t always perfect. Some websites (like Reddit) make scraping intentionally difficult and prefer that you use their official API. Don’t worry if some pages don’t work, the goal here is to get enough usable content to pass along to the LLM. You’re also welcome to modify the scraping code or try other libraries if you’re curious.

In [None]:
def scrape_webpages(source_urls, display=False):
  """
  Purpose:
    Takes in the URLs for the best webpages and attempts to scrape the content
    from those webpages to feed into the LLM.

  Inputs:
    * Source_URLs: A list of URLs for the best webpages.

  Output:
    * Webpages: A list of the scraped content from the webpages.
  """
  webpages = []

  for url in source_urls:

    # First try using Newspaper3k to scrape the webpage
    try:
      webpage_text = scrape_webpage_newspaper3k(url)
      webpages.append(webpage_text)

      if display: print(f"Successfully scraped {url} using newspaper3k.\n")

    # If that fails, trying using Requests and BeautifulSoup
    except:
      if display: print(f"Newspaper3k failed for {url}, falling back to requests.\n")
      webpage_text = scrape_webpage_requests(url)

      if webpage_text:
        if display: print(f"\tSuccessfully scraped {url} using requests.\n")
        webpages.append(webpage_text)

      else:
        if display: print(f"\tFailed to scrape {url} using both methods.\n")

  return webpages

def scrape_webpage_newspaper3k(url):
  """
  Purpose:
    Uses Newspaper3k to scrape the content from the webpage.
  """
  webpage = Article(url)
  webpage.download()
  webpage.parse()

  return webpage.text

def scrape_webpage_requests(url):
  """
  Purpose:
    Uses Requests and BeautifulSoup to scrape the content from the webpage.
  """
  headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

  try:
    # Make an HTTP GET request to the given URL
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    # Identify paragraphs by looking for <p> tags
    soup = BeautifulSoup(response.text, "html.parser")
    paragraphs = soup.find_all("p")

    # Join paragraphs together into one string separated by 2 newline characters
    article_text = "\n\n".join(p.get_text() for p in paragraphs)

    return article_text

  except:
    return None

For this section, you aren't required to code anything unless you want to modify the code. I would recommend skimming through some of the functions (especially `scrape_webpage_request`) to see how the code is scraping content incase you need to troubleshoot later.<br>

> For now, run the code below to scrape the webpages we found during the search phase. If you notice that many pages are failing, you may need to tweak the scraping code or adjust your list of trusted domains.

In [None]:
webpages = scrape_webpages(source_urls, display=True)

Now we're going to load an embedding model to transform text into numerical vectors.
> Make sure that your Hugging Face access token is labeled as `HF_TOKEN` in your notebook secrets.

In [None]:
# Load a sentence embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

This is the provided code for extracting the most relevant information from the webpages. This is done in 2 steps:

> **Separate By Paragraphs**: First, we'll separate the content from each webpage by paragraph.

> **Rank By Cosine Similarity**: Then, we'll calculate the cosine similarity between the question and each paragraph to determine which paragraphs are most relevant.

In [None]:
def extract_best_paragraphs(question, webpages, max_num_paragraphs, display=False):
  """
  Purpose:
    Given the question and the content on the webpages, extract the most relevant
    paragraphs to answer the question.

  Inputs:
    * Question: The question that the user inputs.
    * Webpages: A list of the scraped content from the webpages.
    * Max_Num_Paragraphs: The maximum number of paragraphs to return.

  Output:
    * Best_Paragraphs: A list of the most relevant paragraphs to answer the question.
  """
  # Seperate all the paragraphs in the webpages
  paragraphs = split_webpages_into_paragraphs(webpages)

  # Retrieve the best paragraphs
  best_paragraphs = filter_paragraphs(question, paragraphs, max_num_paragraphs)

  # Print results
  if display:
    print("Query:", question, "\n")
    for paragraph in best_paragraphs:
      print(f"   {paragraph}\n")

  return best_paragraphs

def split_webpages_into_paragraphs(webpages):
  """
  Purpose:
    Splits the content from each webpage by paragraph. We'll determine where
    each paragraph starts by looking for newline characters.

  Inputs:
    * Webpages: A list of the scraped content from the webpages.

  Output:
    * Paragraphs: A list of all the paragraphs from the webpages provided.
  """
  paragraphs = []

  for webpage in webpages:

      # Split by double newlines
      raw_paragraphs = webpage.split("\n\n")

      # Clear whitespace characters from each paragraph
      for p in raw_paragraphs:
          cleaned = p.strip()
          paragraphs.append(cleaned)

  return paragraphs

def filter_paragraphs(question, paragraphs, max_num_paragraphs):
  """
  Purpose:
    Calculates the cosine similarity between the question and each paragraph
    to determine which paragraphs are most relevant.

  Inputs:
    * Question: The question that the user inputs.
    * Paragraphs: A list of all the paragraphs from the webpages provided.
    * Max_Num_Paragraphs: The maximum number of paragraphs to return.

  Output:
    * Best_Paragraphs: A list of the most relevant paragraphs to answer the question.
  """
  # Embed the question and each paragraph
  question_embedding = embedding_model.encode([question])
  paragraph_embeddings = embedding_model.encode(paragraphs)

  # Compute cosine similarity between question and all paragraphs
  similarities = cosine_similarity(question_embedding, paragraph_embeddings)[0]

  # Pair paragraphs with similarity scores and sort in descending order
  scored_paragraphs = list(zip(paragraphs, similarities))
  scored_paragraphs.sort(key=lambda x: x[1], reverse=True)

  # Select the best paragraphs
  top_scores = scored_paragraphs[:max_num_paragraphs]
  best_paragraphs = [paragraph for paragraph, score in top_scores]

  return best_paragraphs

Now let's see what the content we've extracted looks like.

In [None]:
best_paragraphs = extract_best_paragraphs(question, webpages, _, display=True)

# **Configuring Our LLM**

First, let's load the model through Hugging Face. The default model that we'll be using is Qwen2.5 because it has been instruction-tuned for conversation tasks. You can also explore Hugging Face and use another model of your choice, just keep these few things in mind:

> **1.** If you're using a model with a lot of parameters (more than 3 billion), consider using a bitsandbytes configuration to load the weights in 4-bit instead of 32-bit (only a few lines of code). This reduces memory usage and speeds up inference.

> **2.** Not all models on Hugging Face are instruction-tuned, so performance may be worse if the model isn’t specifically trained for human dialogue. This can make a HUGE difference, so be sure to research the model.

In [None]:
# Define the name of the model we want
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

# Load the model weights (if you use a bigger model, add quantization_config=bnb_config)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# Get the specific tokenizer for this model
tokenizer = AutoTokenizer.from_pretrained(model_name)

Now we need to feed the question and the most relevant content from the webpages into the model. For the `tokenize_prompt` function, design the format of the prompt to feed into the model. The rest of the code in the function will tokenize the prompt for you.

In [None]:
def tokenize_prompt(question, paragraphs, tokenizer, max_input_tokens):
  """
  Purpose:
    Given the question and the most relevant paragraphs, create and
    tokenize the prompt to feed into the model.

  Inputs:
    * Question: The question that the user inputs.
    * Paragraphs: A list of the most relevant paragraphs to answer the question.
    * Tokenizer: The tokenizer for the model.
    * Max_Input_Tokens: The maximum number of tokens allowed in the prompt.

  Output:
    * Prompt: The prompt to feed into the model.
    * Tokenized_Prompt: The tokenized prompt
  """
  # Create the prompt to feed into the model
  combined_paragraphs = "\n\n".join(paragraphs)
  prompt = (f"DESIGN YOUR PROMPT HERE")

  # Tokenize the prompt
  tokenized_prompt = tokenizer(prompt, return_tensors="pt")
  token_count = tokenized_prompt['input_ids'].shape[-1]

  # Make sure the prompt isn't too long
  if token_count > max_input_tokens:
    raise ValueError(f"Prompt is too long ({token_count} tokens), max allowed is {max_input_tokens}. Try decreasing number of paragraphs to include or increase the max input tokens.")

  return prompt, tokenized_prompt

When working with LLMs, it’s important to remember that we don’t just give them instructions and walk away. We also need to observe their behavior and add rules/guardrails for them to follow. Without moderation, the model may continue generating endlessly, start rambling, or hallucinate extra information.

> While the `max_new_tokens` argument will prevent the model from generating infinitely, the model might ramble about random topics or constantly repeat itself until it hits that limit. By implementing custom stopping conditions, we can stop the model early to prevent that kind of behavior.

In [None]:
def generate_response(tokenized_prompt, model, tokenizer, max_new_tokens, temperature):
  """
  Purpose:
    Given the tokenized prompt, generate a response from the model.

  Inputs:
    * Tokenized_Prompt: The tokenized prompt.
    * Model: The model to use.
    * Tokenizer: The tokenizer for the model.
    * Max_New_Tokens: The maximum number of tokens for the model to generate.
    * Temperature: The temperature to use for sampling.

  Output:
    * Generated_Text: The generated text from the model.
  """
  # Set up inputs by putting tensors onto the GPU
  inputs = {key: tensor_value.to(model.device) for key, tensor_value in tokenized_prompt.items()}
  prompt_length = inputs['input_ids'].shape[-1]

  # Attach custom stopping criteria
  stopping_criteria = StoppingCriteriaList([Custom_Stop_Conditions(tokenizer, prompt_length)])

  outputs = model.generate(**inputs,
                           temperature=temperature,
                           max_new_tokens=max_new_tokens,
                           do_sample=True,
                           pad_token_id=tokenizer.eos_token_id,
                           eos_token_id=tokenizer.eos_token_id,
                           stopping_criteria=stopping_criteria)

  generated_ids = outputs[0][prompt_length:]
  generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

  return generated_text

class Custom_Stop_Conditions(StoppingCriteria):
  """
  Purpose:
    Implements custom stopping logic for text generation. During generation,
    this class is called repeatedly after each token is generated. It only looks
    at tokens generated after the prompt and checks if the stopping conditions
    have been met. If so, it signals the model to stop early.

  Inputs (via __init__):
    * Tokenizer: The tokenizer for the model (to decode the generated tokens).
    * Prompt_Length: Number of tokens in the original prompt (to separate prompt vs. generated tokens).
  """
  def __init__(self, tokenizer, prompt_length):
    super().__init__()
    self.tokenizer = tokenizer
    self.prompt_length = prompt_length

  def __call__(self, input_ids, scores, **kwargs):
    # Decode only the newly generated tokens
    generated_ids = input_ids[0][self.prompt_length:]
    generated_text = self.tokenizer.decode(generated_ids, skip_special_tokens=False)

    # Stop if <|endoftext|> token actually appears in the decoded text
    if "<|endoftext|>" in generated_text:
      return True

    # Stop if EOS token appears
    last_token = input_ids[0, -1].item()
    if last_token == self.tokenizer.eos_token_id:
      return True

    # Stop if a newline character appears in the generated text
    if "\n" in generated_text:
      return True

    # Otherwise continue generating
    return False

# **Final Product**

In [None]:
def final_product(question, context_prefix, clarifiers, trusted_domains, num_sources_limit, max_num_paragraphs, model, tokenizer, max_input_tokens, max_new_tokens, temperature):

  # Find the most relevant sources
  source_urls = find_sources(question, context_prefix, clarifiers, trusted_domains, num_sources_limit)

  # Scrape the webpages and extract the most relevant paragraphs
  webpages = scrape_webpages(source_urls)
  best_paragraphs = extract_best_paragraphs(question, webpages, max_num_paragraphs)

  # Tokenize the prompt and feed it into the model
  prompt, tokenized_prompt = tokenize_prompt(question, best_paragraphs, tokenizer, max_input_tokens)
  generated_text = generate_response(tokenized_prompt, model, tokenizer, max_new_tokens, temperature)

  return prompt, generated_text

In [None]:
question = ""

context_prefix = ""

clarifiers = {"TERM": "EXPANSION"}

trusted_domains = ["DOMAIN"]

num_sources_limit = _

max_num_paragraphs = _

max_input_tokens = _

max_new_tokens = _

temperature = _

prompt, generated_text = final_product(question, context_prefix, clarifiers, trusted_domains, num_sources_limit, max_num_paragraphs, model, tokenizer, max_input_tokens, max_new_tokens, temperature)

print(f"{prompt}\n\n{generated_text}")