Exercise 1 : Data Loading And Preparation

In this exercise, we will set up the environment and prepare the dataset that we will use throughout this project. Proper data preparation ensures smooth downstream processes, such as generating embeddings, working with vector databases, or building machine learning models. Let’s walk through each step together.



Why This Step Matters:

Before diving into advanced techniques, it’s crucial to:

Ensure all required libraries are installed.
Load and inspect the data to understand its structure.
Prepare a manageable subset for quicker iterations during development.
These steps help us avoid technical issues and ensure our analysis or models are built on a solid foundation.



Instructions:

1. Install Required Libraries

The project requires specialized libraries for vector search and database management:

Enter your folder, then, in your terminal :

In [1]:
!pip install -q faiss-cpu==1.7.4
!pip install -q chromadb==0.3.21
!pip install -qU chromadb
!pip install -q numpy<2

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/17.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/17.6 MB[0m [31m104.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/17.6 MB[0m [31m122.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m14.5/17.6 MB[0m [31m157.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m17.6/17.6 MB[0m [31m164.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m17.6/17.6 MB[0m [31m164.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m76.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ..

Create a Cache Directory to keep our workspace organized and handle any intermediate data or downloaded files:

In [2]:
!mkdir cache

!apt install libomp-dev
!python -m pip install --upgrade faiss-cpu

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libomp-14-dev libomp5-14
Suggested packages:
  libomp-14-doc
The following NEW packages will be installed:
  libomp-14-dev libomp-dev libomp5-14
0 upgraded, 3 newly installed, 0 to remove and 35 not upgraded.
Need to get 738 kB of archives.
After this operation, 8,991 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libomp5-14 amd64 1:14.0.0-1ubuntu1.1 [389 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libomp-14-dev amd64 1:14.0.0-1ubuntu1.1 [347 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libomp-dev amd64 1:14.0-55~exp2 [3,074 B]
Fetched 738 kB in 0s (2,921 kB/s)
Selecting previously unselected package libomp5-14:amd64.
(Reading database ... 126281 files and directories currently installed.)
Preparing to unpack .../libomp5-14_1%3a

then, in your file, import essential libraries

In [3]:
import numpy as np
import pandas as pd
import faiss
import json
from sentence_transformers import SentenceTransformer, InputExample
import chromadb
from chromadb.config import Settings
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

2. Load the Dataset

We’ll be working with a dataset called labelled_newscatcher_dataset.csv, which contains labeled news articles. These articles will later be processed into embeddings for vector storage and search.

Task: Load the dataset into a pandas DataFrame:

In [6]:
path =  "/content/labelled_newscatcher_dataset.csv"
pdf = pd.read_csv(path, sep=';')
print(pdf.head())

     topic                                               link          domain  \
0  SCIENCE  https://www.eurekalert.org/pub_releases/2020-0...  eurekalert.org   
1  SCIENCE  https://www.pulse.ng/news/world/an-irresistibl...        pulse.ng   
2  SCIENCE  https://www.express.co.uk/news/science/1322607...   express.co.uk   
3  SCIENCE  https://www.ndtv.com/world-news/glaciers-could...        ndtv.com   
4  SCIENCE  https://www.thesun.ie/tech/5742187/perseid-met...       thesun.ie   

        published_date                                              title lang  
0  2020-08-06 13:59:45  A closer look at water-splitting's solar fuel ...   en  
1  2020-08-12 15:14:19  An irresistible scent makes locusts swarm, stu...   en  
3  2020-08-03 22:18:26   Glaciers Could Have Sculpted Mars Valleys: Study   en  
4  2020-08-12 19:54:36  Perseid meteor shower 2020: What time and how ...   en  


3. Add an Identifier Column (if needed)

Unique identifiers help us track each record, especially when we work with vector databases:

In [27]:
pdf["id"] = pdf.index.astype(str)

4. Inspect the Data

Use the following command to get a quick overview of the dataset:

In [28]:
display(pdf)

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
108769,NATION,https://www.vanguardngr.com/2020/08/pdp-govern...,vanguardngr.com,2020-08-08 02:40:00,PDP governors’ forum urges security agencies t...,en,108769
108770,BUSINESS,https://www.patentlyapple.com/patently-apple/2...,patentlyapple.com,2020-08-08 01:27:12,"In Q2-20, Apple Dominated the Premium Smartpho...",en,108770
108771,HEALTH,https://www.belfastlive.co.uk/news/health/coro...,belfastlive.co.uk,2020-08-12 17:01:00,Coronavirus Northern Ireland: Full breakdown s...,en,108771
108772,ENTERTAINMENT,https://www.thenews.com.pk/latest/696364-paul-...,thenews.com.pk,2020-08-05 04:59:00,Paul McCartney details post-Beatles distress a...,en,108772


5. Create a Subset for Faster Processing

Working with large datasets can be time-consuming. To enable faster iterations during development:

Task: Select a smaller subset of the DataFrame (e.g., the first 1000 rows).
This approach lets you test your code efficiently before scaling up to the entire dataset.

In [29]:
pdf_subset = pdf.head(1000)
print(pdf_subset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   topic           1000 non-null   object
 1   link            1000 non-null   object
 2   domain          1000 non-null   object
 3   published_date  1000 non-null   object
 4   title           1000 non-null   object
 5   lang            1000 non-null   object
 6   id              1000 non-null   object
dtypes: object(7)
memory usage: 54.8+ KB
None


Exercise 2: Vectorization With Sentence Transformers

In this exercise, we will transform our textual data (news titles) into numerical representations known as embeddings. This step is crucial for enabling machines to understand and work with text data in tasks like similarity search, clustering, and machine learning. We will use Sentence Transformers, a popular library for generating dense vector representations of text.



Why This Step Matters:

Machines cannot directly process raw text—they need numerical input. Embeddings are dense vectors that capture the meaning and context of text. By generating embeddings for our news titles, we make them usable for downstream tasks such as similarity searches or feeding into machine learning models.



Instructions:

1. Install and Import Sentence Transformers Library

The sentence_transformers library provides easy-to-use methods for generating sentence-level embeddings.

In [11]:
from sentence_transformers import InputExample

2. Prepare the Data for Embedding Generation

We will apply a helper function to the subset of our DataFrame that we created earlier. This function formats each row into an InputExample object, which is required for the embedding process.

Task: Extract the subset of the DataFrame (e.g., pdf_subset) for which you want to generate embeddings.

In [None]:
#Already performed.

3. Create a Helper Function

This function converts each record (news title) into the proper format (InputExample) required by the Sentence Transformer model.

In [30]:
if "topic" in pdf.columns:
    unique_topics = pdf["topic"].unique()
    topic_to_label = {topic: i for i, topic in enumerate(unique_topics)}
    print("Topic to label mapping:", topic_to_label)
else:
    topic_to_label = {}

from sentence_transformers import InputExample

def example_create_fn(row: pd.Series) -> InputExample:
    """
    Convert a DataFrame row into an InputExample for sentence_transformers.
    Uses 'title' as the text and 'topic' (mapped to number) as the label.
    If 'topic' missing, label defaults to 0.
    """
    text = row["title"]
    if "topic" in row and row["topic"] in topic_to_label:
        label = float(topic_to_label[row["topic"]])
    else:
        label = 0.0  # default label if no topic

    return InputExample(texts=[text], label=label)

Topic to label mapping: {'SCIENCE': 0, 'TECHNOLOGY': 1, 'HEALTH': 2, 'WORLD': 3, 'ENTERTAINMENT': 4, 'SPORTS': 5, 'BUSINESS': 6, 'NATION': 7}


4. Apply the Helper Function to the Subset

We’ll apply this function across the subset DataFrame to generate a list of InputExample objects:

In [31]:
faiss_train_examples = [example_create_fn(row) for _, row in pdf_subset.iterrows()]
print(faiss_train_examples[0])


<InputExample> label: 0.0, texts: A closer look at water-splitting's solar fuel potential


5. Initialize the Embedding Model

We will use the pre-trained model all-MiniLM-L6-v2, which provides high-quality embeddings for a wide range of natural language processing (NLP) tasks.

In [18]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

6. Extract the Titles and Convert to a List of Strings

Extract the “title” column from your DataFrame subset and convert it into a list. This is the raw text data we’ll be embedding.

Task: Convert the titles into a list of strings.

In [32]:
# Example (fill in appropriately):
titles_list = pdf_subset["title"].tolist()
print(titles_list)



7. Generate Embeddings for the Titles

Using the initialized model, generate embeddings for each title:

In [33]:
faiss_title_embedding = model.encode(
    titles_list,
    convert_to_numpy=True,        # returns NumPy array (easier to use with FAISS)
    show_progress_bar=True        # shows a loading bar during processing
)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

8. Check Embedding Dimensions

To verify the embeddings were generated correctly, check the shape of the output:

In [22]:
len(faiss_title_embedding), len(faiss_title_embedding[0])

(1000, 384)

Exercise 3: FAISS Indexing And Search

In this exercise, we will use FAISS (Facebook AI Similarity Search) to build an index of the embeddings generated in the previous exercise. This allows us to perform fast and efficient similarity searches over large collections of vectors. The goal is to make it possible to retrieve the most relevant news articles based on a user’s query.



Why This Step Matters:

FAISS is a library designed to perform similarity search at scale. When working with embeddings (which are high-dimensional vectors), searching through them efficiently becomes challenging. FAISS provides optimized algorithms for indexing and searching, making it possible to retrieve similar items in milliseconds, even from large datasets.



Instructions:

Install and Import FAISS Library
If you haven’t already, ensure FAISS is installed and import the necessary modules:

In [23]:
import numpy as np
import faiss

2. Prepare the Data for Indexing

Use the embedding vectors generated from the previous exercise and prepare them for indexing:

In [34]:
pdf_to_index =  pdf_subset
id_index = pdf_to_index["id"].tolist()

3. Normalize the Embedding Vectors

To perform cosine similarity search (which measures the angle between vectors rather than their distance), we first need to normalize the embedding vectors:

In [35]:
# Make a copy so you can keep the original if needed
content_encoded_normalized = faiss_title_embedding.copy()

# Normalize in-place (each row becomes a unit vector)
faiss.normalize_L2(content_encoded_normalized)

4. Create the FAISS Index

FAISS provides different types of indexes depending on the similarity measure and search requirements. We will use an IndexFlatIP (Inner Product) wrapped in an IndexIDMap:

In [36]:
index_content = faiss.IndexIDMap(faiss.IndexFlatIP(len(faiss_title_embedding[0])))
index_content.add_with_ids(content_encoded_normalized, id_index)

5. Implement a Search Function

Next, we’ll define a function search_content that takes a user query and retrieves the most similar articles from the index:

In [39]:
def search_content(query, pdf_to_index, k=3):
    # Encode query to embedding
    query_vector = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_vector)

    # Search FAISS index
    D, I = index_content.search(query_vector, k)

    # Use DataFrame index directly (already integers)
    similarities = D[0]
    matched_indices = I[0]

    # Get results from the DataFrame
    results = pdf_to_index.loc[matched_indices].copy()
    results["similarities"] = similarities

    return results.sort_values(by="similarities", ascending=False)

6. Test the Search Function

Use the search function to find articles related to a sample query:

In [40]:
display(search_content("animal", pdf_to_index, k=5))

Unnamed: 0,topic,link,domain,published_date,title,lang,id,similarities
176,TECHNOLOGY,https://www.pushsquare.com/news/2020/08/random...,pushsquare.com,2020-08-03 16:30:00,Random: You Can Pick Up and Pet Cats in Assass...,en,176,0.391902
975,HEALTH,https://www.news-medical.net/news/20200813/Res...,news-medical.net,2020-08-13 05:18:00,Researchers explore social behavior of animals...,en,975,0.376784
99,TECHNOLOGY,https://www.gematsu.com/2020/08/ghostwire-toky...,gematsu.com,2020-08-07 16:43:13,Ghostwire: Tokyo confirms dog petting,en,99,0.344058
928,SCIENCE,https://www.thecut.com/2020/08/scientists-say-...,thecut.com,2020-08-04 12:52:00,Just Let This Lizard Be a Dinosaur,en,928,0.317387
762,SCIENCE,https://af.reuters.com/article/worldNews/idAFK...,af.reuters.com,2020-08-13 16:51:00,'Secret' life of sharks: Study reveals their s...,en,762,0.295497


Exercise 4: ChromaDB Collection And Querying

In this exercise, we will introduce ChromaDB, an open-source vector database designed to store, index, and query embedding vectors. ChromaDB simplifies working with embeddings, and unlike FAISS, it can automatically handle tokenization, embedding, and indexing without requiring manual embedding generation. This makes it ideal for integrating with LLM-based applications (Large Language Model applications), especially in building Q&A systems or search engines.



Why This Step Matters:

With embeddings generated for our data, the next logical step is to store and query these embeddings efficiently. ChromaDB provides a higher-level interface for managing embeddings and supports metadata, making it a good fit for building applications like document search or Q&A systems. By using ChromaDB, we demonstrate how to integrate embeddings into a real-world workflow that supports querying and retrieving relevant documents.



Instructions:

1. Install and Import ChromaDB Library

Ensure you have ChromaDB installed and import the necessary components:

In [41]:
import chromadb
from chromadb.config import Settings

2. Initialize a ChromaDB Client and Create a Collection

ChromaDB organizes vectors into collections, which are similar to tables in a database. Each collection holds a set of documents (vectors) and associated metadata.

In [42]:
chroma_client = chromadb.Client()
collection_name = "my_news"

# If a collection with the same name exists, delete it to avoid conflicts
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)

print(f"Creating collection: '{collection_name}'")
collection = chroma_client.create_collection(name=collection_name)

Creating collection: 'my_news'


3. Add Data to the Collection

ChromaDB simplifies data ingestion by automatically generating embeddings if you don’t supply a custom embedding model. It uses the default SentenceTransformerEmbeddingFunction, which handles tokenization, embedding, and indexing.

Task: Add the first 100 news titles from the DataFrame subset to the collection. Alongside each title, include its corresponding topic as metadata and assign a unique ID for each document.

In [44]:
# Display the DataFrame subset (for reference)
display(pdf_subset)

collection.add(
    documents=pdf_subset["title"][:100].tolist(),
    metadatas=[{"topic": topic} for topic in pdf_subset["topic"][:100].tolist()],
    ids= [str(i) for i in range(100)]  # Provide a list of unique IDs.
)

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
995,TECHNOLOGY,https://www.androidcentral.com/mate-40-will-be...,androidcentral.com,2020-08-07 17:12:33,The Mate 40 will be the last Huawei phone with...,en,995
996,SCIENCE,https://www.cnn.com/2020/08/17/africa/stone-ag...,cnn.com,2020-08-17 17:10:00,"Early humans knew how to make comfy, pest-free...",en,996
997,HEALTH,https://www.tenterfieldstar.com.au/story/68776...,tenterfieldstar.com.au,2020-08-13 03:26:06,Regional Vic set for virus testing blitz,en,997
998,HEALTH,https://news.sky.com/story/coronavirus-trials-...,news.sky.com,2020-08-13 13:22:58,Coronavirus: Trials of second contact-tracing ...,en,998


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:00<00:00, 92.6MiB/s]


4. Query the Collection

Finally, perform a search query to retrieve the most relevant documents based on a search term.

Task: Query the collection using a term (e.g., “space”) and retrieve the top 10 most relevant documents.

In [47]:
import json

results = collection.query(
    query_texts=["space"],
    n_results=10  # Top 10 most relevant documents
)  # Perform the search query.

print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "72",
            "7",
            "30",
            "26",
            "23",
            "76",
            "69",
            "40",
            "47",
            "75"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "Beck teams up with NASA and AI for 'Hyperspace' visual album experience",
            "Orbital space tourism set for rebirth in 2021",
            "NASA drops \"insensitive\" nicknames for cosmic objects",
            "\u2018It came alive:\u2019 NASA astronauts describe experiencing splashdown in SpaceX Dragon",
            "Hubble Uses Moon As \u201cMirror\u201d to Study Earth\u2019s Atmosphere \u2013 Proxy in Search of Potentially Habitable Planets Around Other Stars",
            "Australia's small yet crucial part in the mission to find life on Mars",
            "NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico",
            "SpaceX's Starship spacecraft saw 150 meters high",
          

Exercise 5: Question Answering With Hugging Face Model

In this exercise, we will bring everything together by building a Question Answering (Q/A) system using a Hugging Face language model. By combining document retrieval (via ChromaDB) with text generation (via Hugging Face), we create a simple yet powerful pipeline where a model generates answers based on relevant context.



Why This Step Matters:

Retrieving relevant documents is only half the battle. The next step is generating meaningful responses based on that retrieved content. This is a core technique in modern Retrieval-Augmented Generation (RAG) systems, where a language model leverages both pre-trained knowledge and external information to answer questions more accurately. By integrating ChromaDB and Hugging Face transformers, we simulate a real-world Q/A pipeline.



Instructions:

1. Install and Import the Transformers Library

The Hugging Face transformers library provides access to a variety of pre-trained language models.

In [48]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

2. Initialize the Model and Tokenizer

Select a pre-trained model for text generation (e.g., GPT-2 or a similar causal language model) and initialize both the model and its tokenizer:

In [49]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Choose a pre-trained model ID from Hugging Face Hub
model_id = "gpt2"

# 2. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 3. Load the causal language model
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

3. Create a Text Generation Pipeline

Set up a pipeline for text generation, which wraps the model and tokenizer into a convenient interface:

In [59]:
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=512,  # Maximum number of tokens to generate.
    device_map="auto",   # Automatically uses available GPU/CPU resources.
    pad_token_id=tokenizer.eos_token_id
)

Device set to use cpu


4. Construct a Prompt Template

The prompt includes both the retrieved context (from ChromaDB) and the user’s question. This way, the model generates a response informed by the relevant documents.

In [51]:
question = "What's the latest news on space development?" # Define the user's question (e.g., "What's the latest news on space development?").
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])  # Concatenate the retrieved documents.
prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"

5. Generate a Response Using the Pipeline

Feed the prompt to the text generation pipeline and generate a response:

In [60]:
lm_response = pipe(prompt_template) # Use the pipeline to generate text based on the prompt.
print(lm_response[0]["generated_text"])

Relevant context: #Beck teams up with NASA and AI for 'Hyperspace' visual album experience #Orbital space tourism set for rebirth in 2021 #NASA drops "insensitive" nicknames for cosmic objects #‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon #Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars #Australia's small yet crucial part in the mission to find life on Mars #NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico #SpaceX's Starship spacecraft saw 150 meters high #NASA’s InSight lander shows what’s beneath Mars’ surface #Alien base on Mercury: ET hunters claim to find huge UFO

 The user's question: What's the latest news on space development?

‣Please note: this question is from the NASA Astronauts Blog. It's in the current version of the article (1.5 MB) (accessed 5/18/2017). It is published in English on the NASA website.

‣Please note: this question is from the

6. Experiment with Different Prompts and Context Windows

Try varying the question and the context size (e.g., using more or fewer retrieved documents) to observe how the model’s responses change.

In [62]:
# Try using top 3
results_3 = collection.query(query_texts=[question], n_results=3)

context_3 = " ".join([f"#{str(i)}" for i in results_3["documents"][0]])  # Concatenate the retrieved documents.
prompt_template_3 = f"Relevant context: {context_3}\n\n The user's question: {question}"

lm_response_3 = pipe(prompt_template_3) # Use the pipeline to generate text based on the prompt.
print(lm_response_3[0]["generated_text"])


Relevant context: #Orbital space tourism set for rebirth in 2021 #Beck teams up with NASA and AI for 'Hyperspace' visual album experience #SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch

 The user's question: What's the latest news on space development? The user's question: What's the latest news on space development?

The user's question: What is your favorite story about space tourism? The user's question: What is your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's your favorite story about space tourism?

The user's question: What's yo