Part 1: Word Embedding Arithmetic (30 Marks)
Task
Create 5 examples of word arithmetic similar to the "king - man + woman ≈ queen" analogy. Use
words that have relevant semantic relationships.
Steps
1. Load the BERT model and tokenizer
2. Implement functions to get word embeddings and perform word arithmetic.
3. Write word_arithmetic and find_most_similar functions to create your examples
4. The word arithmetic function will be able to take two list of words:
○ The first list is parameters to the word_arithmatic as example, (paris, france,
italy), run the arithmetic and collect the return value (e.g., paris - france + italy =
?).
○ Using the find_most_similar function with return value of word_arithmetic
as input, along with the second list of words like (rome, romaine, ramania, ronnie,
random) to find the most similar word to the answer.
○ Show this for of 5 potential pairs of such words
○ Print answer for each of the 5 test cases.


In [7]:

!pip install langchain faiss-cpu transformers openai wikipedia




In [1]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.0-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.121-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

In [2]:
!pip install transformers torch scipy




In [3]:
from transformers import BertModel, BertTokenizer
import torch
import numpy as np
from scipy.spatial.distance import cosine

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get word embeddings
def get_embedding(word):
    inputs = tokenizer(word, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Extracting the last hidden state for the first token (the word itself)
    return outputs.last_hidden_state[0][0].numpy()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [4]:
def word_arithmetic(word1, word2, word3):
    # Get embeddings for all three words
    emb1 = get_embedding(word1)
    emb2 = get_embedding(word2)
    emb3 = get_embedding(word3)

    # Perform word arithmetic: (word1 - word2 + word3)
    result_vector = emb1 - emb2 + emb3
    return result_vector


In [5]:
def find_most_similar(target_vector, word_list):
    similarity_scores = []

    for word in word_list:
        word_emb = get_embedding(word)
        # Calculate cosine similarity
        similarity = 1 - cosine(target_vector, word_emb)
        similarity_scores.append((word, similarity))

    # Sort by similarity and return the most similar word
    most_similar_word = sorted(similarity_scores, key=lambda x: x[1], reverse=True)[0]
    return most_similar_word


In [6]:
examples = [
    ('king', 'man', 'woman'),  # king - man + woman ≈ queen
    ('paris', 'france', 'italy'),  # paris - france + italy ≈ rome
    ('apple', 'fruit', 'vegetable'),  # apple - fruit + vegetable ≈ carrot
    ('car', 'road', 'water'),  # car - road + water ≈ boat
    ('doctor', 'hospital', 'school')  # doctor - hospital + school ≈ teacher
]

word_lists = [
    ['queen', 'princess', 'lady', 'woman', 'duchess'],
    ['rome', 'milan', 'florence', 'venice', 'turin'],
    ['carrot', 'broccoli', 'tomato', 'potato', 'cucumber'],
    ['boat', 'ship', 'raft', 'canoe', 'submarine'],
    ['teacher', 'student', 'professor', 'principal', 'nurse']
]

# Generate results for the 5 examples
for i, example in enumerate(examples):
    word1, word2, word3 = example
    result_vector = word_arithmetic(word1, word2, word3)

    # Find the most similar word
    most_similar_word = find_most_similar(result_vector, word_lists[i])

    # Output the result
    print(f"{word1} - {word2} + {word3} ≈ {most_similar_word[0]} (Similarity: {most_similar_word[1]:.4f})")


king - man + woman ≈ lady (Similarity: 0.9683)
paris - france + italy ≈ florence (Similarity: 0.9418)
apple - fruit + vegetable ≈ carrot (Similarity: 0.9111)
car - road + water ≈ ship (Similarity: 0.9358)
doctor - hospital + school ≈ student (Similarity: 0.9528)


Part 2: RAG System Implementation (30 Marks)
Task
Implement a simple RAG system using LangChain, process an article of your choice, and run 5
different queries on its content.
Steps
1. Choose at least 5 diverse articles on a different topic of your interest from wikipedia
dump on HuggingFace (e.g., Artificial Intelligence, Machine Learning, etc.).
2. Use the provided code from the class to load and process each article, create
embeddings, store embeddings for each article in the single VectorDB and set up the
RAG system.
3. Formulate 10 diverse queries that explore various aspects of your article's content.
4. Run each query using the run_query function and record the results.

In [7]:
!pip install datasets langchain faiss-cpu transformers openai wikipedia

from datasets import load_dataset

# Load the Wikipedia dataset
wikipedia_dataset = load_dataset("wikipedia", "20220301.en", split='train')

# Extract a few articles (you can change the indices to extract articles on different topics)
articles = [
    wikipedia_dataset[int(x)]['text'] for x in ['1', '50', '100', '200', '300']
]

# Display the first few characters of each article to confirm
for idx, article in enumerate(articles):
    print(f"Article {idx+1} preview:\n{article[:500]}...\n")


Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting openai
  Downloading openai-1.45.1-py3-none-any.whl.metadata (22 kB)
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.m

wikipedia.py:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

The repository for wikipedia contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wikipedia.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] Y


Downloading data:   0%|          | 0/41 [00:00<?, ?files/s]

train-00009-of-00041.parquet:   0%|          | 0.00/312M [00:00<?, ?B/s]

train-00004-of-00041.parquet:   0%|          | 0.00/431M [00:00<?, ?B/s]

train-00001-of-00041.parquet:   0%|          | 0.00/705M [00:00<?, ?B/s]

train-00005-of-00041.parquet:   0%|          | 0.00/391M [00:00<?, ?B/s]

train-00000-of-00041.parquet:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

train-00014-of-00041.parquet:   0%|          | 0.00/222M [00:00<?, ?B/s]

train-00008-of-00041.parquet:   0%|          | 0.00/329M [00:00<?, ?B/s]

train-00006-of-00041.parquet:   0%|          | 0.00/366M [00:00<?, ?B/s]

train-00011-of-00041.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

train-00007-of-00041.parquet:   0%|          | 0.00/326M [00:00<?, ?B/s]

train-00003-of-00041.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00002-of-00041.parquet:   0%|          | 0.00/558M [00:00<?, ?B/s]

train-00010-of-00041.parquet:   0%|          | 0.00/267M [00:00<?, ?B/s]

train-00015-of-00041.parquet:   0%|          | 0.00/236M [00:00<?, ?B/s]

train-00012-of-00041.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

train-00013-of-00041.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

train-00016-of-00041.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

train-00017-of-00041.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

train-00018-of-00041.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00019-of-00041.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

train-00020-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00021-of-00041.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00022-of-00041.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00023-of-00041.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00024-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00025-of-00041.parquet:   0%|          | 0.00/218M [00:00<?, ?B/s]

train-00026-of-00041.parquet:   0%|          | 0.00/212M [00:00<?, ?B/s]

train-00027-of-00041.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

train-00028-of-00041.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00029-of-00041.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

train-00030-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00031-of-00041.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

train-00032-of-00041.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00033-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00034-of-00041.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00035-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00036-of-00041.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00037-of-00041.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00038-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00039-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00040-of-00041.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6458670 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/41 [00:00<?, ?it/s]

Article 1 preview:
Autism is a neurodevelopmental disorder characterized by difficulties with social interaction and communication, and by restricted and repetitive behavior. Parents often notice signs during the first three years of their child's life. These signs often develop gradually, though some autistic children experience regression in their communication and social skills after reaching developmental milestones at a normal pace.

Autism is associated with a combination of genetic and environmental factors...

Article 2 preview:
Assistive technology (AT) is a term for assistive, adaptive, and rehabilitative devices for people with disabilities and the elderly. People with disabilities often have difficulty performing activities of daily living (ADLs) independently, or even with assistance. ADLs are self-care activities that include toileting, mobility (ambulation), eating, bathing, dressing, grooming, and personal device care. Assistive technology can ameliorate the effects of 

In [9]:
!pip uninstall langchain -y
!pip install langchain


Found existing installation: langchain 0.3.0
Uninstalling langchain-0.3.0:
  Successfully uninstalled langchain-0.3.0
Collecting langchain
  Using cached langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Using cached langchain-0.3.0-py3-none-any.whl (1.0 MB)
Installing collected packages: langchain
Successfully installed langchain-0.3.0


In [11]:
!pip install langchain-community


Collecting langchain-community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metadata (7.2 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloa

In [13]:
!pip install -U langchain-huggingface
!pip install sentence-transformers


Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.0-py3-none-any.whl.metadata (1.3 kB)
Collecting sentence-transformers>=2.6.0 (from langchain-huggingface)
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Downloading langchain_huggingface-0.1.0-py3-none-any.whl (20 kB)
Downloading sentence_transformers-3.1.0-py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers, langchain-huggingface
Successfully installed langchain-huggingface-0.1.0 sentence-transformers-3.1.0


In [15]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load the HuggingFace BERT embedding model
embedding_model = HuggingFaceEmbeddings()

# Split articles into manageable chunks (FAISS works best with smaller pieces of text)
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# Split articles into chunks
article_chunks = []
for article in articles:
    chunks = text_splitter.split_text(article)
    article_chunks.extend(chunks)

# Create embeddings for each chunk and store in FAISS VectorDB
faiss_index = FAISS.from_texts(article_chunks, embedding_model)

# Store the vector DB in memory or on disk (optional)
faiss_index.save_local("wiki_faiss_index")


  embedding_model = HuggingFaceEmbeddings()


In [17]:
# Formulate queries for the RAG system
queries = [
    "What is the difference between artificial intelligence and machine learning?",
    "Can you explain the basics of natural language processing?",
    "What is the significance of robotics in modern industry?",
    "How does data science contribute to decision-making in businesses?",
    "How do generative models, like GPT, create human-like text?",
    "Explain the challenges faced by machine learning systems.",
    "How does deep learning differ from traditional machine learning?",
    "What are some applications of natural language processing?",
    "What role does big data play in the development of smart cities?",
    "What is the future of AI in society?"
]

# Define a function to run a query
def run_query(query):
    # Search the FAISS index for relevant document chunks
    docs = faiss_index.similarity_search(query, k=3)

    # Combine the results and display the content for each retrieved document
    combined_docs = "\n\n".join([doc.page_content for doc in docs])

    # You can use OpenAI or Hugging Face models for generating responses (e.g., GPT-3 or any other LLM)
    # Below is a placeholder for a generative function that processes the retrieved documents.
    # response = generative_model.generate(combined_docs, query)

    # For now, let's print out the documents retrieved:
    print(f"Query: {query}")
    print(f"Retrieved Documents:\n{combined_docs[:500]}...\n")
    # Optionally, return the generated response
    # return response

# Run each query and record the results
for query in queries:
    run_query(query)


Query: What is the difference between artificial intelligence and machine learning?
Retrieved Documents:
Adaptive technology
Adaptive technology and assistive technology are different. Assistive technology is something that is used to help disabled people, while adaptive technology covers items that are specifically designed for disabled people and would seldom be used by a non-disabled person. In other words, assistive technology is any object or system that helps people with disabilities, while adaptive technology is specifically designed for disabled people. Consequently, adaptive technology is ...

Query: Can you explain the basics of natural language processing?
Retrieved Documents:
Some example of screen readers are Apple VoiceOver, Google TalkBack and Microsoft Narrator. This software is provided free of charge on all Apple devices. Apple VoiceOver includes the option to magnify the screen, control the keyboard, and provide verbal descriptions to describe what is happening on th