<a href="https://colab.research.google.com/github/Rajraut12/Prompt-Engineering/blob/main/Prompt_Engineering_Homework_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Word Embedding Arithmetic
### Step 1: Load the BERT Model and Tokenizer

In [None]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Load the BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Ensure the model is in evaluation mode
model.eval()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

### Step 2: Implement Functions to Get Word Embeddings and Perform Word Arithmetic

In [None]:
# Function to get the embedding of a word
def get_word_embedding(word):
    # Tokenize the word and convert to tensor
    inputs = tokenizer(word, return_tensors='pt')

    # Get the hidden states from BERT
    with torch.no_grad():
        outputs = model(**inputs)

    # Use the embeddings from the last hidden layer, taking the average of the token embeddings
    word_embedding = outputs.last_hidden_state.mean(dim=1).squeeze()

    return word_embedding

# Function to perform word arithmetic
def word_arithmetic(words1):
    vec = get_word_embedding(words1[0])

    for i in range(1, len(words1), 2):
        if words1[i] == '-':
            vec -= get_word_embedding(words1[i + 1])
        elif words1[i] == '+':
            vec += get_word_embedding(words1[i + 1])

    return vec

# Function to find the most similar word
def find_most_similar(target_embedding, words_list):
    similarities = []

    for word in words_list:
        word_embedding = get_word_embedding(word)
        # Compute cosine similarity
        cos_sim = torch.nn.functional.cosine_similarity(target_embedding, word_embedding, dim=0)
        similarities.append((word, cos_sim.item()))

    # Sort based on similarity and return the most similar word
    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities[0][0]


### Step 3: Perform Word Arithmetic and Find Most Similar Word

In [None]:
# List of examples
examples = [
    (["paris", "-", "france", "+", "italy"], ["rome", "romaine", "romania", "ronnie", "random"]),
    (["king", "-", "man", "+", "woman"], ["queen", "lady", "empress", "princess", "duchess"]),
    (["apple", "-", "fruit", "+", "vegetable"], ["carrot", "potato", "broccoli", "grape", "berry"]),
    (["japan", "-", "tokyo", "+", "seoul"], ["korea", "china", "singapore", "taiwan", "india"]),
    (["doctor", "-", "hospital", "+", "school"], ["teacher", "student", "principal", "nurse", "librarian"])
]

# Execute the word arithmetic for each example
for i, (arithmetic_words, candidates) in enumerate(examples):
    target_embedding = word_arithmetic(arithmetic_words)
    most_similar_word = find_most_similar(target_embedding, candidates)
    print(f"Example {i+1}: {arithmetic_words} -> Most similar word: {most_similar_word}")

Example 1: ['paris', '-', 'france', '+', 'italy'] -> Most similar word: rome
Example 2: ['king', '-', 'man', '+', 'woman'] -> Most similar word: queen
Example 3: ['apple', '-', 'fruit', '+', 'vegetable'] -> Most similar word: potato
Example 4: ['japan', '-', 'tokyo', '+', 'seoul'] -> Most similar word: korea
Example 5: ['doctor', '-', 'hospital', '+', 'school'] -> Most similar word: teacher


#Part 2: RAG System Implementation

In [1]:
!pip install langchain
!pip install langchain-community
!pip install langchain-groq
!pip install wikipedia
!pip install sentence-transformers
!pip install faiss-gpu

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.2-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.125-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

In [2]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI  # Assuming ChatGroq works similarly to OpenAI models
from langchain_groq import ChatGroq
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import WikipediaLoader
import os

# Set Groq API key
os.environ["GROQ_API_KEY"] = "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"

# Step 1: Choose 5 articles (article titles are just examples, adjust as needed)
articles = [
    "Artificial Intelligence",
    "Quantum Computing",
    "Climate Change",
    "Ancient Civilizations",
    "Space Exploration"
]

# Step 2: Load and process each article
documents = []
for article_title in articles:
    loader = WikipediaLoader(article_title)
    article_text = loader.load()
    documents.extend(article_text)

# Step 3: Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs_chunks = text_splitter.split_documents(documents)

# Step 4: Create embeddings and store in a VectorDB
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vector_db = FAISS.from_documents(docs_chunks, embedding_model)

# Step 5: Initialize the Groq LLM
llm = ChatGroq(model_name="mixtral-8x7b-32768")  # Ensure ChatGroq is compatible with langchain

# Step 6: Load the QA chain using the appropriate LLM
qa_chain = load_qa_chain(llm, chain_type="stuff")

# Step 7: Define the query function using the chain and retriever
def run_query(query):
    docs = vector_db.similarity_search(query)
    result = qa_chain.run(input_documents=docs, question=query)
    return result

# Step 8: Run 10 diverse queries on the RAG system
queries = [
    "What is the main goal of Artificial Intelligence?",
    "How does Quantum Computing differ from classical computing?",
    "What are the primary causes of Climate Change?",
    "Which ancient civilization was known for building pyramids?",
    "What is the significance of space exploration?",
    "How is AI impacting modern industries?",
    "What is the principle of superposition in Quantum Computing?",
    "What are the potential effects of climate change on the environment?",
    "Which ancient civilization is credited with the invention of the wheel?",
    "What are the future prospects of human settlement on Mars?"
]

# Step 9: Run each query and record results
for i, query in enumerate(queries, 1):
    response = run_query(query)
    print(f"Query {i}: {query}")
    print(f"Response: {response}\n")




  lis = BeautifulSoup(html).find_all('li')
  embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

stuff: https://python.langchain.com/v0.2/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/v0.2/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/v0.2/docs/how_to/#qa-with-rag
  qa_chain = load_qa_chain(llm, chain_type="stuff")
  result = qa_chain.run(input_documents=docs, question=query)


Query 1: What is the main goal of Artificial Intelligence?
Response: The main goal of Artificial Intelligence is to develop and study methods and software that enable machines to perceive their environment, learn from experience, and use intelligence to take actions that maximize their chances of achieving defined goals. This overarching goal can be broken down into specific traits or capabilities that researchers aim to display in an intelligent system, such as reasoning, knowledge representation, planning, learning, natural language processing, perception, and support for robotics. The long-term goal of AI is to create a general intelligence that can complete any task performable by a human on at least an equal level.

Query 2: How does Quantum Computing differ from classical computing?
Response: Quantum computing differs from classical computing in several ways. First, quantum algorithms can leverage quantum mechanical phenomena such as superposition and entanglement, which can allo