### Setting up Environment and the LLM

#### Installing necessary packages
First, we install Hugging Face and Llama Index libraries that would be used through out the project.

In [None]:
# install llama_index
! pip install llama_index==0.10.19 llama_index_core==0.10.19 torch llama-index-embeddings-huggingface peft optimum bitsandbytes

# install auto_gptq
! pip install auto_gptq

# install docx2txt
! pip install docx2txt



#### LLM Setup
Run the following cells below to import necessary libraries, setting up the Hugging Face LLM and prompting the LLM to get an response. In this mini project we use the Qwen2.5 LLM from Alibaba Cloud.

In [None]:
# import list of libraries
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings,SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# instantiate the LLM from the Hugging Face library
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             trust_remote_code=False,
                                             revision="main",
                                             device_map="cuda:0"
                                             )
tokenizer = AutoTokenizer.from_pretrained(model_name)

###  Task 1 - LLM Prompting and Output Response Processing



### Step 1: Prompt Generation

In this step we write a funciton that can generate the prompts used for prompting the LLM. Populate the prompt generating functions below based on the following specs.

**Input:** User query and context

**Output:** Prompt as a string.

**Example:**
1. Context: A customer is having issues with their smartphone battery draining quickly and the phone overheating.
2. User query: My battery is constantly draining, what are some suggestions.
3. Example output prompt:
```
Context: A customer is having issues with their smartphone battery draining quickly and the phone overheating.
Please respond to the following user comment. Use the context above if it is helpful.
User comments: My battery is constantly draining, what are some suggestions.
```
4. Expected format:
```
Context:<context>
Please respond to the following user comment. Use the context above if it is helpful.
User comments: <user query>
```


In [None]:
# TODO 1: Populate the user prompt generating function
def prompt_with_context(context, user_query):
    # Format the prompt with the provided context and user query
    prompt = f"""
Context: {context}
Please respond to the following user comment. Use the context above if it is helpful.
User comments: {user_query}
"""
    return prompt

example_prompt = prompt_with_context("", "What is the functionality of an LLM?")
print(example_prompt)


Context: 
Please respond to the following user comment. Use the context above if it is helpful.
User comments: What is the functionality of an LLM?



### Step 2: LLM Query Function

In this step we write a function to query the LLM given the context and the user_query, using the prompt generation function that you have implemented above.

Recall that LLM works with tokens instead of strings and characters directly, as such we need to tokenize the prompt first using tokenizer and decode the tokens generated by the LLM.

**Input:** User query and context

**Output:** LLM response as a string

Refer to the [Hugging Face Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode_plus) for additional information.

In [None]:
def get_llm_response(context, user_query, temperature=0.0001):
    # Generate the prompt
    prompt = prompt_with_context(context, user_query)

    # Setting up prompting messages
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Step 1: Tokenize the input_text
    model_inputs = tokenizer.encode_plus(
        input_text,
        return_tensors="pt",  # Ensure tensors are returned for model input
        padding=True,
        truncation=True
    )

    # Step 1.1: Move inputs to the same device as the model
    device = model.device  # Automatically detect the model's device
    model_inputs = {key: value.to(device) for key, value in model_inputs.items()}  # Move inputs to the model's device

    # Step 2: Call the LLM to generate tokenized outputs
    generated_token_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        temperature=temperature
    )

    # Step 3: Post-process the generated tokens to remove input tokens
    def post_processing(generated_token_ids):
        # Remove input tokens from the generated tokens
        input_length = model_inputs['input_ids'].shape[1]  # Length of input tokens
        return generated_token_ids[:, input_length:]  # Retain only the new tokens

    generated_token_ids_post_processed = post_processing(generated_token_ids)

    # Step 4: Decode the generated tokens
    response = tokenizer.batch_decode(
        generated_token_ids_post_processed,
        skip_special_tokens=True  # Remove special tokens during decoding
    )[0]  # Decode as a string

    return response

example_prompt_response = get_llm_response("", "What is the functionality of an LLM?")
print(example_prompt_response)

An LLM (Large Language Model) is designed to understand and generate human-like text based on the input provided. It can be used for various purposes such as language translation, content creation, customer service chatbots, and more. The specific functionalities depend on the model's architecture and training data.


### Step 3: Query the LLM with Different Temperature Settings

In this step, we query the LLM using the implemented functions. We are now ready to write the first query to interact with the Qwen 2.5-1.5B LLM.

Additionally, we explore the effect of temperature on LLM inference.

**Example prompt (empty context):** What is the functionality of an LLM?


In [None]:
# TODO 5: uses the get_llm_response function implemented
# and print the LLM's response for the example prompt.
def query_llm_example():
    # Example prompt
    context = ""  # Empty context
    user_query = "What is the functionality of an LLM?"
    temperature = 0.0001  # Specified temperature

    # Query the LLM using the implemented function
    response = get_llm_response(context, user_query, temperature)

    # Print the LLM's response
    print("LLM Response:")
    print(response)

# Execute the function
query_llm_example()

LLM Response:
An LLM (Large Language Model) is designed to understand and generate human-like text based on the input provided. It can be used for various purposes such as language translation, content creation, customer service chatbots, and more. The specific functionalities depend on the model's architecture and training data.


The temperature parameter in an LLM controls the randomness or creativity of the generated responses. It adjusts the likelihood distribution from which the next token is selected during text generation. If low temperature (close to 0), then the model becomes deterministic, focusing on the most probable tokens. The responses are precise, factual, and consistent, which are suitable for tasks requiring reliability, such as technical explanations or summarizations. However, if high temperature, then the model becomes more random, exploring less probable tokens. The responses are more creative, diverse, and sometimes unpredictable, so they suit tasks like creative writing or brainstorming.

In [None]:
# TODO 6: change the temperature parameter value and see how that affects the LLM output.
# You might want to repeatly generate a few response of the same prompt to see the difference.
def experiment_with_temperature():
    # Example prompt
    context = ""  # Empty context
    user_query = "What is the functionality of an LLM?"

    # Test different temperature values
    for temp in [0.0001, 0.5, 1.0, 1.5]:
        print(f"Temperature: {temp}")
        response = get_llm_response(context, user_query, temperature=temp)
        print("LLM Response:")
        print(response)
        print("-" * 50)

# Execute the function
experiment_with_temperature()

Temperature: 0.0001
LLM Response:
An LLM (Large Language Model) is designed to understand and generate human-like text based on the input provided. It can be used for various purposes such as language translation, content creation, customer service chatbots, and more. The specific functionalities depend on the model's architecture and training data.
--------------------------------------------------
Temperature: 0.5
LLM Response:
An LLM (Large Language Model) is a type of artificial intelligence that can generate human-like text based on the input provided to it. It has the ability to understand and process natural language, allowing it to perform tasks such as translation, summarization, question answering, and more. LLMs are designed to be highly versatile and capable of handling a wide range of inputs and generating appropriate responses. They have been used in various applications, including chatbots, virtual assistants, content generation for websites or social media platforms, an

### Step 4: Hallucination in LLM
LLM can hallucinate and produce factually incorrect or self-contradictory results. From our code output, we found that temperature alone does not directly reduce hallucinations but influences randomness and creativity. Lower temperatures lead to more grounded and consistent responses. In addition, we also discovered that including factual context explicitly guides the model away from potential hallucinations. The response reflects an understanding of the provided context, reducing the risk of irrelevant or unsupported outputs.

In [None]:
# TODO 7: Come up a prompt that results in hallucination, then see
# if changing the temperature or providing factual context help?
def experiment_with_hallucination():
    hallucination_prompt = "What is the date when squirrels started using ChatGPT?"

    configurations = [
        {"temperature": 0.0001, "context": ""},
        {"temperature": 0.5, "context": ""},
        {"temperature": 1.0, "context": ""},
        {"temperature": 0.5, "context": "Squirrels do not use ChatGPT, and there is no record of such an event."}
    ]

    for config in configurations:
        print(f"Temperature: {config['temperature']} | Context: {config['context']}")
        response = get_llm_response(config["context"], hallucination_prompt, temperature=config["temperature"])
        print("LLM Response:")
        print(response)
        print("-" * 50)

experiment_with_hallucination()

Temperature: 0.0001 | Context: 
LLM Response:
I'm sorry, but I don't have any information about when squirrels started using ChatGPT. This question seems to be unrelated to my knowledge base and capabilities as an AI language model. If you have any other questions or need assistance with something else, feel free to ask!
--------------------------------------------------
Temperature: 0.5 | Context: 
LLM Response:
I'm sorry, but I cannot provide an answer to your question as there is no relevant information provided in the given context. The context does not mention any specific date for when squirrels started using ChatGPT. Please provide more details or clarify your question.
--------------------------------------------------
Temperature: 1.0 | Context: 
LLM Response:
I'm sorry, but I don't have any information about when squirrels started using ChatGPT. My training data does not include such specific details related to wildlife or squirrel behavior in relation to artificial intellige

In [None]:
# Query the LLM with varying temperature settings
temperatures = [0.0001, 0.1, 1.0]
context = ""
example_prompt = "What is the functionality of an LLM?"

for temp in temperatures:
    print(f"\n--- Response at temperature {temp} ---")
    for _ in range(3):
        response = get_llm_response(context, example_prompt, temperature=temp)
        print(f"Response: {response}")
        print('--------')

#### Task 2 - Understanding Tokenization and Embeddings
1. Instantiate an open-sourced Hugging Face embedding model.
2. Encode the given sentence examples.
3. Implement the cosine vector similarity score.
4. Compare the embeddings between those examples and a given reference and select the ones that have a similarity score greater than some threshold.

### Embedding Model Setup
In this project we use the all-mpnet-base-v2 embedding model. all-mpnet-base-v2 is an embedding model based on Microsoft's mpnet-base models. This model is intented to encode sentences and short paragraphs.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from llama_index.core import Document, VectorStoreIndex

In [None]:
# Load a sentence-transformers model for text embeddings from Hugging Face
embed_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

#### Step 1: Embedding Generation

In this step we write a funciton that can generate the embeddings used for downstream tasks such as RAG. Populate the embedding generating functions below based on the following specs.

**Input:** User input text.

**Output:** Text embedding using the embedding model.

Embedding shape: torch.Size([768]). This indicates that the embedding is a 1D tensor (a vector) with 768 elements. Each of these 768 values is a numerical feature representing the semantic meaning of the input text.

In [None]:
# TODO 8: Generate embeddings using the embedding model
import torch

def generate_embeddings(text):
    # Generate embeddings using the encode method of SentenceTransformer
    embedding = embed_model.encode(
        text,
        convert_to_tensor=True,  # Return PyTorch tensor
        device=embed_model.device  # Ensure embeddings are computed on the correct device (CPU or GPU)
    )
    return embedding

# Example usage
text = "Sample text for embedding generation."
embedding = generate_embeddings(text)
print("Embedding shape:", embedding.shape)

Embedding shape: torch.Size([768])


The dimension of the embedding vector remains constant (e.g., 768) regardless of the length of the input sentence. This is because the embedding model (SentenceTransformer) generates fixed-length embeddings based on the model's architecture and not the input size.

In [None]:
sentences = ["Short sentence.", "This is a slightly longer sentence to test embedding length.", "An even longer sentence to further examine whether the embedding size remains fixed regardless of input length."]
embeddings = [generate_embeddings(sentence) for sentence in sentences]

for i, embedding in enumerate(embeddings):
    print(f"Sentence {i+1}: {sentences[i]}")
    print(f"Embedding shape: {embedding.shape}")
    print("-" * 50)

Sentence 1: Short sentence.
Embedding shape: torch.Size([768])
--------------------------------------------------
Sentence 2: This is a slightly longer sentence to test embedding length.
Embedding shape: torch.Size([768])
--------------------------------------------------
Sentence 3: An even longer sentence to further examine whether the embedding size remains fixed regardless of input length.
Embedding shape: torch.Size([768])
--------------------------------------------------


#### Step 2: Cosine Similarity Score

In this step we write a funciton that ccomputes the similarity of two embeddings. We use cosine similarity.

**Input:** Two embeddings as numpy arrays.

**Output:** The cosine similarity score between the two embeddings.

In [None]:
import numpy as np

# TODO 9: Calculate the cosine similarity score for the two embeddings
def cosine_similarity_score(src_embedding, tar_embedding):
    # Normalize the embeddings to unit vectors
    src_norm = np.linalg.norm(src_embedding)
    tar_norm = np.linalg.norm(tar_embedding)

    # Handle edge case where embedding norm is zero
    if src_norm == 0 or tar_norm == 0:
        return 0.0  # No similarity if one of the vectors has zero magnitude

    # Compute the cosine similarity
    score = np.dot(src_embedding, tar_embedding) / (src_norm * tar_norm)

    return score

# Example embeddings
embedding1 = np.array([1, 2, 3])
embedding2 = np.array([4, 5, 6])

# Compute cosine similarity
score = cosine_similarity_score(embedding1, embedding2)
print("Cosine Similarity Score:", score)

Cosine Similarity Score: 0.9746318461970762


#### Step 3: Comparing Similarity between Sentences.

In this step we uses the implemented embedding generation function and the cosine similarity score function to compute the similiarty between pairs of the provided sentences.

In [None]:
from google.colab import files

# Upload sentences.txt
uploaded = files.upload()

Saving sentences.txt to sentences (1).txt


In [None]:
# TODO 10: Calculate the cosine similarity score between target sentence and
# each of the other sentences provided in the sentences.txt file.
# Function to generate embeddings
def generate_embeddings(text):
    embedding = embed_model.encode(
        text,
        convert_to_tensor=True,
        device=embed_model.device
    )
    return embedding.cpu().numpy()

# Function to calculate cosine similarity
def cosine_similarity_score(src_embedding, tar_embedding):
    src_norm = np.linalg.norm(src_embedding)
    tar_norm = np.linalg.norm(tar_embedding)
    if src_norm == 0 or tar_norm == 0:
        return 0.0
    return np.dot(src_embedding, tar_embedding) / (src_norm * tar_norm)

# Read the file
with open("sentences.txt", "r") as file:
    sentences = file.readlines()

# Extract the target sentence and other sentences
target_sentence = sentences[0].strip()  # First line is the target sentence
other_sentences = [sentence.strip() for sentence in sentences[3:]]

# Generate embeddings for the target sentence
target_embedding = generate_embeddings(target_sentence)

# Calculate and print cosine similarity scores
similarity_scores = []
for idx, sentence in enumerate(other_sentences):
    other_embedding = generate_embeddings(sentence)
    score = cosine_similarity_score(target_embedding, other_embedding)
    similarity_scores.append((sentence, score))

# Display results
print(f"Target Sentence: {target_sentence}")
for sentence, score in similarity_scores:
    print(f"Sentence: {sentence}")
    print(f"Cosine Similarity Score: {score}")
    print("-" * 50)

Target Sentence: Target Sentence: The weather today is perfect for a walk in the park.
Sentence: Jogging in the park is a great way to exercise in the morning.
Cosine Similarity Score: 0.4265746772289276
--------------------------------------------------
Sentence: It might rain later, so the park could get muddy.
Cosine Similarity Score: 0.39862650632858276
--------------------------------------------------
Sentence: It’s an ideal day to take a walk through the park.
Cosine Similarity Score: 0.6885090470314026
--------------------------------------------------
Sentence: A picnic in the park sounds like a fun idea.
Cosine Similarity Score: 0.38541653752326965
--------------------------------------------------
Sentence: The park is a great place to enjoy today’s beautiful weather.
Cosine Similarity Score: 0.5804292559623718
--------------------------------------------------
Sentence: Today’s weather makes it wonderful to walk outdoors in the park.
Cosine Similarity Score: 0.6761819124221