# Setting up Google Colab and Hugging Face API

## Setting up Google Colab
1. **Open this notebook**: go to [Colab](https://colab.research.google.com/github/Chair-of-Banking-and-Finance/Bachelor_thesis_24_25_Template/blob/main/Llama_RAG/llama2%20notebook.ipynb).
2. **Connect to runtime**: Click on the "Connect" button at the top right corner of the screen. This will connect your notebook to a virtual machine with GPU support.
3. **Set GPU as hardware accelerator**: Go to "Runtime" -> "Change runtime type" and select "GPU" from the hardware accelerator dropdown menu. This will ensure your notebook is using a GPU, which is highly recommended for working with large language models like Llama 2.
4. **Install required packages**: Run the installation commands provided in the script to install all necessary dependencies for the project. These include `torch`, `transformers`, `langchain`, and others.


# Every part containing a flag 🚩 at the start of the code may need some input of you


# Following part will install and import the necessary libraries on your notebook



In [None]:
# Install necessary libraries
!pip install scikit-learn
!apt-get update
!apt-get install -y tesseract-ocr
!pip install faiss-cpu sentence-transformers python-dotenv replicate PyPDF2 Pillow pytesseract

# Import necessary libraries
from sklearn.metrics.pairwise import cosine_similarity
import os
import replicate
import PyPDF2
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from PIL import Image
import pytesseract
import io

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading pack

# 🚩 Run this code once you inserted your Replicate API Token into the "YOUR_API_TOKEN" part. You can get it here https://replicate.com/account/api-tokens - after a few 100 invoices the billing is 0.03$ for me. You can set up your own monthly billing limit aswell.

In [None]:
# Set the API token directly in the environment
os.environ["REPLICATE_API_TOKEN"] = "HERE YOU PUT IN YOUR API TOKEN"

# 🚩 Creation of definition of file path, embedding, context search and respond generation and sentence transformer model -> Change chunk size in line 38 by increasing or decreasing it, which may resolve to more accurate reponses. E.g. instead of 256 you can type following: sentences = [text[i:i+512] for i in range(0, len(text), 512)]  

In [None]:
def read_file(file_path):
    """
    Reads a PDF or text file and extracts text from it, including OCR for images.
    """
    text = ""
    try:
        if file_path.lower().endswith('.pdf'):
            reader = PyPDF2.PdfReader(file_path)
            for page_num in range(len(reader.pages)):
                page = reader.pages[page_num]
                text += page.extract_text() or ""

                # Check for images in the page
                resources = page.get('/Resources')
                if resources:
                    x_objects = resources.get('/XObject')
                    if isinstance(x_objects, PyPDF2.generic.DictionaryObject):
                        for obj in x_objects:
                            x_object = x_objects[obj]
                            if isinstance(x_object, PyPDF2.generic.IndirectObject):
                                x_object = x_object.get_object()  # Dereference the IndirectObject
                            if x_object.get('/Subtype') == '/Image':
                                try:
                                    # Extract image data
                                    data = x_object._data  # Access raw data
                                    # Check image filter type
                                    if '/Filter' in x_object:
                                        if x_object['/Filter'] == '/DCTDecode':  # JPEG format
                                            img = Image.open(io.BytesIO(data))
                                        elif x_object['/Filter'] == '/JPXDecode':  # JPEG2000 format
                                            img = Image.open(io.BytesIO(data))
                                        elif x_object['/Filter'] == '/FlateDecode':  # PNG format
                                            img = Image.open(io.BytesIO(data))
                                        else:
                                            raise ValueError(f"Unsupported image format: {x_object['/Filter']}")
                                    text += pytesseract.image_to_string(img)
                                except Exception as img_error:
                                    print(f"Error processing image on page {page_num}: {img_error}")
                    else:
                        print(f"Page {page_num}: No images or unsupported format")
        elif file_path.lower().endswith('.txt'):
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
        else:
            raise ValueError("Unsupported file format. Only PDF and TXT files are supported.")
    except Exception as e:
        print(f"Error reading file: {e}")
    return text

def read_files(file_paths):
    """
    Reads multiple PDF or text files and combines their text content.
    """
    combined_text = ""
    for file_path in file_paths:
        combined_text += read_file(file_path) + "\n"
    return combined_text

def create_embeddings(text, model):
    """
    Creates embeddings for the given text using the specified model.
    """
    try:
        # Split text into sentences or smaller chunks
        sentences = text.split('. ')
        embeddings = model.encode(sentences)
        return sentences, embeddings
    except Exception as e:
        print(f"Error creating embeddings: {e}")
        return [], np.array([])

def search_context(query, sentences, embeddings, model, top_k=5):
    """
    Searches for relevant context for the given query using FAISS.
    """
    try:
        query_embedding = model.encode([query])[0]
        # Ensure the embeddings are in the correct shape
        embeddings = np.array(embeddings).astype('float32')
        index = faiss.IndexFlatL2(embeddings.shape[1])
        index.add(embeddings)
        D, I = index.search(np.array([query_embedding]), k=top_k)
        return [sentences[i] for i in I[0]]
    except Exception as e:
        print(f"Error searching for context: {e}")
        return []

def calculate_cosine_similarity(query, embeddings, model):
    """
    Calculates cosine similarity between the query and all sentences.
    """
    try:
        query_embedding = model.encode([query])
        similarities = cosine_similarity(query_embedding, embeddings)
        return similarities[0]  # Return similarity scores for the query
    except Exception as e:
        print(f"Error calculating cosine similarity: {e}")
        return []

def generate_llama2_response(prompt_input, context, pre_prompt):
    """
    Generates a response using the LLaMA2 model.
    """
    try:
        prompt_with_context = f"{pre_prompt}\n\nContext: {context}\n\nUser: {prompt_input}\nAssistant: "
        output = replicate.run(
            'a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5',
            input={"prompt": prompt_with_context, "temperature": 0, "top_p": 1, "max_length": 128, "repetition_penalty": 1}
        )
        response = ''.join(output)
        return response
    except Exception as e:
        print(f"Error generating response: {e}")
        return ""

# Initialize the Sentence Transformer model
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')  # LLM being used



# 🚩 Definition of pre-prompts. Insert any pre-prompt possibly matching for your queries. There are already a few examples given. Feel free to change/add some in the given format.

In [None]:
# New pre-prompt
pre_prompt = (
    "You are a specialized financial analysis assistant with expertise in interpreting and summarizing financial analyst reports found in the 'sample_data' folder. "
    "Your task is to provide data-driven answers based on this document, ensuring that your responses are directly relevant to the user's queries. "
    "Keep your responses short, concise, and organized into clear paragraphs to facilitate understanding. "
    "Focus on delivering exactly as in the document given, precise and accurate financial insights extracted from the document."
)

# Query handling

In [None]:
def handle_query(prompt_input):
    """
    Handles user queries by retrieving relevant context and generating responses.
    """
    context = search_context(prompt_input, sentences, embeddings, sentence_model, top_k=5)
    context_text = " ".join(context)
    response = generate_llama2_response(prompt_input, context_text, pre_prompt)
    return response


# 🚩 Insert your file from the "/content/sample_data" folder on the <- left side. Right

---

click on the document and click on copy path. Insert the path in the code here: file_path = "YOUR_FILE_PATH"

# Example: file_path = "/content/sample_data/JP Morgan BMW@GR BMW Q3’23 First Take strong quarter.pdf"

In [None]:
# Modify the file paths as needed
file_paths = [
    "/content/Chicago Paper.pdf",
    "/content/MAS thesis.pdf"
]

file_text = read_files(file_paths)
sentences, embeddings = create_embeddings(file_text, sentence_model)

Error reading file: 'IndirectObject' object has no attribute 'get'
Page 0: No images or unsupported format
Page 1: No images or unsupported format
Page 2: No images or unsupported format
Page 3: No images or unsupported format
Page 4: No images or unsupported format
Page 5: No images or unsupported format
Page 6: No images or unsupported format
Page 7: No images or unsupported format
Page 8: No images or unsupported format
Page 9: No images or unsupported format
Page 10: No images or unsupported format
Page 11: No images or unsupported format
Page 12: No images or unsupported format
Page 13: No images or unsupported format
Page 14: No images or unsupported format
Page 15: No images or unsupported format
Page 16: No images or unsupported format
Page 17: No images or unsupported format
Page 18: No images or unsupported format
Page 19: No images or unsupported format
Page 20: No images or unsupported format
Page 21: No images or unsupported format
Page 22: No images or unsupported format


In [None]:
# Calculate cosine similarity between the query and sentences
similarity_scores = calculate_cosine_similarity(prompt_input, embeddings, sentence_model)
print(f"Cosine Similarity Scores:\n{similarity_scores}")

Cosine Similarity Scores:
[0.16548398 0.03780279 0.3223809  ... 0.07290286 0.00178354 0.16407873]


In [None]:
# Retrieve relevant context and generate a response
context = search_context(prompt_input, sentences, embeddings, sentence_model, top_k=5)
response = generate_llama2_response(prompt_input, " ".join(context), pre_prompt)

# 🚩 Insert your prompt in prompt_input and run the code

In [None]:
# Prompt
prompt_input = "What is the MAS thesis about?"
response = handle_query(prompt_input)

print(f"Assistant:\n{response}")

Assistant:
 Sure, I'd be happy to help! Based on the provided document, the MAS thesis appears to be about the Macro Analyst System (MAS), which is a systematic approach to financial analysis that involves several stages, including financial modeling, sentiment analysis, risk assessment, and scenario analysis. The thesis also discusses common errors encountered in the MAS and their origins and impacts. Additionally, the thesis highlights the core principles of the MAS, which include autonomy, local views, decentralization, and cooperation. Is there anything specific you would like to know about the MAS thesis?
