### Project: Multilingual RAG Model for Finnish and English Language Support


---



#### Description:

**This project involves the development of a sophisticated Retrieval-Augmented Generation (RAG) model capable of providing accurate, contextually relevant responses in both Finnish and English. The model leverages state-of-the-art natural language processing (NLP) techniques, integrating retrieval mechanisms to augment the generative process, ensuring that the answers are not only relevant but also linguistically appropriate in both languages. The goal is to create a seamless experience for users interacting in either language, allowing for efficient cross-lingual capabilities in real-time applications**.

**The solution employs custom data pipelines, document loaders, and language-specific models for both Finnish and English, ensuring high-quality results across diverse use cases. By utilizing advanced vector databases and context-aware retrieval, the system generates dynamic responses tailored to the user’s input, regardless of language**.



---



### Installing Libraries

In [None]:
'''
!pip install langchain
!pip install -U langchain-community
!pip install -qU pypdf
!pip install -qU langchain-cohere
!pip install langid
!pip install -U deep-translator
'''

'!pip install langchain\n!pip install -U langchain-community\n!pip install -qU pypdf\n!pip install -qU langchain-cohere\n!pip install langid\n!pip install -U deep-translator'

### Import Dependencies



In [2]:
from langchain_community.document_loaders import PyPDFLoader # Importing PyPDFLoader to handle loading PDF files as documents.
from langchain.text_splitter import RecursiveCharacterTextSplitter # Importing RecursiveCharacterTextSplitter to split text content into smaller chunks recursively, based on character count.
from langchain_cohere import ChatCohere # Importing Cohere model for language modeling tasks, allowing the use of Cohere’s language generation capabilities.
from langchain_cohere import CohereEmbeddings # Importing CohereEmbeddings for creating embeddings using Cohere, which can be useful for similarity searches or semantic understanding.
from langchain_chroma import Chroma # Importing Chroma vector store from Langchain to store and retrieve embeddings, enabling vector-based similarity searches.
import os # Importing os module for operating system functionalities.
import langid # Importing langid for language detection functionalities.
from langchain.prompts import PromptTemplate # Using Prompt to Provide Context and Query.
from langchain_core.runnables import RunnableSequence  # Importing Sequence to run llmchains
from deep_translator import GoogleTranslator # Language translate
from dotenv import load_dotenv # Loading Env varaibles.
load_dotenv()


True

### Implementation: Step-by-Step

## Loading in the Data
---
#### 1. The First Step Is to load Documents I am using Langchain community to load documents .

In [3]:
# Function to load PDF files from a directory and extract pages asynchronously
async def load_pdfs_from_directory(directory_path):
    """
    Loads all PDF files in the given directory and extracts their pages asynchronously.

    Args:
        directory_path (str): Path to the directory containing PDF files.

    Returns:
        dict: A dictionary where keys are file names and values are the list of pages extracted from each PDF.
    """
    # Dictionary to store pages extracted from each PDF file
    pdf_pages = {}

    # Iterate over each file in the directory
    for file_name in os.listdir(directory_path):
        # Check if the file is a PDF (you can adjust this based on your file extensions)
        if file_name.endswith(".pdf"):
            file_path = os.path.join(directory_path, file_name)
            loader = PyPDFLoader(file_path=file_path)
            pages = []

            # Asynchronously load pages from the PDF file
            async for page in loader.alazy_load():
                pages.append(page)

            # Store the extracted pages in the dictionary, using the file name as the key
            pdf_pages[file_name] = pages

    # Return the dictionary containing pages for each file
    return pdf_pages

# Input Directory path
directory_path = input("Please enter the path to the directory containing your PDF files: ")
pdf_data = await load_pdfs_from_directory(directory_path)
print("Documents Successfully Added in pdf_data dictionary Thank you :)")
print(f'There are Total {len(pdf_data.keys())} PDF files present in the directory at the moment')

Documents Successfully Added in pdf_data dictionary Thank you :)
There are Total 2 PDF files present in the directory at the moment


## Chunking
---
#### 2. The Second Step Is to Convert the Data Into Chunks Due to Token Limit .

In [4]:
# In this Function to split documents into chunks (size, overlap)
def split_documents(pdf_data, chunk_size=500, chunk_overlap=25):
    """
    Splits the extracted PDF pages into smaller chunks using RecursiveCharacterTextSplitter.

    Args:
        pdf_data (dict): Dictionary containing PDF file names as keys and list of Document objects (each representing a page) as values.
        chunk_size (int): The size of each chunk of text.
        chunk_overlap (int): The overlap between consecutive chunks.

    Returns:
        dict: A dictionary where the keys are file names and values are lists of split document chunks.
    """
    # Initialize the text splitter
    # Size of each chunk in characters, # Overlap between consecutive chunks, # Function to compute the length of the text,
    # Flag to add start index to each chunk.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap,length_function=len,add_start_index=True)

    # Dictionary to store the split documents
    split_docs = {}

    # Iterate over each PDF file's data
    for file_name, documents in pdf_data.items():
        # Split the documents (pages) into chunks
        chunks = text_splitter.split_documents(documents)

        # Store the chunks in the dictionary
        split_docs[file_name] = chunks

    return split_docs

# Now split the documents into chunks
split_pdf_data = split_documents(pdf_data)

# Print the result
print(f"Total split documents: {sum([len(chunks) for chunks in split_pdf_data.values()])}")



Total split documents: 528


## Combining Documents

In [5]:
# In this Function, I am combining all the PDF's under one list so I can embed and store into a vector database.
def combine_documents(split_pdf_data):
    """
    Combines all the PDF documents stored in a dictionary into a single list.

    Args:
        split_pdf_data (dict): A dictionary where the values are lists of text chunks or pages from the PDFs.

    Returns:
        list: A single list containing all the text chunks/pages from all PDFs.
    """
    return sum(split_pdf_data.values(), [])

# Assuming split_pdf_data is a dictionary where the values are lists of text chunks or pages
combine_all_documents = combine_documents(split_pdf_data)

## Embeddings and Data Storage
---
#### 3 . The Third Step Is To Embedded The Sentences and store the data into Database. For that purpose i am Using Chroma Because Its Supports Mutlilanguage Modelling .

In [6]:
# This Function Apply Embeddings and Storing Data into Chroma DB Vector Database.
def store_documents_in_chroma(documents, api_key, persist_directory="./chroma_storage"):
    """
    Stores documents in a Chroma database using multilingual embeddings from Cohere.

    Args:
        documents (list): List of documents to be embedded and stored in the Chroma database.
        api_key (str): API key for accessing Cohere's embedding service.
        collection_name (str, optional): Name of the collection in the Chroma database. Defaults to "multilingual_documents".
        persist_directory (str, optional): Directory path to save the Chroma database. Defaults to "./chroma_storage".

    Returns:
        None
    """

    # Set up environment variable for Cohere API key
    os.environ["COHERE_API_KEY"] = api_key

    # Initialize the Cohere embedding model with multilingual capabilities
    embedding_model = CohereEmbeddings(model="embed-multilingual-v3.0")

    # Initialize Chroma vector store with specified collection and persistence settings
    chroma_db = Chroma.from_documents(
        documents=documents,
        embedding=embedding_model,
        persist_directory=persist_directory,
    )

    # Saving Into Chroma DB Permanently
    #chroma_db.persist()

    # Output confirmation message
    print("Documents successfully stored in Chroma DB :)")

# Example usage
api_key = os.environ.get('COHERE_API_KEY')
store_documents_in_chroma(combine_all_documents, api_key)


Documents successfully stored in Chroma DB :)


## Retrival Using Vector Similarity Search


---

#### 4. Here I am using Vector-similarity search which computes a distance metric between the query vectors and indexed vectors in the database. This method is more effective typically for retrieving contextually relevant information to the prompt.



In [7]:
# This Function loads the Chroma DB and queries the relevant documents based on the query text.
def load_and_query_chroma(persist_directory="./chroma_storage", query_text="none", k=3):
    """
    Loads a persisted Chroma database and runs a similarity search on it.
    Automatically detects the language of the query and provides results in the same language.

    Args:
        persist_directory (str): Directory where the Chroma database is stored.
        query_text (str): The input query for similarity search.
        k (int): Number of nearest neighbors to retrieve from the search.

    Returns:
        results (str): List of search results with relevance scores or a message indicating no suitable match.
    """

    # Initialize the Cohere multilingual embedding model
    embedding_model = CohereEmbeddings(model="embed-multilingual-v3.0")

    # Load the Chroma DB from the persistence directory
    chroma_db = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)

    try:
        # Detect the language of the query using langid
        query_language, _ = langid.classify(query_text)

        # Check if the language is supported (English or Finnish)
        if query_language not in ["en", "fi"]:
            return "This system currently supports queries in English and Finnish only."

    except Exception as e:
        return "Error detecting language. This system currently supports queries in English and Finnish only."

    # Perform similarity search
    results = chroma_db.similarity_search_with_relevance_scores(query=query_text, k=k)

    # Check if any results meet the relevance threshold
    if not results or results[0][1] < 0.3:
        if query_language == "fi":
            return "Emme pystyneet löytämään sopivaa vastausta tähän kyselyyn."
        else:  # Default to English for non-Finnish detected languages
            return "We are unable to find a suitable match for this query."

    # Extract page content from each result
    formatted_results = [result.page_content for result, score in results]

    # Join results into a single string
    joined_results = " ".join(formatted_results)

    # Detect the language of the results (if they are different from the query language, translate them)
    result_language, _ = langid.classify(joined_results)

    if query_language != result_language:
        # Translate results to the language of the query
        translated_results = GoogleTranslator(source=result_language, target=query_language).translate(joined_results)
        return translated_results
    else:
        # Return results in their original language if no translation is needed
        return joined_results


## Combining Context and Query

---
#### 5. I am using ChatCohere model in order to generate Response.


In [None]:
# Function to create a formatted prompt using specific context and question input
def create_formatted_prompt(context, question):
    """
    Generates a formatted prompt for answering questions in a structured and professional manner.

    Parameters:
    - context (str): The context or background information to base the answer on.
    - question (str): The question to be answered.

    Returns:
    - str: A formatted prompt with clear guidelines for the response format and language.
    """

    # Enhanced prompt template to ensure precise, relevant, and well-structured responses
    PROMPT_TEMPLATE = """
    You are a professional assistant with expertise in the subject matter. Your goal is to provide the most accurate, clear, and relevant response based on the context provided. Ensure the response is well-organized, insightful, and demonstrates a deep understanding of the topic.
    Response should be in what language question was asked.
        
    Context:
    {context}
        
    Question:
    {question}
    
    """

    # Initialize the PromptTemplate with defined input variables for context and question
    prompt_template = PromptTemplate(
        input_variables=["context", "question"],
        template=PROMPT_TEMPLATE
    )

    # Format the template with the given context and question, returning a final prompt
    return prompt_template.format(context=context, question=question)



In [9]:
# Main function to generate response
def generate_response(query):
    """
    Generates a response to a query by retrieving context from the Chroma database and
    generating an answer using a Cohere LLM.

    Parameters:
    - query (str): The user's question or input query.

    Returns:
    - str: The generated response based on the context and question.
    """
    # Assuming load_and_query_chroma is a valid function that fetches context
    context = load_and_query_chroma(query_text=query)

    # Generate the formatted prompt
    if context == "Emme pystyneet löytämään sopivaa vastausta tähän kyselyyn.":
        return "Emme valitettavasti löytäneet sopivaa vastausta tähän kyselyyn. Voisitko ystävällisesti tarkentaa tai muotoilla kysymyksesi uudelleen, jotta voimme auttaa sinua paremmin?"  # Return the Finnish no-answer message
    elif context == "We are unable to find a suitable match for this query.":
        return "Unfortunately, we were unable to find a suitable answer to this query. Could you kindly clarify or rephrase your question so that we can assist you better?"  # Return the English no-answer message
    
    prompt = create_formatted_prompt(context=context, question=query)

    # Initialize Cohere LLM (assuming environment variable is set for the API key)
    llm = ChatCohere()

    # Create a prompt template for the LLMChain
    prompt_template = PromptTemplate(
        input_variables=["context", "question"],
        template=prompt
    )

    # Create the LLMChain with the prompt template and LLM
    #llm_chain = LLMChain(prompt=prompt_template, llm=llm)
    llm_chain = RunnableSequence(prompt_template, llm)


    # Run the chain to generate an answer
    generated_answer = llm_chain.invoke({"context": context, "question": query})
    return generated_answer.content



In [10]:
# Input what you want to ask so the chatbot will answer you about the questions 
# Ask In English 
generate_response("What are the provisions for uninterrupted two-shift and three-shift work?")

'**Uninterrupted Two-Shift Work:**\n- The context mentions that the transition to uninterrupted two-shift work will be agreed upon locally, suggesting that the specific terms may vary depending on the location or the agreement between the parties involved.\n- It is stated that this arrangement will not come into effect before 01.12.2021, indicating a specific start date for the implementation of the two-shift system.\n- The company will have uninterrupted access to working time systems related to two-shift work, implying that there are established systems or regulations in place to manage this type of work schedule.\n\n**Uninterrupted Three-Shift Work:**\n- Regular working hours for uninterrupted three-shift work are defined as a maximum of eight hours per day and no more than 48 hours per week.\n- The weekly working time is further clarified as a maximum of eight weekly days, which might refer to a specific scheduling arrangement where the total working hours for the week are distribu

In [11]:
# Ask In Finnish 
generate_response("Mitkä ovat keskeytymättömän kaksi- ja kolmivuorotyön määräykset?")

'**Keskeytymättömän Kaksivuorotyön (KKT) Määräykset:**\n- Keskeytymättömän kaksivuorotyön ehdot ja siirtyminen keskeytymättömään kaksivuorotyöhön sovitaan paikallisesti sopimalla osapuolten välillä.\n- Osapuolet sopivat, että uudet määräykset eivät vaikuta ennen 01.12.2021 yrityksessä jo käytössä oleviin keskeytymättömän kaksivuorotyöhön liittyviin työaikajärjestelmiin.\n\n**Keskeytymättömän Kolmivuorotyön (KKT) Määräykset:**\n- Säännöllinen työaika keskeytymättömässä kolmivuorotyössä on enintään 8 tuntia vuorokaudessa ja 48 tuntia viikossa.\n- Viikoittainen työaika on rajoitettu enintään kahdeksaan viikkoa pidemmälle.\n\nNäiden määräysten tarkoitus on säätää työaikaa ja varmistaa, että työntekijät saavat riittävän levon ja työskentelevät turvallisissa ehdoissa, erityisesti vaatimallisissa vuorotyöskenteissä. Soveltamisohjeet antavat työnantajille ja työntekijöille selkeyttä siitä, miten noudattaa näitä määräyksiä ja soveltaa niitä paikallisesti sopivalla tavalla.'

## Where We Can Improve : 

##### RAG-bot answers questions effectively, though it currently draws from only two documents. To boost its accuracy, we can expand the document base and test a variety of prompts, improving the bot’s performance and reliability. Additionally, adjusting the context window can yield more comprehensive answers.





















---