# Notebook For LangChain Testing with Falcon 40b-Instruct

**Author:** <u>Sven STEINBAUER</u><br>
**Date:** $\underline{10^{th} Nov. 2023}$

---

## Necessary Imports / Libraries

In [2]:
# System Libs
import logging
import re
import json
import glob

# NLP Libs
# from ctransformers import AutoModelForCausalLM
from langchain.llms import CTransformers # for falcon, llama and mistral models
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from accelerate import Accelerator # to put llm inference with ctransformers on my Nvidia GPU

# NLP EDA Personal LIbs
from sst.pdf_extract import PDFExtract
from sst.ppt_extract import PPTExtract
from sst.sem_search import SemanticSearch

2023-11-11 12:33:23,377 - root - 33 - INFO - PyTorch is using: cuda
2023-11-11 12:33:23,378 - root - 36 - INFO - PyTorch version: 2.1.0


## Make Corpus

In [3]:
#================================#
# Build Corpus
#================================#
base_path = r"C:\Users\svens\00_Private\Knowledge\ECONOMICS-TheCode\Keynes"
field = "keynes"
lang = "en"
# ----------------------------------------------------
# PDF Alternative
#pdf_reader = PDFExtract(base_path=base_path, field=field)

# Write dictionary file as "key" and dictionary of pages and related content as "value", e.g. {"file1-path":{"page1":[text_content,..., page_content_general]...}, "file2-path":...}
#keynes_dict = pdf_reader.write_pdf_corpus_dict()

In [4]:
#if lang == "de":
    #with open(f"data/file_page_{field}_{lang}.json", "w", encoding="utf-8") as outfile:
        #json.dump(keynes_dict, outfile, ensure_ascii=False)
#else:
    #with open(f"data/file_sentence_{field}_{lang}.json", "w", encoding="utf-8") as outfile:
       # json.dump(keynes_dict, outfile, ensure_ascii=True)

In my case I have already executed all the PDFs which I wanted to extract. I extracted around 400 pages+. This, unfortunately, took a while. Each run of this module will create a corpus text-file in the `data/` folder called `pdf_corpus_{field}.txt`. So, in order to speed up the code, I uncommented the execution of the **PDFExtract-Class** and directly work on the created text-file.

## EDA: Prepare Corpus

In [5]:
# For Cleaning content
def chain_sub(input_string, substitutions):
        # substitutions must be a list of tuples(r'expression_to_be_replaced', 'replacement_string')
        for pattern, replacement in substitutions:
            input_string = re.sub(pattern, replacement, input_string)
        return input_string
    
# Regex Pattern to be replaced
pat_and_replace=[
    (r"\n", ""),
    (r"[+*%?!'.;,@ʺ‎…`]", ''),
    (r"(´)", ''),
    (r"(\$)", 's'),
    (r"(<UNK>+)", ''),
    (r"(ñ+)", 'n'),
    (r"(\")", ''),
    (r"(/)", ' '),
    (r"(ş)", 's'),
    (r"(ă)", 'a'),
    (r"(†)", ''),
    (r"(ŭ)", 'u'),
    (r"(ŏ)", 'o'),
    (r"(ō)", 'o'),
    (r"(ī)", 'i'),
    (r"(ç)", 'c'),
    # replace :-
    (r'(:+|-+)', ' '),
    # replace empty spaces with single space
   (r" +", ' '),
        ]

corpus_path = "data/pdf_corpus_keynes.txt"
lines_list=[]
clean_content_file = open(f"data/clean_corpus_{field}.txt", "w", encoding="utf-8")
with open(corpus_path, encoding="utf-8") as c:
    for line in c:
        _line = line.strip()
        clean_line = chain_sub(_line, pat_and_replace)
        lines_list.append(clean_line + "\n")
    for l in lines_list:
        # ignore empty lines
        if not l.isspace():
            clean_content_file.write(l)
clean_content_file.close()

## Semantic Search with Torch & SentenceTransformer: Calculate Embeddings, Match Query & Corpus

Once we have a clean corpus we can proceed by getting the corpus into a list of sentences which can be used for the `SentenceTransformer` Class in order to calculate embeddings of the corpus sentences and the query such that we can finally compary both embeddings via the **cosine similarity metric**, i.e. the vectors which have the smallest angle with each other.

We want to implement this *similarity search* via the `SemanticSearch` Class which I have written.

Specifically, we will receive basically a dataframe consisting of top-x number of entries that contain the respective passage of the corpus and the related score.

In [6]:
query = "Please explain the theory of business cycles."
top_k_sent = 5
#===============================================#
# TRANSFORM CORPUS ACCORDINGLY
#===============================================#
def get_list_of_sentences(file):
    global string_list
    with open(file, "r", encoding="utf-8") as f:
        corpus = f.read()
        string_list = corpus.split("\n")
    return string_list

file = f"data/clean_corpus_{field}.txt"
corpus = get_list_of_sentences(file=file)

In [7]:
#===============================================#
# DO SEMANTIC SEARCH AND RETURN BEST MATCHES
#===============================================#
sem_search = SemanticSearch(query=query, top_k_sent=top_k_sent, corpus=corpus)
search_result = sem_search.do_semantic_search()

2023-11-11 12:33:29,147 - root - 57 - INFO - ##################################################

2023-11-11 12:33:29,148 - root - 58 - INFO - GIVEN PARAMETERS
----------------------------------------------------------------------------------------------------------------------
2023-11-11 12:33:29,148 - root - 59 - INFO - QUERY: Please explain the theory of business cycles.
2023-11-11 12:33:29,149 - root - 60 - INFO - NUMBER OF TOP RESULTS SHOWN: 5
2023-11-11 12:33:29,150 - root - 61 - INFO - Corpus is up to you...
2023-11-11 12:33:29,151 - root - 62 - INFO - ##################################################

2023-11-11 12:33:29,151 - sentence_transformers.SentenceTransformer - 66 - INFO - Load pretrained SentenceTransformer: C:\Users\svens\02_DataScience\00_ML\NLP\openModels\paraphrase-multilingual-MiniLM-L12-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/316 [00:00<?, ?it/s]

2023-11-11 12:33:41,887 - root - 96 - INFO - Embeddings calculated...

2023-11-11 12:33:41,906 - root - 129 - INFO - Top Hits from the SEMANTIC SEARCH MODEL.
2023-11-11 12:33:41,906 - root - 130 - INFO - ---------------------------------------------------------------------------



In [8]:
# best match
best_match = str(search_result['corpus_passage'].iloc[0])
best_match

'cycles having a regular phase has been founded The same thing is true of prices which'

## Find Context in your Corpus based on Semantic Search Result

In [9]:
#===============================================#
# FIND CORRESPONDING PAGE CONTENT RELATED TO BEST MATCH SENTENCE
#===============================================#
# show dict keys
file_page_cont_dict = json.load(open(f"data/file_page_{field}_{lang}.json", "r", encoding="utf-8"))
# now get the right key = file-name
# use the list.index()-method

# All files collected; those are the keys
key_list = list(file_page_cont_dict.keys())

# All page dictionaries of all files
page_dict_list = list(file_page_cont_dict.values())

In [10]:
def find_context_data(best_match, key_list, file_page_cont_dict):
    break_out_of_outer_loop_flag = False
    found_strings = []
    for file_key in key_list:
        #print(file_key)
        # get the page dictionary of the individual file.
        pages_dictionary = file_page_cont_dict[file_key]
        # now iterate through the pages (as keys) and extract the fourth element of the list of elements that belongs to that page-key;
        # the fourth element holds the whole page_content as text
        for page_key in list(pages_dictionary.keys()):
            text = ''.join(pages_dictionary[page_key][4])
            # important step: clean the text to remove the new-line characters '\n'
            clean_text = chain_sub(text, pat_and_replace)
            # since Python 3.6 one can use an f-string to include variables in a regex together with a r-string (raw)
            m = re.search(fr"{re.escape(best_match)}", clean_text, re.IGNORECASE)
            if m:
                logging.info("match found")
                found_strings.append(clean_text)
                file = file_key
                page_id = page_key
                break_out_of_outer_loop_flag = True
                break
        if break_out_of_outer_loop_flag:
            break
    return found_strings, file, page_id

In [11]:
context_list, file, page_id= find_context_data(best_match=best_match, key_list=key_list, file_page_cont_dict=file_page_cont_dict)

2023-11-11 12:33:49,683 - root - 17 - INFO - match found


In [12]:
context_list



## LLM Retrieval-Augmented Answer Generation (RAAG)

Once we have the appropriate context we can start using `LangChain` and `ctransformers` in order to initialize our **Large Language Model (LLM)**, the **Falcon-40b-Instruct Open Assistant Model** in our case. Then we set up a **prompt template which contains variables in curly brackets `{}`** and pass in the found context via our **Semantic Search**.

In [13]:
#from langchain.llms import LlamaCpp
#falcon_model = r"C:\Users\svens\02_DataScience\00_ML\NLP\openModels\falcon-40b-top1-560.ggccv1.q4_k.bin"
mistral_model = r"C:\Users\svens\02_DataScience\00_ML\NLP\openModels\mistral-7b-instruct-v0.1.Q8_0.gguf"
#llm = AutoModelForCausalLM.from_pretrained(falcon_model, model_type='falcon', threads=8, context_length=2048, max_new_tokens=1024)
llm_config = {'max_new_tokens': 1024, 'context_length': 5000, 'gpu_layers': 5, 'threads': 8}
llm = CTransformers(model=mistral_model, model_type="mistral", config=llm_config)

In [14]:
# now accelerate and send llm to gpu
accelerator = Accelerator()
llm, llm_config = accelerator.prepare(llm, llm_config)

In [15]:
# Now make a use case specific template for your prompt
prompt_template = """Use the following given economic context in order to answer concisely the given query at the end. If you do not know the answer, please simply say that you do not know it and do not try to make up an answer.
{context}
Query: {query}
Answer:"""

prompt = PromptTemplate.from_template(prompt_template)
prompt.input_variables

['context', 'query']

In [16]:
# Use the LLM to generate an answer from the given context
query_llm = LLMChain(llm=llm, prompt=prompt)
response = query_llm.run({"context": str(context_list[0][2000:]), "query": query})
print(response)

 The theory of business cycles explains the periodic fluctuations in economic activity, known as business cycles, that occur in capitalist economies. These fluctuations are caused by a combination of factors, including changes in consumer spending, investment, and interest rates. While business cycles can be severe and unstable, they tend to wear themselves out before reaching extremes and eventually reverse themselves. The theory of business cycles is based on the idea that these fluctuations are not only common but also predictable and cyclical in nature.
