# **Fine Tune using Local LMM on your data using `LangChain`**  [<sup> *source*<sup>](https://archive.is/medium.com/towards-artificial-intelligence/finetuning-local-large-language-models-on-your-data-using-langchain-9229da66ad9b)

## Extracting Knowledge from Data using LLMs

![alt](resources/document%20understading%20and%20contex-answer%20generation%20using%20LMM.jpg)

Langchain offers a variety of features to simplify document handling, including text files, PDF files, and tabular databases like Google BigQuery. The workflow consists of the following steps:

1. Use Langchain **loaders** to import the desired documents.

2. Divide the documents into **smaller sections or chunks**.

3. Convert the text into **embeddings**, which represent the semantic meaning.

4. Store the embeddings in a **database**, specifically ChromaDB.

5. Conduct a semantic search to retrieve the most relevant content based on our query.

6. Incorporate the **retrieved information as context** into our Large Language Model (LLM).


In [None]:
# INSTALL DEPENDENCIES
# ! pip install langchain==0.0.163
# ! pip install pygpt4all==1.1.0
# ! pip install transformers
# ! pip install datasets
# ! pip install chromadb
# ! pip install tiktoken

## Download the dataset

Consider a scenario where you, as a machine learning engineer, are engaged in working with delicate medical data. Specifically, this data comprises dialog interactions between patients and doctors.
Your objective is to develop an application that utilizes these dialog interactions as a knowledge base. This application would serve as a resource to provide initial answers to basic patient queries before they are redirected to a doctor. 


In [2]:
import pandas as pd
from datasets import load_dataset

# Download the medical_dialog dataset from Hugging Face
dataset = load_dataset('medical_dialog', 'processed.en')

df = pd.DataFrame(dataset['train'])

df.head()

Unnamed: 0,description,utterances
0,throat a bit sore and want to get a good imune...,[patient: throat a bit sore and want to get a ...
1,"hey there i have had cold ""symptoms"" for over ...","[patient: hey there i have had cold ""symptoms""..."
2,i have a tight and painful chest with a dry co...,[patient: i have a tight and painful chest wit...
3,what will happen after the incubation period f...,[patient: what will happen after the incubatio...
4,suggest treatment for pneumonia,[patient: just found out i was pregnant. yeste...


In [3]:
dialog = []
# make each sentence on a seperate row
patient, doctor = zip(*df['utterances'])
for i in range(len(patient)):
  dialog.append(patient[i])
  dialog.append(doctor[i])

dialog_df = pd.DataFrame({"dialog": dialog})

# save the data to txt file
dialog_df.to_csv('resources/data.txt', sep=' ', index=False)
dialog_df.head()

Unnamed: 0,dialog
0,patient: throat a bit sore and want to get a g...
1,doctor: during this pandemic. throat pain can ...
2,"patient: hey there i have had cold ""symptoms"" ..."
3,doctor: yes. protection. it is not enough symp...
4,patient: i have a tight and painful chest with...


## Document embedding

First, a text loader is created by specifying a file named data.txt, which contains our domain data to be processed. After that, the document is split into multiple chunks and embedded using the `HuggingFaceEmbeddings` class from Langchain.

HuggingFaceEmbeddings uses a sentence transform model **all-mpnet-base-v2** that was trained on a 1B sentence pair dataset, which maps sentences and paragraphs to a 768-dimensional dense vector space.

Finally, the embedded document is stored in a `“chroma DB”` database using the `VectorstoreIndexCreator`. This step creates an index for the document as well, which allows for efficient searching and retrieval based on its embedded representation.

In [5]:
from langchain.document_loaders import TextLoader

from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings

# add the path to the CV as a PDF
loader = TextLoader('resources/data.txt')

# Embed the document and store into chroma DB
index = VectorstoreIndexCreator(embedding= HuggingFaceEmbeddings()).from_loaders([loader])
    

## Load the LLM 
In ours case, we use `Gpt4all-J`, the originally released model had a research-only license while the newly released Gpt4all-J has an *Apache-2 license*.

In [4]:
from langchain.llms import GPT4All
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# specify the path to the .bin downloaded file
# replace with your desired local file path
local_path = 'F:\\DOCUMENTOS\\DATA_SCIENCE\Large Language Models LLM\\ggml-gpt4all-j-v1.3-groovy.bin'
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]
# Verbose is required to pass to the callback manager
llm = GPT4All(model=local_path, callbacks=callbacks, verbose=True, backend='gptj')


Found model file at  F:\\DOCUMENTOS\\DATA_SCIENCE\\Large Language Models LLM\\ggml-gpt4all-j-v1.3-groovy.bin


## Similarity Search

Similarity search is a process of finding documents or pieces of information that are most similar to a given query or question.It helps to retrieve relevant content based on the similarity of their contextual meaning.

In our case, the **similarity search** is performed using an index object that has been previously created. We can also specify that the search should retrieve the top 4 most similar documents or pieces of information related to the query.



In [9]:
query = "what is the solution for soar throat"

# perform similarity search and retrieve the context from our documents
results = index.vectorstore.similarity_search(query, k=4)
# join all context information (top 4) into one string
context = "\n".join([document.page_content for document in results])
print(f"Retrieving information related to your question...")
print(f"Found this content which is most similar to your question: \n {context}")


Retrieving information related to your question...
Found this content which is most similar to your question: 
 "doctor: in brief: standard precautions covid-19 is now official name for the illness caused by the newly discovered coronavirus (coronavirus infectious disease - 2019). so far it is extremely rare in the us (2/12/20). until and unless covid-19 becomes common no special precautions are necessary. in any dormitory or group living situation people with respiratory symptoms (colds, flu, etc.) should cover their coughs and wash hands frequently."
"patient: is gargling with listerine effective against corona virus induced sore throat? will it kill the virus? how about with mixture of warm water and salt, will this also kill virus!"
"doctor: gargling. you can't be sure but it may help if you do those things as well as using zinc lozenges at the first sign of any throat discomfort and stay hydrated also. i recommend them. at least it'll do no harm."
"doctor: in brief: standard preca

## Add context to LLM

After executing the **similarity search**, we can join the context information from the retrieved documents into a single string forming the `“context.”` This context will then be fed into the Gpt4all to generate an informed answer to our question.

Adding `context` to LLM involves incorporating additional information or text snippets into the model’s input to enhance its understanding and generate more contextually relevant responses. By providing relevant context, the model gains a better understanding of the conversation or task at hand.

In our case, the context was retrieved using the similarity search by returning the dialogs that are closest to the question the user has asked. These dialogs are then fed to the LLM alongside the original questions in order to generate the final answer.


In [10]:
from langchain import PromptTemplate, LLMChain

from langchain import PromptTemplate

template = """
Please use the following context to answer questions.
Context: {context}
---
Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template, input_variables=["context", "question"]).partial(
    context=context)

llm_chain = LLMChain(prompt=prompt, llm=llm)

print("Processing the information with gpt4all...\n")
print(llm_chain.run(query))


Processing the information with gpt4all...

 
1) First of all, it is important that we understand the symptoms of sore throat as well as its causes. The most common cause of a sore throat in adults and children are viral infections such as colds or flu viruses. However, there can be other reasons for this symptom too like allergies, acid reflux, asthma etc.
2) If you have been experiencing these symptoms for more than two days then it is important to consult with your doctor immediately. 
3) In case of a viral infection such as colds or flu viruses, the best way to treat sore throat would be gargling salt water and zinc lozenges which can help in reducing inflammation caused by virus. However, if you have been experiencing these symptoms for more than two days then it is important that we consult with your doctor immediately.
4) In case of allergies or acid reflux the best way to treat sore throat would be gargling salt water and zinc lozenges which can help in reducing inflammation ca

# **Another** [**example**](https://medium.com/ai-in-plain-english/fine-tuning-large-language-models-with-langchain-1cf453349001)

Suppose we have a collection of CVs in PDF format, and we want to use an LLM to extract information about the candidates or evaluate their suitability for a particular role.

In [None]:
# DEPENDENCES
# !pip install chromadb
# !pip install langchain
# !pip install pypdf
# !pip install llama-index

In [None]:
import os 
# IF we use OPEN-IA,need to add your openai api key
os.environ["OPENAI_API_KEY"] ="your openApi key"

In [2]:

from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings

loader = PyPDFLoader('resources/cv_david_smith.pdf')

# intialize the Vector index creator, embedding by default is OpenAIEmbeddings
index = VectorstoreIndexCreator(embedding= HuggingFaceEmbeddings()).from_loaders([loader])

To retrieve some information from the document we need to write our question in form of a query. The index object we just created has the function query which gives it the impression that we are querying a database.

In [5]:
query = "what is the name of the candidate you have ?"
index.query(query, llm=llm)

 The name of the candidate I am referring to in this context is DAVID SMITH.

' The name of the candidate I am referring to in this context is DAVID SMITH.'

## Adding multiple documents

In [11]:
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator


loaders = [PyPDFLoader('resources/cv_Jo Brown.pdf'), PyPDFLoader('resources/cv_david_smith.pdf')]
index = VectorstoreIndexCreator(embedding= HuggingFaceEmbeddings()).from_loaders(loaders)

In [12]:
index.query('Give me the names of the candidates you have',llm=llm)

 I am sorry but there is no information provided in the given context about any candidate.

' I am sorry but there is no information provided in the given context about any candidate.'

In [14]:
print(index.query('que es CNN en ciencias de la computacion ',llm=llm))


 CNN stands for "Computer Networks" in Spanish, which is the name of an online news website that focuses on computer science topics such as programming languages and software development. CNN stands for "Computer Networks" in Spanish, which is the name of an online news website that focuses on computer science topics such as programming languages and software development.
