# Create proper AI model with LangChain and FAISS as vector database

## Prepare data


## Create vector database in FAISS

Before indexing, you will need to think about the vectorization model. You can use different options: train your own vectorizer (for example, based on models from SentenceTransformers) or use something off-the-shelf. I will use the solution from OpenAI (in this case, you will also need to add an API access key).

As a vector database, I will use FAISS - currently, it is the fastest solution for searching nearest neighbors. In LangChain, there are also implementations for many other storage systems (ElasticSearch, Redis, Pinecone, and so on).

In [65]:
#  import libraries
import os
import pandas as pd
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings, OpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness
)
    



In [30]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [31]:

openai_api_key = os.getenv("OPENAI_API_KEY")
store = '/Users/gala/PycharmProjects/Scalian_Chat/ScalianChatBot/server/frontend/webui/store_documents'
database = os.getenv("DATABASE")
index = os.getenv("INDEX")

In [32]:
folder_path = store
text_loader_kwargs = {"autodetect_encoding": True}

loader = DirectoryLoader(folder_path,
            loader_cls=TextLoader,
            loader_kwargs=text_loader_kwargs,
            show_progress=True,
            silent_errors=True,
        )
documents = loader.load()

100%|██████████| 230/230 [00:00<00:00, 1620.55it/s]


In [34]:
# split text into characters, to meet the token limit, in our case it's not necessary
text_splitter = CharacterTextSplitter(
chunk_size=1000, chunk_overlap=100, #length_function=len, add_start_index=True
)
docs = text_splitter.split_documents(documents)

Created a chunk of size 2885, which is longer than the specified 1000
Created a chunk of size 5142, which is longer than the specified 1000
Created a chunk of size 1247, which is longer than the specified 1000
Created a chunk of size 1359, which is longer than the specified 1000
Created a chunk of size 1302, which is longer than the specified 1000
Created a chunk of size 1197, which is longer than the specified 1000
Created a chunk of size 1044, which is longer than the specified 1000
Created a chunk of size 1285, which is longer than the specified 1000
Created a chunk of size 4012, which is longer than the specified 1000
Created a chunk of size 2745, which is longer than the specified 1000
Created a chunk of size 1791, which is longer than the specified 1000
Created a chunk of size 1309, which is longer than the specified 1000
Created a chunk of size 2680, which is longer than the specified 1000
Created a chunk of size 1258, which is longer than the specified 1000
Created a chunk of s

In [36]:
# vectorize text
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = FAISS.from_documents(docs, embeddings)

In [37]:

# and save it locally
current_dir = os.path.dirname('/Users/gala/PycharmProjects/Scalian_Chat/ScalianChatBot/server/frontend/webui/')
folder_documents = os.path.join(current_dir, 'scalian_database')
if os.path.exists(folder_documents) and os.path.exists(
    f"{folder_documents}/{index}"
):
    print("Database exists...")
    is_database_stored = True
else:
    print("Database does not exist...")
    is_database_stored = False

Database exists...


In [63]:
# as database is stored, we can load it

llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0.3, max_tokens=500)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

if is_database_stored:
    print("Loading the database...")
    db = FAISS.load_local("/Users/gala/PycharmProjects/Scalian_Chat/ScalianChatBot/server/frontend/webui/scalian_database", embeddings)
else:
    print("Creating the database...")
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("scalian_database")

retriever = db.as_retriever()
# prompt = ChatPromptTemplate.from_messages(
#             [
#                 MessagesPlaceholder(variable_name="chat_history"),
#                 
#                 (
#                     "user",
#                     "Given the above conversation, generate a search query to look up in order to get information relevant to the conversation",
#                 ),
#             ]
#         )

retriever_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)

Loading the database...


In [46]:
# нужно добавить поле file_name, иначе генерация не запустится
for document in documents:
    document.metadata['file_name'] = document.metadata['source']
# generator with openai models
generator = TestsetGenerator.with_openai()

# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/800 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/10 [00:00<?, ?it/s]

In [47]:
testset_pd = testset.to_pandas()
test_questions = testset_pd["question"].values.tolist() # list of questions to test
test_groundtruths = testset_pd["ground_truth"].values.tolist() # list of ground truths (answers)
testset_pd.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,¿Cómo ayudan las nuevas tecnologías digitales ...,[title: Industria del futuro - Scalian\nurl: h...,Las nuevas tecnologías digitales ayudan a las ...,simple,True
1,¿Cuál es la importancia de los Datos en el éxi...,"[presarial, hiperpersonalización y la tecnolog...",Optimizar el uso de datos es esencial para el ...,simple,True
2,¿Por qué son esenciales la adaptación y la for...,[title: El Futuro del Trabajo: Cómo la Automat...,Las habilidades de adaptación y la formación c...,simple,True
3,¿Cómo contribuye la innovación colaborativa al...,[title: Nuestros socios - Scalian\nurl: https:...,La innovación colaborativa contribuye al enfoq...,simple,True
4,¿Cómo contribuye el enfoque ágil a la adopción...,[ que hace vivir a la empresa. Incluso si cada...,La práctica ágil permite adoptar una postura d...,simple,True


In [50]:
# save testset
testset_pd.to_csv('testset.csv', index=False)

In [51]:
from datasets import Dataset  
from tqdm import tqdm  

In [64]:

answers = []
contexts = []

for question in test_questions:
  response = retriever_chain.invoke({"query" : question})
  answers.append(response["result"])
  contexts.append([context.page_content for context in response['source_documents']])

#wrap into huggingface dataset
response_dataset_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})
response_dataset_retrieval[0]


{'question': '¿Cómo ayudan las nuevas tecnologías digitales a las industrias a superar nuevos umbrales de rendimiento, eficiencia y creación de valores según Scalian?',
 'answer': 'Las nuevas tecnologías digitales ayudan a las industrias a superar nuevos umbrales de rendimiento, eficiencia y creación de valores según Scalian al ayudar a los industriales a seleccionar entre las nuevas tecnologías digitales disponibles. Esto les permite recurrir a aquellas tecnologías que les permitan sobrepasar nuevos umbrales de rendimiento, eficiencia y creación de valores. Es importante tener en cuenta que los objetivos elegidos deben situarse fuera de la zona de confort, es decir, fuera de los objetivos de progreso continuo o del enfoque "Lean". Esto implica abordar una visión integral para aprovechar al máximo el aporte de la tecnología digital en el sector industrial. Scalian se apoya en metodologías propias y en tecnologías digitales como big data, inteligencia artificial, simulación, realidad au

In [67]:
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
]
retrieval_results = evaluate(response_dataset_retrieval, metrics,raise_exceptions=False)
retrieval_results


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'faithfulness': 1.0000, 'answer_relevancy': 0.9394, 'context_recall': 0.7750, 'context_precision': 0.8917}

In [68]:
df_metrics = pd.DataFrame(list(retrieval_results.items()), columns=['Metric', 'Evalauation'])
df_metrics

Unnamed: 0,Metric,Evalauation
0,faithfulness,1.0
1,answer_relevancy,0.939357
2,context_recall,0.775
3,context_precision,0.891667


In [69]:
df_metrics.to_csv('retrieval_metrics.csv', index=False)