## Information Retreival

Information retrieval (IR) is the process of retrieving relevant information from a large collection of data based on a user's query. It encompasses various techniques and methodologies aimed at efficiently and effectively locating information that matches a user's information need. In IR, documents are typically represented in a structured format (such as text documents, web pages, or multimedia files), and retrieval is often performed using keyword-based search or more advanced techniques like natural language processing and machine learning. IR systems are widely used in search engines, digital libraries, document management systems, and recommendation systems to help users find relevant information amidst vast amounts of data.


### Information Retrieval Search Model:

- Boolean model: In the Boolean model, a query is represented as a Boolean expression of terms, and a document is either relevant or non-relevant to the query. This model is simple and efficient, but it can be too restrictive and may not account for partial matches or term frequency.

- Vector space model: In the vector space model, a document is represented as a vector of term weights, and a query is represented as a vector of term weights as well. The similarity between the query vector and each document vector is computed using a similarity measure, such as cosine similarity. This model is flexible and can handle partial matches and term frequency, but it may suffer from the "curse of dimensionality" and may require normalization and tuning.



![Alt text that describes the graphic](https://blog.langchain.dev/content/images/2023/02/ingest.png)


In [1]:
import os
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

## Load the documents using LangChain

Langchain uses document loaders to bring in information from various sources and prepare it for processing. These loaders act like data connectors, fetching information and converting it into a format Langchain understands.

There are a lot of document loaders in LangChain:

- TextLoader
- CSVLoader
- DirectoryLoader
- PyPDFLoader
- ArxivLoader
- Docx2txtLoader

In [18]:
directory = 'Companies_Documents/Cisco'

def load_docs(directory):
    
    loader = DirectoryLoader(directory, glob="./*.pdf", loader_cls=PyPDFLoader)

    documents = loader.load()
    return documents

documents = load_docs(directory)

## Document Chunking (Splitting)

Chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content.

- The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].

It takes in the large text then tries to split it by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. If it is still larger than our specified chunk size it moves to the next character in the set until we get a split that is less than our specified chunk size.

In [19]:
def split_docs(documents,chunk_size=1000,chunk_overlap=100):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=["\n\n", "\n", " ", ""]
)
    docs = text_splitter.split_documents(documents)
    return docs

docs = split_docs(documents)
print(len(docs))

404


## Create Embeddings of the documents using sentiment transformers

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can find their documentation here:https://www.sbert.net/

You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similarity, semantic search, or paraphrase mining.

You can also find all the pre-trained model here: https://www.sbert.net/docs/pretrained_models.html . Some of these models support Arabic and +50 languages.

In [20]:
embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L12-v2')

## Create the vector store database

Chroma is a Vector Store / Vector DB by the company Chroma. Chroma DB like many other Vector Stores out there, is for storing and retrieving vector embeddings.

At present Chroma does not provide any hosting services. Store the data locally in the local file system when creating applications around Chroma. Though Chroma is planning to build a hosting service in the near future. Chroma DB offers different ways to store vector embeddings. You can store them In-memory, you can save and load them In-memory

Other examples on vector stores databases:
- FAISS
- Qdrant
- Weaviate
- Deep Lake

In [21]:
persist_directory = 'docs/chroma/'

db = Chroma.from_documents(docs, embeddings, persist_directory=persist_directory)
db.persist()

In [22]:
retriever = db.as_retriever(search_kwargs={'k':5})

In [28]:
query = "Does the company prohibit 'recruitment fees' to workers?"
# matching_docs = retriever.get_relevant_documents(query)
matching_docs = db.similarity_search(query)

print(matching_docs[0].page_content)
print(matching_docs[0].metadata['source'])
matching_docs[0].metadata['page']

expectation for the p rohibition  of forced labor.  These nonconformities indicated risks for forced labor or bonded labor. These nonconformities 
ranged in severity.  
The most common  nonconformities we identified related to workers paying small fees pertaining to the recruit ment process, such as small one -
time fees for health examinations, deposits, or transportation fees often amounting to less than five percent of the  worker’s  monthly salary. 
These fees were sometimes reimbursed after commencement of employment. Our teams c ontinue to work with suppliers to develop models in 
which employers pay healthcare providers directly for health examinations, eliminating the need for workers to be reimbursed.  
Less often , we identified  risks of bonded labor, a type of forced labor. Worke rs become bonded by debt when they are forced to work in order to
Companies_Documents/Cisco/cisco-modern-slavery-statement.pdf


4

In [29]:
query = "employer pays"
# matching_docs = db.similarity_search(query)
matching_docs = retriever.get_relevant_documents(query)
print(matching_docs[0].page_content)
print(matching_docs[0].metadata['source'])
print(matching_docs[0].metadata['page'])

For each pay period, workers shall be provided with a timely and understandable wage statement 
that includes sufficient information to verify accurate compensation for work performed. All use of 
temporary, dispatch and outsourced labor shall be within the limits of the local law.  
 
5) Non-Discrimination/Non -Harassment/Humane Treatment  
Participants shall commit to a workplace free of harassment and unlawful discrimination. There 
shall be no harsh or inhumane treatment including violence, gender -based violence, sexual  
harassment,  sexual  abuse,  corporal  punishment,  mental  or physical  coercion,  bullying, public 
shaming, or verbal abuse of workers; nor is there to be the threat of any such treatment. Companies shall not engage in discrimination or harassment based on race, color, age, gender, 
sexual orientation, gender identity or expression,  ethnicity or national origin, disability, pregnancy,
Companies_Documents/Cisco/RBACodeofConduct8.0_English.pdf
2


## Retrieving based on a specific keyword

In [9]:
import PyPDF2

def search_keywords_in_folder(folder_path, search_keywords):
    results = []
    # Iterate through files in the folder
    for filename in os.listdir(folder_path):
        if filename.endswith('.pdf'):  # Check if file is PDF
            filepath = os.path.join(folder_path, filename)
            with open(filepath, 'rb') as pdf_file:
                pdf_reader = PyPDF2.PdfReader(pdf_file)
                # Iterate through pages of the PDF
                for page_num in range(len(pdf_reader.pages)):
                    page_obj = pdf_reader.pages[page_num]
                    text = page_obj.extract_text()
                    # Check if the keyword is in the text
                    if search_keywords.lower() in text.lower():
                        results.append((text, filename, page_num + 1))
                        break  # Stop searching this file once keyword is found
    if results:
        return "Yes", results
    else:
        return "", None

In [27]:
result, found_locations = search_keywords_in_folder(directory, 'employer pays')
print("Search result:", result)
if result == "Yes":
    print("Keywords found in the following locations:")
    for content, filename, page_num in found_locations:
        print(f"Content: {content}, File: {filename}, Page: {page_num}")

Search result: 


## References

- https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123
- https://docs.langflow.org/components/text-splitters#:~:text=The%20RecursiveCharacterTextSplitter%20splits%20the%20text,size%20exceeds%20a%20specified%20threshold.
- https://medium.com/@cronozzz.rocks/splitting-large-documents-text-splitters-langchain-7c7bfa899267
- https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/
- https://jorgepit-14189.medium.com/get-started-with-chroma-db-and-retrieval-models-using-langchain-87784ffaa918