# Overview

This notebook demonstrates a **smart assistant** for analyzing text documents using the **LangChain** framework. It extracts text from uploaded PDF documents, splits the content into manageable chunks, and creates a vector database for efficient semantic search and question answering. The assistant utilizes **Hugging Face embeddings** for semantic similarity and FAISS for vector storage. It is designed to answer questions based on the contents of the provided document while ignoring unrelated queries.

### Key Features:
- **Dynamic Input**: Upload any PDF document for analysis.
- **Semantic Search**: Retrieve relevant information using advanced embeddings.
- **Efficient Storage**: Utilize FAISS for scalable vector-based storage.

### Example Data:
This implementation uses the document [**"Text Classification Algorithms"**](https://www.mdpi.com/2078-2489/10/4/150?source=post_page---------------------------), which provides insights into various methods and steps involved in text classification.

### Use Cases:
- Analyze scientific papers, reports, or articles.
- Retrieve specific details from large documents.
- Dynamically update the assistant's knowledge base by uploading new PDFs.

Feel free to upload any PDF document to test and customize the assistant for your specific needs.


In [1]:
!pip install langchain
!pip install -U langchain-community
!pip install pdfplumber
!pip install faiss-gpu

Collecting langchain-community
  Downloading langchain_community-0.3.7-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain-community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.1-py3-none-any.whl.metadata (7.5 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadat

### Import libraries

In [14]:
import pdfplumber
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

### Upload and process PDF documents

In [3]:
def extract_text_from_pdf(file_path):
    with pdfplumber.open(file_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
    return text

In [7]:
text = extract_text_from_pdf("Text Classification Algorithms.pdf")

### Splitting text into parts with LangChain

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(text)

In [9]:
chunks[0]

'Review\nText Classification Algorithms: A Survey\nKamranKowsari1,2,* ID,KianaJafariMeimandi1,MojtabaHeidarysafa1,SanjanaMendu1 ID,\nLauraBarnes1,2,3 ID andDonaldBrown1,3 ID\n1 DepartmentofSystemsandInformationEngineering,UniversityofVirginia,Charlottesville,VA22904,USA;\nkj6vd@virginia.edu(K.J.M.);mh4pk@virginia.edu(M.H.);sm7gc@virginia.edu(S.M.);\nlb3dp@virginia.edu(L.B.);deb@virginia.edu(D.B.)\n2 SensingSystemsforHealthLab,UniversityofVirginia,Charlottesville,VA22911,USA\n3 SchoolofDataScience,UniversityofVirginia,Charlottesville,VA22904,USA\n* Correspondence:kk7nc@virginia.edu;Tel.:+1-202-812-3013\n(cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:1)\n(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)\nReceived:22March2019;Accepted:17April2019;Published:23April2019\nAbstract: Inrecentyears,therehasbeenanexponentialgrowthinthenumberofcomplexdocuments\nandtextsthatrequireadeeperunderstandingofmachinelearningmethodstobeabletoaccurately'

### Creating a vector database using LangChain

In [10]:
def create_vector_database(chunks):
  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  vector_database = FAISS.from_texts(chunks, embedding=embeddings)
  return vector_database

In [11]:
vector_database = create_vector_database(chunks)
type(vector_database)

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Search with LangChain

In [17]:
import os
os.environ["OPENAI_API_KEY"] = "your OpenAI API key here..."

In [18]:
def search_answer(query, vector_database):
  retriever = vector_database.as_retriever()
  chain = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
  answer = chain.run(query)
  return answer

### Test how it works

In [22]:
query = "what is text classification?"
answer = search_answer(query, vector_database)
print(answer)

 Text classification is the process of automatically categorizing or organizing text documents into different groups or categories based on their content. It involves using machine learning algorithms to analyze and extract features from text data, reducing its dimensionality, selecting a classifier, and evaluating its performance. Text classification is commonly used in various fields such as information retrieval, medicine, social sciences, healthcare, psychology, and law. 


In [24]:
query = "what are the main steps of text classification you know?"
answer = search_answer(query, vector_database)
print(answer)

 The main steps of text classification include feature extraction, dimension reduction, classifier selection, and evaluation.


In [25]:
query = "tell me about Rocchio Classification"
answer = search_answer(query, vector_database)
print(answer)

 The Rocchio algorithm was first introduced in 1971 by J.J. Rocchio as a method for using relevance feedback to query full-text databases. It is a classification algorithm that uses TF-IDF weights for each informative word instead of boolean features, and assigns each test document to the class with the maximum similarity between the test document and each prototype vector. Some advantages of this algorithm include easy implementation and low computational complexity, but it can also misclassify multi-modal classes and is not very robust. It also requires careful tuning of hyper-parameters. 


In [26]:
query = "what is the weather in New York in the summer?"
answer = search_answer(query, vector_database)
print(answer)

 I don't know.


### Conclusion
The smart assistant successfully answered all questions related to the contents of the provided document, such as "What is text classification?" and "Tell me about Rocchio Classification." However, when asked about unrelated topics, such as "What is the weather in New York in the summer?", the assistant correctly responded with "I don't know", as this information was not included in the document. This demonstrates the system's ability to provide accurate and document-specific responses while maintaining contextual relevance.