<a href="https://colab.research.google.com/github/Ayasaberomran/beginners-projects-for-ML-/blob/main/document_qa_langchain/Document_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Analysis AI

**Project Goal**  
Build a Retrieval‑Augmented Generation (RAG) system to answer questions from documents (PDF/text).

**Tools & Libraries**  
- Python | Hugging Face Transformers | LangChain | FAISS  

**Approach**  
1. Load & split documents  
2. Embed text chunks + store in FAISS  
3. Build QA pipeline using LLM  
4. Demonstrate with sample queries


# Install dependencies

In [21]:
!pip install -q pypdf


In [31]:
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline


In [35]:
!pip install -q langchain langchain-community sentence-transformers transformers faiss-cpu


# Load document

In [39]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("/content/Gairola_MLFF-Net.pdf")
docs = loader.load()


# Split The Document

In [40]:

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,      # لازم أقل من 512
    chunk_overlap=50     # overlap بسيط بين كل جزء
)
chunks = splitter.split_documents(docs)
print("Number of chunks:", len(chunks))



Number of chunks: 111


# Step 3 –  Embeddings ,FAISS Vector Store



In [41]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
print("FAISS store created.")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS store created.


# Prepare the model

In [32]:
# إنشاء pipeline باستخدام موديل flan-t5-base
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)

# تحويله لـ LangChain LLM
llm = HuggingFacePipeline(pipeline=qa_pipeline)


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=qa_pipeline)


 بناء RetrievalQA pipeline مع FAISS


In [33]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

def ask_question(q: str) -> str:
    return qa.run(q)


🤖 Step 4 – تفعيل موديل HuggingFace المجاني (flan-t5)


In [42]:
from transformers import pipeline
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Create huggingface pipeline
hf_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)

# Wrap with LangChain
llm = HuggingFacePipeline(pipeline=hf_pipeline)

# Build QA chain
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever())

def ask_question(q: str) -> str:
    return qa.run(q)


Device set to use cpu


In [43]:
print(ask_question("What is the purpose of the document?"))


To describe the work of Ajay Krishan Gairola.


In [44]:
print(ask_question("Summarize the key findings."))


RESULTS This section provides a thorough analysis in addition to explaining the research findings. Figures, graphs, tables, and other reader - friendly formats can be used to present results [17], [18]. 3.1. Results of individual models using three datasets Models trained with DenseNet-121 and ViTb16 have an average F1 -score of 80 and 7 9, recall of 79 and 75, precision of 81 and 75 , and accuracy of 83% and 7 9%, respectively, on ISIC2016 dataset. 3.7. Comparison of the proposed model with the state of the arts Analysis of all the test data showed that our proposed model correctly classified 86% of the data. (e) the FFB 2.6. Performance evaluations The effectiveness of algorithms for automatically classifying images of skin lesions is often measured by their F1 -score, recall, precision, and accuracy. True positives ( T_P), false negatives ( F_N), and false positives ( F_P) proportions provide the basis for these measures’ calculation:
