# **Internship Task: Document Research & Theme Identification Chatbot**

Objective:
Create an interactive chatbot that can perform research across a large set of documents(minimum 75 documents), identify common themes (multiple themes are possible), and provide detailed, cited responses to user queries.

### Installing and Importing Required Libraries

In [13]:
!pip install pymupdf pdfplumber pytesseract pillow sentence-transformers faiss-cpu transformers accelerate pillow opencv-python

Collecting pdfplumber
  Downloading pdfplumber-0.11.6-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20250327 (from pdfplumber)
  Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.6-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.2/60.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20250327-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdfium2-4.30.1-py3-non

In [14]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import pdfplumber
from google.colab import files
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline
import io

### File Upload and Text Extraction

In [15]:
def extract_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def extract_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text

uploaded = files.upload()
texts = []
for filename in uploaded.keys():
    if filename.endswith(".pdf"):
        text = extract_pdf(filename)
    elif filename.lower().endswith((".png", ".jpg", ".jpeg", ".tiff")):
        text = extract_image(filename)
    else:
        text = ""
    texts.append({'filename': filename, 'text': text})
print(text)

Saving air1.pdf to air1 (1).pdf
Environmental Pollution: Types, Causes and 
Consequences 
Satsita Khasanova, Elina Alieva, and Aishat Shemilkhanova  
Kadyrov Chechen State University, Sheripova Street, 32, 364024, Grozny, Russia 
Abstract. Environmental pollution is not a new phenomenon, but it 
remains the greatest global problem facing humanity and a major 
environmental cause of morbidity and mortality. Human activities related 
to urbanization, industrialization, mining and exploration are at the 
forefront of global environmental pollution. Both developed and 
developing countries share this burden, although awareness and stronger 
laws in developed countries have done more to protect their environment. 
Despite global attention to pollution, its impact is still being felt due to its 
severe long-term effects. The purpose of this work is to display the severity 
of the problem of environmental pollution, in particular, water pollution, 
air pollution, radioactive pollution, noise

### Spliting Text into Chunks

In [18]:
def chunk_text(text, max_words=150):
    words = text.split()
    return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]

chunks = []
for doc in texts:
    for chunk in chunk_text(doc['text']):
        chunks.append({'text': chunk, 'filename': doc['filename']})
print(f"\nTotal Chunks Created: {len(chunks)}")


Total Chunks Created: 22


### Text Embeddings is Faiss Index

In [20]:
model = SentenceTransformer('all-MiniLM-L6-v2')
corpus = [chunk['text'] for chunk in chunks]
embeddings = model.encode(corpus, convert_to_numpy=True)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print("FAISS index created and embeddings stored.")

FAISS index created and embeddings stored.


### Querying the Document

In [21]:
query = input("Enter your query: ")
query_embedding = model.encode([query])
D, I = index.search(query_embedding, k=5)
top_chunks = [chunks[i] for i in I[0]]
for i, chunk in enumerate(top_chunks):
    print(f"Result {i+1} from {chunk['filename']}\n{chunk['text'][:300]}\n---")

Enter your query: cause of pollution
Result 1 from air1 (1).pdf
are two types of air pollutants: Primary pollutants are those that directly contribute to air pollution. Sulfur dioxide emitted from factories is the main pollutant. Secondary pollutants are formed as a result of mixing and reaction of primary pollutants. Smog is a secondary pollutant resulting from
---
Result 2 from air1 (1).pdf
hydrocarbons and chemicals are mainly produced in factories and industries. They are released into the atmosphere, degrading its quality; 5. Household Sources: Toxic chemicals are released into the air from household cleaners and paints. The smell coming from freshly painted walls is the smell of ch
---
Result 3 from air1 (1).pdf
by about 20%. Causes and sources of noise pollution: 1. Industrialization: Industrialization has led to an increase in noise pollution due to the use of heavy machinery such as generators, mills, and massive exhaust fans that produce unwanted noise. 2. Vehicles. The secon

### Generate Answers with LLM (FLAN-T5)

In [27]:
qa_model = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=256)

prompt = (
    f"Answer the question based on the following text:\n\n"
    f"{''.join(chunk['text'] for chunk in top_chunks)}\n\n"
    f"Question: {query}\nAnswer:"
)

answer = qa_model(prompt)[0]['generated_text']
print("\n🧠 Final Answer:")
print(answer)

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (1035 > 512). Running this sequence through the model will result in indexing errors



🧠 Final Answer:
primary pollutants


### Theme Identification

In [28]:
def extract_theme(chunks):
    themes = []
    for i, chunk in enumerate(chunks):
        prompt = f"Read this chunk and describe its main theme in one sentence:\n\n{chunk}\n\nAnswer:"
        theme = qa_model(prompt)[0]['generated_text']
        themes.append((f"Chunk {i+1}", theme))
    return themes

theme_results = extract_theme(top_chunks)

print("\n🎯 Themes Identified:")
for chunk_id, theme in theme_results:
    print(f"{chunk_id}: {theme}")


🎯 Themes Identified:
Chunk 1: Air pollution is the result of the incomplete combustion of fossil fuels.
Chunk 2: Air pollution
Chunk 3: Noise pollution can pose a risk to human health
Chunk 4: Water pollution is of vital importance to humanity as it is directly related to human well-being. Water quality is of vital importance to humanity as it is directly related to human well-being. When water becomes polluted, it has a direct or indirect negative effect on all forms of life that depend on it. The effects of water pollution can be felt for many years. Contaminated water is the cause of many waterborne diseases and epidemics that are widespread in many countries. Water pollution is defined as pollution of water bodies. Water pollution is caused by urbanization, deforestation, industrial effluents, detergents and fertilizers, and agricultural effluents are all examples of pollution.
Chunk 5: The study of air pollution is a very important issue that needs to be addressed in the foreseea