1. Install the Dependencies

In [1]:
!pip install openai langchain pymupdf tqdm pinecone sentence-transformers faiss-cpu transformers accelerate

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting pinecone
  Downloading pinecone-7.3.0-py3-none-any.whl.metadata (9.5 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting pinecone-plugin-assistant<2.0.0,>=1.6.0 (from pinecone)
  Downloading pinecone_plugin_assistant-1.7.0-py3-none-any.whl.metadata (28 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==

2. Import the required Packages

In [2]:
import os
import fitz
import faiss
import numpy as np
from tqdm import tqdm
from google.colab import files
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from openai import OpenAI
from pinecone import Pinecone

3. Set OpenAi API Key, Pinecone API Key and Pinecone Index

In [None]:
import getpass
openai_api_key = getpass.getpass("Enter your OpenAI API Key: ")
client = OpenAI(api_key=openai_api_key)

pinecone_api_key = getpass.getpass("Enter your Pinecone API Key: ")
os.environ["PINECONE_API_KEY"] = pinecone_api_key
pc = Pinecone(api_key=pinecone_api_key)
index_name = input("Enter Index Name")
index = pc.Index(index_name)

Enter your OpenAI API Key: ··········
Enter your Pinecone API Key: ··········
Enter Index Namedemo


4. Upload the required files (I am Loading my Resume)

In [3]:
uploaded_files = files.upload()

Saving resume.pdf to resume.pdf


5. Extraction and splitting Text from uploaded file(s)

In [4]:
def extract_text_from_file(file_path):
    if file_path.endswith(".pdf"):
        doc = fitz.open(file_path)
        return "\n".join([page.get_text() for page in doc])
    elif file_path.endswith(".txt"):
        with open(file_path, "r", encoding="utf-8") as f:
            return f.read()
    else:
        raise ValueError("Unsupported file type")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = []
for filename in uploaded_files.keys():
    raw_text = extract_text_from_file(filename)
    chunks = text_splitter.split_text(raw_text)
    for i, chunk in enumerate(chunks):
        documents.append({"id": f"{filename}-{i}", "text": chunk, "metadata": {"source": filename}})
print(f"Chunks Created: {len(documents)}")


Chunks Created: 11


6. Generate Embeddings and creating FAISS Index

In [5]:
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [doc["text"] for doc in documents]
metas = [doc["metadata"] for doc in documents]
ids = [doc["id"] for doc in documents]
embeddings = embed_model.encode(texts, show_progress_bar=True)
dimension = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(np.array(embeddings))
doc_lookup = {i: {"text": texts[i], "metadata": metas[i]} for i in range(len(texts))}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

7. Function for retreiving top queries(change value of k as per requirements)

In [6]:
def retrieve_top_k(query, k=3):
    query_vec = embed_model.encode([query])
    distances, indices = faiss_index.search(np.array(query_vec), k)
    return [doc_lookup[idx]["text"] for idx in indices[0]]

8. Setup the function for answering questions Using FLAN-T5 (Offline QA)[I have free usage policy for both OpenAI and pinecone]

In [7]:
def answer_query_offline(query, k=3):
    top_chunks = retrieve_top_k(query, k)
    context = "\n\n".join(top_chunks)
    prompt = f"Answer the question based on the following context:\n\n{context}\n\nQuestion: {query}"
    model = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512)
    response = model(prompt)
    return response[0]['generated_text']

9. Querying the model and getting the answers

In [8]:
def format_answer(text, words_per_line=15):
    words = text.split()
    lines = [' '.join(words[i:i+words_per_line]) for i in range(0, len(words), words_per_line)]
    return '\n'.join(lines)

while True:
    query = input("\nAsk a question (or type 'exit' to quit): ")
    if query.lower() == 'exit':
        break
    answer = answer_query_offline(query)
    print("Answer:\n", format_answer(answer))



Ask a question (or type 'exit' to quit): Find the total months of experience.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Answer:
 12

Ask a question (or type 'exit' to quit): Show experience details


Device set to use cuda:0


Answer:
 What do you need to know about Srijan Gupta?

Ask a question (or type 'exit' to quit): What is Srijan Gupta job wise experience?


Device set to use cuda:0


Answer:
 over a year of hands-on experience in machine learning, data processing, and system simulation

Ask a question (or type 'exit' to quit): Give detailed job experience 


Device set to use cuda:0


Answer:
 I am currently pursuing my Master’s in Computer Science, with over a year of hands-on
experience in machine learning, data processing, and system simulation. I am skilled in Python for
data manipulation and extraction, with proficiency in libraries such as NumPy, SciPy, Dash, and Pandas.
Additionally, through various research projects, coursework, and internships, I have gained experience in APPLICATIONS.

Ask a question (or type 'exit' to quit): exit
