# 📄 RAG with Your PDF / CSV / Text Documents
This notebook demonstrates Retrieval-Augmented Generation using uploaded files as the context. It uses Sentence-BERT for embeddings, FAISS for retrieval, and OpenAI GPT for generation.

## Step 1: Install Required Libraries

In [16]:
!pip install --upgrade openai




In [17]:
!pip install faiss-cpu sentence-transformers openai PyPDF2 pandas



## Step 2: Import Libraries

In [34]:
import os
import faiss
import numpy as np
import openai
import pandas as pd
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
from IPython.display import display
from pathlib import Path
from sklearn.metrics.pairwise import cosine_similarity

##  Step 3: Set OpenAI API Key

In [35]:
openai.api_key = "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"  # Replace with your actual key


## Step 4: Upload Your File

In [36]:
from google.colab import files  # or use Jupyter upload widget if not using Colab
uploaded = files.upload()
filename = next(iter(uploaded))
filepath = Path(filename)
print(f"Uploaded file: {filepath}")


Saving Milestone 3-Sentiment Analysis of Social Media Posts- Rizmi Sowdhagar.pdf to Milestone 3-Sentiment Analysis of Social Media Posts- Rizmi Sowdhagar (1).pdf
Uploaded file: Milestone 3-Sentiment Analysis of Social Media Posts- Rizmi Sowdhagar (1).pdf


## Step 5: Extract Text from File

In [37]:
def extract_text_from_file(path):
    if path.suffix == '.pdf':
        reader = PdfReader(str(path))
        return '\n'.join([page.extract_text() for page in reader.pages if page.extract_text()])
    elif path.suffix == '.csv':
        df = pd.read_csv(path)
        return '\n'.join(df.astype(str).agg(' '.join, axis=1).tolist())
    elif path.suffix == '.txt':
        return path.read_text()
    else:
        raise ValueError("Unsupported file type.")

full_text = extract_text_from_file(filepath)
chunks = [full_text[i:i+500] for i in range(0, len(full_text), 500)]
print(f"Text split into {len(chunks)} chunks.")


Text split into 31 chunks.


## Step 6: Embed Chunks with Sentence-BERT

In [39]:
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(chunks)
print("Embeddings shape:", embeddings.shape)
print("Sample embedding vector (first chunk):", embeddings[0][:5])  # Show first 5 values

Embeddings shape: (31, 384)
Sample embedding vector (first chunk): [-0.04104293 -0.03323935 -0.02671899 -0.03540211  0.12259932]


## Step 7: Create FAISS Index

In [40]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

print("FAISS index created with dimension:", dimension)
print("Total number of vectors in index:", index.ntotal)


FAISS index created with dimension: 384
Total number of vectors in index: 31


## Step 8: Ask a Question and Retrieve Chunks

In [42]:
query = "Enter your question here"
query_vector = model.encode([query])
_, indices = index.search(query_vector, k=3)
retrieved_chunks = [chunks[i] for i in indices[0]]
context = '\n'.join(retrieved_chunks)

print("Query:", query)
print("Retrieved indices:", indices)
print("Retrieved chunks preview:\n", context[:500])


Query: Enter your question here
Retrieved indices: [[ 7 13  4]]
Retrieved chunks preview:
 ●
 
Scaling
 
 
Scaled
 
key
 
numeric
 
fields
 
to
 
improve
 
model
 
convergence.
 
●
 
Class
 
Imbalance
 
Visualization
 
 
Class
 
distribution:
 
positive
 
>
 
neutral
 
>
 
negative,
 
some
 
imbalance
 
exists.
 
Week
 
2:
 
Model
 
Development
 
and
 
Experimentation
 
●
 
Prepares
 
Features
 
and
 
Target
 
 
●
 
Train-Test
 
Split
 
 
Used
 
80-20
 
train-test
 
split
 
for
 
all
 
supervised
 
models.
 
 
 

●
 
Model
 
Training-
 
Random
 
Forest
 
A
 
Random
 
Forest
 
classifi


##  Step 9: Generate Answer with GPT

In [33]:
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(api_key="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")  # Replace with a secure method in production

# Construct the prompt
augmented_prompt = f"Answer the question using the context below:\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"

# Call the ChatCompletion API
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": augmented_prompt}
    ]
)

# Print the answer
print(response.choices[0].message.content)


How is the class distribution visualized and addressed in the model development and experimentation phase? 

Answer: The class distribution is visualized as positive > neutral > negative, with some imbalance existing. To improve model convergence, key numeric fields are scaled.
