<a href="https://colab.research.google.com/github/KurubaGeethika/Knowledge-graph-ai/blob/main/AIGurukul.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load JSON data → convert to DataFrame

Clean / prepare data

Generate embeddings for prompt + response

Store in a vector database (FAISS) with metadata

RAG: retrieve top-K similar prompts for a user query

Use open-source LLM for generating an answer

Environment setup

In [1]:
!pip install pandas faiss-cpu sentence-transformers transformers accelerate

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


Importing packages

In [None]:
import pandas as pd
import numpy as np
import json
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

STEP 1: Load JSON data

In [5]:
from google.colab import files
import io

uploaded = files.upload()  # This will prompt you to select a file

# Get the uploaded filename
filename = list(uploaded.keys())[0]
print("Uploaded file:", filename)



Saving mobile_qa.json to mobile_qa (1).json
Uploaded file: mobile_qa (1).json


In [19]:
import json
import pandas as pd
with open(filename, 'r') as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data)
print("Initial DataFrame shape:", df.shape)
df.head()

Initial DataFrame shape: (46945, 7)


Unnamed: 0,questionType,asin,answerTime,unixTime,question,answerType,answer
0,yes/no,1466736038,"Mar 8, 2014",1394266000.0,Is there a SIM card in it?,Y,Yes. The Galaxy SIII accommodates a micro SIM ...
1,open-ended,1466736038,"Aug 4, 2014",1407136000.0,Why hasnt it upgraded to latest Android OS 4.4...,,"My S3 was able to upgrade to 4.4.2 last week, ..."
2,yes/no,1466736038,"Jan 29, 2015",1422518000.0,"Is this phone new, with 1 year manufacture war...",?,It is new but I was not able to get it activat...
3,yes/no,1466736038,"Nov 24, 2014",1416816000.0,Is this phone brand new and NOT a mini?,?,The phone we received was exactly as described...
4,open-ended,1466736038,"Oct 14, 2014",1413270000.0,this product is used with GSM chip in my count...,,I am sure (but not positive) that this phone w...


STEP 2 : Basic cleaning

In [20]:
# Initial number of rows
initial_rows = len(df)
print(f"Initial number of rows: {initial_rows}")

# Track missing values before filling
print("Missing values before filling:")
print(df.isna().sum())

# Fill missing/null values
df['questionType'] = df['questionType'].fillna("Unknown questionType")
df['answerTime'] = df['answerTime'].fillna("Unknown answerTime")
df['unixTime'] = df['unixTime'].fillna("Unknown unixTime")
df['answerType'] = df['answerType'].fillna("Unknown answerType")
df['answer'] = df['answer'].fillna("No answer generated")

# Track missing values after filling
print("\nMissing values after filling:")
print(df.isna().sum())

# Drop rows where critical fields are missing
before_drop = len(df)
df = df.dropna(subset=['question', 'asin'])
after_drop = len(df)
print(f"\nRows dropped: {before_drop - after_drop}")
print(f"Remaining rows: {after_drop}")


Initial number of rows: 46945
Missing values before filling:
questionType        0
asin                0
answerTime          0
unixTime         1224
question            0
answerType      19181
answer              0
dtype: int64

Missing values after filling:
questionType    0
asin            0
answerTime      0
unixTime        0
question        0
answerType      0
answer          0
dtype: int64

Rows dropped: 0
Remaining rows: 46945


In [21]:

# Concatenate prompt + response for embedding
df['text_for_embedding'] = df['question'] + " " + df['answer']
df.head()


Unnamed: 0,questionType,asin,answerTime,unixTime,question,answerType,answer,text_for_embedding
0,yes/no,1466736038,"Mar 8, 2014",1394265600.0,Is there a SIM card in it?,Y,Yes. The Galaxy SIII accommodates a micro SIM ...,Is there a SIM card in it? Yes. The Galaxy SII...
1,open-ended,1466736038,"Aug 4, 2014",1407135600.0,Why hasnt it upgraded to latest Android OS 4.4...,Unknown answerType,"My S3 was able to upgrade to 4.4.2 last week, ...",Why hasnt it upgraded to latest Android OS 4.4...
2,yes/no,1466736038,"Jan 29, 2015",1422518400.0,"Is this phone new, with 1 year manufacture war...",?,It is new but I was not able to get it activat...,"Is this phone new, with 1 year manufacture war..."
3,yes/no,1466736038,"Nov 24, 2014",1416816000.0,Is this phone brand new and NOT a mini?,?,The phone we received was exactly as described...,Is this phone brand new and NOT a mini? The ph...
4,open-ended,1466736038,"Oct 14, 2014",1413270000.0,this product is used with GSM chip in my count...,Unknown answerType,I am sure (but not positive) that this phone w...,this product is used with GSM chip in my count...


In [22]:


# Optional: add length or other features
df['text_length'] = df['text_for_embedding'].apply(len)
df.head()

Unnamed: 0,questionType,asin,answerTime,unixTime,question,answerType,answer,text_for_embedding,text_length
0,yes/no,1466736038,"Mar 8, 2014",1394265600.0,Is there a SIM card in it?,Y,Yes. The Galaxy SIII accommodates a micro SIM ...,Is there a SIM card in it? Yes. The Galaxy SII...,78
1,open-ended,1466736038,"Aug 4, 2014",1407135600.0,Why hasnt it upgraded to latest Android OS 4.4...,Unknown answerType,"My S3 was able to upgrade to 4.4.2 last week, ...",Why hasnt it upgraded to latest Android OS 4.4...,282
2,yes/no,1466736038,"Jan 29, 2015",1422518400.0,"Is this phone new, with 1 year manufacture war...",?,It is new but I was not able to get it activat...,"Is this phone new, with 1 year manufacture war...",145
3,yes/no,1466736038,"Nov 24, 2014",1416816000.0,Is this phone brand new and NOT a mini?,?,The phone we received was exactly as described...,Is this phone brand new and NOT a mini? The ph...,362
4,open-ended,1466736038,"Oct 14, 2014",1413270000.0,this product is used with GSM chip in my count...,Unknown answerType,I am sure (but not positive) that this phone w...,this product is used with GSM chip in my count...,403


STEP 3 : Generate Embeddings

In [25]:
!pip install -U sentence-transformers



In [27]:
import torch
from sentence_transformers import SentenceTransformer

# Use a SentenceTransformer model (GPU supported)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
embed_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

# Generate embeddings
embeddings = embed_model.encode(df['text_for_embedding'].tolist(), convert_to_numpy=True, show_progress_bar=True)
print("Embeddings shape:", embeddings.shape)

Batches:   0%|          | 0/1468 [00:00<?, ?it/s]

Embeddings shape: (46945, 384)


STEP 4: Create FAISS vector DB

In [29]:
# Install the CPU version (recommended for most tasks)
!pip install faiss-cpu

# Import the library
import faiss



In [30]:
d = embeddings.shape[1]  # dimension of embeddings
index = faiss.IndexFlatL2(d)  # L2 distance; for cosine, normalize first
faiss.normalize_L2(embeddings)  # normalize for cosine similarity
index.add(embeddings)
print("Vector DB size:", index.ntotal)

Vector DB size: 46945


STEP 5: RAG Retrieval

In [31]:
def retrieve_top_k(query, k=5):
    query_vec = embed_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_vec)
    D, I = index.search(query_vec, k)
    results = df.iloc[I[0]]
    return results

# Example query
user_question = "Can I use 2 SIMs on iPhone 14 in the US?"
top_k_records = retrieve_top_k(user_question, k=5)
print("Top-k retrieved records:\n", top_k_records[['question', 'answer']])

Top-k retrieved records:
                                                 question  \
25646                          do i have to uses 2 sims?   
25124  Hi, Do we need to use 2 sims? Does it have int...   
10540  Does it accept two SIMS? Any issues using this...   
42499    can this phone be used here in the USA with att   
46433                can I switch the sim to a iphone 4s   

                                                  answer  
25646  You do not have to use 2 SIMS for the phone to...  
25124  Dear Customer, Thanks for your interest on thi...  
10540           No it does not. Do not buy it for europe  
42499  Yes it is a two Sims card phone it can also be...  
46433  just get the sim at metro for 15 bucks, just m...  


# STEP 6: Generate LLM answer

In [32]:
# Build context from retrieved records
context = ""
for idx, row in top_k_records.iterrows():
    context += f"Q: {row['question']}\nA: {row['answer']}\n\n"

rag_prompt = f"""
You are a smartphone assistant. Use the following Q&A to answer the user's question.

User Question: {user_question}

Context:
{context}

Answer concisely:
"""

In [34]:
!pip install transformers



In [35]:
from transformers import AutoTokenizer

In [37]:
from transformers import AutoModelForCausalLM

In [38]:
# Use a small open-source LLM (CPU/GPU)
llm_model = "TheBloke/vicuna-7B-1.1-HF"  # or a smaller model if GPU is limited
tokenizer = AutoTokenizer.from_pretrained(llm_model)
model = AutoModelForCausalLM.from_pretrained(llm_model, device_map='auto', torch_dtype=torch.float16)

inputs = tokenizer(rag_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=200)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n=== Generated Answer ===")
print(answer)

config.json:   0%|          | 0.00/582 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


pytorch_model.bin.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['pad_token_id']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers the weights in this format.