<a href="https://colab.research.google.com/github/Nathanael9212/rag-news-project/blob/main/retrieval_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Download Library

In [1]:
!pip install -q sentence-transformers faiss-cpu rank_bm25 transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[?25h

Load Dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import json
import pandas as pd
import re
import numpy as np
import warnings
warnings.filterwarnings('ignore')

DATA_PATH = '/content/drive/MyDrive/dataset_retrieval/news.json'

Mounted at /content/drive


In [3]:
# Load data dari file JSON
data = []
with open(DATA_PATH, 'r') as f:
    for line in f:
        data.append(json.loads(line))

# Convert ke DataFrame, Ambil sampel 5000 berita untuk efisiensi.
df = pd.DataFrame(data).sample(n=5000, random_state=42).reset_index(drop=True)

print("="*70)
print("📊 TABEL 1: RAW DATASET (Sebelum Preprocessing)")
print("="*70)
print(f"Total data: {len(df)} dokumen | Kategori: {df['category'].nunique()} jenis\n")

# Tampilkan tabel raw data
display(df[['headline', 'short_description', 'category', 'date']].head(10))

📊 TABEL 1: RAW DATASET (Sebelum Preprocessing)
Total data: 5000 dokumen | Kategori: 42 jenis



Unnamed: 0,headline,short_description,category,date
0,What If We Were All Family Generation Changers?,"What if, in doing so, we won't just create new...",IMPACT,2014-06-20
1,Firestorm At AOL Over Employee Benefit Cuts,It should have been a glorious week for AOL ch...,BUSINESS,2014-02-08
2,Dakota Access Protesters Arrested As Deadline ...,A few protesters who refused to leave remained...,POLITICS,2017-02-22
3,One Glimpse Of These Baby Kit Foxes And You'll...,,GREEN,2014-05-14
4,"Mens' Sweat Pheromone, Androstadienone, Influe...",Scientists didn't know if humans played that g...,SCIENCE,2013-06-02
5,Summer Sleepover Tips,Here are five ways to get some beauty sleep wh...,PARENTING,2012-07-25
6,End of the Year,"For a moment, let yourself wonder -- and let y...",WELLNESS,2012-11-04
7,Maybe Colleges Should Take A Lesson From Zoos,"By Michael Preston, UCF Forum columnist What w...",SCIENCE,2017-05-10
8,Supermodel Stephanie Seymour Does Sexy Photo S...,"In particular, Seymour remembers the way the b...",STYLE & BEAUTY,2014-02-10
9,American Attitudes About Guns Have Become Much...,"When I was a kid growing up in Washington, D.C...",POLITICS,2017-07-27


Preprocessing Teks

In [4]:
def preprocess(text):
    text = str(text).lower()                          # lowercase
    text = re.sub(r'http\S+|www\S+', '', text)        # hapus URL
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)       # hapus karakter spesial
    text = re.sub(r'\s+', ' ', text).strip()          # hapus spasi berlebih
    return text

# Gabung headline + short_description → document
df['document'] = df['headline'] + ' ' + df['short_description']
df['doc_clean'] = df['document'].apply(preprocess)

print("="*70)
print("📊 TABEL 2: SETELAH PREPROCESSING")
print("="*70)
print("Proses: lowercase → hapus URL → hapus karakter spesial → hapus spasi berlebih\n")

# Tampilkan tabel perbandingan sebelum vs sesudah
comparison = df[['headline', 'document', 'doc_clean']].head(5).copy() # Bersihkan teks → doc_clean
comparison.columns = ['Headline (Original)', 'Document (Gabungan)', 'Document (Clean)']
display(comparison)

📊 TABEL 2: SETELAH PREPROCESSING
Proses: lowercase → hapus URL → hapus karakter spesial → hapus spasi berlebih



Unnamed: 0,Headline (Original),Document (Gabungan),Document (Clean)
0,What If We Were All Family Generation Changers?,What If We Were All Family Generation Changers...,what if we were all family generation changers...
1,Firestorm At AOL Over Employee Benefit Cuts,Firestorm At AOL Over Employee Benefit Cuts It...,firestorm at aol over employee benefit cuts it...
2,Dakota Access Protesters Arrested As Deadline ...,Dakota Access Protesters Arrested As Deadline ...,dakota access protesters arrested as deadline ...
3,One Glimpse Of These Baby Kit Foxes And You'll...,One Glimpse Of These Baby Kit Foxes And You'll...,one glimpse of these baby kit foxes and you ll...
4,"Mens' Sweat Pheromone, Androstadienone, Influe...","Mens' Sweat Pheromone, Androstadienone, Influe...",mens sweat pheromone androstadienone influence...


BM25: index keyword-based.

In [5]:
from rank_bm25 import BM25Okapi

# Tokenisasi dokumen
corpus = [doc.split() for doc in df['doc_clean'].tolist()]

# Buat index BM25
bm25 = BM25Okapi(corpus)

print("✅ BM25 Index siap!")
print(f"   • Jumlah dokumen: {len(corpus)}")
print(f"   • Rata-rata token per dokumen: {np.mean([len(doc) for doc in corpus]):.1f}")

✅ BM25 Index siap!
   • Jumlah dokumen: 5000
   • Rata-rata token per dokumen: 30.2


In [6]:
from sentence_transformers import SentenceTransformer
import faiss

# Load model embedding
model = SentenceTransformer('all-MiniLM-L6-v2')

# Buat embeddings
print("🔄 Membuat embeddings...")
embeddings = model.encode(df['doc_clean'].tolist(), show_progress_bar=True, batch_size=64)

# Buat index FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(embeddings)
index.add(embeddings)

print(f"\n✅ FAISS Index siap!")
print(f"   • Jumlah vektor: {index.ntotal}")
print(f"   • Dimensi vektor: {dimension}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

🔄 Membuat embeddings...


Batches:   0%|          | 0/79 [00:00<?, ?it/s]


✅ FAISS Index siap!
   • Jumlah vektor: 5000
   • Dimensi vektor: 384


Retrieval Fungsi

In [7]:
# Fungsi retrieval BM25
def retrieve_bm25(query, top_k=5):
    tokens = preprocess(query).split()
    scores = bm25.get_scores(tokens)
    top_idx = np.argsort(scores)[::-1][:top_k]
    return top_idx, scores[top_idx]

# Fungsi retrieval FAISS
def retrieve_faiss(query, top_k=5):
    q_emb = model.encode([preprocess(query)])
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k)
    return indices[0], scores[0]

# Test dan tampilkan hasil retrieval
print("="*70)
print("📊 TABEL 3: HASIL RETRIEVAL")
print("="*70)

test_query = "COVID vaccine booster shots in America"
print(f"Query: \"{test_query}\"\n")

# BM25 Results
print("🔹 BM25 Results:")
idx_bm25, scores_bm25 = retrieve_bm25(test_query, top_k=5)
bm25_results = pd.DataFrame({
    'Rank': range(1, 6),
    'Score': [f"{s:.2f}" for s in scores_bm25],
    'Category': [df.iloc[i]['category'] for i in idx_bm25],
    'Headline': [df.iloc[i]['headline'][:60] + '...' for i in idx_bm25]
})
display(bm25_results)

print("\n🔹 FAISS Results:")
idx_faiss, scores_faiss = retrieve_faiss(test_query, top_k=5)
faiss_results = pd.DataFrame({
    'Rank': range(1, 6),
    'Score': [f"{s:.4f}" for s in scores_faiss],
    'Category': [df.iloc[i]['category'] for i in idx_faiss],
    'Headline': [df.iloc[i]['headline'][:60] + '...' for i in idx_faiss]
})
display(faiss_results)

📊 TABEL 3: HASIL RETRIEVAL
Query: "COVID vaccine booster shots in America"

🔹 BM25 Results:


Unnamed: 0,Rank,Score,Category,Headline
0,1,17.32,HEALTHY LIVING,Going Back To The Old Whooping Cough Vaccine C...
1,2,13.62,WORLD NEWS,More Than 15 Million Have Had Covid Vaccine In...
2,3,10.37,WELLNESS,Bird Flu Vaccine: Prototype For H5N1 Vaccine B...
3,4,9.37,HEALTHY LIVING,Experimental Zika Vaccine Successfully Induces...
4,5,9.23,WELLNESS,Vaccine of Hope...



🔹 FAISS Results:


Unnamed: 0,Rank,Score,Category,Headline
0,1,0.5514,IMPACT,Big Pharma Says It Offers Cheap Vaccines For R...
1,2,0.5467,WORLD NEWS,More Than 15 Million Have Had Covid Vaccine In...
2,3,0.5333,HEALTHY LIVING,Health Officials To Decide If The U.S. Should ...
3,4,0.5137,HEALTHY LIVING,Going Back To The Old Whooping Cough Vaccine C...
4,5,0.5136,WELLNESS,Perennial Flu Shot One Step Closer...


In [8]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

# Load model
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
llm = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

# Pindah ke GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
llm.to(device)

print(f"✅ LLM Flan-T5 ready on {device}")

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

✅ LLM Flan-T5 ready on cpu


RAG Pipeline (Retrieval + LLM)

In [9]:
# Fungsi generate jawaban
def generate_answer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    outputs = llm.generate(**inputs, max_length=200, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# RAG Pipeline
def rag(question, top_k=3, use_faiss=True):
    # Step 1: Retrieval
    if use_faiss:
        indices, scores = retrieve_faiss(question, top_k)
    else:
        indices, scores = retrieve_bm25(question, top_k)

    # Step 2: Buat context
    context = ""
    sources = []
    for i, idx in enumerate(indices):
        headline = df.iloc[idx]['headline']
        desc = df.iloc[idx]['short_description']
        context += f"[{i+1}] {headline}: {desc}\n"
        sources.append({'headline': headline, 'category': df.iloc[idx]['category']})

    # Step 3: Buat prompt
    prompt = f"""Answer the question based on these news articles:

{context}

Question: {question}
Answer:"""

    # Step 4: Generate
    answer = generate_answer(prompt)

    return answer, sources, indices, scores

In [10]:
print("="*70)
print("🚀 TEST RAG SYSTEM")
print("="*70)

questions = [
    "What happened with COVID boosters in America?",
    "Tell me about airline passenger incidents",
    "What natural disasters happened recently?"
]

for q in questions:
    answer, sources, _, _ = rag(q, top_k=3)
    print(f"\n❓ Question: {q}")
    print(f"💡 Answer: {answer}")
    print("📚 Sources:")
    for s in sources:
        print(f"   • [{s['category']}] {s['headline'][:55]}...")
    print("-"*70)

🚀 TEST RAG SYSTEM

❓ Question: What happened with COVID boosters in America?
💡 Answer: Car Thefts Spike During COVID-19 Pandemic
📚 Sources:
   • [CRIME] Car Thefts Spike During COVID-19 Pandemic...
   • [WORLD NEWS] More Than 15 Million Have Had Covid Vaccine In 'Extraor...
   • [U.S. NEWS] CDC Drops Some Quarantine, Screening Recommendations Fo...
----------------------------------------------------------------------

❓ Question: Tell me about airline passenger incidents
💡 Answer: [1] These Are The 10 Worst Habits Of Airplane Passengers
📚 Sources:
   • [TRAVEL] These Are The 10 Worst Habits Of Airplane Passengers...
   • [FIFTY] 50 Shades Of Shame-Worthy Behaviors Beyond Airline Pass...
   • [TRAVEL] The Never-Ending Challenge Of Securing Our Air Transpor...
----------------------------------------------------------------------

❓ Question: What natural disasters happened recently?
💡 Answer: [3]
📚 Sources:
   • [IMPACT] When Disaster Strikes, Mothers and Newborns Are the Mos...
   • [

In [11]:
# Fungsi evaluasi
def evaluate(query, expected_category, k=5):
    indices, _ = retrieve_faiss(query, k)

    # Precision@K
    relevant = sum(1 for i in indices if df.iloc[i]['category'] == expected_category)
    precision = relevant / k

    # MRR
    mrr = 0
    for rank, i in enumerate(indices, 1):
        if df.iloc[i]['category'] == expected_category:
            mrr = 1 / rank
            break

    return precision, mrr

print("="*70)
print("📊 TABEL 4: HASIL EVALUASI RETRIEVAL")
print("="*70)

# Test cases untuk evaluasi
test_cases = [
    ("COVID vaccine health booster", "U.S. NEWS"),
    ("funny comedy humor tweets", "COMEDY"),
    ("parenting children kids toddler", "PARENTING"),
    ("hurricane storm flood disaster", "WORLD NEWS"),
    ("movie film documentary cinema", "CULTURE & ARTS"),
    ("politics government election president", "POLITICS"),
]

# Jalankan evaluasi
eval_results = []
for query, cat in test_cases:
    p, m = evaluate(query, cat, k=5)
    eval_results.append({
        'Query': query,
        'Expected Category': cat,
        'Precision@5': f"{p:.2f}",
        'MRR': f"{m:.2f}"
    })

# Tampilkan tabel evaluasi
eval_df = pd.DataFrame(eval_results)
display(eval_df)

# Hitung rata-rata
avg_p = np.mean([float(r['Precision@5']) for r in eval_results])
avg_m = np.mean([float(r['MRR']) for r in eval_results])
print(f"\n📈 Rata-rata Precision@5: {avg_p:.3f}")
print(f"📈 Rata-rata MRR: {avg_m:.3f}")

📊 TABEL 4: HASIL EVALUASI RETRIEVAL


Unnamed: 0,Query,Expected Category,Precision@5,MRR
0,COVID vaccine health booster,U.S. NEWS,0.0,0.0
1,funny comedy humor tweets,COMEDY,0.4,0.25
2,parenting children kids toddler,PARENTING,0.6,1.0
3,hurricane storm flood disaster,WORLD NEWS,0.0,0.0
4,movie film documentary cinema,CULTURE & ARTS,0.0,0.0
5,politics government election president,POLITICS,1.0,1.0



📈 Rata-rata Precision@5: 0.333
📈 Rata-rata MRR: 0.375


In [12]:
print("="*70)
print("🤖 RAG NEWS QA - INTERACTIVE DEMO")
print("="*70)
print("Ketik pertanyaan (bahasa Inggris)")
print("Ketik 'quit' untuk keluar\n")

while True:
    q = input("❓ Pertanyaan: ")
    if q.lower() in ['quit', 'exit', 'q']:
        print("👋 Terima kasih!")
        break

    answer, sources, indices, scores = rag(q, top_k=3)

    print(f"\n💡 Jawaban: {answer}\n")

    # Tampilkan sumber dalam tabel
    source_df = pd.DataFrame({
        'Rank': range(1, len(sources)+1),
        'Score': [f"{s:.4f}" for s in scores],
        'Category': [s['category'] for s in sources],
        'Headline': [s['headline'][:50] + '...' for s in sources]
    })
    display(source_df)
    print("\n" + "-"*70 + "\n")

🤖 RAG NEWS QA - INTERACTIVE DEMO
Ketik pertanyaan (bahasa Inggris)
Ketik 'quit' untuk keluar

❓ Pertanyaan: healty news

💡 Jawaban: [3]



Unnamed: 0,Rank,Score,Category,Headline
0,1,0.3757,MEDIA,New York Daily News Scorches Sean Hannity With...
1,2,0.3755,ENTERTAINMENT,Nick Cannon Wilds Out On Twitter To Shut Down ...
2,3,0.3537,ENTERTAINMENT,27 Deliciously Snarky Tweets About ‘The Bachel...



----------------------------------------------------------------------

❓ Pertanyaan: covid news

💡 Jawaban: [2]



Unnamed: 0,Rank,Score,Category,Headline
0,1,0.5217,U.S. NEWS,"CDC Drops Some Quarantine, Screening Recommend..."
1,2,0.5086,WELLNESS,13 Of The Biggest Health News Stories Of 2013...
2,3,0.4907,WORLD NEWS,Iran Shutters Newspaper After Expert Questions...



----------------------------------------------------------------------

❓ Pertanyaan: comedy news

💡 Jawaban: [1]



Unnamed: 0,Rank,Score,Category,Headline
0,1,0.4919,COMEDY,How Can You Expect Me To Be Funny When The Wor...
1,2,0.4725,POLITICS,Friday Talking Points -- Not Funny Anymore...
2,3,0.4478,COMEDY,Comedians Live Tweet The Oscars...



----------------------------------------------------------------------



KeyboardInterrupt: Interrupted by user