<a href="https://colab.research.google.com/github/Manya123-max/Assesments/blob/main/Quote_Retrieval_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Cell 1 - Package Installation:

In [16]:
# Install Required Packages
print("Installing required packages...")

!pip install -q sentence-transformers datasets transformers torch torchvision torchaudio
!pip install -q faiss-cpu pandas numpy scikit-learn
!pip install -q gradio
!pip install -q huggingface_hub accelerate fsspec

print("All packages installed successfully!")
print("Please restart runtime if prompted, then proceed to Step 2")

Installing required packages...
All packages installed successfully!
Please restart runtime if prompted, then proceed to Step 2


Cell 2 -Import Libraries and Setup

In [17]:
# Import Libraries and Setup
import os
import json
import pandas as pd
import numpy as np
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import faiss
from transformers import pipeline
import gradio as gr
import warnings
warnings.filterwarnings('ignore')

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("Libraries imported successfully!")

Using device: cpu
Libraries imported successfully!


Cell 3- Data Processing Class

In [18]:
# Quote Data Processor Class
class ColabQuoteDataProcessor:
    def __init__(self):
        self.dataset = None
        self.processed_data = None

    def load_data(self, max_samples=5000):
        """Load dataset using pandas read_json from HuggingFace"""
        print(" Loading dataset from HuggingFace...")
        try:
            # Load dataset using the specified method
            df = pd.read_json("hf://datasets/Abirate/english_quotes/quotes.jsonl", lines=True)
            print(f"Dataset loaded successfully. Total size: {len(df)}")

            # Limit dataset size for Colab memory constraints
            if len(df) > max_samples:
                df = df.sample(n=max_samples, random_state=42).reset_index(drop=True)
                print(f"Randomly sampled {max_samples} quotes for Colab optimization")

            # Store as dataset format for compatibility
            self.dataset = {"train": df}
            print(f" Dataset ready with {len(df)} quotes")
            return self.dataset

        except Exception as e:
            print(f" Error loading dataset with pandas method: {e}")
            print(" Trying alternative HuggingFace datasets library...")
            try:
                # Fallback to datasets library
                from datasets import load_dataset
                dataset = load_dataset("Abirate/english_quotes")
                df = pd.DataFrame(dataset['train'])

                if len(df) > max_samples:
                    df = df.sample(n=max_samples, random_state=42).reset_index(drop=True)
                    print(f"Fallback: Limited to {max_samples} samples")

                self.dataset = {"train": df}
                return self.dataset

            except Exception as e2:
                print(f" Fallback also failed: {e2}")
                # Create sample data if both methods fail
                return self.create_sample_data()

    def create_sample_data(self):
        """Create sample data if dataset loading fails"""
        print("🔧 Creating sample dataset as fallback...")
        sample_quotes = [
            {"quote": "The only way to do great work is to love what you do.", "author": "Steve Jobs", "tags": ["motivation", "work", "success"]},
            {"quote": "Life is what happens to you while you're busy making other plans.", "author": "John Lennon", "tags": ["life", "philosophy"]},
            {"quote": "The future belongs to those who believe in the beauty of their dreams.", "author": "Eleanor Roosevelt", "tags": ["dreams", "future", "hope"]},
            {"quote": "It is during our darkest moments that we must focus to see the light.", "author": "Aristotle", "tags": ["hope", "perseverance"]},
            {"quote": "The way to get started is to quit talking and begin doing.", "author": "Walt Disney", "tags": ["action", "motivation"]},
            {"quote": "Your time is limited, don't waste it living someone else's life.", "author": "Steve Jobs", "tags": ["life", "authenticity"]},
            {"quote": "If life were predictable it would cease to be life, and be without flavor.", "author": "Eleanor Roosevelt", "tags": ["life", "unpredictability"]},
            {"quote": "The only impossible journey is the one you never begin.", "author": "Tony Robbins", "tags": ["journey", "motivation"]},
            {"quote": "In the end, we will remember not the words of our enemies, but the silence of our friends.", "author": "Martin Luther King Jr.", "tags": ["friendship", "courage"]},
            {"quote": "Success is not final, failure is not fatal: it is the courage to continue that counts.", "author": "Winston Churchill", "tags": ["success", "failure", "courage"]}
        ]

        # Create dataset structure
        df = pd.DataFrame(sample_quotes)
        self.dataset = {"train": df}
        print(f"Sample dataset created with {len(df)} quotes")
        return self.dataset

    def preprocess_data(self):
        """Clean and preprocess the dataset"""
        print("Preprocessing data...")

        # Get DataFrame from dataset
        if isinstance(self.dataset['train'], pd.DataFrame):
            df = self.dataset['train'].copy()
        else:
            df = pd.DataFrame(self.dataset['train'])

        print(f"Original dataset shape: {df.shape}")
        print(f"Columns: {df.columns.tolist()}")

        # Display sample data
        print("Sample data:")
        print(df.head(2))

        # Handle missing values
        print("Handling missing values...")
        initial_size = len(df)
        df = df.dropna(subset=['quote', 'author'])
        print(f"Removed {initial_size - len(df)} rows with missing quote/author")

        # Clean text
        df['quote_clean'] = df['quote'].astype(str).str.strip()
        df['author_clean'] = df['author'].astype(str).str.strip()

        # Handle tags - check if tags column exists and handle different formats
        if 'tags' in df.columns:
            print(" Processing tags column...")
            df['tags'] = df['tags'].apply(lambda x:
                x if isinstance(x, list)
                else [x] if isinstance(x, str) and x.strip()
                else []
            )
        else:
            print("No tags column found, creating empty tags")
            df['tags'] = [[] for _ in range(len(df))]

        # Create search text for embedding
        df['search_text'] = df.apply(
            lambda row: f"Quote: {row['quote']} Author: {row['author']} Tags: {', '.join(row['tags']) if row['tags'] else 'no tags'}",
            axis=1
        )

        self.processed_data = df.reset_index(drop=True)
        print(f"Data preprocessing completed!")
        print(f"Final dataset size: {len(df)} quotes")
        print(f"Sample search text: {df['search_text'].iloc[0][:100]}...")

        return df

print("Data Processing Class defined successfully!")

Data Processing Class defined successfully!


Cell 4 - Embedding Model Class

In [19]:
# Quote Embedding Model Class
class ColabQuoteEmbeddingModel:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model_name = model_name
        self.model = None
        self.device = device

    def load_model(self):
        """Load the sentence transformer model"""
        print(f" Loading model: {self.model_name}")
        try:
            self.model = SentenceTransformer(self.model_name, device=str(self.device))
            print(f" Model loaded successfully on {self.device}")
        except Exception as e:
            print(f" Error loading model: {e}")
            # Fallback to CPU
            self.model = SentenceTransformer(self.model_name, device='cpu')
            print(" Loaded model on CPU")
        return self.model

print(" Embedding Model Class defined successfully!")

 Embedding Model Class defined successfully!


Cell 5 - RAG Pipeline Class

In [20]:
# Quote RAG Pipeline Class
class ColabQuoteRAGPipeline:
    def __init__(self, embedding_model):
        self.embedding_model = embedding_model
        self.index = None
        self.quotes_data = None
        self.embeddings = None

    def create_embeddings(self, quotes_data):
        """Create embeddings for all quotes"""
        print("Creating embeddings...")
        self.quotes_data = quotes_data.reset_index(drop=True)

        # Generate embeddings in batches to manage memory
        texts = quotes_data['search_text'].tolist()
        batch_size = 32
        embeddings_list = []

        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            batch_embeddings = self.embedding_model.encode(batch_texts, convert_to_tensor=False)
            embeddings_list.append(batch_embeddings)
            print(f"Processed {min(i+batch_size, len(texts))}/{len(texts)} texts")

        self.embeddings = np.vstack(embeddings_list)

        # Create FAISS index
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner product for similarity

        # Normalize embeddings for cosine similarity
        faiss.normalize_L2(self.embeddings)
        self.index.add(self.embeddings.astype('float32'))

        print(f"Index created with {self.index.ntotal} vectors")
        return self.index

    def retrieve_quotes(self, query, top_k=5):
        """Retrieve relevant quotes for a query"""
        if self.index is None:
            raise ValueError("Index not created. Call create_embeddings() first.")

        # Encode query
        query_embedding = self.embedding_model.encode([query])
        faiss.normalize_L2(query_embedding)

        # Search
        similarities, indices = self.index.search(query_embedding.astype('float32'), top_k)

        # Get results
        results = []
        for i, (similarity, idx) in enumerate(zip(similarities[0], indices[0])):
            if idx < len(self.quotes_data):
                quote_data = self.quotes_data.iloc[idx]
                results.append({
                    'quote': quote_data['quote'],
                    'author': quote_data['author'],
                    'tags': quote_data['tags'],
                    'similarity': float(similarity),
                    'rank': i + 1
                })

        return results

    def query(self, user_query, top_k=5):
        """Complete RAG query processing"""
        retrieved_quotes = self.retrieve_quotes(user_query, top_k)
        return retrieved_quotes

print("RAG Pipeline Class defined successfully!")

RAG Pipeline Class defined successfully!


Cell 6 - System Initialization:

In [21]:
# Load and Process Data
print("Loading and processing data...")

data_processor = ColabQuoteDataProcessor()
dataset = data_processor.load_data(max_samples=1000)  # Reduced for better performance
processed_data = data_processor.preprocess_data()

print(f" Data loaded: {len(processed_data)} quotes")
print(" Ready for next step!")

Loading and processing data...
 Loading dataset from HuggingFace...
Dataset loaded successfully. Total size: 2508
Randomly sampled 1000 quotes for Colab optimization
 Dataset ready with 1000 quotes
Preprocessing data...
Original dataset shape: (1000, 3)
Columns: ['quote', 'author', 'tags']
Sample data:
                                               quote                author  \
0  “If you never did you should. These things are...             Dr. Seuss   
1         “Love all, trust a few, do wrong to none.”  William Shakespeare,   

                             tags  
0                         [suess]  
1  [do-wrong, love, trust, wrong]  
Handling missing values...
Removed 0 rows with missing quote/author
 Processing tags column...
Data preprocessing completed!
Final dataset size: 1000 quotes
Sample search text: Quote: “If you never did you should. These things are fun and fun is good.” Author: Dr. Seuss Tags: ...
 Data loaded: 1000 quotes
 Ready for next step!


In [22]:
# Load Embedding Model
print(" Loading embedding model...")

embedding_model = ColabQuoteEmbeddingModel()
model = embedding_model.load_model()

print(" Model loaded successfully!")
print(" Ready for embedding creation!")

 Loading embedding model...
 Loading model: all-MiniLM-L6-v2
 Model loaded successfully on cpu
 Model loaded successfully!
 Ready for embedding creation!


In [23]:
# Create RAG Pipeline and Embeddings
print(" Creating RAG pipeline and embeddings...")
print(" This may take a few minutes depending on dataset size...")

rag_pipeline = ColabQuoteRAGPipeline(model)

# Add progress tracking
import time
start_time = time.time()

rag_pipeline.create_embeddings(processed_data)

end_time = time.time()
print(f" Embedding creation took {end_time - start_time:.2f} seconds")

# Store in global variable for easy access
rag_system = rag_pipeline

print(" System fully initialized!")
print(f" Ready to search through {len(processed_data)} quotes")
print(" Proceed to next cell for testing!")

 Creating RAG pipeline and embeddings...
 This may take a few minutes depending on dataset size...
Creating embeddings...
Processed 32/1000 texts
Processed 64/1000 texts
Processed 96/1000 texts
Processed 128/1000 texts
Processed 160/1000 texts
Processed 192/1000 texts
Processed 224/1000 texts
Processed 256/1000 texts
Processed 288/1000 texts
Processed 320/1000 texts
Processed 352/1000 texts
Processed 384/1000 texts
Processed 416/1000 texts
Processed 448/1000 texts
Processed 480/1000 texts
Processed 512/1000 texts
Processed 544/1000 texts
Processed 576/1000 texts
Processed 608/1000 texts
Processed 640/1000 texts
Processed 672/1000 texts
Processed 704/1000 texts
Processed 736/1000 texts
Processed 768/1000 texts
Processed 800/1000 texts
Processed 832/1000 texts
Processed 864/1000 texts
Processed 896/1000 texts
Processed 928/1000 texts
Processed 960/1000 texts
Processed 992/1000 texts
Processed 1000/1000 texts
Index created with 1000 vectors
 Embedding creation took 92.33 seconds
 System f

Cell 7 - Test Search Function

In [24]:
# Quick Test to Verify System
print(" Testing the system...")

# Simple test function
def quick_test():
    try:
        test_query = "motivation"
        print(f" Testing query: '{test_query}'")

        results = rag_system.query(test_query, top_k=2)

        if results:
            print(" System is working!")
            for i, result in enumerate(results, 1):
                print(f"{i}. \"{result['quote'][:50]}...\" - {result['author']}")
            return True
        else:
            print(" No results found")
            return False

    except Exception as e:
        print(f" Test failed: {e}")
        return False

# Run the test
if quick_test():
    print(" System ready for Gradio interface!")
else:
    print(" Please check previous cells for errors")

 Testing the system...
 Testing query: 'motivation'
 System is working!
1. "“The starting point of all achievement is DESIRE. ..." - Napoleon Hill,
2. "“Of course motivation is not permanent. But then, ..." - Zig Ziglar,
 System ready for Gradio interface!


In [25]:
# Define the main search function for Gradio
def search_quotes(query, num_results=5):
    """Search for quotes based on user query"""
    print(f" Searching for: '{query}'")  # Debug print

    if not query.strip():
        return " Please enter a valid query."

    try:
        # Query the system
        retrieved_quotes = rag_system.query(query, top_k=num_results)

        if not retrieved_quotes:
            return f"No quotes found for: '{query}'"

        # Format response nicely
        response = f" **Search Results for:** '{query}'\n\n"

        for i, quote in enumerate(retrieved_quotes, 1):
            response += f"**{i}. Quote (Similarity: {quote['similarity']:.3f})**\n"
            response += f"💬 \"{quote['quote']}\"\n"
            response += f"👤 **Author:** {quote['author']}\n"

            if quote['tags']:
                response += f" **Tags:** {', '.join(quote['tags'])}\n"

            response += "\n" + "─" * 50 + "\n\n"

        return response

    except Exception as e:
        print(f" Search error: {e}")  # Debug print
        return f" Search failed: {str(e)}"

print(" Search function defined!")

 Search function defined!


Cell 8 - Gradio Interface:

In [26]:
# Create Gradio Interface (Don't launch yet)
def create_interface():
    """Create a clean Gradio interface"""

    with gr.Blocks(title="Quote Search System", theme=gr.themes.Soft()) as demo:
        gr.Markdown("# Semantic Quote Search System")
        gr.Markdown("*Find quotes using natural language queries*")

        # Search Interface
        gr.Markdown("### Search for Quotes")

        with gr.Row():
            with gr.Column(scale=4):
                query_input = gr.Textbox(
                    label="Enter your search query",
                    placeholder="e.g., 'quotes about love', 'motivation quotes', 'Steve Jobs quotes'",
                    lines=2
                )
            with gr.Column(scale=1):
                num_results = gr.Slider(
                    label="Number of results",
                    minimum=1,
                    maximum=10,
                    value=5,
                    step=1
                )

        search_btn = gr.Button("Search Quotes", variant="primary", size="lg")

        # Results
        results_output = gr.Markdown(
            label="Search Results",
            value="Enter a query and click 'Search Quotes' to see results."
        )

        search_btn.click(
            search_quotes,
            inputs=[query_input, num_results],
            outputs=results_output
        )

        # Quick Examples
        gr.Markdown("### Quick Examples")

        example_queries = [
            "quotes about love",
            "motivational quotes",
            "Steve Jobs quotes",
            "quotes about life",
            "inspirational quotes",
            "quotes about success"
        ]

        with gr.Row():
            for query in example_queries[:3]:
                gr.Button(query, size="sm").click(
                    lambda q=query: q, outputs=query_input
                )

        with gr.Row():
            for query in example_queries[3:]:
                gr.Button(query, size="sm").click(
                    lambda q=query: q, outputs=query_input
                )

    return demo

# Create the interface
print("Creating Gradio interface...")
demo = create_interface()
print("Interface created successfully!")
print("Ready to launch in next cell!")

Creating Gradio interface...
Interface created successfully!
Ready to launch in next cell!


Cell 9 - Launch Interface

In [27]:
# Launch the Gradio Interface
print("🚀 Launching Quote Search System Interface...")

# Create and launch the interface
demo = create_interface()

# Launch with public sharing for Colab
demo.launch(
    share=True,      # Creates public URL
    debug=True,      # Show debug info
    height=600,      # Interface height
    show_error=True, # Show errors
    quiet=False      # Show launch info
)

print("✅ Interface launched successfully!")
print("🌐 A public URL has been generated for sharing")
print("⏹️ To stop the interface, interrupt the kernel or restart runtime")

🚀 Launching Quote Search System Interface...
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://aed85085849f38f46a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://aed85085849f38f46a.gradio.live
✅ Interface launched successfully!
🌐 A public URL has been generated for sharing
⏹️ To stop the interface, interrupt the kernel or restart runtime


Cell 10 - Stop and Cleanup

In [28]:
# STOP the interface and clean up
print("Stopping the interface...")

try:
    # Stop the Gradio interface
    demo.close()
    print("Gradio interface stopped!")
except Exception as e:
    print(f"Interface stop warning: {e}")

try:
    # Clear large variables from memory
    del rag_system
    del rag_pipeline
    del embedding_model
    del processed_data
    del demo
    print("Large variables cleared from memory!")
except:
    print("Some variables were already cleared")

# Force garbage collection
import gc
gc.collect()

print("Cleanup completed!")
print("You can now run other code or restart if needed")
print("To restart the system, run cells 1-10 again")

Stopping the interface...
Closing server running on port: 7860
Gradio interface stopped!
Large variables cleared from memory!
Cleanup completed!
You can now run other code or restart if needed
To restart the system, run cells 1-10 again
