# üîç Smart Document Search Engine

## üìö Overview
This notebook implements a **TF-IDF (Term Frequency-Inverse Document Frequency)** search engine from scratch. 
It supports both **Arabic** and **English** documents and uses **Cosine Similarity** to rank search results.

### üöÄ Features
- **Bilingual Support**: Handles English and Arabic text.
- **PDF Processing**: Extracts text from PDF files automatically.
- **Vector Space Model**: Represents text as mathematical vectors.
- **Ranked Retrieval**: Returns the most relevant results first.

---

## üõ†Ô∏è Step 1: Setup Workspace
First, we install the necessary libraries.

In [None]:
# Install required packages
!pip install flask nltk PyPDF2

## üì¶ Step 2: Imports & Initialization
We import standard libraries for file handling and math, along with `nltk` for natural language processing.

In [None]:
import re
import math
import os
import glob
from collections import defaultdict

# PDF Handling
import PyPDF2

# NLP Libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK datasets
print("‚è≥ Downloading NLTK resources...")
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
print("‚úÖ NLTK resources ready.")

---
## üî§ Step 3: Text Preprocessing

The quality of a search engine depends heavily on preprocessing. We perform:
1.  **Normalization**: Unifying character forms (especially for Arabic).
2.  **Tokenization**: Splitting text into words.
3.  **Stopword Removal**: Removing common words like "the", "in", "ŸÅŸä", "ŸÖŸÜ".
4.  **Lemmatization**: Converting words to their base form (e.g., "running" -> "run").

In [None]:
# ==========================================
# Configuration & Constants
# ==========================================

# Define Arabic Stopwords manually as they might be incomplete in NLTK
ARABIC_STOPWORDS = {
    "ŸÅŸä", "ÿπŸÑŸâ", "ŸÖŸÜ", "ÿ•ŸÑŸâ", "ÿπŸÜ", "ŸÖÿπ", "Ÿáÿ∞ÿß", "Ÿáÿ∞Ÿá",
    "ŸáŸà", "ŸáŸä", "ŸáŸÖ", "ŸáŸÜ", "ŸÉÿßŸÜ", "ŸÉÿßŸÜÿ™", "ŸäŸÉŸàŸÜ",
    "ŸÖÿß", "ŸÑÿß", "ŸÑŸÖ", "ŸÑŸÜ", "ÿ£ŸÜ", "ÿ•ŸÜ", "ŸÉŸÑ", "ÿ£Ÿä"
}

# Initialize English Stopwords and Lemmatizer
STOP_WORDS_EN = set(stopwords.words("english"))
LEMMATIZER = WordNetLemmatizer()

# Regex to identify Arabic characters
ARABIC_REGEX = re.compile(r"[\u0600-\u06FF]+")

print("‚úÖ Preprocessing configuration loaded.")

In [None]:
def normalize_arabic(text: str) -> str:
    """
    Normalize Arabic text by removing diacritics and standardizing characters.
    Example: 'ÿ£ÿ≠ŸÖÿØ' -> 'ÿßÿ≠ŸÖÿØ'
    """
    text = re.sub("[ÿ•ÿ£ÿ¢ÿß]", "ÿß", text)
    text = re.sub("Ÿâ", "Ÿä", text)
    text = re.sub("ÿ§", "Ÿà", text)
    text = re.sub("ÿ¶", "Ÿä", text)
    text = re.sub("ÿ©", "Ÿá", text)
    text = re.sub("[ŸãŸåŸçŸéŸèŸêŸëŸí]", "", text)  # Remove diacritics
    return text

def is_arabic_token(token: str) -> bool:
    """Check if the token contains Arabic characters."""
    return ARABIC_REGEX.search(token) is not None

print("‚úÖ Arabic normalization functions defined.")

In [None]:
def preprocess(text: str) -> list:
    """
    Main preprocessing function.
    Input: Raw String
    Output: List of cleaned tokens
    """
    if not text:
        return []

    # 1. Lowercase & Normalize
    text = text.lower()
    text = normalize_arabic(text)

    # 2. Remove Punctuation (keep only word chars and spaces)
    text = re.sub(r"[^\w\s]", " ", text)

    tokens = text.split()
    clean_tokens = []

    for token in tokens:
        # Handle Arabic Tokens
        if is_arabic_token(token):
            if token not in ARABIC_STOPWORDS and len(token) > 1:
                clean_tokens.append(token)
        
        # Handle English Tokens
        elif token.isalpha():
            if token not in STOP_WORDS_EN:
                # Lemmatize: 'running' -> 'run'
                clean_tokens.append(LEMMATIZER.lemmatize(token))

    return clean_tokens

print("‚úÖ Main preprocessing function defined.")

---
## üìÑ Step 4: Document Loading (PDFs)

We need to extract text from PDF files to build our search index. 
We split the text into **paragraphs** to make search results more specific (granular).

In [None]:
MIN_PARAGRAPH_LENGTH = 20

def load_pdf(filename: str) -> list:
    """
    Reads a PDF file and returns a list of substantial paragraphs.
    """
    if not os.path.exists(filename):
        return []

    docs = []
    try:
        with open(filename, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                text = page.extract_text()
                if not text:
                    continue
                # Split by newlines to get potential paragraphs
                for paragraph in text.split("\n"):
                    paragraph = paragraph.strip()
                    if len(paragraph) > MIN_PARAGRAPH_LENGTH:
                        docs.append(paragraph)
    except Exception as e:
        print(f"‚ùå Error loading PDF {filename}: {e}")
        return []
    
    return docs

print("‚úÖ PDF loading function defined.")

In [None]:
def load_all_pdfs_from_folder(folder_path: str) -> list:
    """Scans a directory for all .pdf files and loads them."""
    all_documents = []
    
    if not os.path.exists(folder_path):
        print(f"‚ö†Ô∏è Folder not found: {folder_path}")
        return []
    
    pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))
    
    if not pdf_files:
        print(f"‚ÑπÔ∏è No PDF files found in: {folder_path}")
        return []
    
    print(f"üìÇ Found {len(pdf_files)} PDF file(s). Loading...")
    
    for pdf_file in pdf_files:
        print(f"   ‚Üí Processing: {os.path.basename(pdf_file)}")
        docs = load_pdf(pdf_file)
        all_documents.extend(docs)
    
    print(f"‚úÖ Total loaded documents (paragraphs): {len(all_documents)}")
    return all_documents

print("‚úÖ Batch PDF loading function defined.")

---
## üßÆ Step 5: The Search Engine Core

Here we implement the **Vector Space Model** logic:

### 1. TF (Term Frequency)
How often a word appears in a specific document.

### 2. IDF (Inverse Document Frequency)
How unique a word is across all documents. Common words like "is" have low IDF, while specific terms like "algorithm" have high IDF.

$$ IDF(t) = \log \left( \frac{Total\ Documents}{Documents\ with\ term\ t} \right) $$

### 3. Cosine Similarity
Measures the angle between two vectors (Query Vector vs. Document Vector). A value of **1.0** means identical direction (perfect match), **0.0** means no similarity.

In [None]:
def build_tf(docs: list) -> list:
    """Calculate Term Frequency (TF) for each document."""
    tf_docs = []
    for doc in docs:
        freq = defaultdict(int)
        for token in preprocess(doc):
            freq[token] += 1
        tf_docs.append(freq)
    return tf_docs

def compute_idf(tf_docs: list) -> dict:
    """Calculate Inverse Document Frequency (IDF) for all unique terms."""
    N = len(tf_docs)
    df = defaultdict(int)

    for doc in tf_docs:
        for term in doc.keys():
            df[term] += 1

    # Standard IDF formula
    idf = {term: math.log(N / df_val) for term, df_val in df.items() if df_val > 0}
    return idf

print("‚úÖ TF and IDF calculation functions defined.")

In [None]:
def tfidf_vector(tf_doc: dict, idf: dict) -> dict:
    """
    Convert a document's TF dictionary into a TF-IDF vector.
    Formula: (1 + log(TF)) * IDF
    """
    vec = {}
    for term, freq in tf_doc.items():
        if freq <= 0:
            continue
        idf_val = idf.get(term, 0.0)
        if idf_val == 0:
            continue
        
        # Log-normalization reduces impact of very frequent words within a doc
        weight = (1 + math.log(freq)) * idf_val
        vec[term] = weight
    return vec

print("‚úÖ TF-IDF vectorization function defined.")

In [None]:
def cosine_similarity(v1: dict, v2: dict) -> float:
    """
    Calculate cosine similarity between two sparse vectors.
    Result ranges from 0.0 (no match) to 1.0 (perfect match).
    """
    if not v1 or not v2:
        return 0.0

    # Optimization: iterate over the shorter vector
    if len(v1) > len(v2):
        v1, v2 = v2, v1

    dot_product = 0.0
    for term, val in v1.items():
        dot_product += val * v2.get(term, 0.0)

    norm1 = math.sqrt(sum(v ** 2 for v in v1.values()))
    norm2 = math.sqrt(sum(v ** 2 for v in v2.values()))

    if norm1 == 0.0 or norm2 == 0.0:
        return 0.0

    return dot_product / (norm1 * norm2)

print("‚úÖ Cosine similarity function defined.")

### üèóÔ∏è Search Engine Class
This class encapsulates everything. When initialized, it pre-computes vectors for all documents so that searching is fast.

In [None]:
MAX_TEXT_DISPLAY_LENGTH = 200

class SimpleSearchEngine:
    def __init__(self, docs: list):
        self.docs = docs or []
        print("‚öôÔ∏è Building Index...")
        
        # 1. Build TF for all docs
        self.tf_docs = build_tf(self.docs)
        
        # 2. Compute IDF global stats
        self.idf = compute_idf(self.tf_docs)
        
        # 3. Pre-compute TF-IDF vectors for all docs
        self.doc_vectors = [tfidf_vector(tf_doc, self.idf) for tf_doc in self.tf_docs]
        print("‚úÖ Index built successfully.")

    def ranked_search(self, query: str, top_k: int = 5) -> list:
        query = (query or "").strip()
        if not query:
            return []

        # 1. Convert Query to Vector
        q_tf = defaultdict(int)
        for token in preprocess(query):
            q_tf[token] += 1
            
        q_vec = tfidf_vector(q_tf, self.idf)
        if not q_vec:
            return []

        # 2. Compare Query vs All Docs
        results = []
        for i, d_vec in enumerate(self.doc_vectors):
            score = cosine_similarity(q_vec, d_vec)
            if score > 0:
                # Truncate text for display
                full_text = self.docs[i]
                preview = full_text[:MAX_TEXT_DISPLAY_LENGTH] + ("..." if len(full_text) > MAX_TEXT_DISPLAY_LENGTH else "")
                
                results.append({
                    "score": round(score, 3),
                    "text": preview,
                    "index": i
                })

        # 3. Sort by Score (Descending)
        results.sort(key=lambda x: x["score"], reverse=True)
        return results[:top_k]

print("‚úÖ SimpleSearchEngine class defined.")

---
## üß™ Step 6: Testing & Execution

We'll load some **sample data** directly in code so you can test it immediately without needing PDF files, but we also check for local PDFs.

In [None]:
SAMPLE_DOCUMENTS = [
    "Information retrieval is the process of obtaining information system resources relevant to an information need.",
    "Search engines use algorithms like TF-IDF and PageRank to rank web pages.",
    "Machine learning improves search results by learning from user feedback.",
    "Natural language processing enables computers to understand human language.",
    "Deep learning models are used in modern neural search engines.",
    "Football is a popular sport played with a spherical ball.",
    "Artificial Intelligence is simulating human intelligence in machines.",
    "PyTorch and TensorFlow are popular deep learning libraries.",
]

print(f"‚úÖ Loaded {len(SAMPLE_DOCUMENTS)} sample documents.")

In [None]:
# Initialize Engine

# 1. Try to load real PDFs
pdf_folder = "pdfs"
documents = load_all_pdfs_from_folder(pdf_folder)

# 2. Fallback to sample data if no PDFs found
if not documents:
    print("\n‚ö†Ô∏è No PDFs found. Using Sample Documents instead.")
    documents = SAMPLE_DOCUMENTS

# 3. Create Engine Instance
engine = SimpleSearchEngine(documents)
print(f"\nüéâ Search engine ready with {len(documents)} documents!")

### üîç Try a Search Query

In [None]:
query = "machine learning AI"
results = engine.ranked_search(query)

print(f"\nüîé Query: '{query}'")
print("=" * 80)
if results:
    for i, res in enumerate(results, 1):
        print(f"\n{i}. Score: {res['score']:.3f}")
        print(f"   {res['text']}")
else:
    print("No results found.")
print("\n" + "=" * 80)

---

## üåê Step 7: Flask Web Application (Optional)

This section shows how to wrap the search engine in a **Flask web interface**. 

**‚ö†Ô∏è Note:** 
- Running this cell will start a web server that blocks the notebook
- You'll need to stop the cell manually (interrupt kernel) to continue
- The Flask app will reuse the `engine` we already initialized above
- Make sure you have `templates/index.html` file in the correct location

In [None]:
# Flask Web Application Code
# This reuses the search engine we already created above

from flask import Flask, render_template, request

# Configuration
FLASK_HOST = "127.0.0.1"
FLASK_PORT = 5000
FLASK_DEBUG = True
TOP_K_RESULTS = 10

# Create Flask app
app = Flask(__name__)

# Note: We're reusing the 'engine' variable initialized above
# No need to create a new search engine instance

@app.route("/", methods=["GET", "POST"])
def index():
    """
    Main route that handles displaying the search form and processing queries.
    """
    results = []
    query = ""

    if request.method == "POST":
        query = request.form.get("query", "").strip()
        if query:
            # Use the global 'engine' variable
            results = engine.ranked_search(query, top_k=TOP_K_RESULTS)

    return render_template("index.html", query=query, results=results)


# Run the Flask app
if __name__ == "__main__":
    print("=" * 50)
    print("üîç Smart Document Search Engine")
    print("=" * 50)
    print(f"Server running on: http://{FLASK_HOST}:{FLASK_PORT}")
    print("Press INTERRUPT (‚ñ† button) to stop the server")
    print("=" * 50)
    
    # Run without reloader in notebook environment to avoid issues
    app.run(debug=FLASK_DEBUG, host=FLASK_HOST, port=FLASK_PORT, use_reloader=False)