# **Search Engine**

##  Overview
implements  **TF-IDF (Term Frequency-Inverse Document Frequency)** search engine from scratch. 
It supports both **Arabic** and **English** documents and uses **Cosine Similarity** to rank search results.

###  Features
- **Bilingual Support**: Handles English and Arabic text.
- **PDF Processing**: Extracts text from PDF files automatically.
- **Vector Space Model**: Represents text as mathematical vectors.
- **Ranked Retrieval**: Returns the most relevant results first.

In [1]:
# Install required packages
!pip install flask nltk PyPDF2

'pip' is not recognized as an internal or external command,
operable program or batch file.


## **Step 2: Imports & Initialization**

In [2]:
import re
import math
import os
import glob
from collections import defaultdict

# PDF Handling
import PyPDF2

# NLP Libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## **Step 3: Text Preprocessing**

The quality of a search engine depends heavily on preprocessing. We perform:
1.  **Normalization**: Unifying character forms (especially for Arabic).
2.  **Tokenization**: Splitting text into words.
3.  **Stopword Removal**: Removing common words like "the", "in", "في", "من".
4.  **Lemmatization**: Converting words to their base form (e.g., "running" -> "run").

In [3]:
# ==========================================
# Configuration & Constants
# ==========================================

# Define Arabic Stopwords manually as they might be incomplete in NLTK
ARABIC_STOPWORDS = {
    "في", "على", "من", "إلى", "عن", "مع", "هذا", "هذه",
    "هو", "هي", "هم", "هن", "كان", "كانت", "يكون",
    "ما", "لا", "لم", "لن", "أن", "إن", "كل", "أي"
}

# Initialize English Stopwords and Lemmatizer
STOP_WORDS_EN = set(stopwords.words("english"))
LEMMATIZER = WordNetLemmatizer()

# Regex to identify Arabic characters
ARABIC_REGEX = re.compile(r"[\u0600-\u06FF]+")

In [4]:
def normalize_arabic(text: str) -> str:
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "و", text)
    text = re.sub("ئ", "ي", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("[ًٌٍَُِّْ]", "", text)  # Remove diacritics
    return text

def is_arabic_token(token: str) -> bool:
    return ARABIC_REGEX.search(token) is not None


In [5]:
def preprocess(text: str) -> list:
    if not text:
        return []

    # 1. Lowercase & Normalize
    text = text.lower()
    text = normalize_arabic(text)

    # 2. Remove Punctuation (keep only word chars and spaces)
    text = re.sub(r"[^\w\s]", " ", text)

    tokens = text.split()
    clean_tokens = []

    for token in tokens:
        # Handle Arabic Tokens
        if is_arabic_token(token):
            if token not in ARABIC_STOPWORDS and len(token) > 1:
                clean_tokens.append(token)
        
        # Handle English Tokens
        elif token.isalpha():
            if token not in STOP_WORDS_EN:
                # Lemmatize: 'running' -> 'run'
                clean_tokens.append(LEMMATIZER.lemmatize(token))

    return clean_tokens

## **Step 4: Document Loading (PDFs)**

In [6]:
MIN_PARAGRAPH_LENGTH = 20

def load_pdf(filename: str) -> list:
    if not os.path.exists(filename):
        return []

    docs = []
    try:
        with open(filename, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                text = page.extract_text()
                if not text:
                    continue
                # Split by newlines to get potential paragraphs
                for paragraph in text.split("\n"):
                    paragraph = paragraph.strip()
                    if len(paragraph) > MIN_PARAGRAPH_LENGTH:
                        docs.append(paragraph)
    except Exception as e:
        print(f"Error loading PDF {filename}: {e}")
        return []
    
    return docs

In [7]:
def load_all_pdfs_from_folder(folder_path: str) -> list:
    all_documents = []
    
    if not os.path.exists(folder_path):
        print(f"Folder not found: {folder_path}")
        return []
    
    pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))
    
    if not pdf_files:
        print(f"ℹ No PDF files found in: {folder_path}")
        return []
    
    print(f" Found {len(pdf_files)} PDF file(s). Loading...")
    
    for pdf_file in pdf_files:
        print(f"   → Processing: {os.path.basename(pdf_file)}")
        docs = load_pdf(pdf_file)
        all_documents.extend(docs)
    
    print(f"Total loaded documents (paragraphs): {len(all_documents)}")
    return all_documents


## **Step 5: The Search Engine Core**

Here we implement the **Vector Space Model** logic:

### 1. TF (Term Frequency)
How often a word appears in a specific document.

### 2. IDF (Inverse Document Frequency)
How unique a word is across all documents. Common words like "is" have low IDF, while specific terms like "algorithm" have high IDF.

$$ IDF(t) = \log \left( \frac{Total\ Documents}{Documents\ with\ term\ t} \right) $$

### 3. Cosine Similarity
Measures the angle between two vectors (Query Vector vs. Document Vector). A value of **1.0** means identical direction (perfect match), **0.0** means no similarity.

In [8]:
def build_tf(docs: list) -> list:
    tf_docs = []
    for doc in docs:
        freq = defaultdict(int)
        for token in preprocess(doc):
            freq[token] += 1
        tf_docs.append(freq)
    return tf_docs

def compute_idf(tf_docs: list) -> dict:
    N = len(tf_docs)
    df = defaultdict(int)

    for doc in tf_docs:
        for term in doc.keys():
            df[term] += 1

    # Standard IDF formula
    idf = {term: math.log(N / df_val) for term, df_val in df.items() if df_val > 0}
    return idf

In [9]:
def tfidf_vector(tf_doc: dict, idf: dict) -> dict:
    vec = {}
    for term, freq in tf_doc.items():
        if freq <= 0:
            continue
        idf_val = idf.get(term, 0.0)
        if idf_val == 0:
            continue
        
        # Log-normalization reduces impact of very frequent words within a doc
        weight = (1 + math.log(freq)) * idf_val
        vec[term] = weight
    return vec

In [10]:
def cosine_similarity(v1: dict, v2: dict) -> float:
    if not v1 or not v2:
        return 0.0

    # Optimization: iterate over the shorter vector
    if len(v1) > len(v2):
        v1, v2 = v2, v1

    dot_product = 0.0
    for term, val in v1.items():
        dot_product += val * v2.get(term, 0.0)

    norm1 = math.sqrt(sum(v ** 2 for v in v1.values()))
    norm2 = math.sqrt(sum(v ** 2 for v in v2.values()))

    if norm1 == 0.0 or norm2 == 0.0:
        return 0.0

    return dot_product / (norm1 * norm2)

### **Search Engine Class**
This class encapsulates everything. When initialized, it pre-computes vectors for all documents so that searching is fast.

In [11]:
MAX_TEXT_DISPLAY_LENGTH = 200

class SimpleSearchEngine:
    def __init__(self, docs: list):
        self.docs = docs or []      

        # 1. Build TF for all docs
        self.tf_docs = build_tf(self.docs)
        
        # 2. Compute IDF global stats
        self.idf = compute_idf(self.tf_docs)
        
        # 3. Pre-compute TF-IDF vectors for all docs
        self.doc_vectors = [tfidf_vector(tf_doc, self.idf) for tf_doc in self.tf_docs]

    def ranked_search(self, query: str, top_k: int = 5) -> list:
        query = (query or "").strip()
        if not query:
            return []

        # 1. Convert Query to Vector
        q_tf = defaultdict(int)
        for token in preprocess(query):
            q_tf[token] += 1
            
        q_vec = tfidf_vector(q_tf, self.idf)
        if not q_vec:
            return []

        # 2. Compare Query vs All Docs
        results = []
        for i, d_vec in enumerate(self.doc_vectors):
            score = cosine_similarity(q_vec, d_vec)
            if score > 0:
                # Truncate text for display
                full_text = self.docs[i]
                preview = full_text[:MAX_TEXT_DISPLAY_LENGTH] + ("..." if len(full_text) > MAX_TEXT_DISPLAY_LENGTH else "")
                
                results.append({
                    "score": round(score, 3),
                    "text": preview,
                    "index": i
                })

        # 3. Sort by Score (Descending)
        results.sort(key=lambda x: x["score"], reverse=True)
        return results[:top_k]

## **Step 6: Testing & Execution**

We'll load some **sample data** directly in code so you can test it immediately without needing PDF files, but we also check for local PDFs.

In [12]:
SAMPLE_DOCUMENTS = [
    "Information retrieval is the process of obtaining information system resources relevant to an information need.",
    "Search engines use algorithms like TF-IDF and PageRank to rank web pages.",
    "Machine learning improves search results by learning from user feedback.",
    "Natural language processing enables computers to understand human language.",
    "Deep learning models are used in modern neural search engines.",
    "Football is a popular sport played with a spherical ball.",
    "Artificial Intelligence is simulating human intelligence in machines.",
    "PyTorch and TensorFlow are popular deep learning libraries.",
]

## **Initialize Engine Class**

In [13]:
# Initialize Engine

# 1. Try to load real PDFs
pdf_folder = "pdfs"
documents = load_all_pdfs_from_folder(pdf_folder)

# 2. Fallback to sample data if no PDFs found
if not documents:
    print("\nNo PDFs found. Using Sample Documents instead.")
    documents = SAMPLE_DOCUMENTS

# 3. Create Engine Instance
engine = SimpleSearchEngine(documents)
print(f"\nSearch engine ready with {len(documents)} documents!")

ℹ No PDF files found in: pdfs

No PDFs found. Using Sample Documents instead.

Search engine ready with 8 documents!


## Step 7: Flask Web Application

In [14]:
# Flask Web Application Code
# This reuses the search engine we already created above

from flask import Flask, render_template, request

# Configuration
FLASK_HOST = "127.0.0.1"
FLASK_PORT = 5000
FLASK_DEBUG = True
TOP_K_RESULTS = 10

# Create Flask app
app = Flask(__name__)

# Note: We're reusing the 'engine' variable initialized above
# No need to create a new search engine instance

@app.route("/", methods=["GET", "POST"])
def index():
    """
    Main route that handles displaying the search form and processing queries.
    """
    results = []
    query = ""

    if request.method == "POST":
        query = request.form.get("query", "").strip()
        if query:
            # Use the global 'engine' variable
            results = engine.ranked_search(query, top_k=TOP_K_RESULTS)

    return render_template("index.html", query=query, results=results)


# Run the Flask app
if __name__ == "__main__":
    
    # Run without reloader in notebook environment to avoid issues
    app.run(debug=FLASK_DEBUG, host=FLASK_HOST, port=FLASK_PORT, use_reloader=False)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [23/Dec/2025 01:16:35] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [23/Dec/2025 01:16:35] "GET /static/style.css HTTP/1.1" 304 -
127.0.0.1 - - [23/Dec/2025 01:16:40] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [23/Dec/2025 01:16:40] "GET /static/style.css HTTP/1.1" 304 -
127.0.0.1 - - [23/Dec/2025 01:16:44] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [23/Dec/2025 01:16:44] "GET /static/style.css HTTP/1.1" 304 -
127.0.0.1 - - [23/Dec/2025 01:16:48] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [23/Dec/2025 01:16:48] "GET /static/style.css HTTP/1.1" 304 -
127.0.0.1 - - [23/Dec/2025 01:16:54] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [23/Dec/2025 01:16:54] "GET /static/style.css HTTP/1.1" 304 -
127.0.0.1 - - [23/Dec/2025 01:16:58] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [23/Dec/2025 01:16:58] "GET /static/style.css HTTP/1.1" 304 -
