<a href="https://colab.research.google.com/github/HazemmoAlsady/Search-Engine/blob/main/Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Setup***

In [5]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# **Preprocessing**

In [6]:
import re
import math
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = word_tokenize(text)
    tokens = [
        lemmatizer.lemmatize(t)
        for t in tokens
        if t not in stop_words and t.isalpha()
    ]
    return tokens


# **Input Documents**

In [7]:
def input_documents():
    docs = []
    print("Enter documents (type 'done' to finish):")
    while True:
        doc = input(f"Doc {len(docs)}: ")
        if doc.lower() == "done":
            break
        docs.append(doc)
    return docs

docs = input_documents()
print(f"\n{len(docs)} documents indexed ‚úÖ")


Enter documents (type 'done' to finish):
Doc 0: 
Doc 1: Information retrieval is the process of finding relevant documents.
Doc 2: Search engines use TF IDF and cosine similarity for ranking.
Doc 3: The quick brown fox jumps over the lazy dog.
Doc 4: Data science and information retrieval are closely related.
Doc 5: Cosine similarity is used in ranked retrieval models.
Doc 6: done

6 documents indexed ‚úÖ


# **Build TF + IDF**

In [9]:
def build_tf(docs):
    tf_docs = []
    for doc in docs:
        freq = defaultdict(int)
        for token in preprocess(doc):
            freq[token] += 1
        tf_docs.append(freq)
    return tf_docs

def compute_idf(tf_docs):
    N = len(tf_docs)
    df = defaultdict(int)

    for doc in tf_docs:
        for term in doc:
            df[term] += 1

    idf = {term: math.log(N / df[term]) for term in df}
    return idf

tf_docs = build_tf(docs)
idf = compute_idf(tf_docs)


# **TF-IDF Vector + Cosine Similarity**

In [10]:
def tfidf_vector(tf_doc, idf):
    vec = {}
    for term, freq in tf_doc.items():
        vec[term] = (1 + math.log(freq)) * idf.get(term, 0)
    return vec

def cosine(v1, v2):
    dot = sum(v1.get(t, 0) * v2.get(t, 0) for t in v1)
    norm1 = math.sqrt(sum(v**2 for v in v1.values()))
    norm2 = math.sqrt(sum(v**2 for v in v2.values()))
    if norm1 == 0 or norm2 == 0:
        return 0
    return dot / (norm1 * norm2)


# **Ranked Search Engine**

In [11]:
def ranked_search(query, docs, tf_docs, idf, top_k=5):
    q_tf = defaultdict(int)
    for token in preprocess(query):
        q_tf[token] += 1

    q_vec = tfidf_vector(q_tf, idf)

    scores = []
    for i, tf_doc in enumerate(tf_docs):
        d_vec = tfidf_vector(tf_doc, idf)
        score = cosine(q_vec, d_vec)
        scores.append((i, score))

    scores.sort(key=lambda x: x[1], reverse=True)
    return [(i, docs[i], score) for i, score in scores[:top_k] if score > 0]


# **Interactive Search Loop**

In [12]:
print("\nSearch Engine Ready üîç (type 'quit' to exit)")

while True:
    query = input("\nSearch: ")
    if query.lower() == "quit":
        break

    results = ranked_search(query, docs, tf_docs, idf)

    if not results:
        print("No results found ‚ùå")
        continue

    for i, text, score in results:
        print(f"\nScore: {score:.3f}")
        print(f"Doc {i}: {text}")



Search Engine Ready üîç (type 'quit' to exit)

Search: 
No results found ‚ùå

Search: information retrieval

Score: 0.341
Doc 1: Information retrieval is the process of finding relevant documents.

Score: 0.341
Doc 4: Data science and information retrieval are closely related.

Score: 0.105
Doc 5: Cosine similarity is used in ranked retrieval models.

Search: cosine similarity

Score: 0.439
Doc 5: Cosine similarity is used in ranked retrieval models.

Score: 0.334
Doc 2: Search engines use TF IDF and cosine similarity for ranking.

Search: search engine

Score: 0.544
Doc 2: Search engines use TF IDF and cosine similarity for ranking.

Search: tf idf

Score: 0.544
Doc 2: Search engines use TF IDF and cosine similarity for ranking.

Search: quit


In [13]:
print("Indexed terms:")
print(list(idf.keys())[:20])


Indexed terms:
['information', 'retrieval', 'process', 'finding', 'relevant', 'document', 'search', 'engine', 'use', 'tf', 'idf', 'cosine', 'similarity', 'ranking', 'quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
