## Step 1: Importing Libraries

In this step, we are importing essential libraries required for text processing and calculations.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from collections import defaultdict
import math




[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Step 2: Reading and Parsing Documents

In this step, we read the **Cran 1400** dataset, split the content into individual documents, and extract the relevant fields such as the document ID, title, and body text.

In [None]:
with open("/kaggle/input/information-ret/cran.all.1400", "r", encoding="utf-8") as f:
    content = f.read()

raw_documents = content.strip().split(".I ")[1:]
documents = []

for raw_doc in raw_documents:
    parts = raw_doc.split("\n.T\n")
    if len(parts) < 2:
        continue

    doc_id = parts[0].strip()
    body_parts = parts[1].split("\n.W\n")
    if len(body_parts) < 2:
        continue

    title = body_parts[0].strip()
    body = body_parts[1].strip()

    full_text = title + " " + body
    documents.append({
        "id": doc_id,
        "text": full_text
    })

print(f"Number of documents: {len(documents)}")
print("Example document:\n", documents[0])


Number of documents: 1400
Example document:
 {'id': '1', 'text': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .\n.A\nbrenckman,m.\n.B\nj. ae. scs. 25, 1958, 324. experimental investigation of the aerodynamics of a\nwing in a slipstream .\n  an experimental study of a wing in a propeller slipstream was\nmade in order to determine the spanwise distribution of the lift\nincrease due to slipstream at different angles of attack of the wing\nand at different free stream to slipstream velocity ratios .  the\nresults were intended in part as an evaluation basis for different\ntheoretical treatments of this problem .\n  the comparative span loading curves, together with\nsupporting evidence, showed that a substantial part of the lift increment\nproduced by the slipstream was due to a /destalling/ or\nboundary-layer-control effect .  the integrated remaining lift\nincrement, after subtracting this destalling lift, was found to agree\nwell with a potential flow theor

## Step 3: Preprocessing the Documents

In this step, we perform text preprocessing on each document.

In [None]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [stemmer.stem(token) for token in tokens]
    return tokens

for doc in documents:
    doc["tokens"] = preprocess(doc["text"])

print("Sample tokens from the first document:", documents[0]["tokens"][:20])


Sample tokens from the first document: ['experiment', 'investig', 'aerodynam', 'wing', 'slipstream', 'brenckman', 'ae', 'sc', 'experiment', 'investig', 'aerodynam', 'wing', 'slipstream', 'experiment', 'studi', 'wing', 'propel', 'slipstream', 'made', 'order']


## Step 4: Building the Inverted Index

In this step, we build the **inverted index** which maps each term to:
- The list of documents where the term appears.
- The **term frequency (TF)** in each document.

The **inverted index** helps in quickly identifying which documents contain which terms, and how often each term appears in those documents.


In [None]:
inverted_index = defaultdict(lambda: defaultdict(int))

for doc in documents:
    doc_id = doc["id"]
    tokens = doc["tokens"]

    for token in tokens:
        inverted_index[token][doc_id] += 1

sample_term = list(inverted_index.keys())[0]
print(f"Sample inverted index for the term '{sample_term}':", inverted_index[sample_term])


Sample inverted index for the term 'experiment': defaultdict(<class 'int'>, {'1': 3, '11': 1, '12': 1, '16': 1, '17': 1, '19': 1, '25': 1, '29': 1, '30': 2, '35': 1, '37': 1, '41': 1, '43': 1, '47': 1, '52': 2, '53': 1, '58': 1, '69': 1, '70': 1, '74': 2, '78': 2, '84': 3, '99': 2, '101': 1, '103': 1, '112': 1, '115': 1, '121': 1, '123': 3, '131': 1, '137': 1, '140': 1, '142': 1, '154': 1, '156': 1, '167': 1, '168': 1, '170': 1, '171': 2, '173': 2, '176': 1, '179': 2, '183': 1, '184': 1, '186': 3, '187': 1, '188': 1, '189': 2, '191': 1, '195': 3, '197': 2, '202': 1, '203': 1, '206': 2, '207': 2, '212': 1, '216': 1, '220': 1, '222': 1, '225': 2, '227': 1, '230': 1, '234': 4, '245': 1, '251': 1, '256': 3, '257': 1, '262': 1, '271': 3, '273': 1, '277': 1, '282': 1, '283': 1, '286': 1, '287': 1, '289': 1, '294': 1, '295': 1, '304': 1, '307': 1, '329': 2, '330': 2, '334': 2, '338': 1, '339': 2, '344': 3, '345': 1, '346': 3, '347': 1, '354': 1, '360': 1, '369': 1, '370': 1, '372': 3, '377': 

## Step 5: Calculating TF-IDF

In this step, we calculate the **TF-IDF (Term Frequency - Inverse Document Frequency)** for each term in each document.

- **TF (Term Frequency)**: Measures how frequently a term appears in a document.
- **IDF (Inverse Document Frequency)**: Measures how important a term is across all documents. If a term appears in many documents, its IDF will be lower.




In [None]:
def calculate_idf(term, total_docs):
    doc_count = len(inverted_index[term])
    if doc_count == 0:
        return 0
    return math.log(total_docs / doc_count)

def calculate_tfidf(documents, inverted_index):
    total_docs = len(documents)
    tfidf_scores = {}

    for term in inverted_index:
        idf = calculate_idf(term, total_docs)
        for doc_id, tf in inverted_index[term].items():
            if doc_id not in tfidf_scores:
                tfidf_scores[doc_id] = {}
            tfidf_scores[doc_id][term] = tf * idf
    return tfidf_scores

tfidf_scores = calculate_tfidf(documents, inverted_index)

print(f"Sample TF-IDF for document 1: {tfidf_scores['1']}")


Sample TF-IDF for document 1: {'experiment': 4.2724337557529655, 'investig': 2.710699114540938, 'aerodynam': 4.147487041130396, 'wing': 7.4757964316767405, 'slipstream': 28.075668948850883, 'brenckman': 7.24422751560335, 'ae': 1.520642413650969, 'sc': 1.4390925466868614, 'studi': 1.7719568419318754, 'propel': 3.912023005428146, 'made': 2.7611926800105056, 'order': 2.2335922215070942, 'determin': 1.5437839422126636, 'spanwis': 4.02535169073515, 'distribut': 1.3609051271150712, 'lift': 9.181870500900727, 'increas': 1.916351346813769, 'due': 4.576800916004179, 'differ': 5.9758622626701605, 'angl': 1.9409226075442743, 'attack': 2.571398681141444, 'free': 2.5998366164619773, 'stream': 2.2137895942109145, 'veloc': 1.5076552181241583, 'ratio': 1.5882357047834974, 'result': 0.70608769183568, 'intend': 4.75932086581535, 'part': 5.180534330891653, 'evalu': 5.991464547107982, 'basi': 3.0247198104272432, 'theoret': 1.784642001459191, 'treatment': 3.5306554488990423, 'problem': 1.4665751923806936, 

In [None]:
def get_query_tfidf(query, inverted_index, tfidf_scores):
    tokens = preprocess(query)
    tf_query = defaultdict(int)

    for token in tokens:
        tf_query[token] += 1

    total_docs = len(tfidf_scores)

    query_tfidf = {}
    for term, tf in tf_query.items():
        doc_count = len(inverted_index.get(term, {}))
        if doc_count == 0:
            idf = 0
        else:
            idf = math.log(total_docs / doc_count)
        query_tfidf[term] = tf * idf

    return query_tfidf


## Step 6: Calculating Cosine Similarity

In this step, we calculate the **Cosine Similarity** between the query and each document. Cosine similarity is a metric used to measure how similar two vectors are, by calculating the cosine of the angle between them.

In [None]:
def cosine_similarity(query_tfidf, doc_tfidf):
    dot_product = sum(query_tfidf.get(term, 0) * doc_tfidf.get(term, 0) for term in query_tfidf)

    query_magnitude = math.sqrt(sum(value**2 for value in query_tfidf.values()))
    doc_magnitude = math.sqrt(sum(value**2 for value in doc_tfidf.values()))

    if query_magnitude == 0 or doc_magnitude == 0:
        return 0

    return dot_product / (query_magnitude * doc_magnitude)

query = "aerodynamics of wing"
query_tokens = preprocess(query)
query_tfidf = {term: tfidf_scores['1'].get(term, 0) for term in query_tokens}

similarity = cosine_similarity(query_tfidf, tfidf_scores['1'])
print(f"Cosine similarity between query and document 1: {similarity}")


Cosine similarity between query and document 1: 0.20443306891784382


In [None]:
def rank_documents(query):
    query_tfidf = get_query_tfidf(query, inverted_index, tfidf_scores)

    similarities = []

    for doc_id in tfidf_scores:
        similarity = cosine_similarity(query_tfidf, tfidf_scores[doc_id])
        similarities.append((doc_id, similarity))

    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities[:10]


## Step 7: Handling User Input for Multiple Queries

In this step, we allow the user to input **multiple queries** (The number of queries he wants by changing the parameter in the for loop). The user will be prompted to enter each query, and the engine will test the search engine's functionality by processing these queries.

For each query entered, the engine will rank the top documents based on their **Cosine Similarity** to the query.

In [None]:
def test_multiple_user_inputs():
    queries = []

    print("Please enter search queries:")
    for i in range(1, 3):
        query = input(f"Enter query {i}: ")
        queries.append(query)

    for query in queries:
        print(f"Testing query: {query}")
        top_documents = rank_documents(query)
        for doc_id, similarity in top_documents:
            print(f"Document {doc_id} - Similarity: {similarity}")
        print("-" * 50)

test_multiple_user_inputs()


Please enter search queries:


Enter query 1:  aeroelastic problems
Enter query 2:  slender conical wings


Testing query: aeroelastic problems
Document 746 - Similarity: 0.5162829056599343
Document 875 - Similarity: 0.34890303298886677
Document 781 - Similarity: 0.24123787275371236
Document 12 - Similarity: 0.22877521283067803
Document 14 - Similarity: 0.21428828154357907
Document 685 - Similarity: 0.17268928451467708
Document 284 - Similarity: 0.16395843169681382
Document 141 - Similarity: 0.15026406258008437
Document 1361 - Similarity: 0.14046464118688892
Document 390 - Similarity: 0.12201489530958481
--------------------------------------------------
Testing query: slender conical wings
Document 633 - Similarity: 0.4006985119440997
Document 752 - Similarity: 0.3740413963481433
Document 683 - Similarity: 0.3221847744750085
Document 513 - Similarity: 0.32130820053841525
Document 1070 - Similarity: 0.3194540592510208
Document 1058 - Similarity: 0.2908455156213503
Document 601 - Similarity: 0.2880270341369228
Document 465 - Similarity: 0.27946546679755213
Document 680 - Similarity: 0.2733836

## Step 8: Reading Queries from cran.qry

In this step, we read the queries from the **cran.qry** file, which contains predefined queries. Each query is processed to extract the query text and then tested using the search engine.

In [None]:
queries = []

with open("/kaggle/input/information-ret/cran.qry", "r", encoding="utf-8") as f:
    content = f.read()

raw_queries = content.strip().split(".I ")[1:]

for raw_query in raw_queries:
    parts = raw_query.split("\n.W\n")
    if len(parts) < 2:
        continue

    query_id = parts[0].strip()
    query_text = parts[1].strip()

    queries.append({
        "id": query_id,
        "query": query_text
    })

for query in queries:
    query_text = query["query"]
    print(f"Testing query: {query_text}")
    processed_query_tokens = preprocess(query_text)
    top_documents = rank_documents(query_text)
    for doc_id, similarity in top_documents:
        print(f"Document {doc_id} - Similarity: {similarity}")
    print("-" * 50)


Testing query: what similarity laws must be obeyed when constructing aeroelastic models
of heated high speed aircraft .
Document 51 - Similarity: 0.2705118029810288
Document 746 - Similarity: 0.24020465925891124
Document 359 - Similarity: 0.20846121267541776
Document 875 - Similarity: 0.18525992743688613
Document 13 - Similarity: 0.1794288024366314
Document 56 - Similarity: 0.17942148554813572
Document 12 - Similarity: 0.17638242504480664
Document 486 - Similarity: 0.14833458942226246
Document 584 - Similarity: 0.14083125315903686
Document 685 - Similarity: 0.13971730958454948
--------------------------------------------------
Testing query: what are the structural and aeroelastic problems associated with flight
of high speed aircraft .
Document 12 - Similarity: 0.4579373478031025
Document 746 - Similarity: 0.40526811530518575
Document 51 - Similarity: 0.32788591691126917
Document 875 - Similarity: 0.24770325140724356
Document 100 - Similarity: 0.2059761779971445
Document 1169 - Simila