# Query Processing – Sports IR

In this lab, we demonstrate **query processing techniques** for Information Retrieval.  
We will perform **query expansion**, **spelling correction**, and handle **query language variations** using cricket-related documents and vocabulary.


In [3]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
from nltk.util import ngrams
from collections import defaultdict, Counter
import re

# Download NLTK resources
nltk.download('wordnet')
nltk.download('punkt')

# Cricket-themed documents
documents = [
    "Sachin scored a brilliant century in the match.",
    "Virat Kohli is the captain of the Indian cricket team.",
    "The bowler took three wickets in the final over.",
    "He hit a six to win the game."
]

query = "batsman"
vocab = ["batsman", "bowler", "wicket", "inning", "century", "six", "match", "over"]
corpus = " ".join(documents)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Query Expansion Using WordNet

We expand the query by including synonyms and related terms using **WordNet**.  
This helps the IR system retrieve more relevant cricket documents.


In [4]:
def query_expansion_wordnet(query):
    words = nltk.word_tokenize(query)
    expanded_query = set(words)
    
    for word in words:
        for syn in wn.synsets(word):
            for lemma in syn.lemmas():
                expanded_query.add(lemma.name().replace('_', ' '))
    return list(expanded_query)

print("1. Query Expansion (WordNet):")
print(query_expansion_wordnet(query))


1. Query Expansion (WordNet):
['batter', 'slugger', 'batsman', 'hitter']


## Spelling Correction

We demonstrate **three approaches** to correct misspelled cricket terms:

1. **Edit Distance** – compares word similarity.
2. **K-grams** – matches sequences of characters.
3. **Context-sensitive** – uses surrounding words in the corpus.


In [5]:
# A. Edit Distance
def edit_distance(w1, w2):
    dp = [[0]*(len(w2)+1) for _ in range(len(w1)+1)]
    for i in range(len(w1)+1):
        for j in range(len(w2)+1):
            if i == 0:
                dp[i][j] = j
            elif j == 0:
                dp[i][j] = i
            elif w1[i-1] == w2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
    return dp[-1][-1]

def correct_by_edit_distance(word, vocab):
    min_dist = float('inf')
    correction = word
    for w in vocab:
        dist = edit_distance(word, w)
        if dist < min_dist:
            min_dist = dist
            correction = w
    return correction

print("Edit Distance Correction of 'batman':")
print(correct_by_edit_distance("batman", vocab))


Edit Distance Correction of 'batman':
batsman


# B. K-gram method
def k_grams(word, k=2):
    return {word[i:i+k] for i in range(len(word)-k+1)}

def kgram_correction(word, vocab, k=2):
    word_grams = k_grams(word, k)
    scores = []
    for v in vocab:
        v_grams = k_grams(v, k)
        overlap = len(word_grams & v_grams) / max(len(word_grams), 1)
        scores.append((v, overlap))
    return sorted(scores, key=lambda x: -x[1])[:3]

print("K-gram Correction for 'wiket':", kgram_correction("wiket", vocab))


In [6]:
# C. Context-sensitive (based on co-occurrence)
def context_sensitive(word, corpus, vocab):
    tokens = corpus.split()
    context = Counter()
    for i, w in enumerate(tokens):
        if w == word and i > 0:
            context[tokens[i-1]] += 1
        if w == word and i < len(tokens)-1:
            context[tokens[i+1]] += 1
    return context.most_common(3)

print("Context-sensitive Correction for 'batsman':", context_sensitive("batsman", corpus, vocab))


Context-sensitive Correction for 'batsman': []


## Query Language Variations

We simulate different query types in cricket IR:

1. **Single keyword query**
2. **Boolean query** – e.g., "batsman AND century"
3. **Natural/structured query** – e.g., "Show me documents about bowlers taking wickets"


In [7]:
# Single query
single_query = [d for d in documents if query in d.lower()]
print("Single Query Results:", single_query)

# Boolean query: "batsman AND century"
bool_query = [d for d in documents if "batsman" in d.lower() and "century" in d.lower()]
print("Boolean Query (batsman AND century):", bool_query)

# Natural / Structured query simulation
natural_query = "Show me documents about batsmen scoring centuries or bowlers taking wickets."
structured_terms = ["batsman", "bowler", "century", "wicket"]

structured_results = [d for d in documents if any(term in d.lower() for term in structured_terms)]
print("Natural Query:", natural_query)
print("Structured Query Results:")
for result in structured_results:
    print(result)


Single Query Results: []
Boolean Query (batsman AND century): []
Natural Query: Show me documents about batsmen scoring centuries or bowlers taking wickets.
Structured Query Results:
Sachin scored a brilliant century in the match.
The bowler took three wickets in the final over.
