<a href="https://colab.research.google.com/github/RafaelNovais/MasterAI/blob/master/QueryTerms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1

The algorithm for suggesting query augmentation terms is designed to address the dual goals of relevance and diversity. It aims to provide users with a rich set of terms to expand their query while capturing different facets of the underlying information need. Below, I explain the design and rationale behind each step of the process.

The algorithm begins with the user query and the top N documents retrieved as relevant by an Information Retrieval system. Preprocessing these documents is crucial to ensure that only meaningful terms are considered. Tokenization breaks text into individual words, while stopword removal eliminates common words that do not contribute to meaning. Stemming or lemmatization reduces words to their root forms, helping group variations together.

Candidate terms are extracted from the preprocessed documents using TF-IDF. TF-IDF ranks terms based on their frequency in the top N documents while penalizing terms common across all documents. This scoring highlights terms that are both significant within the retrieved set and less generic, ensuring that suggestions are specific to the query context. Original query terms are excluded from the candidate list, as the focus is on generating additional terms.

To ensure the terms are analyzed in a meaningful way, they are converted into vector representations using pre-trained word embeddings. These embeddings capture semantic relationships between words based on their contextual usage in large corpora. For example, "global warming" and "climate change" would have similar vector representations.

A key requirement is diversity in suggestions, ensuring that various aspects of the query intent are covered. Clustering algorithms are applied to group semantically similar terms. Each cluster represents a distinct theme or facet related to the query. The number of clusters can be determined dynamically, balancing granularity and usability.From each cluster, the algorithm selects representative terms, ensuring diversity by picking terms closest to the cluster centroid. This avoids redundancy while preserving relevance within the thematic grouping.Candidate terms are scored based on their relevance to the query and their co-occurrence with query terms in the retrieved documents. A penalty is applied to similar terms to maintain diversity.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec


In [31]:
import nltk
nltk.data.path = []
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [32]:
def preprocess_documents(documents):

    stop_words = set(stopwords.words('english'))
    processed_docs = []
    for doc in documents:
        tokens = word_tokenize(doc.lower())
        tokens = [t for t in tokens if t.isalnum() and t not in stop_words]
        processed_docs.append(tokens)
    return processed_docs

In [27]:
def get_candidate_terms(docs, query_terms):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([' '.join(doc) for doc in docs])
    feature_names = vectorizer.get_feature_names_out()
    scores = tfidf_matrix.sum(axis=0).A1
    candidate_terms = [
        feature_names[i] for i in scores.argsort()[::-1] if feature_names[i] not in query_terms
    ]
    return candidate_terms

In [28]:
def cluster_terms(candidate_terms, docs):
    model = Word2Vec(docs, vector_size=100, min_count=1, workers=4)
    term_vectors = [model.wv[term] for term in candidate_terms if term in model.wv]
    kmeans = KMeans(n_clusters=5)
    clusters = kmeans.fit_predict(term_vectors)
    cluster_terms = {}
    for idx, term in enumerate(candidate_terms):
        cluster = clusters[idx]
        if cluster not in cluster_terms:
            cluster_terms[cluster] = []
        cluster_terms[cluster].append(term)
    return cluster_terms

In [29]:
def suggest_query_terms(query, documents, top_n=5):
    processed_docs = preprocess_documents(documents)
    query_terms = query.lower().split()
    candidate_terms = get_candidate_terms(processed_docs, query_terms)
    clusters = cluster_terms(candidate_terms, processed_docs)
    suggestions = [terms[:top_n] for cluster, terms in clusters.items()]
    return suggestions
