<a href="https://colab.research.google.com/github/RafaelNovais/MasterAI/blob/master/IRExam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Weightinh Schemes
* Describe, in your own words, with reference to any well-known term weighting scheme, the main constituents of a good weighting scheme.

A good weighting scheme in information retrieval, such as TF-IDF (Term Frequency-Inverse Document Frequency) or related schemes, is designed to assign a weight to terms in a document based on their importance relative to the document and the entire collection (corpus). The main constituents of a good weighting scheme include:

1. Term Frequency (TF)
Purpose: Measures how often a term occurs in a document.
Reasoning: Terms that appear more frequently in a document are likely more relevant to its content.
Challenges: Raw frequency alone can overemphasize commonly occurring terms in longer documents, requiring normalization (e.g., by document length).

**qnt frenquency/total qnt - in the document**

2. Inverse Document Frequency (IDF)
Purpose: Measures the rarity of a term across the entire corpus.
Reasoning: Terms that occur in many documents (e.g., "the", "is") are less useful for distinguishing one document from another.
Formula: Typically expressed as IDF

**Qnt total Document / Qnt total document with term - in all document**

Challenges: Rare terms are not always more relevant, and logarithmic scaling is often used to moderate IDF values.

2. Normalization
Purpose: Ensures comparability between documents of varying lengths and term distributions.
Methods: Can include cosine normalization or length-based scaling to avoid bias toward longer documents.
Benefit: Improves ranking fairness and retrieval accuracy.

--------------------------------------------------
#2. Evaluate
* Precision and recall are often used to measure the quality of an IR system. Explain what is meant by these terms. Suggest any alternative approach that you might consider to measure the quality of the IR system.

Precision and Recall are key metrics in Information Retrieval (IR) used to evaluate the effectiveness of a system's ability to retrieve relevant documents for a query.

1. Precision

Definition: The proportion of retrieved documents that are relevant.

Example: If an IR system retrieves 10 documents, and 7 of them are relevant, the precision is 70%.

Significance: Indicates the system’s ability to avoid retrieving irrelevant documents.

2. Recall

Definition: The proportion of all relevant documents in the corpus that are retrieved.

Example: If there are 50 relevant documents in the corpus, and the system retrieves 25 of them, the recall is 50%.

Significance: Indicates the system’s ability to retrieve all relevant documents.


**Alternative Metrics to Measure Quality**

 1. F1-Score -

 Definition: Harmonic mean of precision and recall, providing a single measure of balance.

  Use Case: Useful when a balance between precision and recall is desired

2. Mean Average Precision (MAP)

  Definition: Computes the average precision across all queries in a dataset.

  Advantage: Takes the ranking of retrieved documents into account, rewarding systems that place relevant documents higher.


----------------------------------------------------------------
#3. Preprocessing
* Outline a suitable approach to indexing a document collection to allow efficient handling of queries in a system adopting a vector space framework.

1. Preprocessing the Document Collection
  * Tokenization: Break text into individual terms (e.g., words or phrases).
  * Stop Word Removal: Exclude common, non-informative words (e.g., "the", "is").
  * Stemming/Lemmatization: Reduce words to their root forms (e.g., "running" → "run").
  * Lowercasing: Normalize terms to lowercase for case-insensitive matching.


2. Feature Extraction
  * Term Frequency (TF): Count occurrences of each term in a document.
  * Inverse Document Frequency (IDF): Compute the rarity of each term across the corpus.
  * TF-IDF Weighting: Calculate weighted term importance for each document

------------------------------------------------------------------
#1. Feedback
* Discuss an approach where user feedback (users identifying whether returned documents are relevant or not) can be used to extend or improve the query. Outline any issues with this approach.

  User feedback about the relevance of returned documents can be utilized to refine or extend the original query, improving retrieval performance. This approach, known as relevance feedback, is commonly used in systems adopting models like the vector space model or probabilistic retrieval.

  1. Basic Idea
  * Input: Users mark returned documents as "relevant" or "non-relevant" after an initial query.
  * Process: The system adjusts the query based on this feedback by incorporating terms from relevant documents and de-emphasizing terms from non-relevant ones.
  * Output: A refined query that retrieves more relevant documents in subsequent searches.

  Relevance feedback is a powerful approach for improving query effectiveness in IR systems, leveraging user interactions to dynamically refine results. However, the approach requires careful handling of user effort, noise, and computational complexity to ensure practicality and scalability. Proper balancing of automation and user input can maximize its benefits.
----------------------------------------------------------------------
#2. Implicit Feedback
* Predicting which query q, q' (modified through relevance feedback), is likely to perform better requires analyzing various aspects of their composition, context, and anticipated effectiveness. Below are strategies for making such predictions:

**Query Quality Metrics**

1. Query Specificit
* Definition: Measures how focused or narrow a query is in identifying relevant documents.
* Prediction: A more specific query (e.g., with unique, high-discrimination terms) is likely to perform better, as it reduces ambiguity.
* Example: If q includes high-IDF terms from relevant documents, it may outperform q.
2. Query Generality
* Definition: Measures the breadth of terms included in a query.
* Prediction: If the information need is broad,might perform better if it includes diverse terms, improving recall.

To predict which query is likely to perform better, analyze metrics like specificity, precision, recall, and feedback utilization, while accounting for the characteristics of the document corpus and user needs. Empirical testing using relevance metrics (e.g., precision, recall) or predictive models can provide quantitative insights into performance differences between q and q'.

FOR BETTER METRIC PERFORMACE

#2. Pseudo FeedBack

To provide a diverse set of candidate terms for users to augment their queries, a pseudo-feedback mechanism can analyze the initial query results and extract terms that represent different facets of the information need. The goal is to present terms that enhance both precision and recall while covering a range of related topics or contexts. Below is a suggested approach:

1. Initial Retrieval
  * Retrieve Documents: Execute the initial query and retrieve the top 𝑁, N ranked documents (e.g., top 10-50).
  * Extract Terms: Identify candidate terms from these documents. Candidate terms could include:
    * High-frequency words.
    * Proper nouns, named entities (e.g., person names, locations, organizations).
    * Key phrases (using phrase detection algorithms).

2. Scoring and Filtering Terms
Assign scores to each extracted term to measure its importance and relevance:

  * TF-IDF Weighting:
  Prioritize terms that occur frequently in the top 𝑁 N documents but rarely in the entire corpus.


3. Clustering for Diversity
  * Cluster candidate terms into distinct groups to represent different facets of the topic:
  * Algorithm: Use clustering techniques like **K-Means**, DBSCAN, or hierarchical clustering on term vectors (e.g., embeddings).
  * Output: Each cluster represents a unique subtopic or context related to the query.


This approach ensures a diverse and informative set of candidate terms for query augmentation. By using clustering and ranking mechanisms, the system captures various aspects of the query topic, enabling users to refine their search effectively. Integrating interactive features empowers users to customize the query further, balancing automation with user control.
----------------------------------------------------------
#3. Query Difficulty Prediction
* Given a submitted query, we can process the query in many ways, e.g. query expansion. These additional techniques may be more beneficial for difficult queries. Suggest a suitable means to identify a difficult query.

Identifying a difficult query is crucial for deciding when to apply additional techniques like query expansion, relevance feedback, or other refinements. A suitable approach would involve analyzing multiple aspects of the query and its initial retrieval performance, as detailed below:
1.  Query Length
2.  Query Specificity
3.  Term Ambiguity
4.


A comprehensive approach to identifying difficult queries involves analyzing query structure, retrieval performance, term semantics, and user interaction data. Combining these metrics into a hybrid model ensures robust predictions, allowing the system to selectively apply advanced techniques like query expansion or clustering to improve retrieval effectiveness.














#1. Collab Filtering
#2. Similarity
#3. Clustering

#1. Learning MIR
#2. Format approches
#3. Visualisation