# Computational Linguistics Semester VI Project

## Student Details
Name: Desh Iyer <br>
SAP ID: 500081889 <br>
Batch: Year III, AI/ML(H), B5 <br>

## Problem Statement
The project is on the "Applications of NLP and Emerging technologies in NLP". Refer to the uploaded units 4 & 5 presentations and create a summary of the sub-topics listed there.

---

## 1. Information Retrieval

### What is IR?
Information Retrieval (IR) in NLP is the task of retrieving relevant documents or information from a large collection of unstructured or semi-structured data based on user query or input. IR techniques are used in various applications such as web search, recommendation systems, chatbots, and question answering systems. 

The field of IR has gained immense popularity in recent years due to the explosive growth of digital data and the need to manage and extract insights from this data. IR techniques have become increasingly sophisticated, using a combination of machine learning algorithms, natural language processing, and knowledge representation techniques to deliver more accurate and relevant results.

One of the primary uses of IR in NLP is in search engines, where it is used to retrieve relevant web pages based on user queries. For example, search engines like Google and Bing use sophisticated algorithms to match user queries with relevant web pages, taking into account various factors such as keyword relevance, page authority, and user behavior. IR is also used in recommendation systems, where it is used to recommend products, movies, or music to users based on their past behavior or preferences. For instance, Amazon uses IR techniques to recommend products to its customers based on their purchase history and browsing behavior.

### IR Development Timeline
The development timeline of Information Retrieval (IR) can be traced back to the early 1900s when books were indexed manually using a controlled vocabulary. A brief overview of the major developments in IR follows:

1. Manual indexing: In the early days of IR, documents were manually indexed using controlled vocabularies like thesauri. This approach was time-consuming and expensive.

2. Boolean retrieval: In the 1960s, Boolean retrieval was introduced, which enabled the use of Boolean logic in searching for documents. This method is based on the use of AND, OR, and NOT operators.

3. Probabilistic retrieval: In the 1970s, probabilistic retrieval models were developed that used statistical methods to rank documents based on their relevance to a query. These models assigned a probability score to each document, and documents were ranked based on these scores.

4. Vector space model: In the 1980s, the vector space model was introduced, which represented documents and queries as vectors in a high-dimensional space. This method used cosine similarity to rank documents based on their similarity to a query.

5. Web search engines: In the 1990s, web search engines like Yahoo and Google were developed, which made IR accessible to the general public. These search engines used a combination of techniques including crawling, indexing, and ranking to retrieve relevant results.

6. Semantic search: In recent years, there has been a focus on developing semantic search engines that can understand natural language queries and retrieve relevant results based on the meaning of the query rather than just keyword matching.

Seeing as how the development timeline of IR has evolved over time, it's no surprise that IR has grown in popularity due to the exponential growth of digital data and the need to efficiently search and retrieve relevant information from this data.

### IR Models and Types

There are several models and types of Information Retrieval (IR), some of which are:

1. Boolean Model: It is one of the earliest and simplest models of IR. It is based on Boolean algebra and uses the operators AND, OR, and NOT to retrieve documents based on specific keywords.

2. Vector Space Model: It represents documents and queries as vectors in a high-dimensional space. The similarity between a query and a document is calculated by measuring the angle between the two vectors.

3. Probabilistic Model: It assumes that documents are generated based on a probabilistic process and the relevance of a document to a query is computed based on the probability that the document is relevant to the query.

4. Latent Semantic Analysis (LSA): It is a type of vector space model that identifies the underlying semantic relationships between terms and documents. It reduces the dimensionality of the space by identifying latent semantic dimensions.

5. Neural IR models: It uses deep learning techniques to learn representations of queries and documents, which can be used to rank the relevance of documents to a given query.

6. Cross-Language IR: It deals with the retrieval of documents in languages other than the query language.

These models are used in various types of IR systems such as web search engines, question-answering systems, and digital libraries, among others.

### Implementing the Boolean IR Model

Here's an example Python script that implements the Boolean model using the Whoosh library on the Cranfield dataset, which is found at this [link](http://ir.dcs.gla.ac.uk/resources/test_collections/cran/).

In [5]:
from whoosh.fields import Schema, TEXT, ID
from whoosh.index import create_in
from whoosh.qparser import QueryParser
import os

# Define the schema for the index
schema = Schema(documentNumber=ID(stored=True), text=TEXT(stored=True))

# Create a new index in a directory called booleanSchema
if not os.path.exists("./assets/data/booleanSchema"):
    os.mkdir("./assets/data/booleanSchema")

# Save the schema to the directory
ix = create_in("./assets/data/booleanSchema", schema)

# Open a writer object for the index
writer = ix.writer()

# Parse the Cranfield dataset and add documents to the index
with open("./assets/data/cran/cran.all.1400") as f:
    current_doc = ""
    current_text = ""
    
    for line in f:
        if line.startswith(".I"):
            if current_doc and current_text:
                writer.add_document(documentNumber=current_doc, text=current_text)
                
            current_doc = line.strip().split()[-1]
            current_text = ""
        elif line.startswith(".T") or line.startswith(".W"):
            pass
        else:
            current_text += line.strip() + " "
    
    if current_doc and current_text:
        writer.add_document(documentNumber=current_doc, text=current_text)

writer.commit()

# Open a searcher object for the index
searcher = ix.searcher()

# Define a query using the Whoosh QueryParser
query_parser = QueryParser("text", schema=ix.schema)
query = query_parser.parse("Boolean OR model")

# Execute the query and get the matching document numbers
results = [r["documentNumber"] for r in searcher.search(query)]

# Print the results
print(results)

['874', '800', '879', '358', '1164', '1340', '1091', '1090', '431', '1162']


---

## 2. QA Systems