**Practical 10**

**Aim :   Advanced Topics in Information Retrieval**
*   Implement a text summarization algorithm (e.g., extractive or abstractive).

*   Build a question-answering system using techniques such as information
extraction

In [2]:
print("T074 Kermeen")
# Import required libraries
import nltk
import numpy as np
import networkx as nx
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

# Download NLTK resources (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab') # Added to resolve LookupError

# Input text
text = """
Information Retrieval is the process of obtaining relevant information
from a large collection of data. It plays an important role in search engines.
Text summarization helps in reducing the size of documents while preserving
important information. Automatic summarization is widely used in news
applications and research domains.
"""

# Sentence tokenization
sentences = sent_tokenize(text)

# Remove stop words and calculate word frequencies
stop_words = set(stopwords.words('english'))
word_frequencies = {}

for word in word_tokenize(text.lower()):
    if word.isalnum() and word not in stop_words:
        if word not in word_frequencies:
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

# Normalize word frequencies
max_frequency = max(word_frequencies.values())
for word in word_frequencies:
    word_frequencies[word] /= max_frequency

# Score sentences
sentence_scores = {}
for sentence in sentences:
    for word in word_tokenize(sentence.lower()):
        if word in word_frequencies:
            if sentence not in sentence_scores:
                sentence_scores[sentence] = word_frequencies[word]
            else:
                sentence_scores[sentence] += word_frequencies[word]

# Select top 2 sentences
summary_sentences = sorted(
    sentence_scores,
    key=sentence_scores.get,
    reverse=True
)[:2]

# Generate summary
summary = ' '.join(summary_sentences)
print("Summary:\n", summary)

T074 Kermeen


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Summary:
 
Information Retrieval is the process of obtaining relevant information
from a large collection of data. Text summarization helps in reducing the size of documents while preserving
important information.


Question–Answering System – Python Code

In [3]:
print("T074 Kermeen")
# Import required libraries
import nltk
import spacy
from nltk.tokenize import sent_tokenize

# Load spaCy language model
nlp = spacy.load("en_core_web_sm")

# Input document
text = """
Information Retrieval deals with the storage and retrieval of information.
Search engines use IR techniques to retrieve relevant documents.
Natural Language Processing helps computers understand human language.
"""

# Sentence tokenization
sentences = sent_tokenize(text)

# Accept user question
question = input("Enter your question: ")

# Extract keywords from question
question_doc = nlp(question)
keywords = [
    token.text.lower()
    for token in question_doc
    if not token.is_stop and not token.is_punct
]

# Find most relevant sentence
best_sentence = ""
max_score = 0

for sentence in sentences:
    score = 0
    sentence_doc = nlp(sentence.lower())

    for token in sentence_doc:
        if token.text in keywords:
            score += 1

    if score > max_score:
        max_score = score
        best_sentence = sentence

# Display answer
if best_sentence:
    print("Answer:", best_sentence)
else:
    print("Answer not found in the document.")

T074 Kermeen
Enter your question: What is information retrieval
Answer: 
Information Retrieval deals with the storage and retrieval of information.
