In [6]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download required NLTK data
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

# Initialize Porter Stemmer
ps = PorterStemmer()

# Read text from file
text = ""
with open("IR_text.txt") as file:
    for line in file:
        text += line
print("Original Text:\n", text)

# Tokenize the text
word_token = word_tokenize(text)
print("\nTokenized Words:\n", word_token)

# Remove punctuation
def remove(words):
    cleaned_words = [word for word in words if word not in string.punctuation]
    return cleaned_words

clean = remove(word_token)
print("\nAfter Removing Punctuation:\n", clean)

# Remove stopwords
swords = stopwords.words("english")
def remove_stopwords(clean_words):
    without_stop = [word for word in clean_words if word.lower() not in swords]
    return without_stop

removed = remove_stopwords(clean)
print("\nAfter Removing Stopwords:\n", removed)

# Apply stemming
def stemming(cleaned):
    stemmed_words = [ps.stem(word) for word in cleaned]
    return stemmed_words

stemmed = stemming(removed)
print("\nAfter Stemming:\n", stemmed)


Original Text:
 Information Retrieval (IR) is the process of extracting relevant information from vast collections of unstructured or semi-structured data, such as documents, web pages, or databases, to satisfy a userâ€™s information need. It involves several key steps, starting with document representation, where raw text is preprocessedâ€”through tokenization, stemming, lemmatization, and stop-word removalâ€”to ensure consistency in indexing and retrieval. Next, indexing organizes data to allow efficient searching, often using inverted indexes that map terms to their occurrences across documents. Query processing transforms and refines user queries to enhance search relevance, incorporating techniques like query expansion, synonym recognition, and spelling correction. Finally, the matching and ranking stage scores and ranks documents based on relevance to the query, typically using algorithms like TF-IDF or modern deep learning models, ensuring that the most relevant results appear a

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here's a summary of each step in the code:

1. **Reading the Text:**
   - The program reads the content of a file called `IR_text.txt` line by line and combines all the lines into one string.
   - **Purpose:** This step loads the raw text data for further processing.

2. **Tokenizing the Text:**
   - The `word_tokenize()` function from NLTK is used to split the text into individual words (tokens).
   - **Purpose:** Tokenization is the process of breaking down text into smaller pieces (words), which are easier to analyze.

3. **Removing Punctuation:**
   - A function `remove()` filters out punctuation marks using Python's `string.punctuation`.
   - **Purpose:** To clean the text by removing unnecessary punctuation, leaving only words for analysis.

4. **Removing Stopwords:**
   - The program uses NLTK's list of stopwords (common, non-informative words like "the", "is", etc.) and removes them from the token list.
   - **Purpose:** Stopwords are often removed because they do not contribute significant meaning to text analysis.

5. **Stemming:**
   - The `PorterStemmer` is used to reduce words to their root forms (e.g., "running" becomes "run").
   - **Purpose:** Stemming simplifies words to their base or root form, helping in grouping similar words (e.g., "running" and "runner" both become "run").

Each step cleans, transforms, and prepares the text for analysis by breaking it down into more manageable and meaningful components.