A corpus (plural: corpora) is essentially a large collection of text. This text can range from multiple paragraphs to entire books, or even a compilation of various texts. The primary purpose of a corpus is to serve as a dataset for linguistic research and NLP applications. For instance, in this discussion, we are using a corpus derived from a paragraph of Indian Wikipedia to illustrate various concepts.

In [4]:
corpus = "India is a country in South Asia. It is the seventh-largest country by land area, the second-most populous country, and the most populous democracy in the world."
print(corpus)

India is a country in South Asia. It is the seventh-largest country by land area, the second-most populous country, and the most populous democracy in the world.


# Understanding Vocabulary

Vocabulary, in the context of NLP, refers to the set of unique words in a given corpus. The size of the vocabulary is a critical factor in various text analysis tasks. It represents the total number of unique words after preprocessing steps like removing stop words and special characters.



Preprocessing the Corpus

Preprocessing is a crucial step in text analysis. It involves cleaning and preparing the text data to make it suitable for further analysis. Here are the steps involved in preprocessing:




*   Tokenization: Splitting the corpus into individual words or tokens.
*   Stop Words Removal: Removing common words that do not contribute much to the meaning of the text (e.g., "is", "the", "and").
*   Special Characters Removal: Removing punctuation marks and other non-alphabetic characters.
*   Converting to Lowercase: Converting all words to lowercase to ensure uniformity.



Let’s walk through the preprocessing steps with code examples.

First, we need to tokenize the corpus and remove stop words:

In [8]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
# nltk.download('all')

# Tokenize the corpus
tokens = word_tokenize(corpus)

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)

['India', 'country', 'South', 'Asia', '.', 'seventh-largest', 'country', 'land', 'area', ',', 'second-most', 'populous', 'country', ',', 'populous', 'democracy', 'world', '.']


Removing Special Characters

In [9]:
filtered_tokens = [word for word in filtered_tokens if word.isalpha()]
print(filtered_tokens)

['India', 'country', 'South', 'Asia', 'country', 'land', 'area', 'populous', 'country', 'populous', 'democracy', 'world']


Calculating Vocabulary Size

In [10]:
unique_words = set(filtered_tokens)
vocab_size = len(unique_words)
print(f"Vocabulary Size: {vocab_size}")

Vocabulary Size: 9
