# NLTK Corpus

The NLTK (Natural Language Toolkit) corpus is a collection of text datasets provided by the NLTK library, which is widely used in natural language processing (NLP). These datasets enable developers to experiment with and train NLP models.

It includes:
- **Text Data**: Collections of books, news articles, chats, etc. (e.g., "Gutenberg Corpus," "Chat Corpus").
- **Annotated Corpora**: Tagged and structured text, like part-of-speech-tagged sentences (e.g., "Treebank").
- **Lexical Resources**: Wordlists and dictionaries (e.g., "WordNet").
- **Language Models**: Pretrained statistical models for tasks like tokenization.

### Stop Words

stop words are commonly used words in a language that are often excluded from processing because they carry little to no meaningful information. These words are usually so frequent that they can overshadow the more important terms in a text when analyzing or building models.

Examples of stop words in English include:

+ **Articles**: a, an, the
+ **Conjunctions**: and, or, but
+ **Prepositions**: in, on, at
+ **Pronouns**: he, she, they

In NLTK, there is a built-in stopwords corpus that contains predefined lists of stop words for different languages. These lists can be used to filter out common words when processing text data.

in order to use this corpus, we first need to download it, then import the package.

In [5]:
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [6]:
stop_words = set(stopwords.words('english'))
print(stop_words)

{'t', "it's", 'their', 'some', "it'll", 'between', 'd', 'needn', 'o', "he'd", "we've", 'to', 'yourself', 'there', "won't", "couldn't", 'so', 'of', 'once', 'themselves', "we'd", 'won', 'her', 'not', "weren't", 'only', "we'll", 'any', 'but', 'how', 'theirs', 'about', 'me', 'what', 'him', 'hadn', 'by', "needn't", "you'd", 'doing', 'had', 'we', "doesn't", 'll', 'in', 'isn', 'she', "mustn't", "she's", 'aren', 'have', 'was', 'while', 'here', 'that', 'where', 'whom', "you've", 'ma', 'they', 'on', 'does', "he'll", 'because', "she'd", "they'd", 'weren', 'as', 'very', 'up', "hadn't", "they're", "that'll", 'has', "i'm", 'most', 'than', 'same', 'then', 'and', 'a', 'are', 'i', 'shan', 'out', 'wouldn', 'which', "didn't", 'its', 'having', 'those', 'own', 'more', 'when', 'further', 'ourselves', 'our', 'your', "aren't", "you're", 'until', 'herself', 'during', 'again', "shouldn't", 'will', 'through', 'couldn', "they'll", 'himself', 'yours', 'who', "you'll", 'before', 'over', "wouldn't", "haven't", 'can'

By removing stop words, we reduce the noise in the text data and focus on the words that carry more significant meaning.

In [7]:
sentence="This is a sample sentence, showing off the stop words filtration."
words=word_tokenize(sentence)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


in a similar manner to stopwords, other corpora may be downloaded and imported from the `nltk.corpus` package, for use in various applications

### WordNet

*WordNet* is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. One can use it to find the synonyms, antonyms and definitions of words.

In [8]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nagan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [1]:
from nltk.corpus import wordnet
syns=wordnet.synsets("computer")
print(syns)

[Synset('computer.n.01'), Synset('calculator.n.01')]


Print the word itself, as well as the definition:

In [3]:
print(syns[0].lemmas()[0].name())
print(syns[0].definition())

computer
a machine for performing calculations automatically


In [7]:
syns1=wordnet.synsets("program")
print(syns1[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


Get the synonyms and antonyms of a word:

In [8]:
synonyms=[]
antonyms=[]

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
print(set(synonyms))
print(set(antonyms))


{'dear', 'honest', 'unspoilt', 'commodity', 'thoroughly', 'adept', 'good', 'safe', 'expert', 'salutary', 'well', 'practiced', 'serious', 'honorable', 'full', 'just', 'goodness', 'respectable', 'ripe', 'in_force', 'undecomposed', 'trade_good', 'secure', 'upright', 'unspoiled', 'in_effect', 'skillful', 'effective', 'right', 'dependable', 'beneficial', 'estimable', 'sound', 'near', 'proficient', 'skilful', 'soundly'}
{'ill', 'bad', 'badness', 'evil', 'evilness'}


Show the semantic similarity between 2 words:

In [9]:
w1=wordnet.synset("ship.n.01")
w2=wordnet.synset("boat.n.01")
print(w1.wup_similarity(w2))

0.9090909090909091


### Sources

+ [sentdex-corpora](https://youtu.be/TKAXDqoG2dc)
+ [sentdex-Wordnet](https://youtu.be/T68P5-8tM-Y)
+ [Princeton.edu](https://wordnet.princeton.edu/)