# Working with texts

There are several ways to analyze the context of a text besides simply reading it.

In [1]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


Let's analyze the context of a word:

In [3]:
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


In [4]:
text2.concordance("monstrous")

Displaying 11 of 11 matches:
. " Now , Palmer , you shall see a monstrous pretty girl ." He immediately went
your sister is to marry him . I am monstrous glad of it , for then I shall have
ou may tell your sister . She is a monstrous lucky girl to get him , upon my ho
k how you will like them . Lucy is monstrous pretty , and so good humoured and 
 Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company 
 usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could n
t however , as it turns out , I am monstrous glad there was never any thing in 
so scornfully ! for they say he is monstrous fond of her , as well he may . I s
possible that she should ." " I am monstrous glad of it . Good gracious ! I hav
thing of the kind . So then he was monstrous happy , and talked on some time ab
e very genteel people . He makes a monstrous deal of money , and they keep thei


The same words have different contexts in different texts.

We can measure the similarity of any word:

In [5]:
text1.similar("monstrous")
text2.similar("monstrous")

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly


In question, the context *differ*. Therefore, we extract valuable information with just a few lines of code.

Furthermore, we can analyze the relationship between two words in a given context. We must use **common_contexts()**:

In [9]:
text1.common_contexts(["monstrous", "very"])
text2.common_contexts(["monstrous", "very"])

No common contexts were found
am_glad a_pretty a_lucky is_pretty be_glad


## Ocurrence of words in a text

Through the occurence of words in text we can determine many things, such as word trend analyses, assosiation between terms and even to train algorithms as a means of data preparation.

First we must find out the length of the text.

In [12]:
print(len(text3))
#Tokenization process.
tokens = sorted(set(text3))

44764


We can use the FreqDist() object to **count the frequencies** of tokens:

In [20]:
fdist = FreqDist(tokens)
print(fdist)
#We can return the most frequent tokens.
most_common = fdist.most_common(20)
print(most_common)

<FreqDist with 2789 samples and 2789 outcomes>
[('!', 1), ("'", 1), ('(', 1), (')', 1), (',', 1), (',)', 1), ('.', 1), ('.)', 1), (':', 1), (';', 1), (';)', 1), ('?', 1), ('?)', 1), ('A', 1), ('Abel', 1), ('Abelmizraim', 1), ('Abidah', 1), ('Abide', 1), ('Abimael', 1), ('Abimelech', 1)]


We should note that although we have 44.764 elements in the text when **tokenized**, only 2.789 unique items remain.

Another important thing is to measure the lexical richness of the text:

In [23]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

print(lexical_diversity(text3))

0.06230453042623537


We also measure the **percentage of occurance** of a given word:

In [24]:
def percentage(count, total):
    return 100 * count / total

print(percentage(fdist["smote"], len(text3)))

0.002233937985881512


## corpus.gutenberg.fileids() in NLTK

The corpus.gutenberg.fileids() function is part of the Natural Language Toolkit (NLTK) library in Python. It provides access to a list of available text file identifiers (file IDs) in the Gutenberg corpus, which is a small collection of literary texts included with NLTK.

### Purpose

This function is used to list all the file names (as strings) of the texts available in the gutenberg corpus. These file IDs can then be used to load and analyze specific texts.

In [26]:
import nltk

nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt',
 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt',
 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt',
 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Now I will choose the first text from these options.

In [27]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

len(emma)

192427

We don't need to type such long names all the time. Python provides another version of the import statement, as follows:

In [29]:
from nltk.corpus import gutenberg

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',...]

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', Ellipsis]

Now let's write a short program to display further information about each text by looping through all the fileid values ​​corresponding to the previously identified gutenberg file identifiers and then calculating exstatistics for each text.

In [40]:
from nltk.corpus import gutenberg

for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))  # Corrigido: removeu o [1]
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))

    print(f"{fileid}")
    print(f" - Caracteres: {num_chars}")
    print(f" - Palavras:   {num_words}")
    print(f" - Frases:     {num_sents}")
    print(f" - Vocabulário único: {num_vocab}")
    print("-" * 40)



austen-emma.txt
 - Caracteres: 887071
 - Palavras:   192427
 - Frases:     7752
 - Vocabulário único: 7344
----------------------------------------
austen-persuasion.txt
 - Caracteres: 466292
 - Palavras:   98171
 - Frases:     3747
 - Vocabulário único: 5835
----------------------------------------
austen-sense.txt
 - Caracteres: 673022
 - Palavras:   141576
 - Frases:     4999
 - Vocabulário único: 6403
----------------------------------------
bible-kjv.txt
 - Caracteres: 4332554
 - Palavras:   1010654
 - Frases:     30103
 - Vocabulário único: 12767
----------------------------------------
blake-poems.txt
 - Caracteres: 38153
 - Palavras:   8354
 - Frases:     438
 - Vocabulário único: 1535
----------------------------------------
bryant-stories.txt
 - Caracteres: 249439
 - Palavras:   55563
 - Frases:     2863
 - Vocabulário único: 3940
----------------------------------------
burgess-busterbrown.txt
 - Caracteres: 84663
 - Palavras:   18963
 - Frases:     1054
 - Vocabulário único

## What is a Corpus in NLTK?
In the context of NLTK (Natural Language Toolkit), a corpus is a large and structured collection of texts that can be used for various natural language processing (NLP) tasks such as tokenization, tagging, parsing, and language modeling.

In [44]:
from sklearn.datasets import fetch_20newsgroups

# Download the training data
newsgroups_train = fetch_20newsgroups(subset='train')

# Download the test data
newsgroups_test = fetch_20newsgroups(subset='test')

# You can then access the data
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target_names)
print(newsgroups_train.data[0])

(11314,)
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info y

## Meaning of each part:

(11314,)
This is likely the index or ID of the document in a dataset — in this case, document number 11314 in the 20 Newsgroups dataset.

['alt.atheism', 'comp.graphics', ..., 'talk.religion.misc']
These are the 20 newsgroups (topics/categories) in the dataset.

This specific post belongs to one of them (e.g., maybe rec.autos in this case, since it's about a car).

From: lerxst@wam.umd.edu (where's my thing)
The author of the post (username: lerxst) from the University of Maryland.

Subject: WHAT car is this!?
The title/topic of the post — the author is asking about an unknown car.

Nntp-Posting-Host: rac3.wam.umd.edu
The server/host that posted it, part of the Usenet infrastructure.

Organization: University of Maryland, College Park
The user's organization or institution.

Lines: 15
Number of text lines in the message body.

