<h1>2- Accessing Text Corpora</h2>

<h3>Gutenberg Corpus</h3>

<p>Archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:</p>

In [None]:
import matplotlib.pyplot as plt

In [None]:
%matplotlib notebook

In [None]:
import nltk

In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')
print(emma[:20])
print(len(emma), len(set(emma)))

In [None]:
len(emma)/len(set(emma))

In [None]:
for s in gutenberg.sents("shakespeare-macbeth.txt")[10:20]:
    print(s)

In [None]:
emma

In [None]:
emma = nltk.Text(emma)
emma.concordance("surprize", lines=8)

In [None]:
import string
print(string.punctuation)

In [None]:
s = "Mr. Obama has explained that. Fifa world cup (year 2022)"
s

In [None]:
[nltk.word_tokenize(s) for s in nltk.sent_tokenize(s)]

<h4><i>Join function:</i></h4>

In [None]:
import re
def join_funct(sentence):
    sentence = ' '.join(sentence) # join normally
    return(sentence)

longest_se = max([len(s) for s in gutenberg.sents("shakespeare-macbeth.txt")])
longest_sent = [s for s in gutenberg.sents("shakespeare-macbeth.txt") if len(s) == longest_se]
print(*longest_sent[0], sep=' ')

In [None]:
import re
re.sub(" ([,.;\):])", lambda m: m.group(1), "Mr. seguin ( propriétaire de la chèvre ) nous parle de lui ( Merci à lui ) .")

In [None]:
import re
def join_func(sentence):
    sentence = ' '.join(sentence) # join normally
    sentence = re.sub(" ([,.;\):])", lambda m: m.group(1), sentence) # stick to left
    sentence = re.sub("([\(]) ", lambda m: m.group(1), sentence) # stick to right
    sentence = re.sub(" ([']) ", lambda m: m.group(1), sentence) # join both sides
    return(sentence)

In [None]:
import re
from nltk.corpus import gutenberg
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
longest_se = max([len(s) for s in macbeth_sentences])
longest_sen = [s for s in macbeth_sentences if len(s) == longest_se]
longest_sent = ' '.join(longest_sen[0])
print("First method \n", longest_sent, "\n Second method \n", join_func(longest_sen[0]), "\n")

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print(int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid)

<h3>Web and Chat Text</h3>

<p>Although Project Gutenberg contains thousands of books, it represents established
literature. It is important to consider less formal language as well. NLTK’s small collection
of web text includes content from a Firefox discussion forum, conversations
overheard in New York, the movie script of <i>Pirates of the Carribean</i>, personal advertisements, and wine reviews:</p>

In [None]:
from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(fileid, ":", webtext.raw(fileid)[:65], '...')

In [None]:
print(webtext.raw(fileid)[:1000])

In [None]:
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
print(join_func(chatroom[100]))

<h3><font color="red">Brown Corpus</font></h3>

<p>The Brown Corpus was the first million-word electronic corpus of English, created in
1961 at Brown University. This corpus contains text from 500 sources, and the sources
have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an
    example of each genre (for a complete list, see <a>http://icame.uib.no/brown/bcm-los.html</a>).</p>

<p>We can access the corpus as a list of words or a list of sentences (where each sentence
is itself just a list of words). We can optionally specify particular categories or files to
read:</p>

In [None]:
from nltk.corpus import brown
brown.categories()

In [None]:
print(brown.words(categories='adventure')[:25])

In [None]:
print(brown.words(fileids=['cg22'])[:25])

In [None]:
for counter, value in enumerate(brown.sents(categories=['news', 'editorial', 'reviews'])[1:7]):
    print(str(counter), "- ", join_func(value), end="\n\n")

In [None]:
from nltk.corpus import brown
news_text = brown.words(categories='news')
fdist = nltk.FreqDist([w.lower() for w in news_text])
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', fdist[m],) 

In [None]:
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

<h3>Reuters Corpus</h3>

<p>The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The
documents have been classified into 90 topics, and grouped into two sets, called “training”
and “test”; thus, the text with fileid <FONT face="courier", size="3">'test/14826'</FONT> is a document drawn from the
test set. This split is for training and testing algorithms that automatically detect the
topic of a document, as we will see in Chapter 6.</p>

In [None]:
from nltk.corpus import reuters
#reuters.fileids()

In [None]:
reuters.categories()

<h3>Inaugural Address Corpus</h3>

In [None]:
from nltk.corpus import inaugural
inaugural.fileids()

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

In [None]:
%matplotlib notebook
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['man', 'women',"people"]
    if w.lower().startswith(target))
cfd.plot(cumulative=True)

<h2>Annotated Text Corpora</h2>

<h3>Corpora in Other Languages</h3>

In [None]:
join_func(nltk.corpus.cess_esp.words()[0:100])  # Espangol

In [None]:
nltk.corpus.floresta.words()  # Italien

In [None]:
nltk.corpus.indian.words('hindi.pos')  # Hindi

In [None]:
nltk.corpus.udhr.fileids()

In [None]:
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))
cfd.plot()

In [None]:
udhr.fileids()
raw_text = udhr.raw('Arabic_Alarabia-Arabic')
nltk.FreqDist(raw_text).plot()

In [None]:
raw = gutenberg.raw("burgess-busterbrown.txt")
print(raw[1:20])

In [None]:
words = gutenberg.words("burgess-busterbrown.txt")
print(words[1:20])

In [None]:
sents = gutenberg.sents("burgess-busterbrown.txt")
print(sents[1:20])

In [None]:
from nltk.corpus import treebank
print(treebank.words())
tree1 = treebank.parsed_sents()[0:1]
print(tree1)

<h3>Counting Words by Genre</h3>

In [None]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

<h3>Plotting and Tabulating Distributions</h3>

In [None]:
from nltk.corpus import inaugural
cfd1 = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd1.plot(cumulative=True, title="Freq of words ('amercia' & 'citizen') from 1789 until 2009 in inaugural corpus")

In [None]:
cfd2.plot(conditions=["English","French_Francais"],samples=range(25), cumulative=False)

<h3>More Bigrams</h3>

In [None]:
import nltk
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']
sent2 = []
for i in nltk.bigrams(sent):
    sent2.append(list(i))
    
print(sent2)

In [None]:
from nltk.corpus import inaugural
text_bigram = []
for s in inaugural.sents():
    for w in nltk.bigrams(s):
        text_bigram.append(list(w))
    
print(text_bigram[:80])


<h3>Names Corpus</h3>

In [None]:
names = nltk.corpus.names
names.fileids()

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')
print([w for w in male_names if w in female_names])

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))
cfd.plot(title="Conditional frequency distribution")

<h3>A Pronouncing Dictionary</h3>

In [None]:
entries = nltk.corpus.cmudict.entries()
print(len(entries), "\n")
for entry in entries[39943:39951]:
    print(entry)

<p>For each word, this lexicon provides a list of phonetic codes—distinct labels for each contrastive sound—known as phones. Observe that fire has two pronunciations (in U.S. English): the one-syllable <font face="courier",size="3"> F AY1 R</font>, and the two-syllable <font face="courier",size="3">F AY1 ER0</font>. The symbols in the CMU Pronouncing Dictionary are from the Arpabet, described in more detail at <a>http://en.wikipedia.org/wiki/Arpabet.</a></p>
<p>Each entry consists of two parts, and we can process these individually using a more complex version of the <font face="courier", size="3">for</font> statement. Instead of writing <font face="courier", size="3">for entry in entries:</font>, we replace <font face="courier", size="3">entry</font> with two variable names, <font face="courier", size="3">word</font>, <font face="courier", size="3">pron</font> . Now, each time through the loop, <font face="courier", size="3">word</font> is assigned the first part of the entry, and <font face="courier", size="3">pron</font> is assigned the second part of the entry:</p>

In [None]:
for word, pron in entries:
    if len(pron) == 3:
        ph_1, ph_2, ph_3 = pron
        if ph_1 == 'P' and ph_3 == 'T':
            print(word, ph_2)

In [None]:
syllable = ['N', 'IH0', 'K', 'S']
print([word for word, pron in entries if pron[-4:] == syllable])

In [None]:
[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']

In [None]:
def stress(pron):
    return [char for phone in pron for char in phone if char.isdigit()]

In [None]:
print([w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']])

In [None]:
print([w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']])

In [None]:
p3 = [(pron[0]+'-'+pron[2], word)
    for (word, pron) in entries
    if pron[0] == 'P' and len(pron) == 3]
cfd = nltk.ConditionalFreqDist(p3)
for template in cfd.conditions():
    if len(cfd[template]) > 10:
        words = cfd[template].keys()
        wordlist = ' '.join(words)
        print(template, wordlist[:70] + " ...")

In [None]:
prondict = nltk.corpus.cmudict.dict()
prondict['fire']

## Using datasets package

In [None]:
from datasets import list_datasets

# Get a list of all available dataset names
all_datasets = list_datasets()
print(len(all_datasets), all_datasets[:10])

As we know, only ```en``` and ```en-basic``` are supported by the ```nltk.corpus.words``` interface.

There are other things like ```lowercase```, ```lemmatize/stemming``` etc. other than ```word_tokenize``` to consider but hopefully the data source above will be a good start to find words in the language you need beyond the English words that ```nltk.corpus.words``` provides.

In [None]:
# In case
# pip install -U apache_beam
# pip install -U datasets
# pip install -U nltk
# pip install -U tqdm

```tqdm``` like the word in arabic taqadum that means progress

In [None]:
from itertools import chain

from tqdm import tqdm
#import nltk
#nltk.download('popular')

from nltk import word_tokenize
from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.de")


num_texts = len(dataset['train']) # outputs: 2,665,357


# Pick the first 100,000 texts
de_words = set(chain(*(word_tokenize(dataset['train'][i]['text']) for i in tqdm(range(100_000)))))

In [None]:
len(de_words)

In [None]:
print(*list(de_words)[:42], sep=", ")

In [None]:
de_words_part = list(de_words)[:42]
positions = list(map(lambda x: 0 if any(i.isdigit() for i in x) else 1, de_words_part))
print(*positions[:15], sep=', ', end='\n\n')

popped_out = []
k = 0

for i, j in enumerate(positions):
    if not j :
        i -= k
        popped_out.append(de_words_part.pop(i))
        k += 1

print(*de_words_part, sep=', ', end='\n\n')
print(*popped_out, sep=', ')

### Other known Natural Language Processing (NLP) Datasets:
- **GLUE**: A collection of nine different NLP tasks for evaluating language understanding.
- **SQuAD**: The Stanford Question Answering Dataset, widely used for reading comprehension and QA.
- **CNN/DailyMail**: A dataset for summarization tasks based on news articles.
- **CoNLL-2003**: For named entity recognition (NER).
- **Wikitext**: A large-scale language modeling dataset extracted from Wikipedia.
- **IMDB**: A sentiment analysis dataset composed of movie reviews.

See <a>https://huggingface.co/</a> for more information.

For example, if we want to extract the *'hamim-87/Ashrafur_bangla_math'* dataset 

```small_test = load_dataset('hamim-87/Ashrafur_bangla_math')```