# **PRACTICE 1 - NLTK**

*   Exploring a Corpus
*   Vocabulary and Frequency distribution
*   Collocations - Unigrams, Bigrams, Trigrams
*   Exercise 1 - 16/03/2023

## COLAB ONLY - Installing requirements

In [None]:
!pip install nltk

## Exploring a Corpus

In [None]:
# Imports
import nltk
import nltk.corpus

In [None]:
#A download utility to get a lot of data!
nltk.download()

In [None]:
# Getting the Brown corpus:
from nltk.corpus import brown

## Download a specific dataset
try:
    nltk.data.find('brown')
except LookupError:
    nltk.download('brown')

## Check type and folder
print("Brown corpora:"+str(nltk.corpus.brown).replace('\\\\','/'))

In [None]:
## Corpus categories
print(f'{brown.categories()}')

In [None]:
## Corpus files ids for a given category
print(f'{brown.fileids(categories="adventure")}')
## Corpus files ids
#print(f'{brown.fileids()}')

In [None]:
## Getting first 30 word of first file
n = 30
category = 'adventure'
fid = brown.fileids(categories=category)[0]

print(f'{fid} first {n} words: {brown.words(fid)[:n]}')

In [None]:
## Getting first sentence
first_sentence = brown.sents(fid)[0]
print(f'{fid} first {n} words: {first_sentence}')

In [None]:
## Getting first sentence as text
first_sentence_text = ' '.join(first_sentence)
print(f'{fid} text: {first_sentence_text}')

In [None]:
print(f'File "{fid}" sentences:')
for i, s in enumerate(brown.sents(fileids=fid)[:10]):
    print(f'{i} {" ".join(s)}')

## Vocabulary and Frequency

In [None]:
# Vocabulary extraction and comparison
vocabulary_fid = set(brown.words(fileids=fid))
vocabulary_cat = set(brown.words(categories=category))

print(f'File "{fid}" vocab size: {len(vocabulary_fid)}'
      f'\nCategory "{category}" vocab size: {len(vocabulary_cat)}')
print(f'Brown tot words {len(brown.words())}')
print(f'Brown tot words for {category} {len(brown.words(categories=category))}')

In [None]:
print(f'Category vocabulary - Fid vocabulary:'
      f'\n-> size '
      f'\n--> cat - fid : {len(vocabulary_cat.difference(vocabulary_fid))}'
      f'\n--> fid - cat : {len(vocabulary_fid.difference(vocabulary_cat))} (Obviously!)'
      #f'\n-> words '
      #f'\n--> cat - fid : {vocabulary_cat.difference(vocabulary_fid)}'
      #f'\n--> fid - cat : {vocabulary_fid.difference(vocabulary_cat)}'
      )

In [None]:
vocabulary_fid

In [None]:
from nltk.probability import *

# Extracting frequency distribution for a specific file
f_dist = FreqDist(brown.words(fileids=fid))
f_dist

In [None]:
## Looking for the 10 most frequent words
f_dist.most_common(10)

## Why the dot (.) is the most common "word"?

How do we solve this?

In [None]:
import string

punctuations = set(string.punctuation)
punctuations.add('``')
punctuations.add('\'\'')

words = brown.words(fileids=fid)
filtered_words = [w for w in words if w not in punctuations]
fixed_f_dist = FreqDist(filtered_words)
fixed_f_dist.most_common(20)

## Collocations - Unigrams, Bigrams
the nltk.collocations gives you classes to extract Bigrams, Trigrams and so on.
And Unigrams? know what? FreqDist gives you Unigrams :-)

In [None]:
import nltk.collocations as collocations
from nltk.corpus import brown
import string

punctuations = list(string.punctuation)

bigram_measures = collocations.BigramAssocMeasures()
bigrams_finder = collocations.BigramCollocationFinder.from_words(brown.words())
bigrams_finder.apply_word_filter(lambda w: w.lower() in punctuations)
bigrams_finder.nbest(bigram_measures.pmi, 20)

# EXERCISE 1 - 16/03/2023

Using Brown's dataset "Adventure" category as your test set, compare it with the remaining part of the Brown dataset (namely all remaining categories together) and check:
* Vocabulary size difference.
* The intersection between the 100 most frequent words.
* Compare the most common Bigrams, with different measures
* Do the same with trigrams