In [43]:
import nltk
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import re


Importing the corpus

In [15]:
nltk.download('brown')

sentences = [' '.join(words) for words in brown.sents()]

print(sentences[:5])  


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Aymane\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


["The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .", "The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .", "The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .", "`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .", "The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' ."]


__Corpus Normalization Steps__

Before processing any corpus, several steps must be followed to prepare the corpus in a format suitable for analysis. This process is called **normalization**. The steps include:

__Case Folding__
- Transform the entire corpus content into **lowercase characters** to ensure uniformity.

__Tokenization__
- Divide the text into smaller units, such as:
  - **Words**
  - **Subwords**
  - **Other entities**
- The choice of units depends on various factors, including the algorithm used and the corpus characteristics.

__Special Characters Management__
- Remove all **punctuation marks**, as I do not consider them as words in this application.
- Add two special characters:
  - `<s>`: Marks the **start** of a sentence.
  - `</s>`: Marks the **end** of a sentence.


In [16]:
lower_sentences = [sentence.lower() for sentence in sentences]

print(lower_sentences[:5])  


["the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .", "the jury further said in term-end presentments that the city executive committee , which had over-all charge of the election , `` deserves the praise and thanks of the city of atlanta '' for the manner in which the election was conducted .", "the september-october term jury had been charged by fulton superior court judge durwood pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by mayor-nominate ivan allen jr. .", "`` only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .", "the jury said it did find that many of georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' ."]


Now I will use regular expressions to match all the non alphanumerical characters and remove them (except the white space)

In [38]:
cleaned_sentences = [re.sub(r'[^\w\s]',' ',sentence) for sentence in lower_sentences]
print(cleaned_sentences[:5]) 

['the fulton county grand jury said friday an investigation of atlanta s recent primary election produced    no evidence    that any irregularities took place  ', 'the jury further said in term end presentments that the city executive committee   which had over all charge of the election      deserves the praise and thanks of the city of atlanta    for the manner in which the election was conducted  ', 'the september october term jury had been charged by fulton superior court judge durwood pye to investigate reports of possible    irregularities    in the hard fought primary which was won by mayor nominate ivan allen jr   ', '   only a relative handful of such reports was received      the jury said      considering the widespread interest in the election   the number of voters and the size of this city     ', 'the jury said it did find that many of georgia s registration and election laws    are outmoded or inadequate and often ambiguous     ']


For the tokenization task, I will use the punkt tokenizer which is a prebuilt and trained tokenizer from nltk. I will try to implement the BPE algorithm, which is a tokenization algorithm, later and show the steps behind it. But for now lets stick with this.

In [33]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Aymane\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Aymane\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Aymane\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Aymane\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [40]:
tokenized_sentences = [word_tokenize(sentence)for sentence in cleaned_sentences]
print(tokenized_sentences[:5])

[['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', 'atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place'], ['the', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had', 'over', 'all', 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted'], ['the', 'september', 'october', 'term', 'jury', 'had', 'been', 'charged', 'by', 'fulton', 'superior', 'court', 'judge', 'durwood', 'pye', 'to', 'investigate', 'reports', 'of', 'possible', 'irregularities', 'in', 'the', 'hard', 'fought', 'primary', 'which', 'was', 'won', 'by', 'mayor', 'nominate', 'ivan', 'allen', 'jr'], ['only', 'a', 'relative', 'handful', 'of', 'such', 'reports', 'was', 'received', 'the', 'jury', 'said', '

Now I add the special markers to the beggining and the end of each sentence, note that i added two `<s>` at the beggining of each sentence since im going to use a trigram model.

In [42]:
normalized_sentences = [['<s>','<s>'] + sentence + ['</s>'] for sentence in tokenized_sentences]
print(normalized_sentences[:5])

[['<s>', '<s>', 'the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', 'atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place', '</s>'], ['<s>', '<s>', 'the', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had', 'over', 'all', 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '</s>'], ['<s>', '<s>', 'the', 'september', 'october', 'term', 'jury', 'had', 'been', 'charged', 'by', 'fulton', 'superior', 'court', 'judge', 'durwood', 'pye', 'to', 'investigate', 'reports', 'of', 'possible', 'irregularities', 'in', 'the', 'hard', 'fought', 'primary', 'which', 'was', 'won', 'by', 'mayor', 'nominate', 'ivan', 'allen', 'jr', '</s>'], ['<s>', '<s>', 'only', 'a', 'relative',

Here I will calculate all the trigrams,bigrams and unigrams in the corpus. they will be used for the interpollation smoothing (to deal with with zero probabilty n grams and perform better)

In [47]:
trigrams_corpus =[list(ngrams(sentence,3)) for sentence in normalized_sentences]
bigrams_corpus =[list(ngrams(sentence,2)) for sentence in normalized_sentences]
unigrams_corpus =[list(ngrams(sentence,1)) for sentence in normalized_sentences]
