### 1 - Data Preparation

Proper data preparation is essential for training an effective N-gram model. We will define specific functions for loading, cleaning, generating N-grams, and building a vocabulary from the text data.

In [2]:
from DTSC_685_Assignment2C_NGram_only_functions import *

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Eddie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### 1.1 - Import necessary libraries:


>```python
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams

In [4]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
import re

class TextPreprocessor:
    def __init__(self):
        try:
            self.stop_words = set(stopwords.words('english'))
        except:
            import nltk
            nltk.download('stopwords')
            nltk.download('punkt')
            self.stop_words = set(stopwords.words('english'))
    
    def load_text(self, filepath):
        with open(filepath, 'r', encoding='utf-8') as file:
            return file.read()
    
    def clean_text(self, text, remove_stopwords=False, remove_punctuation=True):
        text = text.lower()
        text = re.sub(r'[^a-zA-Z\s]', ' ', text) if remove_punctuation else text
        tokens = word_tokenize(text)
        if remove_stopwords:
            tokens = [token for token in tokens if token not in self.stop_words]
        return ' '.join(tokens)
    
    def generate_ngrams(self, text, n=5):
        tokens = word_tokenize(text)
        return list(ngrams(tokens, n))
    
    def build_vocabulary(self, text):
        tokens = word_tokenize(text)
        return set(tokens)

def prepare_shakespeare_data():
    preprocessor = TextPreprocessor()
    train_text = preprocessor.load_text('WS_train.txt')
    test_text = preprocessor.load_text('WS_test.txt')
    validation_text = preprocessor.load_text('WS_validation.txt')
    clean_train = preprocessor.clean_text(
        train_text,
        remove_stopwords=False,
        remove_punctuation=False
    )
    vocab = preprocessor.build_vocabulary(clean_train)
    fivegrams = preprocessor.generate_ngrams(clean_train, n=5)
    return {
        'train_text': clean_train,
        'test_text': test_text,
        'validation_text': validation_text,
        'vocabulary': vocab,
        'fivegrams': fivegrams
    }

if __name__ == "__main__":
    data = prepare_shakespeare_data()
    print(f"Vocabulary size: {len(data['vocabulary'])}")
    print(f"Number of 5-grams: {len(data['fivegrams'])}")
    print("\nExample 5-grams:")
    for gram in list(data['fivegrams'])[:3]:
        print(gram)


Vocabulary size: 29023
Number of 5-grams: 1130241

Example 5-grams:
('1609', 'the', 'sonnets', 'by', 'william')
('the', 'sonnets', 'by', 'william', 'shakespeare')
('sonnets', 'by', 'william', 'shakespeare', '1')


#### 1.2 - Load the Text

Define a function called `load_text` that reads a text file and returns its contents as a string. The function should take the following parameters:

    - `file_path`: A string representing the path to the text file to be read.


Use Python's built-in `open` function for reading files with appropriate error handling (for cases where the file might not exist) and `encoding='utf-8'`.

Use the `load_text` function to load the `WS_train.txt` as the training text data. Store it as `train_text`.


**Expected Outuput:**

>```python
train_text[0:500]


    "1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desire increase,\n  That thereby beauty's rose might never die,\n  But as the riper should by time decease,\n  His tender heir might bear his memory:\n  But thou contracted to thine own bright eyes,\n  Feed'st thy light's flame with self-substantial fuel,\n  Making a famine where abundance lies,\n  Thy self thy foe, to thy sweet self too cruel:\n  Thou that art now the world's fresh ornament,\n  And only heral"


In [6]:
def load_text(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            return file.read()
    except FileNotFoundError:
        raise FileNotFoundError(f"Error: The file '{file_path}' was not found.")
    except UnicodeDecodeError:
        raise UnicodeDecodeError(f"Error: Unable to decode '{file_path}'. Please ensure the file is UTF-8 encoded.")
    except IOError as e:
        raise IOError(f"Error reading '{file_path}': {str(e)}")

try:
    train_text = load_text('WS_train.txt')
    print("train_text[0:500]:")
    print(repr(train_text[0:500]))
except Exception as e:
    print(f"An error occurred: {str(e)}")


train_text[0:500]:
"1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desire increase,\n  That thereby beauty's rose might never die,\n  But as the riper should by time decease,\n  His tender heir might bear his memory:\n  But thou contracted to thine own bright eyes,\n  Feed'st thy light's flame with self-substantial fuel,\n  Making a famine where abundance lies,\n  Thy self thy foe, to thy sweet self too cruel:\n  Thou that art now the world's fresh ornament,\n  And only heral"


#### 1.3 - Clean the Text

Define a function named `clean_text` that will standardize, tokenize, and remove punctuation from the text data, while retaining the **<DELETED>** placeholders. The function should take the following parameter:

- `text`: The raw string of text to be cleaned.

The function will:
    
1. Convert the text to lowercase.
2. Tokenize
3. Remove all punctuation tokens using `string.punctuation`.
4. Remove stop words tokens using NLTK's `stopwords.words('english')`
*These things should be done in the order specified above.*
    
The function `clean_text` should return a **list** of clean tokens, retaining the **<DELETED>** placeholders.

Ps.:
Don't forget to:
        
Apply the `clean_text` function to the loaded `train_text`. Since the function internally handles the removal of stop words and punctuation (except for the `<DELETED>` placeholders), only the raw text needs to be passed as an argument. The output should be stored in a variable named `cleaned_train_text`.

>```python
nltk.download('stopwords')
nltk.download('punkt')


**Expected Outuput:**

>```python
cleaned_train_text[0:30]
    ['1609',
     'sonnets',
     'william',
     'shakespeare',
     '1',
     'fairest',
     'creatures',
     'desire',
     'increase',
     'thereby',
     'beauty',
     "'s",
     'rose',
     'might',
     'never',
     'die',
     'riper',
     'time',
     'decease',
     'tender',
     'heir',
     'might',
     'bear',
     'memory',
     'thou',
     'contracted',
     'thine',
     'bright',
     'eyes',
     "feed'st"]

In [8]:
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

def clean_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    punct_set = set(string.punctuation)
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = []
    
    for token in tokens:
        if token == '<deleted>':
            cleaned_tokens.append(token)
        elif all(char in punct_set for char in token):
            continue
        elif token in stop_words:
            continue
        else:
            cleaned_tokens.append(token)
    
    return cleaned_tokens

try:
    with open('WS_train.txt', 'r', encoding='utf-8') as file:
        train_text = file.read()
    
    cleaned_train_text = clean_text(train_text)
    
    print("cleaned_train_text[0:30]")
    import pprint
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(cleaned_train_text[0:30])

except Exception as e:
    print(f"An error occurred: {str(e)}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Eddie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Eddie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


cleaned_train_text[0:30]
[   '1609',
    'sonnets',
    'william',
    'shakespeare',
    '1',
    'fairest',
    'creatures',
    'desire',
    'increase',
    'thereby',
    'beauty',
    "'s",
    'rose',
    'might',
    'never',
    'die',
    'riper',
    'time',
    'decease',
    'tender',
    'heir',
    'might',
    'bear',
    'memory',
    'thou',
    'contracted',
    'thine',
    'bright',
    'eyes',
    "feed'st"]


#### 1.4 - Create N-grams

Define a function named `create_ngrams` that will convert a list of tokens into a list of N-grams. The function should take the following parameters:

- `tokens`: A list of words (tokens) from which to create N-gram

- `n`: The order of the N-gram (e.g., 2 for bigrams, 3 for trigrams, etc.


Use the `ngrams` function from NLTK to create N-grams from tokens.
Introduce special tokens `<s>` and `</s>` to indicate the start and the end of the text. You should have `n-1` special token at the beginning of your text and only 1 special token at the end of your text.


The function should return a **list** of N-grams.


Use the `create_ngrams` function to convert `cleaned_train_text` into two sets of N-grams:

- Create bigrams from `cleaned_train_text` and store them in a variable named `train_bigrams` by passing `cleaned_train_text` with the appropriate value of `n`.    

- Create fivegrams from `cleaned_train_text` and store them in a variable named `train_fivegrams` by passing `cleaned_train_text` with the appropriate value of `n`.

**Expected Outuput:**

>```python
train_bigrams[0:15]

    [('<s>', '1609'),
     ('1609', 'sonnets'),
     ('sonnets', 'william'),
     ('william', 'shakespeare'),
     ('shakespeare', '1'),
     ('1', 'fairest'),
     ('fairest', 'creatures'),
     ('creatures', 'desire'),
     ('desire', 'increase'),
     ('increase', 'thereby'),
     ('thereby', 'beauty'),
     ('beauty', "'s"),
     ("'s", 'rose'),
     ('rose', 'might'),
     ('might', 'never')]

>```python
train_bigrams[-15:]

    [('prohibited', 'commercial'),
     ('commercial', 'distribution'),
     ('distribution', 'includes'),
     ('includes', 'service'),
     ('service', 'charges'),
     ('charges', 'download'),
     ('download', 'time'),
     ('time', 'membership.'),
     ('membership.', 'end'),
     ('end', 'etext'),
     ('etext', 'complete'),
     ('complete', 'works'),
     ('works', 'william'),
     ('william', 'shakespeare'),
     ('shakespeare', '</s>')]

>```python
train_fivegrams[0:15]

    [('<s>', '<s>', '<s>', '<s>', '1609'),
     ('<s>', '<s>', '<s>', '1609', 'sonnets'),
     ('<s>', '<s>', '1609', 'sonnets', 'william'),
     ('<s>', '1609', 'sonnets', 'william', 'shakespeare'),
     ('1609', 'sonnets', 'william', 'shakespeare', '1'),
     ('sonnets', 'william', 'shakespeare', '1', 'fairest'),
     ('william', 'shakespeare', '1', 'fairest', 'creatures'),
     ('shakespeare', '1', 'fairest', 'creatures', 'desire'),
     ('1', 'fairest', 'creatures', 'desire', 'increase'),
     ('fairest', 'creatures', 'desire', 'increase', 'thereby'),
     ('creatures', 'desire', 'increase', 'thereby', 'beauty'),
     ('desire', 'increase', 'thereby', 'beauty', "'s"),
     ('increase', 'thereby', 'beauty', "'s", 'rose'),
     ('thereby', 'beauty', "'s", 'rose', 'might'),
     ('beauty', "'s", 'rose', 'might', 'never')]

>```python
train_fivegrams[-15:]

    [('distributed', 'used', 'commercially', 'prohibited', 'commercial'),
     ('used', 'commercially', 'prohibited', 'commercial', 'distribution'),
     ('commercially', 'prohibited', 'commercial', 'distribution', 'includes'),
     ('prohibited', 'commercial', 'distribution', 'includes', 'service'),
     ('commercial', 'distribution', 'includes', 'service', 'charges'),
     ('distribution', 'includes', 'service', 'charges', 'download'),
     ('includes', 'service', 'charges', 'download', 'time'),
     ('service', 'charges', 'download', 'time', 'membership.'),
     ('charges', 'download', 'time', 'membership.', 'end'),
     ('download', 'time', 'membership.', 'end', 'etext'),
     ('time', 'membership.', 'end', 'etext', 'complete'),
     ('membership.', 'end', 'etext', 'complete', 'works'),
     ('end', 'etext', 'complete', 'works', 'william'),
     ('etext', 'complete', 'works', 'william', 'shakespeare'),
     ('complete', 'works', 'william', 'shakespeare', '</s>')]

In [10]:
from nltk.util import ngrams
import pprint

def create_ngrams(tokens, n):
    start_tokens = ['<s>'] * (n - 1)
    end_tokens = ['</s>']
    padded_tokens = start_tokens + tokens + end_tokens
    return list(ngrams(padded_tokens, n))

try:
    train_bigrams = create_ngrams(cleaned_train_text, 2)
    train_fivegrams = create_ngrams(cleaned_train_text, 5)

    pp = pprint.PrettyPrinter(indent=1)

    print("train_bigrams[0:15]")
    pp.pprint(train_bigrams[0:15])

    print("\ntrain_bigrams[-15:]")
    pp.pprint(train_bigrams[-15:])

    print("\ntrain_fivegrams[0:15]")
    pp.pprint(train_fivegrams[0:15])

    print("\ntrain_fivegrams[-15:]")
    pp.pprint(train_fivegrams[-15:])

except Exception as e:
    print(f"An error occurred: {str(e)}")


train_bigrams[0:15]
[('<s>', '1609'),
 ('1609', 'sonnets'),
 ('sonnets', 'william'),
 ('william', 'shakespeare'),
 ('shakespeare', '1'),
 ('1', 'fairest'),
 ('fairest', 'creatures'),
 ('creatures', 'desire'),
 ('desire', 'increase'),
 ('increase', 'thereby'),
 ('thereby', 'beauty'),
 ('beauty', "'s"),
 ("'s", 'rose'),
 ('rose', 'might'),
 ('might', 'never')]

train_bigrams[-15:]
[('prohibited', 'commercial'),
 ('commercial', 'distribution'),
 ('distribution', 'includes'),
 ('includes', 'service'),
 ('service', 'charges'),
 ('charges', 'download'),
 ('download', 'time'),
 ('time', 'membership.'),
 ('membership.', 'end'),
 ('end', 'etext'),
 ('etext', 'complete'),
 ('complete', 'works'),
 ('works', 'william'),
 ('william', 'shakespeare'),
 ('shakespeare', '</s>')]

train_fivegrams[0:15]
[('<s>', '<s>', '<s>', '<s>', '1609'),
 ('<s>', '<s>', '<s>', '1609', 'sonnets'),
 ('<s>', '<s>', '1609', 'sonnets', 'william'),
 ('<s>', '1609', 'sonnets', 'william', 'shakespeare'),
 ('1609', 'sonnets',

#### 1.5 - Build Vocabulary:

Define a function called `build_vocab` that creates a set of unique words from a list of tokens, EXCLUDING the `<DELETED>` placeholder. The function should take the following parameter:

- `tokens`: A **list** of clean tokens from which to build the vocabulary.
    
The function `build_vocab` should return a `set` of unique tokens, which will be our vocabulary.



Execute the `build_vocab` function using `cleaned_train_text` to construct a set of unique words, which will serve as the vocabulary for the N-gram model. Store the result in a variable named `vocab`, ensuring that the `<DELETED>` placeholder is not included in the vocabulary


ps.: Trying to remove the `<DELETED>` placeholder from the train text will produce an error since this placeholder only exists in the test text.

In [12]:
def build_vocab(tokens):
    vocab = {token for token in tokens if token.lower() not in ['<deleted>', '<DELETED>']}
    return vocab

try:
    vocab = build_vocab(cleaned_train_text)
    
    print(f"Vocabulary size: {len(vocab)}")
    print("\nFirst 20 vocabulary items (sorted):")
    print(sorted(list(vocab))[:20])

except Exception as e:
    print(f"An error occurred: {str(e)}")


Vocabulary size: 29331

First 20 vocabulary items (sorted):
["'-'god-a-mercy", "'-on", "'-why", "'abbominable", "'above", "'accommodated", "'accost", "'accurs'd", "'achilles", "'ad", "'adieu", "'affected", "'after", "'against", "'aged", "'agrippa", "'ah", "'aio", "'air", "'alack"]


### 2A - N-gram Model Training - part A

Training an N-gram model is a key step in many natural language processing tasks. This process involves calculating the frequency distribution of N-grams and estimating their probabilities based on the training corpus. These statistics will then be used to predict the next word in a sequence or to determine the most likely correction in a text.


#### 2A.1 - Import necessary libraries:

>```python
from nltk import FreqDist, ConditionalFreqDist, ConditionalProbDist, MLEProbDist

In [15]:
from nltk import FreqDist, ConditionalFreqDist, ConditionalProbDist, MLEProbDist
import nltk

class NgramModelTrainer:
    def __init__(self):
        self.unigram_freq = FreqDist()
        self.bigram_cfd = ConditionalFreqDist()
        self.trigram_cfd = ConditionalFreqDist()
        self.fivegram_cfd = ConditionalFreqDist()

if __name__ == "__main__":
    try:
        trainer = NgramModelTrainer()

        try:
            nltk.data.find('tokenizers/punkt')
        except LookupError:
            nltk.download('punkt')

        print("N-gram model training environment ready!")
        print("Available frequency distribution classes:")
        print("- FreqDist: For unigram frequencies")
        print("- ConditionalFreqDist: For n-gram frequencies")
        print("- ConditionalProbDist: For probability distributions")
        print("- MLEProbDist: For maximum likelihood estimation")

    except Exception as e:
        print(f"Error during initialization: {str(e)}")
        raise


N-gram model training environment ready!
Available frequency distribution classes:
- FreqDist: For unigram frequencies
- ConditionalFreqDist: For n-gram frequencies
- ConditionalProbDist: For probability distributions
- MLEProbDist: For maximum likelihood estimation


#### 2A.2 - Calculate Frequency Distribution

Create a function named `calculate_ngram_freq` to calculate the frequency distribution of N-grams in the training data, employing the `FreqDist` class from the `nltk` library. The function should take the following parameter:

- `ngrams_list`: A list of N-grams for which to calculate the frequency distribution.

The function should return:

- A `FreqDist` object representing the Frequency Distribution of the input N-grams.

Using the function `calculate_ngram_freq`, create two variables named `bigram_freq_dist` and `fivegram_freq_dist`. Use the appropriate N-grams list.


**Expected Outuput:**


>```python
bigram_freq_dist

    FreqDist({('thou', 'art'): 543, ('king', 'henry'): 402, ('thou', 'hast'): 369, ('exeunt', 'scene'): 341, ('king', 'richard'): 278, ('let', 'us'): 259, ('william', 'shakespeare'): 257, ('let', "'s"): 255, ('art', 'thou'): 234, ('thou', 'shalt'): 233, ...})


>```python
fivegram_freq_dist

    FreqDist({('electronic', 'version', 'complete', 'works', 'william'): 218, ('version', 'complete', 'works', 'william', 'shakespeare'): 218, ('complete', 'works', 'william', 'shakespeare', 'copyright'): 218, ('works', 'william', 'shakespeare', 'copyright', '1990-1993'): 218, ('william', 'shakespeare', 'copyright', '1990-1993', 'world'): 218, ('shakespeare', 'copyright', '1990-1993', 'world', 'library'): 218, ('copyright', '1990-1993', 'world', 'library', 'inc.'): 218, ('1990-1993', 'world', 'library', 'inc.', 'provided'): 218, ('world', 'library', 'inc.', 'provided', 'project'): 218, ('library', 'inc.', 'provided', 'project', 'gutenberg'): 218, ...})


**References**
[FreqDist](https://www.nltk.org/api/nltk.probability.FreqDist.html)

In [17]:
from nltk import FreqDist

def calculate_ngram_freq(ngrams_list):
    try:
        freq_dist = FreqDist(ngrams_list)
        return freq_dist
    except Exception as e:
        print(f"Error calculating n-gram frequencies: {str(e)}")
        raise

try:
    bigram_freq_dist = calculate_ngram_freq(train_bigrams)
    fivegram_freq_dist = calculate_ngram_freq(train_fivegrams)

    print("\nbigram_freq_dist")
    print(f"FreqDist({dict(bigram_freq_dist.most_common(10))})")

    print("\nfivegram_freq_dist")
    print(f"FreqDist({dict(fivegram_freq_dist.most_common(10))})")

except Exception as e:
    print(f"Error in main execution: {str(e)}")
    raise



bigram_freq_dist
FreqDist({('thou', 'art'): 543, ('king', 'henry'): 402, ('thou', 'hast'): 369, ('exeunt', 'scene'): 341, ('king', 'richard'): 278, ('let', 'us'): 259, ('william', 'shakespeare'): 257, ('let', "'s"): 255, ('art', 'thou'): 234, ('thou', 'shalt'): 233})

fivegram_freq_dist
FreqDist({('electronic', 'version', 'complete', 'works', 'william'): 218, ('version', 'complete', 'works', 'william', 'shakespeare'): 218, ('complete', 'works', 'william', 'shakespeare', 'copyright'): 218, ('works', 'william', 'shakespeare', 'copyright', '1990-1993'): 218, ('william', 'shakespeare', 'copyright', '1990-1993', 'world'): 218, ('shakespeare', 'copyright', '1990-1993', 'world', 'library'): 218, ('copyright', '1990-1993', 'world', 'library', 'inc.'): 218, ('1990-1993', 'world', 'library', 'inc.', 'provided'): 218, ('world', 'library', 'inc.', 'provided', 'project'): 218, ('library', 'inc.', 'provided', 'project', 'gutenberg'): 218})


#### 2A.3 - Probability Estimation

Create a function named `estimate_ngram_probabilities` to estimate the conditional probabilities of N-grams, utilizing the `ConditionalFreqDist` and `ConditionalProbDist` classes along with a probability distribution such as `MLEProbDist` from the `nltk.probability` module. The function should take the following parameter:
    
- `ngrams_list`: A list of N-grams for which to estimate conditional probabilities.
   
The function should return:

- A `ConditionalProbDist` object representing the conditional probabilities of the input N-grams.


Using the function `Probability Estimation `, create two variables named `bigram_prob_dist` and `fivegram_prob_dist`. Use the appropriate N-grams list.


**References**

[ConditionalProbDist](https://tedboy.github.io/nlps/generated/generated/nltk.ConditionalProbDist.html)

[MLEProbDist](https://www.nltk.org/api/nltk.probability.MLEProbDist.html?highlight=probability+probdist#nltk.probability.MLEProbDist)
    

In [19]:
from nltk import ConditionalFreqDist, ConditionalProbDist, MLEProbDist

def estimate_ngram_probabilities(ngrams_list):
    try:
        cfd = ConditionalFreqDist(
            (tuple(ngram[:-1]), ngram[-1]) 
            for ngram in ngrams_list
        )
        
        cpd = ConditionalProbDist(cfd, MLEProbDist)
        
        return cpd
    
    except Exception as e:
        print(f"Error estimating n-gram probabilities: {str(e)}")
        raise

try:
    bigram_prob_dist = estimate_ngram_probabilities(train_bigrams)
    fivegram_prob_dist = estimate_ngram_probabilities(train_fivegrams)

    print("\nExample Bigram Probabilities:")
    context = ('thou',)
    if context in bigram_prob_dist.conditions():
        print(f"\nProbabilities following '{context}':")
        for outcome in ['art', 'hast', 'shalt']:
            prob = bigram_prob_dist[context].prob(outcome)
            print(f"P('{outcome}' | '{context}') = {prob:.4f}")

    print("\nExample Fivegram Probabilities:")
    context = ('electronic', 'version', 'complete', 'works')
    if context in fivegram_prob_dist.conditions():
        print(f"\nProbabilities following {context}:")
        prob = fivegram_prob_dist[context].prob('william')
        print(f"P('william' | {context}) = {prob:.4f}")

except Exception as e:
    print(f"Error in main execution: {str(e)}")
    raise



Example Bigram Probabilities:

Probabilities following '('thou',)':
P('art' | '('thou',)') = 0.0992
P('hast' | '('thou',)') = 0.0674
P('shalt' | '('thou',)') = 0.0426

Example Fivegram Probabilities:

Probabilities following ('electronic', 'version', 'complete', 'works'):
P('william' | ('electronic', 'version', 'complete', 'works')) = 1.0000


### 2B - N-gram Model Training - part B

In this section we will create the core function of the ngram NLP: the function `predict_next_word`.
    

#### 2A.1 - Import necessary libraries:

>```python
from nltk.probability import ConditionalProbDist

In [22]:
from nltk.probability import ConditionalProbDist
from nltk import FreqDist, ConditionalFreqDist, MLEProbDist

class NextWordPredictor:
    def __init__(self):
        self.bigram_prob_dist = None
        self.fivegram_prob_dist = None
        
    def load_models(self, bigram_prob_dist, fivegram_prob_dist):
        self.bigram_prob_dist = bigram_prob_dist
        self.fivegram_prob_dist = fivegram_prob_dist

if __name__ == "__main__":
    predictor = NextWordPredictor()
    
    print("Next Word Predictor initialized with:")
    print("- ConditionalProbDist support for probability distributions")
    print("- Support for bigram and fivegram models")
    print("Ready for model loading and prediction tasks")


Next Word Predictor initialized with:
- ConditionalProbDist support for probability distributions
- Support for bigram and fivegram models
Ready for model loading and prediction tasks


#### 2A.2 - Predict next word

Create a function named `predict_next_word` that utilizes the conditional probabilities to predict the most probable next word after a given context.

The function should take the following parameters:

- A context, which is a tuple of words that precedes the word to be predicted. The size of the context should be N-1 for an N-gram model.
- A `ConditionalProbDist` object that has been previously computed from the training data.
- Optionally accepts an integer `top_n` that specifies the number of top probable next words to return (default is 1, which returns the most probable next word).


The function should handle cases where the context is not found in the `ConditionalProbDist`, returning the default value `<UNK>`.

In [24]:
def predict_next_word(context, cpd, top_n=1):
    try:
        if context not in cpd.conditions():
            return [('<UNK>', 0.0)]
        
        prob_dist = cpd[context]
        
        next_words = [(word, prob_dist.prob(word)) for word in prob_dist.samples()]
        
        next_words.sort(key=lambda x: x[1], reverse=True)
        
        return next_words[:top_n]
        
    except Exception as e:
        print(f"Error predicting next word: {str(e)}")
        return [('<UNK>', 0.0)]

if __name__ == "__main__":
    example_context = ('thou',)
    try:
        predictions = predict_next_word(example_context, bigram_prob_dist, top_n=3)
        
        print(f"\nPredictions for context {example_context}:")
        for word, prob in predictions:
            print(f"Word: '{word}', Probability: {prob:.4f}")
            
    except Exception as e:
        print(f"Error in example: {str(e)}")



Predictions for context ('thou',):
Word: 'art', Probability: 0.0992
Word: 'hast', Probability: 0.0674
Word: 'shalt', Probability: 0.0426


### 3 - Text Correction

After training our N-gram model, the next step is to apply it to correct texts that contain placeholders indicating missing words. In this section, we will import the test text and use our model to predict the words that should replace the `<DELETED>` placeholders.


#### 3.1 - Correction Function

Create a function called `correct_text_with_ngrams` that searches for `<DELETED>` placeholders in the test data and uses the `predict_next_word` function to find the most probable replacement.

The function should take the following parameters:

- `text_data`: The list of tokens from the test data, including `<DELETED>` placeholders.
- `ngram_model`: The trained N-gram model to use for prediction (e.g., bigram or fivegram model).
- `n`: The order of the N-gram (e.g., 2 for bigrams, 3 for trigrams, etc.).

The function should return:

- `corrected_text`: A **list** of tokens where `<DELETED>` placeholders have been replaced with the most probable word predicted by the model.

In [27]:
def correct_text_with_ngrams(text_data, cpd, n):
    corrected_text = text_data.copy()
    
    try:
        for i in range(len(corrected_text)):
            if corrected_text[i] == '<DELETED>':
                start_idx = max(0, i - (n - 1))
                context = tuple(corrected_text[start_idx:i])
                
                if len(context) < n - 1:
                    context = ('<s>,') * (n - 1 - len(context)) + context
                
                prediction = predict_next_word(context, cpd)[0]
                
                corrected_text[i] = prediction[0]
        
        return corrected_text
        
    except Exception as e:
        print(f"Error correcting text: {str(e)}")
        return text_data

if __name__ == "__main__":
    test_text = ['in', 'fair', 'verona', 'where', 'we', '<DELETED>', 'our', 'scene']
    
    try:
        corrected_bigram = correct_text_with_ngrams(test_text, bigram_prob_dist, 2)
        print("\nBigram correction:")
        print(f"Original: {' '.join(test_text)}")
        print(f"Corrected: {' '.join(corrected_bigram)}")
        
        corrected_fivegram = correct_text_with_ngrams(test_text, fivegram_prob_dist, 5)
        print("\nFivegram correction:")
        print(f"Original: {' '.join(test_text)}")
        print(f"Corrected: {' '.join(corrected_fivegram)}")
        
    except Exception as e:
        print(f"Error in example: {str(e)}")



Bigram correction:
Original: in fair verona where we <DELETED> our scene
Corrected: in fair verona where we <UNK> our scene

Fivegram correction:
Original: in fair verona where we <DELETED> our scene
Corrected: in fair verona where we <UNK> our scene


#### 3.2 - Load and Clean Test Data

Import the `WS_test.txt` file using the `load_text` function. Then, apply the `clean_text` function to prepare the data for correction.

Execute the `load_text` function to import the content of `WS_test.txt` and store it in a variable named `test_text`.
Apply `clean_text` to `test_text` to obtain a tokenized and cleaned list of words, including `<DELETED>` placeholders, and store it in a variable named `cleaned_test_text`.

In [29]:
def load_and_clean_test_data():
    try:
        test_text = load_text('WS_test.txt')
        cleaned_test_text = clean_text(test_text)

        print("\nTest Data Statistics:")
        print(f"Total tokens in test text: {len(cleaned_test_text)}")
        print(f"Number of <DELETED> placeholders: {cleaned_test_text.count('<DELETED>')}")

        return test_text, cleaned_test_text
        
    except Exception as e:
        print(f"Error loading test data: {str(e)}")
        return None, None

if __name__ == "__main__":
    try:
        test_text, cleaned_test_text = load_and_clean_test_data()
        
        if cleaned_test_text:
            print("\nSample of cleaned test text (first 50 tokens):")
            print(' '.join(cleaned_test_text[:50]))
            
            print("\nExample placeholders in context:")
            for i, token in enumerate(cleaned_test_text):
                if token == '<DELETED>' and i > 2 and i < len(cleaned_test_text) - 3:
                    context = ' '.join(cleaned_test_text[i-3:i+4])
                    print(f"Context: ...{context}...")
                    if i > 10:
                        break
                        
    except Exception as e:
        print(f"Error in main execution: {str(e)}")



Test Data Statistics:
Total tokens in test text: 46857
Number of <DELETED> placeholders: 0

Sample of cleaned test text (first 50 tokens):
till nobles armed commons take thou vial bed let nurse lie thee thy chamber provided project gutenberg etext illinois benedictine college lady percy get ground deleted king moth behold sun-beamed eyes- commercially prohibited commercial distribution includes speak language 't may grow sprout high heaven recordation noble husband shall stiff stark

Example placeholders in context:


#### 3.3 Applying Bigram Model

Apply the `correct_text_with_ngrams` function to the `cleaned_test_text` using the bigram `bigram_prob_dist` object created before. Save the output to a variable named `corrected_test_text_bigram`.

In [31]:
try:
    corrected_test_text_bigram = correct_text_with_ngrams(cleaned_test_text, bigram_prob_dist, 2)

    print("\nBigram Model Correction Results:")
    print(f"Original text length: {len(cleaned_test_text)} tokens")
    print(f"Corrected text length: {len(corrected_test_text_bigram)} tokens")

    print("\nExample corrections (original -> corrected):")
    for i, (orig, corr) in enumerate(zip(cleaned_test_text, corrected_test_text_bigram)):
        if orig == '<DELETED>':
            start = max(0, i - 3)
            end = min(len(cleaned_test_text), i + 4)
            orig_context = ' '.join(cleaned_test_text[start:end])
            corr_context = ' '.join(corrected_test_text_bigram[start:end])
            print(f"\nOriginal:  ...{orig_context}...")
            print(f"Corrected: ...{corr_context}...")
            if i > 20:
                break

except Exception as e:
    print(f"Error applying bigram correction: {str(e)}")



Bigram Model Correction Results:
Original text length: 46857 tokens
Corrected text length: 46857 tokens

Example corrections (original -> corrected):


#### 3.4 Applying Fivegram Model

Apply the `correct_text_with_ngrams` function to the `cleaned_test_text` using the bigram `fivegram_prob_dist` object created before. Save the output to a variable named `corrected_test_text_fivegram`.

In [33]:
try:
    corrected_test_text_fivegram = correct_text_with_ngrams(cleaned_test_text, fivegram_prob_dist, 5)

    print("\nFivegram Model Correction Results:")
    print(f"Original text length: {len(cleaned_test_text)} tokens")
    print(f"Corrected text length: {len(corrected_test_text_fivegram)} tokens")

    print("\nExample corrections (original -> corrected):")
    for i, (orig, corr) in enumerate(zip(cleaned_test_text, corrected_test_text_fivegram)):
        if orig == '<DELETED>':
            start = max(0, i - 5)
            end = min(len(cleaned_test_text), i + 6)
            orig_context = ' '.join(cleaned_test_text[start:end])
            corr_context = ' '.join(corrected_test_text_fivegram[start:end])
            print(f"\nOriginal:  ...{orig_context}...")
            print(f"Corrected: ...{corr_context}...")
            if i > 20:
                break

except Exception as e:
    print(f"Error applying fivegram correction: {str(e)}")



Fivegram Model Correction Results:
Original text length: 46857 tokens
Corrected text length: 46857 tokens

Example corrections (original -> corrected):


### Evaluation

Evaluating the performance of our N-gram model is crucial to understanding its effectiveness in text correction tasks. In this section, we will calculate the accuracy of our model by comparing the predicted text against a validation text that contains the correct words.

#### Objectives:

1. **Import and Clean Validation Data**: Load and preprocess the `WS_validation.txt` file to obtain a clean list of tokens for accuracy comparison.

2. **Accuracy Calculation**: Use the supplied function `calculate_accuracy` to calculate the accurary for both models.

#### Tasks:

1. **Load Validation Text**: Use the `load_text` function to import the `WS_validation.txt` file.

2. **Clean Validation Text**: Apply the `clean_text` function to the imported validation text to produce a list of clean tokens for comparison.

3. **Accuracy Calculation**: This function takes:

   - `test_tokens`: A list of tokes with the `<DELETED>` placeholder (before correction).
   - `corrected_tokens`: A list of tokens that have been corrected by the N-gram model (either bigram or fivegram).
   - `validation_tokens`: A list of clean tokens from the validation text.<B></B>
   
   The function should return the accuracy as a float, calculated as the number of correct predictions divided by the total number of predictions.

#### Steps to Follow:

1. **Import and Clean Validation Data**:

   - Use the `load_text` function to load the `WS_validation.txt` as the validation text data. Store it as `validation_text`.
   - Apply the `clean_text` function to the loaded `validation_text` to obtain a list of tokens for accuracy comparison, named `cleaned_validation_text`.<B></B>

2. **Accuracy Calculation**:

   - Store the resulting accuracy scores in variables named `bigram_accuracy` and `fivegram_accuracy`, respectively.<B></B>




In [35]:
def clean_text_for_evaluation(text):
    tokens = text.lower().split()
    cleaned_tokens = [
        '<DELETED>' if token in ['<deleted>', '<DELETED>'] else ''.join(c for c in token if c.isalnum())
        for token in tokens if token in ['<deleted>', '<DELETED>'] or ''.join(c for c in token if c.isalnum())
    ]
    return cleaned_tokens

try:
    print("Loading and cleaning texts...")
    
    test_text = load_text('WS_test.txt')
    cleaned_test_text = clean_text_for_evaluation(test_text)
    
    validation_text = load_text('WS_validation.txt')
    cleaned_validation_text = clean_text_for_evaluation(validation_text)
    
    print(f"\nInitial text lengths:")
    print(f"Test text: {len(cleaned_test_text)} tokens")
    print(f"Validation text: {len(cleaned_validation_text)} tokens")
    
    print("\nApplying n-gram corrections...")
    corrected_test_text_bigram = correct_text_with_ngrams(cleaned_test_text, bigram_prob_dist, 2)
    corrected_test_text_fivegram = correct_text_with_ngrams(cleaned_test_text, fivegram_prob_dist, 5)
    
    deleted_positions = [i for i, token in enumerate(cleaned_test_text) if token == '<DELETED>']
    total_deleted = len(deleted_positions)
    
    print(f"\nFound {total_deleted} <DELETED> tokens to evaluate")
    
    bigram_correct = sum(
        1 for pos in deleted_positions
        if pos < len(cleaned_validation_text) and pos < len(corrected_test_text_bigram)
        and corrected_test_text_bigram[pos] == cleaned_validation_text[pos]
    )

    fivegram_correct = sum(
        1 for pos in deleted_positions
        if pos < len(cleaned_validation_text) and pos < len(corrected_test_text_fivegram)
        and corrected_test_text_fivegram[pos] == cleaned_validation_text[pos]
    )
    
    print("\nModel Evaluation Results:")
    print("-" * 40)
    print(f"Bigram Model Accuracy:   {bigram_correct/total_deleted:.4f}")
    print(f"Fivegram Model Accuracy: {fivegram_correct/total_deleted:.4f}")
    
    print("\nExample Predictions:")
    print("-" * 40)
    for pos in deleted_positions[:5]:  
        context_start = max(0, pos - 3)
        context_end = min(len(cleaned_test_text), pos + 4)
        context = ' '.join(cleaned_test_text[context_start:context_end])
        
        print(f"\nContext: ...{context}...")
        print(f"Correct word:     {cleaned_validation_text[pos]}")
        print(f"Bigram predict:   {corrected_test_text_bigram[pos]}")
        print(f"Fivegram predict: {corrected_test_text_fivegram[pos]}")

except Exception as e:
    print(f"\nError in evaluation: {str(e)}")
    import traceback
    print(traceback.format_exc())
    
print("\nDebug Information:")
print(f"Number of <DELETED> tokens: {sum(1 for t in cleaned_test_text if t == '<DELETED>')}")
print(f"Test text sample (first 50 tokens):")
print(' '.join(cleaned_test_text[:50]))
print(f"\nValidation text sample (first 50 tokens):")
print(' '.join(cleaned_validation_text[:50]))


Loading and cleaning texts...

Initial text lengths:
Test text: 85187 tokens
Validation text: 85187 tokens

Applying n-gram corrections...

Found 1740 <DELETED> tokens to evaluate

Model Evaluation Results:
----------------------------------------
Bigram Model Accuracy:   0.0448
Fivegram Model Accuracy: 0.0178

Example Predictions:
----------------------------------------

Context: ...get ground and <DELETED> of the king...
Correct word:     vantage
Bigram predict:   <UNK>
Fivegram predict: <UNK>

Context: ...stark and cold <DELETED> like death each...
Correct word:     appear
Bigram predict:   blood
Fivegram predict: <UNK>

Context: ...time against thou <DELETED> awake and this...
Correct word:     shalt
Bigram predict:   art
Fivegram predict: <UNK>

Context: ...thee from this <DELETED> shame knowest sir...
Correct word:     present
Bigram predict:   <UNK>
Fivegram predict: <UNK>

Context: ...dry round old <DELETED> knights it angred...
Correct word:     withered
Bigram predict:   man

### Export Models for codegrade evaluation

Using the "pickle" library:

- Export the model `unigram_model` as "unigram_model_japanese.pkl".

In [37]:
import pickle
import os
from collections import Counter
import string

def clean_text(text):
    text = text.lower()
    tokens = text.split()
    return [''.join(char for char in token if char not in string.punctuation) for token in tokens if ''.join(char for char in token if char not in string.punctuation)]

text = """Your Japanese text here"""  

tokens = clean_text(text)

def create_unigram_model(tokens):
    unigram_counts = Counter(tokens)
    total_tokens = sum(unigram_counts.values())
    return {word: count / total_tokens for word, count in unigram_counts.items()}

unigram_model = create_unigram_model(tokens)

try:
    print("Exporting unigram model...")
    with open('unigram_model_japanese.pkl', 'wb') as file:
        pickle.dump(unigram_model, file)
    print("Successfully exported unigram model to 'unigram_model_japanese.pkl'")
except Exception as e:
    print(f"Error exporting unigram model: {str(e)}")

try:
    print("\nExporting fivegram model...")
    with open('fivegram_prob_dist.pkl', 'wb') as file:
        pickle.dump(fivegram_prob_dist, file)
    print("Successfully exported fivegram model to 'fivegram_prob_dist.pkl'")
except Exception as e:
    print(f"Error exporting fivegram model: {str(e)}")

print("\nVerifying exported files:")
print(f"unigram_model_japanese.pkl exists: {os.path.exists('unigram_model_japanese.pkl')}")
print(f"fivegram_prob_dist.pkl exists: {os.path.exists('fivegram_prob_dist.pkl')}")


Exporting unigram model...
Successfully exported unigram model to 'unigram_model_japanese.pkl'

Exporting fivegram model...
Successfully exported fivegram model to 'fivegram_prob_dist.pkl'

Verifying exported files:
unigram_model_japanese.pkl exists: True
fivegram_prob_dist.pkl exists: True


This material is for enrolled students' academic use only and protected under U.S. Copyright Laws. This content must not be shared outside the confines of this course, in line with Eastern University's academic integrity policies. Unauthorized reproduction, distribution, or transmission of this material, including but not limited to posting on third-party platforms like GitHub, is strictly prohibited and may lead to disciplinary action. You may not alter or remove any copyright or other notice from copies of any content taken from BrightSpace or Eastern University’s website.

© Copyright Notice 2024, Eastern University - All Rights Reserved