# Aris Tsilifonis - mtn2323 - NLP Assignment 1 part 2

The second part of the project was to produce n-gram models than can generate sentences. They utilized bigrams and trigrams as well as add-k smoothing to find patterns in the text. The performance of those models was measured by their perplexity. Based on the probability distribution learned by the train set, the algorithm attempts to predict the following word in a sentence.

In [1]:
import nltk
import random
import math
from nltk.util import ngrams
from collections import defaultdict, Counter
from nltk.corpus import treebank
import re
nltk.download("treebank")

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

## Download dataset, form train and test set
Download treebank corpus from nltk library. Split corpus into train and test dataset

In [2]:
documents = treebank.fileids()
train_documents = documents[:170] # 170 news files in train set
test_documents = documents[170:] # 29 remaining news files in test set

## Create dataset
 Three version of the corpus will be created. The first one
will contain both lowercased and capital characters. The second one will
consist of lowercased characters only. The last one, which is called abstract
digits, will replace all of the numerical characters with \# symbol.

In [3]:
# Define the symbol to replace digits with
digit_replacement_symbol = "#"

def transform_documents(input_documents):
    original_documents = treebank.sents(input_documents)
    lowercased_documents = [list(map(str.lower, sentence)) for sentence in original_documents]
    # Replace digits in each sentence of original_documents
    # Join the words into a string, replace digits, then split back into words
    abstractdigit_documents = [
        re.sub(r'\d', digit_replacement_symbol, ' '.join(sentence)).split()
        for sentence in original_documents
    ]
    return abstractdigit_documents, lowercased_documents, original_documents

abstractdigit_train, lowercased_train, original_train = transform_documents(train_documents)
abstractdigit_test, lowercased_test, original_test = transform_documents(test_documents)

# Generate vocabulary

One vocabulary set will be created for each type of dataset (original,lowercased, abstract digit). The generate_vocabulary function forms a set of unique words that are existent in the train corpus with minimum frequency of value 3. Counter is used to measure the frequency of each token in the dataset. The pairs of token and count are returned by the method. Variables bigr and trigr represent the order of the n-grams. ka and kb are smoothing parameters of add-k smoothing technique. num_sent param shows the number of sentences that will be generated by the n-gram models.

In [4]:
def generate_vocabulary(frequency_threshold, documents):
    word_counter = Counter()
    for sentence in documents:
        word_counter.update(sentence)
    return set(word for word, frequency in word_counter.items() if frequency >= frequency_threshold)

bigr, trigr,  min_frequency, ka, kb, num_sent = 2, 3, 3, 1, 0.01, 3

vocabulary_original = generate_vocabulary(min_frequency, original_train)
vocabulary_lowercased = generate_vocabulary(min_frequency, lowercased_train)
vocabulary_abstractdigit = generate_vocabulary(min_frequency, abstractdigit_train)

# Preprocess text

This function pads every sentence of the corpus with start and end symbols, which are ```<BOS>``` and ```<EOS>``` respectively. Then it creates n-grams of order n specified by the argument "n" of the function. The n-grams are filtered to remove any of them that have more than one occurence of start or end symbol. If a word is not in the specified vocabulary, the function replaces that with a token called ```<UNK>```.



In [5]:
def preprocess_text(start_symbol, end_symbol,oov_label, documents, vocabulary, n):
    # Initialize the list for storing the processed n-grams
    ngrams_processed = []

    for sentence in documents:
        # Add start and end symbols to each sentence before generating n-grams
        padded_sentence = [start_symbol] + sentence + [end_symbol]

        # Generate n-grams for the padded sentence
        sentence_ngrams = list(ngrams(padded_sentence, n))

        # Directly filter and append the n-grams that meet the criteria to the result list
        ngrams_processed.extend([ngram for ngram in sentence_ngrams if not (
            ngram[:n-1].count(end_symbol) > 0 or ngram[1:].count(start_symbol) > 0)])

    ngrams_replaced = []  # Initialize an empty list to hold the processed n-grams

    # Loop through each n-gram in the list of processed n-grams
    for ngram in ngrams_processed:
        replaced_ngram = []  # Initialize a list to hold the tokens of the current n-gram after processing

        # Loop through each token in the current n-gram
        for token in ngram:
            # Check if the token is in the vocabulary or is a start/end symbol
            if token in vocabulary or token in {start_symbol, end_symbol}:
                replaced_ngram.append(token)  # Keep the token as it is
            else:
                replaced_ngram.append(oov_label)  # Replace the token with the out-of-vocabulary label

        # Add the processed n-gram (as a tuple) to the list of replaced n-grams
        ngrams_replaced.append(tuple(replaced_ngram))


    return ngrams_replaced

# Train and test n-gram model

The first function computes the perplexity of the n-gram model. For this purpose, probabilities of the n-gram's as well as base (prefix) probabilities are utilized.  The language model's probabilities are determined by the likelihood of each suffix given its prefix through k-smoothing. To compute them, we use counters of prefix(base) and suffix(follow) produced by the n-grams. A dictionary of counters is returned.

To measure how the model performs we calculate the perplexity of the model on the given n-grams. We need to have perplexity as low as possible to have better performance. N-grams from test set are used to calcualte perplexity.

In [6]:
import math
from collections import defaultdict, Counter

def compute_perplexity(tested_ngrams, vocabulary_set, model, base_probs):
    log_probability_total = sum(
        math.log(model.get(gram[:-1], {}).get(gram[-1], base_probs.get(gram[:-1], 1 / len(vocabulary_set))))
        for gram in tested_ngrams
    )
    return math.exp(-log_probability_total / len(tested_ngrams))

def initialize_model(sequence_of_ngrams, smoothing_factor, vocabulary_set):
    language_model = defaultdict(lambda: defaultdict(float))
    total_counts = Counter(sequence_of_ngrams)
    base_probabilities = {}

    # Calculate base counts and probabilities
    base_counts = defaultdict(int)
    for seq, freq in total_counts.items():
        base = seq[:-1]
        base_counts[base] += freq
        language_model[base][seq[-1]] = freq  # Temporarily store raw frequency

    # Adjust counts for smoothing and calculate base probabilities
    vocab_length = len(vocabulary_set)
    for base, total in base_counts.items():
        base_probabilities[base] = smoothing_factor / (total + smoothing_factor * vocab_length)
        for follow in language_model[base]:
            language_model[base][follow] = (language_model[base][follow] + smoothing_factor) / (total + smoothing_factor * vocab_length)

    return language_model, base_probabilities

def process_text(train_sequences, vocabulary_set, test_sequences, method_flag, n_value, smoothing_factor):
    trained_model, probabilities = initialize_model(train_sequences, smoothing_factor, vocabulary_set)
    perplexity_value = compute_perplexity( test_sequences, vocabulary_set, trained_model, probabilities)

    print(f"{'Bigram' if n_value == 2 else 'Trigram'} model | Method = {method_flag} | Smoothing k = {smoothing_factor} | Perplexity = {perplexity_value}")

    return trained_model


# Generate sentences

The function searches for a n-gram that begins with the init symbol randomly. Since n-grams are used to form a sentence, we need prefixes to produce a new word. The valid following words in a sentence are chosen based on the weighted probability produced by the n-gram model. At each iteration, the program checks if the stop symbol is reached to end the sentence. If the algorithm does not find any valid next words, it stops generating a sentence and tries to create a new one until the sentence's threshold is reached.

In [7]:
def generate_sentences(init, stop, model, order, allowed_words, sentences_count):
    for index in range(sentences_count):
        # Starting with a randomly chosen n-gram that begins with the initial token.
        current_ngram = random.choice([key for key in model if key[0] == init])
        text = list(current_ngram[:order-1])

        # Build the sentence.
        while text[-1] != stop:
            current_prefix = tuple(text[-order+1:])
            # Stop if the current prefix is not in the model or cannot proceed further.
            if current_prefix not in model or not model[current_prefix]:
                text += [stop]
                break

            # Filter candidates based on vocabulary and end token, then choose the next word.
            valid_next_words = []
            for word, prob in model.get(current_prefix, {}).items():
                if word in allowed_words or word == stop:
                    valid_next_words.append((word, prob))

            if not valid_next_words:
                text += [stop]
                break

            # Section for weighted selection:
            new_word, probability = zip(*valid_next_words)
            next_word = random.choices(new_word, weights=probability)[0]
            text.append(next_word)

        # Output the generated sentence.
        formatted_sentence = " ".join(text)
        print(f"Sentence {index + 1}: {formatted_sentence}")

In [8]:
train0 = preprocess_text("<BOS>", "<EOS>", "<UNK>",original_train, vocabulary_original, bigr)
test0 = preprocess_text("<BOS>", "<EOS>", "<UNK>", original_test, vocabulary_original, bigr)
bigram_model_original_ka = process_text(train0 , vocabulary_original, test0 , "original", bigr, ka)

Bigram model | Method = original | Smoothing k = 1 | Perplexity = 383.50361532871557


## Generating sentences with bigram model trained on original text, smoothing factor=1

The sentences generated by this model have some problems with grammar and syntax. The context of the texts is not very meaningful but there is some level of logical structure and coherence. It uses punctuation but not very accurately.  

The first sentence is very short and lacks meaningful context that could be provided with some verb. Also, the two apostrophes at the start of the first sentence were not needed at all. The sentence ends with full-stop which is correct punctuation.

Regarding the second sentence, it lacks meaning too. Some nations' right plan is reffered and some character named Moleculon. It is not coherent since there are a lot of uncorrelated parts in the sentence. It is not grammticaly correct at all.

The third sentence has slightly better meaning than the previous two. The syntax and grammar that was used is not accurate. The sentence suggests some warning signals based on evidence provided by test scores. Also, the part"Michael P. Sullivan , or had reached a regulatory life on what the House and consented to own tax rates of the paper on cost-sharing" could raise concern about tax-rates and cost sharing. Finally, ```<UNK>``` token is not included in the text.  


In [9]:
generate_sentences("<BOS>", "<EOS>",bigram_model_original_ka, bigr, vocabulary_original, num_sent)

Sentence 1: <BOS> `` really only modestly . <EOS>
Sentence 2: <BOS> These nations over rights plan if its Moleculon 's in the company . <EOS>


In [10]:
train1 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_train, vocabulary_lowercased, bigr)
test1 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_test, vocabulary_lowercased, bigr)
bigram_model_lowercased_ka = process_text(train1, vocabulary_lowercased, test1, "lowercase", bigr, ka)

Bigram model | Method = lowercase | Smoothing k = 1 | Perplexity = 383.9460197558427


In [11]:
generate_sentences("<BOS>", "<EOS>",bigram_model_lowercased_ka, bigr, vocabulary_lowercased, num_sent)

Sentence 1: <BOS> she says *t*-1 . <EOS>
Sentence 2: <BOS> that *t*-1 . <EOS>
Sentence 3: <BOS> as profit *ich*-1 said 0 they add it in a magazine , st. louis sullivan , he owns and connecticut will continue *-3 to 8 1\/2 % rate on 30-year bond prices closed up 1 order *-1 talking , editor and gyrations of high-tech medical association through another only offers a slowing in october at $ 500 stocks in fiscal 1990 , from # 6 5\/8 , '' <EOS>


In [12]:
train2 = preprocess_text("<BOS>", "<EOS>", "<UNK>",original_train, vocabulary_original, bigr)
test2 = preprocess_text("<BOS>", "<EOS>", "<UNK>",original_test, vocabulary_original, bigr)
bigram_model_original_kb = process_text(train2 , vocabulary_original, test2,"original", bigr, kb)

Bigram model | Method = original | Smoothing k = 0.01 | Perplexity = 137.81108464477174


## Generating sentences with bigram model trained on original text, smoothing factor=0.01

The coherence is improved on this experiment, since there is better logical structure of the words. The content has better meaning in this situation than when smoothing factor was 1.

In the first sentence there are two subsentences, separated by comma. The first part tells us that the screen shows European economic slowdown in the top 10%. The second part suggest to reform law enforcement and the economics. The content here makes more sense than before but still it is not very clear.  

The second sentence is not very accurate and lacks meaningful context. Someone can understand that troubled Media Heritage proposed something about California Health organization. Also, we can swap dealing is a grammatically correct phrase
with proper syntax but it does not fit in this sentence.

The third sentence has no coherence and the numbers provided are very confusing.
There are probably some claims about thoughts regarding Goldman Sachs but no reasonable context is provided unfortunately. "It should set" and "i know" show that correct syntax was used partially but the sentence as a whole does not convey any meaningful message.

In [35]:
generate_sentences("<BOS>", "<EOS>", bigram_model_original_kb, bigr, vocabulary_original, num_sent)

Sentence 1: <BOS> But the screen shows two European economic slowdown in the top 10 % , '' said 0 insurance reform law enforcement and its economics . <EOS>
Sentence 2: <BOS> After troubled Heritage Media proposed * with a California Health Organization and we can swap dealings . <EOS>
Sentence 3: <BOS> This is thought 0 it should set *-1 $ 15,000 *U* and I know 0 *T*-1 Goldman Sachs . <EOS>


In [14]:
train3 = preprocess_text("<BOS>","<EOS>", "<UNK>",lowercased_train, vocabulary_lowercased, bigr)
test3 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_test, vocabulary_lowercased, bigr)
bigram_model_lowercased_kb = process_text(train3 , vocabulary_lowercased, test3 ,"lowercase", bigr, kb)

Bigram model | Method = lowercase | Smoothing k = 0.01 | Perplexity = 143.78868465313255


In [15]:
generate_sentences("<BOS>", "<EOS>", bigram_model_lowercased_kb, bigr, vocabulary_lowercased, num_sent)

Sentence 1: <BOS> the concept 's a compensation program trading halt proposal comes as a woman in the previous contract modestly as the former colleagues that drug administration officials began in the first three with other requirements that post a woman in an editorial product was full membership in other with fewer than a package designed * to consider them . <EOS>
Sentence 2: <BOS> `` side of the usx case . <EOS>
Sentence 3: <BOS> attorneys have wa '' mr. baldwin , including net income for common shares . <EOS>


In [16]:
train4 = preprocess_text("<BOS>", "<EOS>", "<UNK>", original_train, vocabulary_original, trigr)
test4 = preprocess_text("<BOS>", "<EOS>", "<UNK>",  original_test, vocabulary_original, trigr)
trigram_model_original_ka = process_text(train4, vocabulary_original, test4, "original", trigr, ka)

Trigram model | Method = original | Smoothing k = 1 | Perplexity = 1504.6100128907715


## Generating sentences with trigram model trained on original text, smoothing factor=1

The coherence and logical structure is further improved on this experiment. The content has better meaning in this situation than when bigrams were used.

The first sentence conveys a very clear message and the whole sentence has a proper meaning. It claims that Taiwan has improved its position thanks to some company's work on clean air legislation. The syntax is totally correct as well as grammar and punctuation. Overall, the result is very satisfying since it conveys a logical message.

Unfortunately, the second sentece contains only one word and not much information can be extracted from it. Nevertheless, starting and ending token are used correctly and the ```<UNK>``` token is avoided here too.

The third sentence is problematic. Although, punctuation is used accurately, the sentence lacks proper meaning. It does not convey any logical message but the syntax is correct. It could suggest something about earnings regarding some research but no concrete meaning can be extracted. Grammar is wrong since plural form should be used(aka spinoffs instead of spinoff).

In [17]:
generate_sentences("<BOS>", "<EOS>",trigram_model_original_ka, trigr,  vocabulary_original, num_sent)


Sentence 1: <BOS> Taiwan has improved its standing with the company 's work on clean-air legislation . <EOS>
Sentence 2: <BOS> FEDERAL <EOS>
Sentence 3: <BOS> However , none of the spinoff caused Cray Research 's earnings . <EOS>


In [18]:
train5 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_train, vocabulary_lowercased, trigr)
test5 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_test, vocabulary_lowercased, trigr)
trigram_model_lowercased_ka = process_text(train5,  vocabulary_lowercased, test5, "lowercase", trigr, ka)


Trigram model | Method = lowercase | Smoothing k = 1 | Perplexity = 1470.5319718904307


In [19]:
generate_sentences("<BOS>", "<EOS>", trigram_model_lowercased_ka , trigr, vocabulary_lowercased, num_sent)


Sentence 1: <BOS> thus , an announcer talks about the new , lower in seoul , will be eligible for duty-free treatment . <EOS>
Sentence 2: <BOS> dividend growth next year based on actual <EOS>
Sentence 3: <BOS> 20 billion *u* of japanese investors as a result , ms. ganes . <EOS>


In [20]:
train6 = preprocess_text("<BOS>", "<EOS>", "<UNK>", original_train, vocabulary_original, trigr)
test6 = preprocess_text("<BOS>", "<EOS>", "<UNK>", original_test, vocabulary_original, trigr)
trigram_model_orginal_kb  = process_text(train6, vocabulary_original, test6, "original", trigr, kb)

Trigram model | Method = original | Smoothing k = 0.01 | Perplexity = 463.80467915524156


## Generating sentences with trigram model trained on original text, smoothing factor=0.01

From the experiment, it is understood that trigram models produce shorter sentences compared to bigram models. There is some coherence in this model too.
The first sentece has some logical structure, but the other two are confusing.

The first sentence is very logical and conveys a clear message. It states that Mary Beth's mother did not see any signs that inventories are excessive. Grammar, syntax and punctuation are entirely correct. The two apostrophes at the end of the first sentence were not needed though. Nevertheless, excessive inventories  is any inventory that exceeds projected demand, and is therefore not expected to sell. Based on the structure of the given dataset, this sentence makes a lot of sense.

However, the second sentece is very short and not many conclusions can be extracted from it.

Unfortuantely, the third second is not correct. It lacks logical structure and syntax, showcasing that the model can be inefficient. The first part of the sentence suggests that there are some signs about some growth but it does not state clearly why this could happen. Also, it is stated that an institution might need help but the meaning is very vague. Grammar is not significantly incorrect while syntax definitely is. There are some characters that were not needed in the sentence, such as '*'.

In [47]:
generate_sentences("<BOS>", "<EOS>", trigram_model_orginal_kb , trigr, vocabulary_original, num_sent)

Sentence 1: <BOS> Mary Beth 's mother did n't see any signs that inventories are excessive . '' <EOS>
Sentence 2: <BOS> -LRB- During its <EOS>
Sentence 3: <BOS> While there were signs that growth is coming to a round of bilateral economic talks scheduled * for delivery last night to a savings institution needs your help now ! <EOS>


In [22]:
train7 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_train, vocabulary_lowercased, trigr)
test7 = preprocess_text("<BOS>", "<EOS>", "<UNK>", lowercased_test, vocabulary_lowercased, trigr)
trigram_model_lowercased_kb = process_text(train7 ,vocabulary_lowercased, test7 , "lowercase", trigr, kb)

Trigram model | Method = lowercase | Smoothing k = 0.01 | Perplexity = 461.769817591675


In [23]:
generate_sentences("<BOS>", "<EOS>", trigram_model_lowercased_kb, trigr, vocabulary_lowercased, num_sent)

Sentence 1: <BOS> douglas madison , a problem created * by president bush 's job performance and 85 % of its new customers each week , u.s. trade representative carla hills to use their pill must still pay their share of a rival . <EOS>
Sentence 2: <BOS> he has a yield of about $ 1.1 billion *u* in food aid had <EOS>
Sentence 3: <BOS> associates say 0 *t*-1 was put $ 15,000 *u* fine ; john p. <EOS>


In [24]:
train8 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_train, vocabulary_abstractdigit, bigr)
test8 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_test, vocabulary_abstractdigit, bigr)
bigram_model_abstarctdigit_ka = process_text(train8 , vocabulary_abstractdigit, test8 , "abstract digit", bigr, ka)

Bigram model | Method = abstract digit | Smoothing k = 1 | Perplexity = 351.3990885618452


In [25]:
generate_sentences("<BOS>", "<EOS>", bigram_model_abstarctdigit_ka, bigr, vocabulary_abstractdigit, num_sent)

Sentence 1: <BOS> The bids for the U.S. Embassy . <EOS>
Sentence 2: <BOS> Douglas Madison , St. Mary Beth 's probably will direct an interview . <EOS>
Sentence 3: <BOS> One , with analysts say # Chrysler Corp. , Philadelphia Fed officials could use the union 's school was for results *T*-# to send your own intended *-# to set *-# *-# to increase in the very least costs throughout the IRS in computers and testing . <EOS>


In [26]:
train9 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_train, vocabulary_abstractdigit, bigr)
test9 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_test, vocabulary_abstractdigit, bigr)
bigram_model_abstarctdigit_kb = process_text(train9 , vocabulary_abstractdigit, test9 , "abstract digit", bigr, kb)

Bigram model | Method = abstract digit | Smoothing k = 0.01 | Perplexity = 125.38175697463285


In [27]:
generate_sentences("<BOS>", "<EOS>", bigram_model_abstarctdigit_kb, bigr, vocabulary_abstractdigit, num_sent)

Sentence 1: <BOS> The cases , however , teacher just an analyst who *T*-## came together not be available *ICH*-# on alcohol and Scott Paper gained ##.## billion yen compared with their daily contracts traded portfolios are common form of UAL stock index , Calif. , five bells is n't *?* the situation , Republican Rudolph Giuliani 's meeting in October #### second incentive plan the fourth among investors have had been hit the many warrants , who *T*-### are no more customers by major markets themselves with cheaper electrical current-carrying capacity ### million to join a Chemical Banking Corp. , deal is that firms ' `` Insurance Institute . <EOS>
Sentence 2: <BOS> In an appeal , referred the appropriations bill is what *T*-## took over policy , or anyone has `` really need the rates , both Massachusetts banks . '' <EOS>
Sentence 3: <BOS> Mrs. Yeargin 's short-term investor is just as well below #### , if you want *-# to cut the Bank . <EOS>


In [28]:
train10 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_train, vocabulary_abstractdigit, trigr)
test10 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_test, vocabulary_abstractdigit, trigr)
trigram_model_abstarctdigit_ka = process_text(train10 , vocabulary_abstractdigit, test10 , "abstract digit", trigr, ka)

Trigram model | Method = abstract digit | Smoothing k = 1 | Perplexity = 1389.017984976645


In [29]:
generate_sentences("<BOS>", "<EOS>", trigram_model_abstarctdigit_ka, trigr, vocabulary_abstractdigit, num_sent)

Sentence 1: <BOS> Judge Curry added an additional $ ## million *U* a share , on the way for # ### million *U* a bottle . <EOS>
Sentence 2: <BOS> Sir Peter will succeed Mr. McAlpine could be lowered *-# soon , immediately bid up shares of common stock of record Nov. ## , vice president with Morgan Stanley there , '' so `` by request , '' he said # its trucks and minivans to be signed *-## by the Giuliani ads , Mr. Yamamoto said , `` your TV ad needs *-# to appear in the futures market . <EOS>
Sentence 3: <BOS> Macmillan\/McGraw says # the significant drop in interest rates and the medical schools of <EOS>


In [30]:
train11 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_train, vocabulary_abstractdigit, trigr)
test11 = preprocess_text("<BOS>", "<EOS>", "<UNK>", abstractdigit_test, vocabulary_abstractdigit, trigr)
trigram_model_abstarctdigit_kb = process_text(train11 , vocabulary_abstractdigit, test11 , "abstract digit", trigr, kb)

Trigram model | Method = abstract digit | Smoothing k = 0.01 | Perplexity = 412.4322411400487


In [31]:
generate_sentences("<BOS>", "<EOS>", trigram_model_abstarctdigit_kb, trigr, vocabulary_abstractdigit, num_sent)

Sentence 1: <BOS> Taiwan has improved its standing with the language '' in Asia to $ #.# billion *U* of high-yield , <EOS>
Sentence 2: <BOS> Charles D. <EOS>
Sentence 3: <BOS> USX said # they may very well not <EOS>


| Languange Model | Original Text | Lowercased Text | Abstract Digits Text |
|----|----|----|----|
| Bigrams(k=1) | 383.50361532871557 | 383.9460197558427 | 351.3990885618452 |
| Bigrams(k=0.01) | 137.81108464477174 | 143.78868465313255|  125.38175697463285 |
| Trigrams(k=1) | 1504.6100128907715 | 1470.5319718904307 | 1389.017984976645 |
| Trigrams(k=0.01) | 463.80467915524156 | 461.769817591675 | 412.4322411400487 |

From the matrix above, it is observed that the best smoothing factor for both bigrams and trigrams is 0.01. Bigrams always have lower perplexity than trigrams in this experiment, which means that the model performs better on them. Lowercase text increases slightly the perplexity for bigrams. On the contrary, it reduces the perplexity scores for trigrams compared to original text. So, lowercase text is more suitable for trigrams than bigrams. Abstract digit transformation improves the scores of the original text in all of the experiments, regardless smoothing factor k. Abstract digit's perplexity scores are consistently the lowest in the experiment which means that the model has the best performance on this type of data.
Regarding the range of the values, those of 120-150 indicate that the model can predict the sentences of the text very accurately. When the values range from 380 to 460, the effectiveness of the model drops a little but it is still satisfying at predicting the corpus. Perplexity of 1000 to 1500 shows that the model's accuracy is poor and cannot predict the text properly.




# Chat Gpt-4 Prompt

I commented out code that was not working, and added some lines in order to make it functional. Someone can read the code to see the editing that was made

---------------------------------

Prompt:

I'm working on a natural language processing project and need to preprocess my text data. Could you provide a Python function that takes in several parameters including a start symbol, an end symbol, an out-of-vocabulary label, a list of sentences (documents), a vocabulary set, and an integer n for generating n-grams? The function should add start and end symbols to each sentence, generate n-grams, and filter out any n-grams that have the start symbol not at the beginning or the end symbol not at the end. Additionally, it should replace any tokens not found in the provided vocabulary with the out-of-vocabulary label. Finally, the function should return the list of processed n-grams. It's important that the start and end symbols are treated as valid tokens and not replaced


### Follow this algorithm:
###  Initialize the list for storing the processed n-grams
###  Add start and end symbols to each sentence before generating n-grams
###         Generate n-grams for the padded sentence
###         Directly filter and append the n-grams that meet the criteria to the result list
### Initialize an empty list to hold the processed n-grams
###     Loop through each n-gram in the list of processed n-grams
    
###         Initialize a list to hold the tokens of the current n-gram after processing

###         Loop through each token in the current n-gram
###           Keep the token as it is
     
###                  Replace the token with the out-of-vocabulary label
###        Add the processed n-gram (as a tuple) to the list of replaced n-grams

In [32]:
# Removed these lines (with star)
# *from typing import List, Set, Tuple

# *def generate_ngrams(sentences: List[str], vocab: Set[str], start_symbol: str, end_symbol: str, out_of_vocab: str, n: int) -> List[Tuple[str, ...]]:
# *    def ngrams(input_list, n):
# *        return [tuple(input_list[i:i+n]) for i in range(len(input_list)-n+1)]
def preprocess_text2(start_symbol, end_symbol,out_of_vocab, sentences, vocab, n):
    processed_ngrams = []
    # Adding start and end symbols, and generating n-grams
    for sentence in sentences:

       # Removed .split() from sentence, it was not needed
        padded_sentence = [start_symbol] * (n-1) + sentence + [end_symbol] * (n-1)
        sentence_ngrams = ngrams(padded_sentence, n)

        # added this line
        # * Directly filter and append the n-grams that meet the criteria to the result list
        processed_ngrams.extend([ngram for ngram in sentence_ngrams if not (
            ngram[:n-1].count(end_symbol) > 0 or ngram[1:].count(start_symbol) > 0)])

        # Removed these lines
        # # *Filtering out n-grams with misplaced start and end symbols
        # *for ngram in sentence_ngrams:
        #  *   if ngram[0] == start_symbol and ngram[-1] == end_symbol:
        #   *      processed_ngrams.append(ngram)

    # Replace tokens not in vocabulary
    replaced_ngrams = []
    for ngram in processed_ngrams:
        replaced_ngram = []
        for token in ngram:
            if token in vocab or token in [start_symbol, end_symbol]:
                replaced_ngram.append(token)
            else:
                replaced_ngram.append(out_of_vocab)
        replaced_ngrams.append(tuple(replaced_ngram))

    return replaced_ngrams


# Chat GPT-4 prompts

First prompt:
I need a Python script that accomplishes the following tasks for working with n-gram language models, specifically designed to calculate the perplexity of a language model on a set of test data:

Compute Perplexity: Given a list of tested n-grams, a set representing the vocabulary, a pre-trained language model, and a dictionary of base probabilities for unseen n-grams, calculate the perplexity of the model on the test data. Perplexity should be computed as the exponent of the negative average log probability across all tested n-grams.

Initialize Model: Initialize a language model from a sequence of n-grams. This involves:

Counting the occurrences of each n-gram in the training data.
Applying a smoothing factor to adjust for n-grams that do not appear in the training data, ensuring that the model can handle unseen n-grams.
Calculating the smoothed probabilities for each n-gram and the base probability for unseen n-grams within known contexts.
The model should be represented as a nested dictionary where the first level corresponds to the (n-1)-gram contexts, and the second level maps the nth word of the n-gram to its probability given the context.

Process Text: This function should tie everything together. It takes as input:

The sequences of n-grams from the training data.
A vocabulary set derived from the training data.
The sequences of n-grams from the test data.
A method flag indicating the type of smoothing applied ('Laplace' for example).
An n_value indicating whether the model is a bigram or trigram model.
A smoothing factor to be applied during model initialization.
The function should train the model with the training data and then compute and print the perplexity of the test data.

Please include appropriate comments in the code to explain the purpose and functionality of each part


------------------------------------
Second prompt:

Input is of this form "
```[ ('<BOS>', '<UNK>', '<UNK>'), ('<UNK>', '<UNK>', ','),('<UNK>', ',', '61')```

In [33]:
import math
from collections import defaultdict

def train_model2(training_ngrams, vocabulary, n_value, smoothing='Laplace', smoothing_factor=1):
    """
    Initialize the language model from the sequence of n-grams.

    :param training_ngrams: A list of n-grams from the training data.
    :param vocabulary: A set representing the unique words in the training data, including special tokens.
    :param n_value: The n in n-gram (e.g., 2 for bigrams, 3 for trigrams).
    :param smoothing: The type of smoothing applied ('Laplace' for now).
    :param smoothing_factor: The smoothing factor to adjust for unseen n-grams.
    :return: A nested dictionary representing the n-gram model and base probabilities.
    """
    model = defaultdict(lambda: defaultdict(float))
    context_counts = defaultdict(int)

    # Count occurrences of each n-gram and its context
    for ngram in training_ngrams:
        context = ngram[:-1]
        word = ngram[-1]
        model[context][word] += 1
        context_counts[context] += 1

    # Calculate probabilities with smoothing for each context
    for context, words in model.items():
        total_count = context_counts[context]
        denominator = total_count + (smoothing_factor * len(vocabulary))
        for word in words:
            words[word] = (words[word] + smoothing_factor) / denominator

    # Calculate base probability for unseen n-grams in any context
    base_probability = smoothing_factor / (len(vocabulary) + smoothing_factor * len(vocabulary))

    return model, base_probability

def compute_perplexity2(test_ngrams, model, base_probability):
    """
    Compute the perplexity of the model on the test data.

    :param test_ngrams: A list of tested n-grams, including special tokens.
    :param model: The trained n-gram model.
    :param base_probability: Base probability for unseen n-grams.
    :return: The perplexity score.
    """
    log_probability_sum = 0
    for ngram in test_ngrams:
        context = ngram[:-1]
        word = ngram[-1]
        probability = model[context].get(word, base_probability)
        log_probability_sum += math.log(probability)

    average_log_probability = log_probability_sum / len(test_ngrams)
    perplexity = math.exp(-average_log_probability)

    return perplexity

def process_text2(training_ngrams, vocabulary, test_ngrams, n_value, smoothing='Laplace', smoothing_factor=1):
    """
    Train the model with the training data and compute the perplexity of the test data.

    :param training_ngrams: Sequences of n-grams from the training data, including special tokens.
    :param vocabulary: A set derived from the training data, including special tokens.
    :param test_ngrams: Sequences of n-grams from the test data, including special tokens.
    :param n_value: Indicates the n-gram model (e.g., 2 for bigrams, 3 for trigrams).
    :param smoothing: Type of smoothing applied.
    :param smoothing_factor: Smoothing factor during model initialization.
    """
    # Initialize and train the model
    model, base_probability = train_model2(training_ngrams, vocabulary, n_value, smoothing, smoothing_factor)

    # Compute and print the perplexity of the test data
    perplexity = compute_perplexity2(test_ngrams, model, base_probability)
    print(f"Perplexity: {perplexity} of {'Original text with Bigrams' if n_value == 2 else 'Original text with Trigrams'} and k={smoothing_factor} ")

# Example Usage
# Note: Only main function was changed to call the methods properly, and print function in process_text2
if __name__ == "__main__":

    train0new = preprocess_text2("<BOS>", "<EOS>", "<UNK>", original_train, vocabulary_original, bigr)
    test0new = preprocess_text2("<BOS>", "<EOS>", "<UNK>", original_test, vocabulary_original, bigr)

    train1new = preprocess_text2("<BOS>", "<EOS>", "<UNK>", original_train, vocabulary_original, trigr)
    test1new = preprocess_text2("<BOS>", "<EOS>", "<UNK>", original_test, vocabulary_original, trigr)


    bigram_model_original_ka2 = process_text2(train0new , vocabulary_original, test0new , bigr,'Laplace', ka)

    bigram_model_original_kb2 = process_text2(train0new , vocabulary_original, test0new , bigr,'Laplace', kb)

    trigram_model_original_ka2 = process_text2(train1new , vocabulary_original, test1new , trigr,'Laplace',ka)

    trigram_model_original_kb2 = process_text2(train1new , vocabulary_original, test1new , trigr ,'Laplace',kb)


Perplexity: 437.91990460192983 of Original text with Bigrams and k=1 
Perplexity: 296.8463590843814 of Original text with Bigrams and k=0.01 
Perplexity: 2228.697161847104 of Original text with Trigrams and k=1 
Perplexity: 5519.48279959992 of Original text with Trigrams and k=0.01 


# Comments on chat gpt code

The results show similar trend to the original code. Smoothing factor k=0.01 shows better performance than k=1 on bigrams. Also, the results for the bigrams differ only 50-100 from initial results which is not substantial difference. Regarding the trigrams, the scores are significantly worse, since they are 500-5000 units higher. This presumably happens because the perplexity in this implementation is calculated in a more straight-forward way.