# **My Personal Language Model**

_Developed by: Paolo Edni Andryn V. Espiritu_
<br>
_Course: CSC714M | Theories In Natural Language Processing_

In this project, I will be making a **probabilistic language model** using a dataset that contains my messenger conversation data from 2021-2024. The objective is to make a model that captures the way how I speak to people (e.g. peers, friends, family, and even strangers).

As suggested in class, I will be using **Jadesse Chan's** statistical trigram language model found in this link: https://github.com/jadessechan/Text-Prediction.


## Import Libraries


In [None]:
import helper
import re
import unicodedata
import string
import random
import nltk
from nltk.probability import ConditionalFreqDist
from nltk.tokenize import word_tokenize

## Data Preparation

The first step involves preparing the data that will be provided to the language model. With this, I created a scraper that will retrieve the user's messages given the `directory name of the downloaded messenger folder` and the `sender's name`. The `config` file used in the scraper contains some texts that I specifically excluded. You may also create your own config file for your specific case.


In [None]:
# Run the scraper

path = "./messenger_data"
sender_name = "Paolo Espiritu"

helper.scrape_messenger_data(path, sender_name)

Let us also check the size of the corpus in terms of the number of tokens and its vocabulary. For this, we will use the **word_tokenize** function from the **nltk** library.


In [None]:
dataset_path = "./dataset.txt"
tokens = []

with open(dataset_path) as input_file:
    tokens = word_tokenize(input_file.read())

print("tokens: {} | vocabulary size: {}".format(len(tokens), len(set(tokens))))

## Statistical Trigram language model

Now that we have generated the `dataset.txt` file, we can now proceed to use Jadesse Chan's statistical trigram language model. In this section, we can generate a sentence based on the given phrase of the user. Ideally, the model should sound like the person whose dataset was used in this project.

In my case, I need to come up with specific phrases that I often use to know if the output sounds like something I would say.


In [None]:
"""
    Install NLTK resources
"""

nltk.download("punkt")
nltk.download("wordnet")


"""
    Normalize text, remove unnecessary characters, 
    perform regex parsing, and make lowercase
"""


def filter(text):
    # normalize text
    text = (
        unicodedata.normalize("NFKD", text)
        .encode("ascii", "ignore")
        .decode("utf-8", "ignore")
    )
    # replace html chars with ' '
    text = re.sub("<.*?>", " ", text)
    # remove punctuation
    text = text.translate(str.maketrans(" ", " ", string.punctuation))
    # only alphabets and numerics
    text = re.sub("[^a-zA-Z]", " ", text)
    # replace newline with space
    text = re.sub("\n", " ", text)
    # lower case
    text = text.lower()
    # split and join the words
    text = " ".join(text.split())

    return text


"""
    Tokenize remaining words
    and perform lemmatization
"""


def clean(text):
    tokens = nltk.word_tokenize(text)
    wnl = nltk.stem.WordNetLemmatizer()

    output = []
    for words in tokens:
        # lemmatize words
        output.append(wnl.lemmatize(words))

    return output


"""
    Make a language model using a dictionary, trigrams, 
    and calculate word probabilities
"""


def n_gram_model(text):
    trigrams = list(
        nltk.ngrams(
            text,
            3,
            pad_left=True,
            pad_right=True,
            left_pad_symbol="<s>",
            right_pad_symbol="</s>",
        )
    )
    bigrams = list(
        nltk.ngrams(
            text,
            2,
            pad_left=True,
            pad_right=True,
            left_pad_symbol="<s>",
            right_pad_symbol="</s>",
        )
    )

    # N-gram Statistics
    # get freq dist of trigrams
    # freq_tri = nltk.FreqDist(trigrams)
    freq_bi = nltk.FreqDist(bigrams)
    # freq_tri.plot(30, cumulative=False)
    freq_bi.plot(30, cumulative=False)
    # print("Most common trigrams: ", freq_tri.most_common(5))
    print("Most common bigrams: ", freq_bi.most_common(5))

    # make conditional frequencies dictionary
    cfdist = ConditionalFreqDist()

    for w1, w2 in bigrams:
        cfdist[w1][w2] += 1
    # transform frequencies to probabilities
    for w1 in cfdist:
        total_count = float(sum(cfdist[w1].values()))
        for w2 in cfdist[w1]:
            cfdist[w1][w2] /= total_count

    return cfdist


"""
    Generate predictions from the Conditional Frequency Distribution
    dictionary (param: model), append weighted random choice to user's phrase,
    allow option to generate more words following the prediction
"""


def predict(model, user_input):
    user_input = filter(user_input)
    user_input = user_input.split()

    # w1 = len(user_input) - 2
    # w2 = len(user_input) - 1
    # prev_words = user_input[w1 : w2 + 1]

    # # display prediction from highest to lowest maximum likelihood
    # prediction = sorted(
    #     dict(model[prev_words[0], prev_words[1]]),
    #     key=lambda x: dict(model[prev_words[0], prev_words[1]])[x],
    #     reverse=True,
    # )
    # Get the last word from the user input
    prev_word = (
        user_input[-1] if user_input else "<s>"
    )  # Use '<s>' if user_input is empty

    # Display predictions from highest to lowest maximum likelihood
    prediction = sorted(
        model[prev_word], key=lambda x: model[prev_word][x], reverse=True
    )

    print("Trigram model predictions: ", prediction)

    word = []
    weight = []
    for key, prob in dict(model[prev_words[0], prev_words[1]]).items():
        word.append(key)
        weight.append(prob)

    # pick from a weighted random probability of predictions
    next_word = random.choices(word, weights=weight, k=1)
    # add predicted word to user input
    user_input.append(next_word[0])
    print(" ".join(user_input))

    ask = input(
        "Do you want to generate another word? (type 'y' for yes or 'n' for no): "
    )
    if ask.lower() == "y":
        predict(model, str(user_input))
    elif ask.lower() == "n":
        print("done")


def use_model(path):
    file = open(path, "r")
    text = ""
    while True:
        line = file.readline()
        text += line
        if not line:
            break

    # pre-process text
    print("Filtering...")
    words = filter(text)
    print("Cleaning...")
    words = clean(words)

    # make language model
    print("Making model...")
    model = n_gram_model(words)

    print("Enter a phrase: ")
    user_input = input()
    predict(model, user_input)


if __name__ == "__main__":
    path = "./dataset.txt"

    use_model(path)