# Language Model and Application for Spelling Error Correction

### Objective:
Develop a simple English syntax error correction program.

### Tasks:
a) Build a language model based on n-grams using the Laplace smoothing method for the following models:
  - 1-gram
  - 2-gram
  - 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).


## Import Libraries

In [4]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

In [5]:
import string
import re
import nltk
nltk.download('punkt_tab')
import contractions
from collections import Counter

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Data Downloading

In [12]:
with open('/content/tedtalk.txt') as file:
  docs = file.read()

## Data Preprocessing

In [13]:
vocab = set()     # Number of unique word
token_count = 0   # Number of token

# Lower case character
def text_lowercase(text):
    text = text.lower()
    return text

# Split into sentences
def sent_tokenize(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

# Removing contractions and Keep number, a to z and . ! ? character
def remove_punctuation(text):
    text = contractions.fix(text)
    text = re.sub(r'[^a-z0-9.!?]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

# Tokenize the text into words
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    return ['<s>'] + tokens + ['</s>']

# Data preprocessing
def preprocess_text(text):
    global token_count
    text = text_lowercase(text)
    sentences = sent_tokenize(text)
    processed_sentences = []
    for sentence in sentences:
      sentence = remove_punctuation(sentence)
      sentence = tokenize(sentence)
      token_count += len(sentence)
      for token in sentence:
        vocab.add(token)
      processed_sentences.append(sentence)

    return processed_sentences

preprocess_data = preprocess_text(docs)
print("Number of unique words:", len(vocab))
print("Number of tokens:", token_count)

Number of unique words: 69371
Number of tokens: 8718249


## Build a language model based on n-grams using the Laplace smoothing method

In [20]:
class NgramModel:
    def __init__(self, n, v, token_count):
      self.n = n
      self.ngrams = Counter()
      self.context = Counter()
      self.vocab = v
      self.token_count = token_count

    # Train the model by counting n-grams and contexts
    def train(self, corpus):
      for tokens in corpus:
         for i in range(len(tokens) - self.n + 1):
          ngram = tuple(tokens[i:i+self.n]) # Create n-gram
          self.ngrams[ngram] += 1
          self.context[ngram[:-1]] += 1

    # Calculate the probability of an n-gram using Laplace smoothing
    def laplace_smoothing(self, ngram):
      if self.n != 1:
        return (self.ngrams[ngram] + 1) / (self.context[ngram[:-1]] + self.vocab)
      else:
        return (self.ngrams[ngram] + 1) / (self.token_count + self.vocab)

    # Calculate the probability of a given sentence
    def sentence_probability(self, sentence):
      prob = 1
      tokens = preprocess_text(sentence)[0]
      for i in range(len(tokens) - self.n + 1):
          ngram = tuple(tokens[i:i+self.n])
          laplace = self.laplace_smoothing(ngram)
          prob = prob * laplace
      return prob

    # Calculate the perplexity of a given sentence
    def sentence_perplexity(self, sentence):
      tokens = preprocess_text(sentence)[0]
      N = len(tokens)
      prob = self.sentence_probability(sentence)
      return prob ** (-1 / N)

## Calculate the probability of a sentence and compute the Perplexity of a sentence

In [21]:
# Train unigram, bigram, and trigram model
unigram_model = NgramModel(n=1, v=len(vocab), token_count=token_count)
bigram_model = NgramModel(n=2, v=len(vocab), token_count=token_count)
trigram_model = NgramModel(n=3, v=len(vocab), token_count=token_count)

unigram_model.train(preprocess_data)
bigram_model.train(preprocess_data)
trigram_model.train(preprocess_data)

In [22]:
# Sample sentences for evaluation
correct_sentence = "I want to give a speech at Ted Talk."
incorrect_sentence = "I want give a speech to at Ted Talk."

# Calculate probabilities and perplexities of each sentence
for model, name in zip([unigram_model, bigram_model, trigram_model], ["Unigram", "Bigram", "Trigram"]):
    print(f"{name} Model:")
    print(f"  Probability of correct sentence: {model.sentence_probability(correct_sentence)}")
    print(f"  Perplexity of correct sentence: {model.sentence_perplexity(correct_sentence)}")

    print(f"  Probability of incorrect sentence: {model.sentence_probability(incorrect_sentence)}")
    print(f"  Perplexity of incorrect sentence: {model.sentence_perplexity(incorrect_sentence)}")
    print(100*'-')

Unigram Model:
  Probability of correct sentence: 6.525214922979809e-30
  Perplexity of correct sentence: 270.4687527281362
  Probability of incorrect sentence: 6.52521492297981e-30
  Perplexity of incorrect sentence: 270.4687527281362
----------------------------------------------------------------------------------------------------
Bigram Model:
  Probability of correct sentence: 4.0739919098866776e-25
  Perplexity of correct sentence: 107.77010843045838
  Probability of incorrect sentence: 1.72312816050699e-30
  Perplexity of incorrect sentence: 302.20863434985387
----------------------------------------------------------------------------------------------------
Trigram Model:
  Probability of correct sentence: 3.2913209604134965e-33
  Perplexity of correct sentence: 509.1969735230111
  Probability of incorrect sentence: 1.192217162207629e-40
  Perplexity of incorrect sentence: 2123.0996850122556
-------------------------------------------------------------------------------------

## Analyze the results
- **Unigram Model:**
  - Probability of correct and incorrect sentences is **identical** because unigrams treat words as independent without any context
  - High perplexity (~270), indicating not very accurate at predicting the next word

- **Bigram Model:**
  - The correct sentence has a **higher probability** than the incorrect sentence due to consideration of word pairs
  - Perplexity:
    - Correct sentence: Lower (~107), meaning the model is more confident.
    - Incorrect sentence: Higher (~302), showing the model can detect incorrect word order

- **Trigram Model:**
  - The correct sentence has a **significantly higher probability** than the incorrect sentence by capturing more context
  - Perplexity:
    - Correct sentence: Much higher (~509) since trigram required three-word sequence in the training data, however these trigrams are unseened in traning set due to the limiation of traning data, resulting in high perplexity
    - Incorrect sentence: Extremely high (~2123) since a severe mismatch with training data, meaning many of trigrams don't exist in the training set
