# NLP ASSIGNMENT

# To build an auto-complete system. Auto-complete system is something you may see every day:

When you google something, you often have suggestions to help you complete
your search.
When you are writing an email, you get suggestions telling you possible endings
to your sentence.

Develop the prototype of such a system.

Q1.1 Load and Preprocess Data

Q1.2 Develop n-gram based language models 

Q1.3 Perplexity 

Q1.4 Build an auto-complete system


In [1]:
import math
import random
import numpy as np
import pandas as pd
import nltk
nltk.data.path.append('.')
from nltk.tokenize import word_tokenize


nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bindu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# 1.1 Load and Preprocess Data

Load the Data.

This section involves working with Twitter data. Begin by loading the data and examining the first few sentences. The data consists of a long string containing numerous tweets, with each tweet separated by a line break ("\n").

In [2]:
with open('C:\\Users\\Bindu\\Documents\\NLP\\twitter.txt','r',encoding='utf-8') as f:
    data=f.read()
print("Data type:", type(data))
print("Number of letters:", len(data))
print("First 30 letters of the data")
print("-------")
display(data[0:30])
print("-------")

print("Last 30 letters of the data")
print("-------")
display(data[-30:])
print("-------")

Data type: <class 'str'>
Number of letters: 3335477
First 30 letters of the data
-------


'How are you? Btw thanks for th'

-------
Last 30 letters of the data
-------


' after 5 was a TERRIBLE idea.\n'

-------


Pre-process the Data

To pre-process the Twitter data, follow these steps:

1. **Split the data into sentences**: Use the line break (`"\n"`) as the delimiter to separate the data into individual sentences.
2. **Tokenize each sentence**: Split each sentence into tokens (words). In this context, "tokens" and "words" are used interchangeably.
3. **Assign sentences to train or test sets**: Divide the sentences into training and testing datasets.
4. **Identify common tokens**: Find tokens that appear at least \( N \) times in the training data.
5. **Replace rare tokens**: Replace tokens that appear less than \( N \) times with the `<unk>` marker.

Note: Validation data is omitted for simplicity in this exercise. In real-world applications, a validation set would typically be used to tune the model during training.



# EXERCISE1-SPLIT THE DATA INTO SENTENCES



**Hint**: Use the `str.split` method to split the data into sentences based on the `"\n"` delimiter.

In [3]:
import re

def split_to_sentences(data):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!|\n)\s', data)
    sentences = [s.strip() for s in sentences]
    sentences = [s for s in sentences if len(s) > 0]
    
    return sentences

x = "hello !\nmy name is siri"
print(x)
print(split_to_sentences(x))


hello !
my name is siri
['hello !', 'my name is siri']


# EXERCISE2-TOKENIZING GIVEN SENTENCE


The next step involves tokenizing the sentences, which means splitting each sentence into a list of words. Follow these steps:

Convert Tokens to Lowercase: Convert all tokens to lowercase to ensure that words capitalized at the start of a sentence are treated the same as their lowercase versions.

Tokenize Sentences: Use nltk.word_tokenize to split each sentence into tokens. This approach handles punctuation and edge cases more effectively than str.split.

Append Tokenized Sentences: Collect each list of tokenized words into a larger list of tokenized sentences.

Hints:

Use str.lower to convert strings to lowercase.
Use nltk.word_tokenize to handle tokenization properly and account for punctuation.

In [4]:
def tokenize_sentences(sentences):
    result = []
    for sentence in sentences:
        lowercase_sentence = sentence.lower() 
        tokens = word_tokenize(lowercase_sentence)  
        result.append(tokens)
    return result

sentences = ["Hey there, what's up?", "HELLO, how are you?", "Call sir to take classes."]
print(tokenize_sentences(sentences))


[['hey', 'there', ',', 'what', "'s", 'up', '?'], ['hello', ',', 'how', 'are', 'you', '?'], ['call', 'sir', 'to', 'take', 'classes', '.']]


# EXERCISE3-TRAINING AND TESTING DATA

To split the data into training and test sets, you can follow these general steps. Here's a sample code snippet to achieve this:

Determine the Split Ratio: Decide how to split the data (e.g., 80% training, 20% testing).

Shuffle the Data: Shuffle the list of tokenized sentences to ensure randomness.

Split the Data: Divide the shuffled list into training and test sets based on the chosen ratio.

In [5]:


def get_tokenized_data(data):
    sentences = data.split("\n")
    tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
    
    return tokenized_sentences
x = "The quick brown fox jumps over the lazy dog.\nPython is a powerful programming language.\nData science is an exciting field."
tokenized_data = get_tokenized_data(x)
print(tokenized_data)


[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'], ['python', 'is', 'a', 'powerful', 'programming', 'language', '.'], ['data', 'science', 'is', 'an', 'exciting', 'field', '.']]


In [6]:
tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]
print("{} data are split into {} train and {} test set".format(
    len(tokenized_data), len(train_data), len(test_data)))

print("First training sample:")
print(train_data[0])
      
print("First test sample")
print(test_data[0])

47962 data are split into 38369 train and 9593 test set
First training sample:
['i', '❤', 'and', 'her', 'boo', 'relationship', 'they', 'so', 'thuged', 'out', '!', '!']
First test sample
['better', 'than', '``', 'misplaced', 'quotation', 'marks', "''", 'rt', ':', 'lately', 'i', 'find', 'that', 'i', 'am', 'adding', 'too', 'many', 'exclamation', 'points', 'where', 'they', 'do', "n't", 'belong', '!']


# EXERCISE4 - WORD COUNT IN SENTENCES


In this exercise, the goal is to focus on words that appear at least N times in the data. This involves counting the frequency of each word and filtering out words that don't meet the frequency threshold.

Here’s how to approach this:

Count Word Frequencies:

Use a double for-loop to count how many times each word appears in the data.
Filter Words:

Keep only those words that appear at least N times.

In [7]:
from collections import defaultdict


def count_words(tokenized_sentences):
    word_counts = defaultdict(int)
    for sentence in tokenized_sentences:
        for word in sentence:
            word_counts[word] += 1
    return dict(word_counts)
x = "Hello world.\nWelcome to the world of programming.\nHello again!"
tokenized_sentences = get_tokenized_data(x)
word_freq = count_words(tokenized_sentences)

print("Word Frequencies:", word_freq)


Word Frequencies: {'hello': 2, 'world': 2, '.': 2, 'welcome': 1, 'to': 1, 'the': 1, 'of': 1, 'programming': 1, 'again': 1, '!': 1}


### Handling Out-of-Vocabulary (OOV) Words

When building an autocomplete model, encountering a word that wasn’t seen during training presents a challenge. This "unknown" word, or out-of-vocabulary (OOV) word, makes it difficult for the model to predict the next word since there are no counts for it.

To address this issue, use a special token, such as `'unk'`, to represent all unknown words. Here’s how to modify the training data to handle OOV words:

1. **Identify Frequent Words**: Determine which words appear frequently in the training data. These words will form the "closed vocabulary."

2. **Replace Rare Words**: Convert all words that are not in the closed vocabulary to the token `'unk'`.

3. **Calculate the OOV Rate**: The percentage of words in the test set that are unknown or replaced with `'unk'` is called the OOV rate.

By following these steps, the model can handle unknown words more effectively during prediction.

# EXERCISE5 - COUNT_THRESHOLD AND CLOSED VOCALUBULARY

To create a function that identifies and returns a closed vocabulary list based on a count threshold, follow these steps:

Count Word Frequencies: First, count the frequency of each word in the text document.

Filter Words: Keep only those words whose count is greater than or equal to the specified threshold.
Return the Closed Vocabulary: Return the list of words that meet the threshold.

In [8]:

def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):
    word_counts = count_words(tokenized_sentences)
    closed_vocab = [word for word, count in word_counts.items() if count >= count_threshold]
    return closed_vocab
tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
min_count = 2
closed_vocab = get_words_with_nplus_frequency(tokenized_sentences, min_count)

print("Closed Vocabulary:")
print(closed_vocab)

Closed Vocabulary:
['.', 'are']


# EXERCISE6 - CLOSED VOCABULARY CREATION: WORDS APPEARING AT LEAST COUNT_THRESHOLD TIMES

HANDLING UNKNOWN WORDS: REPLACING NON-CLOSED VOCABULARY WORDS WITH unk

To replace words not in the closed vocabulary with the token <unk>, follow these steps:

Identify Closed Vocabulary: Determine which words appear at least count_threshold times.

Replace Unknown Words: For words not in the closed vocabulary, replace them with <unk>.

In [9]:
def  replace_out_of_vocab_tokens(tokenized_sentences, vocabulary, unknown_token="<unk>"):
    vocabulary = set(vocabulary)
    replaced_tokenized_sentences = []
    
    
    for sentence in tokenized_sentences:
        replaced_sentence = []
       
        for token in sentence:
            if token in vocabulary:
                replaced_sentence.append(token)
            else:
    
                replaced_sentence.append(unknown_token)
        
        
        replaced_tokenized_sentences.append(replaced_sentence)
        
    return replaced_tokenized_sentences


tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
vocabulary = ["dogs", "sleep"]
print("Original sentence:")
print(tokenized_sentences)
print("tokenized_sentences with less frequent words converted to '<unk>':")
print( replace_out_of_vocab_tokens(tokenized_sentences, vocabulary))


Original sentence:
[['dogs', 'run'], ['cats', 'sleep']]
tokenized_sentences with less frequent words converted to '<unk>':
[['dogs', '<unk>'], ['<unk>', 'sleep']]


In [10]:

example_sentences = [
    ['apple', 'banana', 'apple', 'grape', 'banana', 'fruit', 'apple', 'banana', 'fruit'],
    ['cat', 'chases', 'mouse']
]


vocabulary = ['apple', 'banana', 'cat']

updated_sentences = replace_out_of_vocab_tokens(example_sentences, vocabulary)

print("Processed Sentences with '<unk>':")
print(updated_sentences)


Processed Sentences with '<unk>':
[['apple', 'banana', 'apple', '<unk>', 'banana', '<unk>', 'apple', 'banana', '<unk>'], ['cat', '<unk>', '<unk>']]


# EXERCISE7 - PROCESSING DATA: COMBINING FUNCTIONS TO HANDLE UNKNOWN TOKENS

To process the data, follow these steps:

Identify Closed Vocabulary: Find tokens in the training data that appear at least count_threshold times.

Replace Unknown Tokens: Replace tokens that appear less frequently than count_threshold in both the training and test data with unk.

In [11]:
def process_and_replace_tokens(train_sentences, test_sentences, count_threshold):
  
    vocab = get_words_with_nplus_frequency(train_sentences, count_threshold)
    
    train_sentences_replaced = replace_out_of_vocab_tokens(train_sentences, vocab)
    
    test_sentences_replaced = replace_out_of_vocab_tokens(test_sentences, vocab)
    
    return train_sentences_replaced, test_sentences_replaced, vocab

tmp_train = [['water', 'is', 'blue', '.'],
             ['trees', 'are', 'green']]

tmp_test = [['lips', 'are', 'pick', '.']]

tmp_train_repl, tmp_test_repl, tmp_vocab = process_and_replace_tokens(tmp_train, tmp_test, count_threshold=1)

print("tmp_train_repl")
print(tmp_train_repl)
print()
print("tmp_test_repl")
print(tmp_test_repl)
print()
print("tmp_vocab")
print(tmp_vocab)

tmp_train_repl
[['water', 'is', 'blue', '.'], ['trees', 'are', 'green']]

tmp_test_repl
[['<unk>', 'are', '<unk>', '.']]

tmp_vocab
['water', 'is', 'blue', '.', 'trees', 'are', 'green']


## To preprocess the train and test data, you need to run the preprocess_data function with the appropriate minimum_freq and then print the results. Here’s a step-by-step guide on how to achieve this:

Define the preprocess_data function: Ensure you have the preprocess_data, get_words_with_nplus_frequency, and replace_oov_words_by_unk functions defined and working correctly.

Preprocess the data: Call the preprocess_data function with your training and test datasets along with the minimum_freq value.

Print the results: Display the first preprocessed training sample, the first preprocessed test sample, the first 10 vocabulary words, and the size of the vocabulary.

In [12]:

minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = process_and_replace_tokens(train_data, test_data, minimum_freq)

print("First preprocessed training sample:")
print(train_data_processed[0])
print()
print("First preprocessed test sample:")
print(test_data_processed[0])
print()
print("First 10 vocabulary:")
print(vocabulary[0:10])
print()
print("Size of vocabulary:", len(vocabulary))

First preprocessed training sample:
['i', '❤', 'and', 'her', 'boo', 'relationship', 'they', 'so', '<unk>', 'out', '!', '!']

First preprocessed test sample:
['better', 'than', '``', 'misplaced', '<unk>', 'marks', "''", 'rt', ':', 'lately', 'i', 'find', 'that', 'i', 'am', 'adding', 'too', 'many', 'exclamation', 'points', 'where', 'they', 'do', "n't", 'belong', '!']

First 10 vocabulary:
['i', '❤', 'and', 'her', 'boo', 'relationship', 'they', 'so', 'out', '!']

Size of vocabulary: 14931


We have completed the preprocessing phase of the assignment. The objects train_data_processed, test_data_processed, and vocabulary will be utilized in the subsequent exercises.

# 1.2 Develop n-gram based language models

# EXERCISE8 - IMPLEMENTING FUNCTION THAT COMPUTES THE N-GRAMS OF ARBITRARY NUMBER N.


Next, you'll implement a function to compute the counts of n-grams for any value of \( n \).

When calculating the counts for n-grams, first prepare the sentence by adding starting markers \<s \> at the beginning to signify the start of the sentence.

For instance, in a bigram model (N=2), you should prepend two start tokens \< s > \<s > to predict the first word of the sentence. Thus, if the sentence is \"I like food\"\, modify it to \<s>\<s> I like food\". Also, append an end token \ <e> to signal the end of the sentence, allowing the model to know when to conclude.

Technically, you'll store these counts in a dictionary where:
- The key is a tuple of n words (instead of a list).
- The value is the count of occurrences of this n-gram.

Using a tuple as the key is preferable because tuples are immutable and thus suitable for dictionary keys, whereas lists are mutable and not allowed as dictionary keys.

Hints:
- To prepend or append tokens, you can create lists and concatenate them using the `+` operator.
- To create a list with repeated values, use syntax like `['a'] * 3` to generate `['a', 'a', 'a']`.
- To determine the range for index 'i', consider this example: For a bigram model (n=2) with a sentence length of N=5 (including two start tokens and one end token), the valid index positions are [0, 1, 2, 3, 4]. The largest index 'i' for starting a bigram is 3, as the words at positions 3 and 4 form the bigram.

Remember, the `range()` function excludes the maximum value; for example, `range(3)` produces (0, 1, 2) but does not include 3.

In [13]:
def count_n_grams(data, n, start_token='<s>', end_token='<e>'):
  
    n_grams = {}

    for sentence in data:
        
        sentence = [start_token] * (n - 1) + sentence + [end_token]
      
        sentence = tuple(sentence)  
       
        for i in range(len(sentence) - n + 1):
            
            n_gram = sentence[i:i + n]
           
            if n_gram in n_grams:
              
                n_grams[n_gram] += 1
            else:
                
                n_grams[n_gram] = 1
    
    return n_grams


sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
print("Uni-gram:")
print(count_n_grams(sentences, 1))
print("Bi-gram:")
print(count_n_grams(sentences, 2))


Uni-gram:
{('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
Bi-gram:
{('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}


# EXERCISE9 - ESTIMATE THE PROBABILITY OF THE WORD GIVEN THE PRIOR 'N' WORDS USING N-GRAM COUNTS

Here are additional hints:

- To define a tuple with just one value, include a comma after the value. For example, `('apple',)` is a tuple containing a single string `'apple'`.
- To concatenate two tuples, use the `+` operator. For example, `('apple',) + ('banana',)` results in `('apple', 'banana')`.

In [14]:
def estimate_probability(word, previous_n_gram, 
                         n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
   
    previous_n_gram = tuple(previous_n_gram)
    
   
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)
    denominator = previous_n_gram_count + k * vocabulary_size

    n_plus1_gram = previous_n_gram + (word,)
  
   
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)
        
  
  
    numerator = n_plus1_gram_count + k

    probability = numerator / denominator

    
    return probability


sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
tmp_prob = estimate_probability("cat", ["a"], unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")


The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333


# Estimate probabilities for all words

The function defined below loops over all words in vocabulary to calculate probabilities for all possible words.

In [15]:
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
   
    previous_n_gram = tuple(previous_n_gram)
  
    vocabulary = vocabulary + ["<e>", "<unk>"]
    vocabulary_size = len(vocabulary)
    
    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        probabilities[word] = probability

    return probabilities

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
estimate_probabilities("a", unigram_counts, bigram_counts, unique_words, k=1)

{'a': 0.09090909090909091,
 'i': 0.09090909090909091,
 'like': 0.09090909090909091,
 'this': 0.09090909090909091,
 'dog': 0.09090909090909091,
 'cat': 0.2727272727272727,
 'is': 0.09090909090909091,
 '<e>': 0.09090909090909091,
 '<unk>': 0.09090909090909091}

In [16]:

trigram_counts = count_n_grams(sentences, 3)
estimate_probabilities(["<s>", "<s>"], bigram_counts, trigram_counts, unique_words, k=1)

{'a': 0.1111111111111111,
 'i': 0.2222222222222222,
 'like': 0.1111111111111111,
 'this': 0.2222222222222222,
 'dog': 0.1111111111111111,
 'cat': 0.1111111111111111,
 'is': 0.1111111111111111,
 '<e>': 0.1111111111111111,
 '<unk>': 0.1111111111111111}

### Count and probability matrices

As we've discussed, the n-gram counts computed are adequate for calculating the probabilities of the next word.

To make these probabilities more intuitive, they can be presented as count or probability matrices. The functions provided in the following cells will return these matrices. This function has already been provided for you.

In [17]:
def make_count_matrix(n_plus1_gram_counts, vocabulary):

    vocabulary = vocabulary + ["<e>", "<unk>"]
    
    n_grams = []
    for n_plus1_gram in n_plus1_gram_counts.keys():
        n_gram = n_plus1_gram[0:-1]
        n_grams.append(n_gram)
    n_grams = list(set(n_grams))
    
  
    row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
   
    col_index = {word:j for j, word in enumerate(vocabulary)}
    
    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = np.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count
    
    count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    return count_matrix
sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

print('bigram counts')
display(make_count_matrix(bigram_counts, unique_words))

bigram counts


Unnamed: 0,a,i,like,this,dog,cat,is,<e>,<unk>
"(a,)",0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
"(cat,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
"(like,)",2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(dog,)",0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"(<s>,)",0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
"(is,)",0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
"(i,)",0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
"(this,)",0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [18]:
print('\ntrigram counts')
trigram_counts = count_n_grams(sentences, 3)
display(make_count_matrix(trigram_counts, unique_words))


trigram counts


Unnamed: 0,a,i,like,this,dog,cat,is,<e>,<unk>
"(<s>, <s>)",0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
"(this, dog)",0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"(<s>, i)",0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
"(dog, is)",0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
"(<s>, this)",0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
"(is, like)",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(i, like)",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(a, cat)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
"(like, a)",0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0


In [19]:
import pandas as pd
from collections import defaultdict

def make_count_matrix(n_plus1_gram_counts, vocabulary):
   
    n_gram_keys = list(set([n_gram[:-1] for n_gram in n_plus1_gram_counts.keys()]))
    
    
    count_matrix = pd.DataFrame(0, index=pd.MultiIndex.from_tuples(n_gram_keys), columns=vocabulary)

  
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[:-1]
        word = n_plus1_gram[-1]    
        if n_gram in count_matrix.index and word in count_matrix.columns:
            count_matrix.at[n_gram, word] = count

    return count_matrix

def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
  
    count_matrix = make_count_matrix(n_plus1_gram_counts, vocabulary)
    

    count_matrix += k
    
   
    prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
    return prob_matrix

def count_n_grams(sentences, n):
    n_gram_counts = defaultdict(int)
    for sentence in sentences:
        sentence = ['<s>'] * (n-1) + sentence + ['</s>'] 
        n_grams = zip(*[sentence[i:] for i in range(n)])
        for n_gram in n_grams:
            n_gram_counts[n_gram] += 1
    return n_gram_counts


sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

unique_words = list(set(sentences[0] + sentences[1]))


bigram_counts = count_n_grams(sentences, 2)
print("Bigram probabilities")
bigram_prob_matrix = make_probability_matrix(bigram_counts, unique_words, k=1)
display(bigram_prob_matrix)

trigram_counts = count_n_grams(sentences, 3)
print("Trigram probabilities")
trigram_prob_matrix = make_probability_matrix(trigram_counts, unique_words, k=1)
display(trigram_prob_matrix)


Bigram probabilities


Unnamed: 0,a,i,like,this,dog,cat,is
a,0.111111,0.111111,0.111111,0.111111,0.111111,0.333333,0.111111
cat,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857
like,0.333333,0.111111,0.111111,0.111111,0.111111,0.111111,0.111111
dog,0.125,0.125,0.125,0.125,0.125,0.125,0.25
<s>,0.111111,0.222222,0.111111,0.222222,0.111111,0.111111,0.111111
is,0.125,0.125,0.25,0.125,0.125,0.125,0.125
i,0.125,0.125,0.25,0.125,0.125,0.125,0.125
this,0.125,0.125,0.125,0.125,0.25,0.125,0.125


Trigram probabilities


Unnamed: 0,Unnamed: 1,a,i,like,this,dog,cat,is
<s>,<s>,0.111111,0.222222,0.111111,0.222222,0.111111,0.111111,0.111111
this,dog,0.125,0.125,0.125,0.125,0.125,0.125,0.25
<s>,i,0.125,0.125,0.25,0.125,0.125,0.125,0.125
dog,is,0.125,0.125,0.25,0.125,0.125,0.125,0.125
<s>,this,0.125,0.125,0.125,0.125,0.25,0.125,0.125
is,like,0.25,0.125,0.125,0.125,0.125,0.125,0.125
i,like,0.25,0.125,0.125,0.125,0.125,0.125,0.125
a,cat,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857,0.142857
like,a,0.111111,0.111111,0.111111,0.111111,0.111111,0.333333,0.111111


# 1.3 Perplexity

The higher the probabilities are, the lower the perplexity will be.

The more the n-grams tell us about the sentence, the lower the perplexity score will be.

# EXERCISE 10 - Compute the perplexity score given an N-gram count matrix and a sentence.

In [20]:
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    
    n = len(list(n_gram_counts.keys())[0]) 
    
   
    sentence = ["<s>"] * n + sentence + ["<e>"]
    
   
    N = len(sentence)
    
    
    product_pi = 1.0
    
    
    for t in range(n, N):
        
        n_gram = tuple(sentence[t-n:t])
        
       
        word = sentence[t]
        
        
        probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=k)
        
       
        product_pi *= 1 / probability

  
    perplexity = product_pi ** (1 / N)
    
    return perplexity

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

perplexity_train1 = calculate_perplexity(sentences[0],
                                         unigram_counts, bigram_counts,
                                         len(unique_words), k=1.0)
print(f"Perplexity for first train sample: {perplexity_train1:.4f}")

test_sentence = ['i', 'like', 'a', 'dog']
perplexity_test = calculate_perplexity(test_sentence,
                                       unigram_counts, bigram_counts,
                                       len(unique_words), k=1.0)
print(f"Perplexity for test sample: {perplexity_test:.4f}")


Perplexity for first train sample: 3.2293
Perplexity for test sample: 3.8027


# 1.4 Build an auto-complete system

# Exercise 11- Compute probabilities for all possible next words and suggest the most likely one.

Compute the probabilities for all potential next words and identify the most likely one.

This function also accepts an optional argument, `start_with`, which specifies the initial letters of the next words.

Hints:
- `estimate_probabilities` returns a dictionary where each key is a word, and the corresponding value is the probability of that word.
- Use `str1.startswith(str2)` to check if a string begins with the specified letters. For example, `'learning'.startswith('lea')` returns `True`, while `'learning'.startswith('ear')` returns `False`. You can use the default values for the additional parameters of `str.startswith()` in this context.

In [21]:
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):
    
    n = len(list(n_gram_counts.keys())[0]) 
    
   
    previous_n_gram = previous_tokens[-n:]

    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary, k=k)
   
    suggestion = None
    max_prob = 0
    
    for word, prob in probabilities.items():  
      
        if start_with:  
            if not word.startswith(start_with):
                continue
        
        if prob > max_prob:
            suggestion = word
            
            max_prob = prob
    return suggestion, max_prob


sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

previous_tokens = ["i", "like"]
tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"The previous words are 'i like',\n\tand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")

print()

tmp_starts_with = 'c'
tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with)
print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`\n\tand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}")



The previous words are 'i like',
	and the suggested word is `a` with a probability of 0.2727

The previous words are 'i like', the suggestion must start with `c`
	and the suggested word is `cat` with a probability of 0.0909


## Get multiple suggestions
The function defined below loop over varioud n-gram models to get multiple suggestions.

In [22]:
def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    
    model_counts = len(n_gram_counts_list)
    suggestions = []
    for i in range(model_counts - 1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i + 1]
        
        suggestion = suggest_a_word(previous_tokens, n_gram_counts,
                                    n_plus1_gram_counts, vocabulary,
                                    k=k, start_with=start_with)
        suggestions.append(suggestion)
    return suggestions


def count_n_grams(sentences, n):
    
    n_grams = {}
    for sentence in sentences:
        sentence_length = len(sentence)
        for i in range(sentence_length - n + 1):
            n_gram = tuple(sentence[i:i + n])
            if n_gram in n_grams:
                n_grams[n_gram] += 1
            else:
                n_grams[n_gram] = 1
    return n_grams

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
trigram_counts = count_n_grams(sentences, 3)
quadgram_counts = count_n_grams(sentences, 4)
qintgram_counts = count_n_grams(sentences, 5)

n_gram_counts_list = [unigram_counts, bigram_counts, trigram_counts, quadgram_counts, qintgram_counts]
previous_tokens = ["i", "like"]
tmp_suggest3 = get_suggestions(previous_tokens, n_gram_counts_list, unique_words, k=1.0)

print(f"The previous words are 'i like', the suggestions are:")
for suggestion in tmp_suggest3:
    print(f"Suggested word: {suggestion[0]}, Probability: {suggestion[1]:.4f}")


The previous words are 'i like', the suggestions are:
Suggested word: a, Probability: 0.2727
Suggested word: a, Probability: 0.2000
Suggested word: a, Probability: 0.1111
Suggested word: a, Probability: 0.1111


## n-grams of varying lengths (unigrams, bigrams, trigrams, 4-grams...6-grams).

In [23]:
from IPython.display import display  # Use this in Jupyter Notebook

def count_n_grams(sentences, n):
    
    n_grams = {}
    for sentence in sentences:
        sentence_length = len(sentence)
        for i in range(sentence_length - n + 1):
            n_gram = tuple(sentence[i:i + n])
            if n_gram in n_grams:
                n_grams[n_gram] += 1
            else:
                n_grams[n_gram] = 1
    return n_grams


train_data_processed = [
    ['i', 'am', 'happy'],
    ['i', 'want', 'to', 'go'],
    ['how', 'are', 'you', 'doing'],
    ['i', 'am', 'going', 'to', 'school'],
    ['hey', 'how', 'are', 'you']
]

vocabulary = list(set([word for sentence in train_data_processed for word in sentence]))


n_gram_counts_list = []
for n in range(1, 6):
    print("Computing n-gram counts with n =", n, "...")
    n_model_counts = count_n_grams(train_data_processed, n)
    n_gram_counts_list.append(n_model_counts)

previous_tokens_list = [
    ["i", "am", "to"],
    ["i", "want", "to", "go"],
    ["hey", "how", "are"],
    ["hey", "how", "are", "you"]
]

for previous_tokens in previous_tokens_list:
    suggestions = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)
    print(f"The previous words are {previous_tokens}, the suggestions are:")
    display(suggestions)


previous_tokens = ["hey", "how", "are", "you"]
suggestions_with_start = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with="d")
print(f"The previous words are {previous_tokens}, the suggestions are:")
display(suggestions_with_start)


Computing n-gram counts with n = 1 ...
Computing n-gram counts with n = 2 ...
Computing n-gram counts with n = 3 ...
Computing n-gram counts with n = 4 ...
Computing n-gram counts with n = 5 ...
The previous words are ['i', 'am', 'to'], the suggestions are:


[('go', 0.11764705882352941),
 ('want', 0.06666666666666667),
 ('want', 0.06666666666666667),
 ('want', 0.06666666666666667)]

The previous words are ['i', 'want', 'to', 'go'], the suggestions are:


[('want', 0.0625), ('want', 0.0625), ('want', 0.0625), ('want', 0.0625)]

The previous words are ['hey', 'how', 'are'], the suggestions are:


[('you', 0.17647058823529413),
 ('you', 0.17647058823529413),
 ('you', 0.125),
 ('want', 0.06666666666666667)]

The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:


[('doing', 0.11764705882352941),
 ('doing', 0.11764705882352941),
 ('doing', 0.11764705882352941),
 ('want', 0.0625)]

The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:


[('doing', 0.11764705882352941),
 ('doing', 0.11764705882352941),
 ('doing', 0.11764705882352941),
 ('doing', 0.0625)]

## RESULTS - WHICH HAS BEST ACCURACY?             

In [24]:
import pandas as pd
from IPython.display import display

# Example implementation of get_suggestions
def get_suggestions(previous_tokens, n_gram_counts, vocabulary, k=1.0, start_with=None):
    suggestions = []
    n = len(next(iter(n_gram_counts.keys())))
    for n_gram, count in n_gram_counts.items():
        if len(n_gram) == n and previous_tokens[-(n-1):] == list(n_gram[:-1]):
            if start_with is None or n_gram[-1].startswith(start_with):
                suggestions.append(n_gram[-1])
    return suggestions

def evaluate_accuracy(n_gram_counts, test_data):
    correct_predictions = 0
    total_predictions = 0
    
    for sentence in test_data:
        for i in range(len(sentence) - 1):
            context = tuple(sentence[:i+1])
            true_next_word = sentence[i+1]
            
            suggestions = get_suggestions(list(context), n_gram_counts, vocabulary, k=1.0)
            print(f"Context: {context}, True Next Word: {true_next_word}, Suggestions: {suggestions}")
            
            if true_next_word in suggestions:
                correct_predictions += 1
            total_predictions += 1
    
    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

# Test data
test_data = [
    ['i', 'am', 'happy'],
    ['i', 'want', 'to', 'go'],
    ['how', 'are', 'you', 'doing'],
    ['i', 'am', 'going', 'to', 'school'],
    ['hey', 'how', 'are', 'you']
]

# Example n-gram counts list (replace with actual counts)
n_gram_counts_list = [
    {('i',): 5, ('am',): 4, ('happy',): 2, ('want',): 3, ('to',): 5, ('go',): 2, ('how',): 3, ('are',): 4, ('you',): 5, ('doing',): 1, ('going',): 2, ('school',): 1, ('hey',): 2},
    {('i', 'am'): 4, ('am', 'happy'): 2, ('i', 'want'): 2, ('want', 'to'): 2, ('to', 'go'): 2, ('how', 'are'): 3, ('are', 'you'): 3, ('you', 'doing'): 1, ('i', 'am', 'going'): 1, ('am', 'going', 'to'): 1, ('going', 'to', 'school'): 1, ('hey', 'how'): 2, ('how', 'are'): 2},
    {('i', 'am', 'happy'): 2, ('am', 'happy', 'to'): 1, ('happy', 'to', 'go'): 1, ('to', 'go', 'to'): 1, ('go', 'to', 'school'): 1, ('how', 'are', 'you'): 3, ('are', 'you', 'doing'): 1, ('you', 'doing', 'well'): 1, ('i', 'am', 'going', 'to'): 1, ('am', 'going', 'to', 'school'): 1},
    # Add actual counts for n=4, n=5, etc.
]

# Evaluate each n-gram model
accuracy_results = []
for n, n_gram_counts in enumerate(n_gram_counts_list, start=1):
    print(f"Evaluating accuracy for n-gram model with n = {n}...")
    accuracy = evaluate_accuracy(n_gram_counts, test_data)
    accuracy_results.append({'n': n, 'Accuracy': accuracy})

# Create a DataFrame for better visualization
accuracy_df = pd.DataFrame(accuracy_results)
display(accuracy_df)


Evaluating accuracy for n-gram model with n = 1...
Context: ('i',), True Next Word: am, Suggestions: []
Context: ('i', 'am'), True Next Word: happy, Suggestions: []
Context: ('i',), True Next Word: want, Suggestions: []
Context: ('i', 'want'), True Next Word: to, Suggestions: []
Context: ('i', 'want', 'to'), True Next Word: go, Suggestions: []
Context: ('how',), True Next Word: are, Suggestions: []
Context: ('how', 'are'), True Next Word: you, Suggestions: []
Context: ('how', 'are', 'you'), True Next Word: doing, Suggestions: []
Context: ('i',), True Next Word: am, Suggestions: []
Context: ('i', 'am'), True Next Word: going, Suggestions: []
Context: ('i', 'am', 'going'), True Next Word: to, Suggestions: []
Context: ('i', 'am', 'going', 'to'), True Next Word: school, Suggestions: []
Context: ('hey',), True Next Word: how, Suggestions: []
Context: ('hey', 'how'), True Next Word: are, Suggestions: []
Context: ('hey', 'how', 'are'), True Next Word: you, Suggestions: []
Evaluating accuracy 

Unnamed: 0,n,Accuracy
0,1,0.0
1,2,0.8
2,3,0.266667


In [1]:
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        sentences = [line.strip().split() for line in f if line.strip()]
    return sentences

# Load the dataset
file_path = 'C:\\Users\\Bindu\\Documents\\NLP\\twitter.txt'
data = load_data(file_path)

# Print a few sentences to check
print("Sample sentences from the dataset:")
for sentence in data[:5]:
    print(sentence)


Sample sentences from the dataset:
['How', 'are', 'you?', 'Btw', 'thanks', 'for', 'the', 'RT.', 'You', 'gonna', 'be', 'in', 'DC', 'anytime', 'soon?', 'Love', 'to', 'see', 'you.', 'Been', 'way,', 'way', 'too', 'long.']
['When', 'you', 'meet', 'someone', 'special...', "you'll", 'know.', 'Your', 'heart', 'will', 'beat', 'more', 'rapidly', 'and', "you'll", 'smile', 'for', 'no', 'reason.']
["they've", 'decided', 'its', 'more', 'fun', 'if', 'I', "don't."]
['So', 'Tired', 'D;', 'Played', 'Lazer', 'Tag', '&', 'Ran', 'A', 'LOT', 'D;', 'Ughh', 'Going', 'To', 'Sleep', 'Like', 'In', '5', 'Minutes', ';)']
['Words', 'from', 'a', 'complete', 'stranger!', 'Made', 'my', 'birthday', 'even', 'better', ':)']


In [2]:
def get_suggestions(previous_tokens, n_gram_counts, vocabulary, k=1.0, start_with=None):
    suggestions = []
    n = len(next(iter(n_gram_counts.keys())))
    for n_gram, count in n_gram_counts.items():
        if len(n_gram) == n and previous_tokens[-(n-1):] == list(n_gram[:-1]):
            if start_with is None or n_gram[-1].startswith(start_with):
                suggestions.append(n_gram[-1])
    return suggestions


In [3]:
import pandas as pd
from IPython.display import display
from collections import defaultdict

# Function to count n-grams
def count_n_grams(sentences, n):
    n_grams = defaultdict(int)
    for sentence in sentences:
        sentence_length = len(sentence)
        for i in range(sentence_length - n + 1):
            n_gram = tuple(sentence[i:i + n])
            n_grams[n_gram] += 1
    return n_grams

# Function to get suggestions based on n-grams
def get_suggestions(previous_tokens, n_gram_counts, vocabulary, k=1.0, start_with=None):
    suggestions = []
    n = len(next(iter(n_gram_counts.keys())))
    previous_tokens_tuple = tuple(previous_tokens[-(n-1):])
    for n_gram, count in n_gram_counts.items():
        if len(n_gram) == n and previous_tokens_tuple == tuple(n_gram[:-1]):
            if start_with is None or n_gram[-1].startswith(start_with):
                suggestions.append(n_gram[-1])
    return suggestions

# Function to evaluate accuracy
def evaluate_accuracy(n_gram_counts, test_data):
    correct_predictions = 0
    total_predictions = 0
    
    for sentence in test_data:
        for i in range(len(sentence) - 1):
            context = tuple(sentence[:i+1])
            true_next_word = sentence[i+1]
            
            suggestions = get_suggestions(list(context), n_gram_counts, vocabulary, k=1.0)
            if true_next_word in suggestions:
                correct_predictions += 1
            total_predictions += 1

    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

# Load data from file
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        sentences = [line.strip().split() for line in f if line.strip()]
    return sentences

# Path to the dataset
file_path = 'C:\\Users\\Bindu\\Documents\\NLP\\twitter.txt'

# Load and preprocess the dataset
data = load_data(file_path)

# Use only the first 5 sentences for testing
subset_data = data[:5]

# Generate n-gram counts for different n
n_gram_counts_list = []
for n in range(1, 6):  # Example for n=1 to n=5
    print(f"Generating n-gram counts for n = {n}...")
    n_gram_counts = count_n_grams(subset_data, n)
    n_gram_counts_list.append(n_gram_counts)

# Generate vocabulary
vocabulary = list(set([word for sentence in subset_data for word in sentence]))

# Evaluate each n-gram model
accuracy_results = []
for n, n_gram_counts in enumerate(n_gram_counts_list, start=1):
    print(f"Evaluating accuracy for n-gram model with n = {n}...")
    accuracy = evaluate_accuracy(n_gram_counts, subset_data)
    accuracy_results.append({'n': n, 'Accuracy': accuracy})

# Create a DataFrame for better visualization
accuracy_df = pd.DataFrame(accuracy_results)
display(accuracy_df)


Generating n-gram counts for n = 1...
Generating n-gram counts for n = 2...
Generating n-gram counts for n = 3...
Generating n-gram counts for n = 4...
Generating n-gram counts for n = 5...
Evaluating accuracy for n-gram model with n = 1...
Evaluating accuracy for n-gram model with n = 2...
Evaluating accuracy for n-gram model with n = 3...
Evaluating accuracy for n-gram model with n = 4...
Evaluating accuracy for n-gram model with n = 5...


Unnamed: 0,n,Accuracy
0,1,0.0
1,2,1.0
2,3,0.935065
3,4,0.87013
4,5,0.805195


In [4]:
import pandas as pd
from IPython.display import display
from collections import defaultdict

# Function to count n-grams
def count_n_grams(sentences, n):
    n_grams = defaultdict(int)
    for sentence in sentences:
        sentence_length = len(sentence)
        for i in range(sentence_length - n + 1):
            n_gram = tuple(sentence[i:i + n])
            n_grams[n_gram] += 1
    return n_grams

# Function to get suggestions based on n-grams
def get_suggestions(previous_tokens, n_gram_counts, vocabulary, k=1.0, start_with=None):
    suggestions = []
    n = len(next(iter(n_gram_counts.keys())))
    previous_tokens_tuple = tuple(previous_tokens[-(n-1):])
    for n_gram, count in n_gram_counts.items():
        if len(n_gram) == n and previous_tokens_tuple == tuple(n_gram[:-1]):
            if start_with is None or n_gram[-1].startswith(start_with):
                suggestions.append(n_gram[-1])
    return suggestions

# Function to evaluate accuracy
def evaluate_accuracy(n_gram_counts, test_data):
    correct_predictions = 0
    total_predictions = 0
    
    for sentence in test_data:
        print(f"\nEvaluating sentence: {' '.join(sentence)}")
        for i in range(len(sentence) - 1):
            context = tuple(sentence[:i+1])
            true_next_word = sentence[i+1]
            
            suggestions = get_suggestions(list(context), n_gram_counts, vocabulary, k=1.0)
            print(f"Context: {context}, True Next Word: {true_next_word}, Suggestions: {suggestions}")
            
            if true_next_word in suggestions:
                correct_predictions += 1
            total_predictions += 1

    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

# Load data from file
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        sentences = [line.strip().split() for line in f if line.strip()]
    return sentences

# Path to the dataset
file_path = 'C:\\Users\\Bindu\\Documents\\NLP\\twitter.txt'

# Load and preprocess the dataset
data = load_data(file_path)

# Use only the first 5 sentences for testing
subset_data = data[:5]

# Generate n-gram counts for different n
n_gram_counts_list = []
for n in range(1, 6):  # Example for n=1 to n=5
    print(f"Generating n-gram counts for n = {n}...")
    n_gram_counts = count_n_grams(subset_data, n)
    n_gram_counts_list.append(n_gram_counts)

# Generate vocabulary
vocabulary = list(set([word for sentence in subset_data for word in sentence]))

# Evaluate each n-gram model
accuracy_results = []
for n, n_gram_counts in enumerate(n_gram_counts_list, start=1):
    print(f"\nEvaluating accuracy for n-gram model with n = {n}...")
    accuracy = evaluate_accuracy(n_gram_counts, subset_data)
    accuracy_results.append({'n': n, 'Accuracy': accuracy})

# Create a DataFrame for better visualization
accuracy_df = pd.DataFrame(accuracy_results)
display(accuracy_df)


Generating n-gram counts for n = 1...
Generating n-gram counts for n = 2...
Generating n-gram counts for n = 3...
Generating n-gram counts for n = 4...
Generating n-gram counts for n = 5...

Evaluating accuracy for n-gram model with n = 1...

Evaluating sentence: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
Context: ('How',), True Next Word: are, Suggestions: []
Context: ('How', 'are'), True Next Word: you?, Suggestions: []
Context: ('How', 'are', 'you?'), True Next Word: Btw, Suggestions: []
Context: ('How', 'are', 'you?', 'Btw'), True Next Word: thanks, Suggestions: []
Context: ('How', 'are', 'you?', 'Btw', 'thanks'), True Next Word: for, Suggestions: []
Context: ('How', 'are', 'you?', 'Btw', 'thanks', 'for'), True Next Word: the, Suggestions: []
Context: ('How', 'are', 'you?', 'Btw', 'thanks', 'for', 'the'), True Next Word: RT., Suggestions: []
Context: ('How', 'are', 'you?', 'Btw', 'thanks', 'for', 'the', 'RT.'), True

Unnamed: 0,n,Accuracy
0,1,0.0
1,2,1.0
2,3,0.935065
3,4,0.87013
4,5,0.805195


In [5]:
import pandas as pd
from collections import defaultdict

# Function to count n-grams
def count_n_grams(sentences, n):
    n_grams = defaultdict(int)
    for sentence in sentences:
        sentence_length = len(sentence)
        for i in range(sentence_length - n + 1):
            n_gram = tuple(sentence[i:i + n])
            n_grams[n_gram] += 1
    return n_grams

# Dummy implementation of get_suggestions function
def get_suggestions(previous_tokens, n_gram_counts, vocabulary, k=1.0, start_with=None):
    # Generate suggestions based on the most frequent n-gram that matches the context
    suggestions = []
    n = len(next(iter(n_gram_counts.keys())))
    previous_tokens_tuple = tuple(previous_tokens[-(n-1):])
    candidates = {n_gram[-1]: count for n_gram, count in n_gram_counts.items() if n_gram[:-1] == previous_tokens_tuple}
    sorted_candidates = sorted(candidates.items(), key=lambda x: x[1], reverse=True)
    suggestions = [word for word, _ in sorted_candidates]
    return suggestions[:3]  # Return top 3 suggestions for simplicity

# Function to evaluate accuracy
def evaluate_accuracy(n_gram_counts, test_data):
    correct_predictions = 0
    total_predictions = 0
    
    for sentence in test_data:
        for i in range(len(sentence) - 1):
            context = tuple(sentence[:i+1])
            true_next_word = sentence[i+1]
            
            suggestions = get_suggestions(list(context), n_gram_counts, vocabulary, k=1.0)
            
            if true_next_word in suggestions:
                correct_predictions += 1
            total_predictions += 1

    accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
    return accuracy

# Load data from file
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        sentences = [line.strip().split() for line in f if line.strip()]
    return sentences

# Path to the dataset
file_path = 'C:\\Users\\Bindu\\Documents\\NLP\\twitter.txt'

# Load and preprocess the dataset
data = load_data(file_path)

# Use only the first 5 sentences for testing
subset_data = data[:5]

# Generate vocabulary
vocabulary = list(set([word for sentence in subset_data for word in sentence]))

# Generate n-gram counts for different n
n_gram_counts_list = []
for n in range(1, 6):  # Example for n=1 to n=5
    n_gram_counts = count_n_grams(subset_data, n)
    n_gram_counts_list.append(n_gram_counts)

# Evaluate each n-gram model and store results
results = []

for idx, sentence in enumerate(subset_data):
    sentence_result = {'Sentence': ' '.join(sentence)}
    best_accuracy = 0
    best_n = 0
    
    for n, n_gram_counts in enumerate(n_gram_counts_list, start=1):
        accuracy = evaluate_accuracy(n_gram_counts, [sentence])
        sentence_result[f'{n}-gram Accuracy'] = accuracy
        
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_n = n
    
    sentence_result['Best N-gram'] = f'{best_n}-gram' if best_n > 0 else 'None'
    results.append(sentence_result)

# Create a DataFrame for better visualization
accuracy_df = pd.DataFrame(results)
display(accuracy_df)


Unnamed: 0,Sentence,1-gram Accuracy,2-gram Accuracy,3-gram Accuracy,4-gram Accuracy,5-gram Accuracy,Best N-gram
0,How are you? Btw thanks for the RT. You gonna ...,0.0,1.0,0.956522,0.913043,0.869565,2-gram
1,When you meet someone special... you'll know. ...,0.0,1.0,0.944444,0.888889,0.833333,2-gram
2,they've decided its more fun if I don't.,0.0,1.0,0.857143,0.714286,0.571429,2-gram
3,So Tired D; Played Lazer Tag & Ran A LOT D; Ug...,0.0,1.0,0.947368,0.894737,0.842105,2-gram
4,Words from a complete stranger! Made my birthd...,0.0,1.0,0.9,0.8,0.7,2-gram
