# Task 1 - NGram Models

## 1. Write a python program that **takes a dataset and generates n-grams** to a given value of n. Use the generated n-grams to conduct a **frequency analysis**. What is the most common n-grams in the provided text? Relate the results to the theory behind n-grams and their importance in language representation.

In [None]:
# I will use this dataset: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis

import pandas as pd
from collections import Counter
import re

# Function to read the dataset
def read_csv(file_path):
    return pd.read_csv(file_path)

# Function to generate n-grams
def generate_ngrams(text, n):
    # Clean and tokenize text
    text = re.sub(r'\W+', ' ', text)
    tokens = text.lower().split()

    # Generate n-grams
    ngrams = zip(*[tokens[i:] for i in range(n)])
    ngrams = [' '.join(ngram) for ngram in ngrams]

    return ngrams

# Function for n-gram frequency analysis
def ngram_frequency_analysis(text, n):
    ngrams = generate_ngrams(text, n)
    ngram_freq = Counter(ngrams)

    return ngram_freq

# Read csv file
file_path = '/content/twitter_training_clean.csv'  # Replace with the path to your file
df = read_csv(file_path)

# Select the text column you want to analyze
column_to_analyze = 'Tweet content'  # Make sure that the column exists in your dataset
text_data = ' '.join(df[column_to_analyze].dropna().astype(str).tolist())

n = 3  # Change this value to generate different n-grams (e.g., 1 for unigrams, 2 for bigrams, etc.)
ngram_freq = ngram_frequency_analysis(text_data, n)

# Find the most common n-grams
most_common_ngrams = ngram_freq.most_common(10)  # Top 10 most common n-grams

print("Most common n-grams:")
for ngram, freq in most_common_ngrams:
    print(f"{ngram}: {freq}")

# Relating the results to the theory behind n-grams
# N-grams are a fundamental concept in text mining and natural language processing. They represent
# sequences of words or tokens in a given text and are used to capture the context and relationship
# between words. For instance, unigrams (n=1) give the frequency of individual words, bigrams (n=2)
# give the frequency of word pairs, and trigrams (n=3) give the frequency of word triplets.

# The frequency analysis of n-grams is crucial for understanding language patterns, identifying common phrases,
# and building language models. In tasks such as text prediction, speech recognition, and machine translation,
# n-grams help in modeling the likelihood of a word given the previous words, thus enabling more accurate predictions
# and translations.

# In the example text provided, the most common trigrams (n=3) help identify prevalent phrases and contextual word
# associations within the text. This information can be used to improve algorithms in natural language processing
# applications.

Most common n-grams:
pic twitter com: 3511
https t co: 2085
red dead redemption: 1077
i can t: 937
_ _ _: 936
call of duty: 835
italy italy italy: 761
i don t: 735
dead redemption 2: 708
league of legends: 651


**Explanation**

**N-grams Theory and Importance**

1. **Definition:**

*   N-grams are contiguous sequences of 'n' items (words, characters, etc.) from a given text.
*   For example, in the sentence "I love NLP", unigrams are \["I", "love", "NLP"\], bigrams are \["I love", "love NLP"\], and trigrams are \["I love NLP"\].

2. **Importance in Language Represention:**

* **Context Capture:** N-grams capture the local context of words. Higher-order n-grams (bigrams, trigrams) provide more context than unigrams.
* **Predictive Modeling:** In language models, n-grams are used to predict the next word given the previous 'n-1' words, improving the accuracy of predictions.
* **Text Analysis:** Frequency analysis of n-grams helps identify common phrases, understand text structure, and extract meaningful patterns.
* **Applications:** N-grams are used in various NLP tasks like text generation, machine translation, sentiment analysis, and more.

By analyzing the most common n-grams in a text, we can gain insights into the text's structure, common phrases, and thematic elements. This information is crucial for building effective natural language processing systems.






### Analysis of the Most Common N-grams

1. **pic twitter com: 3511**
   - This is a common combination in tweets containing links to Twitter images. The prefix “pic” usually precedes “twitter.com” in shared image links.

2. **https t co: 2085**.
   - Similar to the above, this n-gram is common in tweets that include links shortened using Twitter's URL shortening service, `t.co`.

3. **red dead redemption: 1077**
   - This is a name of a popular game, indicating that the game `red dead redemption` was a common theme in the tweets analyzed.

4. **i can t: 937**
   - A common n-gram in everyday language, showing a phrase commonly used in conversation.

5. **_ _ _: 936**
   - This could be an artifact of some formatting in the tweets (e.g., underscore characters used for emphasis or separators). It should be investigated further.

6. **call of duty: 835**
   - Another name of a popular game, indicating that “Call of Duty” was also a frequent theme in tweets.

7. **italy italy italy: 761**
   - This could be a repeating pattern used in some tweets, perhaps related to events in Italy or Italy-related spam.

8. **i don t: 735**
   - Another common phrase in everyday language, reflecting how people speak and express negations.

9. **dead redemption 2: 708**
   - Part of the name of the game “Red Dead Redemption 2”, reinforcing that this game was a much-discussed topic.

10. **league of legends: 651**
    - The name of another popular game, “League of Legends,” which was also a recurring theme in tweets.



## 2. Implement a **simple language model** using n-grams. Use the unigrams, bigrams and / or trigrams that you have found to compute the MLE probability of a word given some context. Apply it for for at least **four different pairs of word and context**. Apply some **smoothing** and justify why you are doing it. Play with **different smoothing K's** and explain the differences between the methods you try. Do all methods agree about which n-gram has higher probability? Comment on your results.

In [None]:
# Select the text column you want to analyze
column_to_analyze = 'Tweet content'  # Make sure that the column exists in your dataset
text_data = ' '.join(df[column_to_analyze].dropna().astype(str).tolist())

# Generate n-grams for smoothing
unigrams = generate_ngrams(text_data, 1)
bigrams = generate_ngrams(text_data, 2)
trigrams = generate_ngrams(text_data, 3)

# Calculate the frequencies of the n-grams for smoothing
unigram_freq = Counter(unigrams)
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)

# Calculate the frequencies of the n-gram contexts for smoothing
bigram_context_freq = Counter([bigram.split(' ')[0] for bigram in bigrams])
trigram_context_freq = Counter([tuple(trigram.split(' ')[0:2]) for trigram in trigrams])

# Vocabulary size for smoothing
vocab_size = len(set(unigrams))

# Select word-context pairs based on frequent n-grams
def select_word_context_pairs(frequent_ngrams, context_size):
    word_context_pairs = []
    for ngram, freq in frequent_ngrams.most_common(4):
        words = ngram.split()
        word = words[-1]
        context = words[:-1]
        word_context_pairs.append((word, tuple(context)))  # Convert context to tuple
    return word_context_pairs

# Get the most common bigrams and trigrams
most_common_bigrams = bigram_freq.most_common(4)
most_common_trigrams = trigram_freq.most_common(4)

# Combine the most common bigrams and trigrams to form word-context pairs
word_context_pairs = select_word_context_pairs(bigram_freq, 1) + select_word_context_pairs(trigram_freq, 2)

# Ensure unique pairs
word_context_pairs = list(set(word_context_pairs))

# Function for calculating the maximum likelihood estimate (MLE) with smoothing
def calculate_mle_with_smoothing(word, context, ngram_freq, ngram_context_freq, vocab_size, k=1):
    context_str = ' '.join(context)
    context_freq = ngram_context_freq.get(context_str, 0)
    word_given_context_freq = ngram_freq.get(' '.join(context + (word,)), 0)  # Concatenate context with word as tuple

    # Apply Laplace smoothing
    mle_prob = (word_given_context_freq + k) / (context_freq + k * vocab_size)

    return mle_prob

# Apply smoothing and calculate the maximum likelihood probability for each word pair and context
for word, context in word_context_pairs:
    context_str = ' '.join(context)
    print(f"Word: '{word}', Context: '{context_str}'")

    print("Unigram MLE probabilities:")
    for k in [0.1, 0.5, 1, 5]:
        unigram_mle_prob = calculate_mle_with_smoothing(word, context, unigram_freq, unigram_freq, vocab_size, k)
        print(f"K={k}: {unigram_mle_prob:.6f}")
    print()

    print("Bigram MLE probabilities:")
    for k in [0.1, 0.5, 1, 5]:
        bigram_mle_prob = calculate_mle_with_smoothing(word, context[:1], bigram_freq, bigram_context_freq, vocab_size, k)
        print(f"K={k}: {bigram_mle_prob:.6f}")
    print()

    print("Trigram MLE probabilities:")
    for k in [0.1, 0.5, 1, 5]:
        trigram_mle_prob = calculate_mle_with_smoothing(word, context[:2], trigram_freq, trigram_context_freq, vocab_size, k)
        print(f"K={k}: {trigram_mle_prob:.6f}")
    print()

Word: 't', Context: 'i can'
Unigram MLE probabilities:
K=0.1: 0.000032
K=0.5: 0.000032
K=1: 0.000032
K=5: 0.000032

Bigram MLE probabilities:
K=0.1: 0.000461
K=0.5: 0.000358
K=1: 0.000282
K=5: 0.000120

Trigram MLE probabilities:
K=0.1: 0.301173
K=0.5: 0.060260
K=1: 0.030146
K=5: 0.006055

Word: 'twitter', Context: 'pic'
Unigram MLE probabilities:
K=0.1: 0.000014
K=0.5: 0.000025
K=1: 0.000028
K=5: 0.000031

Bigram MLE probabilities:
K=0.1: 0.476760
K=0.5: 0.177254
K=1: 0.099299
K=5: 0.021999

Trigram MLE probabilities:
K=0.1: 0.000032
K=0.5: 0.000032
K=1: 0.000032
K=5: 0.000032

Word: 'com', Context: 'twitter'
Unigram MLE probabilities:
K=0.1: 0.000014
K=0.5: 0.000025
K=1: 0.000028
K=5: 0.000031

Bigram MLE probabilities:
K=0.1: 0.505432
K=0.5: 0.187464
K=1: 0.104952
K=5: 0.023235

Trigram MLE probabilities:
K=0.1: 0.000032
K=0.5: 0.000032
K=1: 0.000032
K=5: 0.000032

Word: 'm', Context: 'i'
Unigram MLE probabilities:
K=0.1: 0.000003
K=0.5: 0.000010
K=1: 0.000015
K=5: 0.000026

Bigram 

Smoothing is essential in language modeling to handle the issue of zero probabilities for unseen n-grams. Without smoothing, any n-gram not present in the training data would have a probability of zero, which would drastically affect the performance of the model.

We applied Laplace smoothing (additive smoothing) with different K-values to observe how the probabilities change. The values tested were K=0.1, 0.5, 1, and 5.


#### Observations and Analysis

1. **Effect of Smoothing on Unigram Probabilities:**
   - The unigram probabilities are very small and do not vary significantly with different K-values. This is because unigrams have a larger context (the entire vocabulary), making individual word probabilities small.

2. **Effect of Smoothing on Bigram Probabilities:**
   - The bigram probabilities are higher than unigram probabilities, as they consider the preceding word, which reduces the context size.
   - Higher K-values tend to reduce the bigram probabilities because the additional counts from smoothing distribute the probability mass more evenly across the vocabulary.

3. **Effect of Smoothing on Trigram Probabilities:**
   - Trigram probabilities are generally higher than both unigram and bigram probabilities when the context is well-defined and frequent.
   - For some word-context pairs, smoothing with higher K-values drastically reduces the probability. This indicates that the trigram model heavily relies on the context, and adding more smoothing dilutes this effect.

#### Agreement Among Models

The different n-gram models do not always agree on which n-gram has the higher probability. For example:
- For the word 't' in the context 'i can', the trigram model assigns a high probability with lower K-values, while the bigram and unigram models assign much lower probabilities.
- For the word 'com' in the context 'pic twitter', the trigram model assigns a very high probability with lower K-values, indicating strong context dependence.

#### Conclusion

The n-gram models (unigram, bigram, trigram) show varying probabilities based on the context and the amount of smoothing applied. The trigram model tends to provide higher probabilities for well-defined contexts but is more sensitive to the choice of K-value for smoothing. The unigram model is the least sensitive to context, while the bigram model provides a middle ground.

Smoothing is crucial for handling unseen n-grams, and different K-values help balance the trade-off between assigning too much probability mass to seen n-grams and distributing it too evenly across the entire vocabulary. This balance is essential for building robust language models.

## 3. Use the frequency of n-grams to **predict the next Word in a sentence.** Try your code on a use case of your choice. Explain how you do it, justify the choice of the language model and discuss its limitations.

In [None]:
# Function to predict the next word
def predict_next_word(context, ngram_freq, ngram_size):
    context_str = ' '.join(context[-(ngram_size-1):])  # Get the relevant part of the context
    candidates = {ngram.split()[-1]: freq for ngram, freq in ngram_freq.items() if ngram.startswith(context_str)}

    if not candidates:
        return None

    # Return the word with the highest frequency
    return max(candidates, key=candidates.get)

# Use case: Predict the next word for a given context
context = ['red', 'dead']  # Example context
next_word_bigram = predict_next_word(context, bigram_freq, 2)
next_word_trigram = predict_next_word(context, trigram_freq, 3)

print(f"Given context: {' '.join(context)}")
print(f"Next word prediction using bigram model: {next_word_bigram}")
print(f"Next word prediction using trigram model: {next_word_trigram}")

Given context: red dead
Next word prediction using bigram model: redemption
Next word prediction using trigram model: redemption


To predict the next word in a sentence using n-gram frequencies, we can employ a simple language model based on the n-grams we've generated. Here, I'll demonstrate how to use bigram and trigram models for this task. We'll start by calculating the conditional probabilities of potential next words given the current context (last one or two words).

**Predicting the Next Word:**
   - The `predict_next_word` function takes the context (last word for bigrams, last two words for trigrams), searches the n-grams that match the context, and selects the next word with the highest frequency.

### Justification for Choice of Language Model

- **Bigram Model:**
  - The bigram model considers only the last word of the context. It is computationally less expensive and works well for cases where the immediate predecessor word is highly indicative of the next word.

- **Trigram Model:**
  - The trigram model takes into account the last two words, capturing more context and providing better predictions for cases where the next word depends on a two-word phrase.

### Limitations

1. **Data Sparsity:**
   - Trigram models suffer from data sparsity, as the probability estimates for trigrams with low counts can be unreliable. This is where smoothing techniques help, but even with smoothing, rare trigrams can skew predictions.

2. **Fixed Context Size:**
   - Both bigram and trigram models have a fixed context size, limiting their ability to capture longer-range dependencies in language.

3. **Vocabulary Size:**
   - Large vocabulary sizes can lead to many unseen n-grams in the training data, again necessitating smoothing.

4. **No Semantic Understanding:**
   - N-gram models do not capture the semantic meaning of words, which can lead to grammatically correct but semantically nonsensical predictions.

### Use Case and Predictions

In the given example, for the context `['red', 'dead']`, the bigram model might predict 'redemption' (if it follows 'dead' frequently), while the trigram model can provide more accurate predictions if the phrase 'red dead' is common in the dataset.

### Conclusion

While n-gram models are simple and provide a baseline for language modeling, they have limitations that can be addressed with more advanced models like neural language models, which capture longer dependencies and semantic meanings better.

## 4. Extend the language model to perform basic **sentiment analysis**. Perform some descriptive analysis and then use the n-grams and Word frequency to predict the sentiment of the sentences in your text (positive, negative and, if you want, neutral) instead of just predicting the next Word. Don't forget to evaluate the performance of your model and comment on the results.

In [None]:
# Print some descriptive statistics
print(f"Total number of unigrams: {len(unigram_freq)}")
print(f"Total number of bigrams: {len(bigram_freq)}")
print(f"Total number of trigrams: {len(trigram_freq)}")
print(f"Most common unigrams: {unigram_freq.most_common(10)}")
print(f"Most common bigrams: {bigram_freq.most_common(10)}")
print(f"Most common trigrams: {trigram_freq.most_common(10)}")

Total number of unigrams: 31115
Total number of bigrams: 338250
Total number of trigrams: 693150
Most common unigrams: [('the', 44611), ('i', 36164), ('to', 29042), ('and', 26712), ('a', 24307), ('of', 19528), ('it', 17941), ('is', 17883), ('in', 15795), ('for', 15672)]
Most common bigrams: [('i m', 4531), ('it s', 3716), ('twitter com', 3708), ('pic twitter', 3511), ('of the', 2862), ('in the', 2842), ('can t', 2377), ('don t', 2369), ('and i', 2197), ('t co', 2087)]
Most common trigrams: [('pic twitter com', 3511), ('https t co', 2085), ('red dead redemption', 1077), ('i can t', 937), ('_ _ _', 936), ('call of duty', 835), ('italy italy italy', 761), ('i don t', 735), ('dead redemption 2', 708), ('league of legends', 651)]


**Label Sentiments**

We have sentiment labels (positive, negative, neutral) for each tweet in the dataset so we will use these labels to train a sentiment classifier.