## Overview & Objectives:

- What is an n‑gram language model.

- How to generate text using n‑gram probabilities.

- Our plan to compare outputs for different values of n.

## Data Preparation:

- Loading or defining a sample corpus.

- Tokenizing and cleaning up the text.

## Building the n‑gram Model:

- Creating a dictionary (or frequency table) of n‑grams.

- Calculating probabilities for each possible next word given a context.

- Writing functions that build the model for any given n value.

## Text Generation:

- Implementing a function to generate text using our n‑gram probabilities.

- Using random sampling (weighted by probabilities) to choose the next word.

- Comparing text outputs with different n (e.g., unigram, bigram, trigram).

## Visualization:

- Creating bar plots (or similar plots) to visualize frequency distributions of n‑grams.

- Exploring how the distribution changes as n increases.


# Overview & Objectives:


### What is an n‑gram language model?
- you can say model is dependent on only n-1 time stamps or another word n‑gram model or language model predicts the next word in a sequence based on the previous (n−1) words.

- If n=1, it’s a unigram model → looks at each word independently.

- If n=2, it’s a bigram model → looks at the previous one word.

- If n=3, it’s a trigram model → looks at the previous two words.

And so on...

#### Example:
- For the sentence: "I love coding in Python"

Unigrams: ["I", "love", "coding", "in", "Python"]

Bigrams: ["I love", "love coding", "coding in", "in Python"]

Trigrams: ["I love coding", "love coding in", "coding in Python"]




# ✍️ How to Generate Text Using N-Gram Probabilities

### 1. Train the Model  
Count all the n-grams in the training data.

---

### 2. Estimate Probabilities  
Use Maximum Likelihood Estimation (MLE):

P(w_n | w_{n-1}, ..., w_1) = {Count(w_1,...,w_n)}/{Count(w_1,...,w_{n-1})}


---

### 3. Generate Text  
- Start with a seed sequence of **(n−1)** words.  
- Predict the **next word** based on the previous context.  
- Keep adding predicted words by sliding the window forward. 

In [1]:
import pandas as pd

In [2]:
# we will start with one 
def count_all_ngram(texts, n_gram): #list of all text
    #add start and end token </s> </e>
    for i in range(len(texts)):
        texts[i] = '</s> ' + texts[i].lower() + ' </e>'

    # compute n gram using sliding window
    ngram_list = []
    for text in texts:
        # sentance tokanization
        text_split  = text.split()
        for i in range(0, len(text_split)-n_gram):
            ngram_list.append(text_split[i: i+ n_gram ])

    return ngram_list
    

In [3]:
path_to_data = ['/kaggle/input/sentiment-labelled-sentences-data-set/sentiment labelled sentences/amazon_cells_labelled.txt',
               '/kaggle/input/sentiment-labelled-sentences-data-set/sentiment labelled sentences/imdb_labelled.txt',
                '/kaggle/input/sentiment-labelled-sentences-data-set/sentiment labelled sentences/yelp_labelled.txt']
path_to_data2 = [
                '/kaggle/input/bible/t_asv.csv',
                '/kaggle/input/bible/t_bbe.csv',
                # '/kaggle/input/bible/t_dby.csv',
                '/kaggle/input/bible/t_kjv.csv',
                '/kaggle/input/bible/t_wbt.csv',
                '/kaggle/input/bible/t_web.csv',
                '/kaggle/input/bible/t_ylt.csv'
               ]

In [4]:
df1 = pd.concat([pd.read_csv(path, sep='\t', header=None, on_bad_lines="skip" ) for path in path_to_data] )


In [5]:
df2 = pd.concat([pd.read_csv(path) for path in path_to_data2])

In [6]:
df2 

Unnamed: 0,id,b,c,v,t
0,1001001,1,1,1,In the beginning God created the heavens and t...
1,1001002,1,1,2,And the earth was waste and void; and darkness...
2,1001003,1,1,3,"And God said, Let there be light: and there wa..."
3,1001004,1,1,4,"And God saw the light, that it was good: and G..."
4,1001005,1,1,5,"And God called the light Day, and the darkness..."
...,...,...,...,...,...
31098,66022017,66,22,17,"And the Spirit and the Bride say, Come; and he..."
31099,66022018,66,22,18,`For I testify to every one hearing the words ...
31100,66022019,66,22,19,and if any one may take away from the words of...
31101,66022020,66,22,20,he saith -- who is testifying these things -- ...


In [7]:
df1

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [8]:
df  = pd.concat([df1[0],df2['t']])

In [9]:
df

0        So there is no way for me to plug it in here i...
1                              Good case, Excellent value.
2                                   Great for the jawbone.
3        Tied to charger for conversations lasting more...
4                                        The mic is great.
                               ...                        
31098    And the Spirit and the Bride say, Come; and he...
31099    `For I testify to every one hearing the words ...
31100    and if any one may take away from the words of...
31101    he saith -- who is testifying these things -- ...
31102    The grace of our Lord Jesus Christ `is' with y...
Length: 189368, dtype: object

In [10]:
# count_all_ngram(df[0].to_list(), 2)  # here is the code for getting all ngram if want to see its freq then count in the list 
# count_all_ngram(df[0].to_list(), 1).count(['so'])

In [11]:
vocab = set([word for senatnce in df.to_list() for word in senatnce.split()])

In [13]:
len(vocab)

68592

In [15]:
from collections import defaultdict, Counter

dict_all_n = {}
n_gram = 8

for i in range(1, n_gram):
    print(i)
    dict_n = defaultdict(list)
    all_ngrams = [tuple(item) for item in count_all_ngram(df.to_list(), i)]
    
    # Count all n-grams just once
    ngram_counts = Counter(all_ngrams)
    
    
    dict_all_n[i] = ngram_counts


1
2
3
4
5
6
7


In [16]:
# dict_all_n

In [26]:
def n_gram_model(words , n) : # prob of a next word given these set of words 
    if len(words) >= n:
        words = words[- n:]
        print(words, - n)
    else:
        return {}

    li = {word: dict_all_n[n+1][tuple(words + [word])]/(dict_all_n[n][tuple(words)] + 1) for word in vocab}

    return dict(sorted({k:v for k,v in li.items() if v != 0 }.items(), key=lambda item: -item[1]))
    
    

In [28]:
n_gram_model(['there', 'is', 'a'], 3) # it will give you a genral prob of next word (how likly this word given sentance)

['there', 'is', 'a'] -3


{'word': 0.14915966386554622,
 'man': 0.04201680672268908,
 'god': 0.029411764705882353,
 'generation': 0.0273109243697479,
 'time': 0.025210084033613446,
 'way': 0.025210084033613446,
 'certain': 0.023109243697478993,
 'great': 0.02100840336134454,
 'sin': 0.018907563025210083,
 'prophet': 0.01680672268907563,
 'lion': 0.01680672268907563,
 'generation,': 0.014705882352941176,
 'reward': 0.014705882352941176,
 'natural': 0.012605042016806723,
 'morning': 0.012605042016806723,
 'multitude': 0.012605042016806723,
 'son': 0.01050420168067227,
 'crying': 0.01050420168067227,
 'king': 0.01050420168067227,
 'sound': 0.01050420168067227,
 'noise': 0.01050420168067227,
 'feast': 0.01050420168067227,
 'place': 0.01050420168067227,
 'vanity': 0.01050420168067227,
 'white': 0.01050420168067227,
 'cup,': 0.008403361344537815,
 'wicked': 0.008403361344537815,
 'people': 0.008403361344537815,
 'league': 0.008403361344537815,
 'kinsman': 0.008403361344537815,
 'cry,': 0.008403361344537815,
 'season,

In [None]:
# dict_all_n[2]