**N-Gram**

An n-gram is a contiguous sequence of n items from a given sample of text or speech. In simpler terms, it's a group of n words that appear together in a text. For example, in the sentence "the quick brown fox", "the quick", "quick brown", and "brown fox" are 2-grams (bigrams), and "the quick brown" and "quick brown fox" are 3-grams (trigrams). N-grams are commonly used in natural language processing tasks like language modeling, text prediction, and spam filtering.

# **Uni-Gram**
The most basic form of an N-gram in natural language processing (NLP) is a unigram, sometimes referred to as a 1-gram. It is made up of discrete words or tokens inside a text; each unigram represents a single word on its own, without reference to context or other words nearby.

In [12]:
import re

with open('big.txt','r')as fd:
  lines = fd.readlines()

  words = []

  for line in lines:
    words += re.findall('\w+',line.lower())


  def get_pairs(words):
    data = []

    for i in range(len(words)-1):
      data.append(' '.join(words[i:i+2]))

    return data

data = get_pairs(words)

  words += re.findall('\w+',line.lower())


# **Creating Probability Distributions**

To predict the next word in NLP, we harness probability distributions, particularly conditional probabilities. The process involves:

1. Counting Occurrences: Just as with basic unigram pairs, we analyze a text corpus to tally the frequency of these bigrams.

2. Conditional Probability: This is where the real magic happens. Given a sequence of N-1 words (in the case of bigrams, just one preceding word), probability distributions estimate the likelihood of different words occurring next.

3. Using Frequency: The frequency and patterns of unigram pairs in a training corpus play a crucial role in calculating these probabilities. Words that often follow each other are assigned higher probabilities.

In [5]:
import re
import pandas as pd
import numpy as np
from tqdm import tqdm

Finding Occurence Probabilities

In [13]:
# "data" is the words pairs we made earlier
a = np.array(data)

# use numpy to find unique pairs and their counts
pair, count = np.unique(a, return_counts=True)

print(pair)

print(f'Unique pairs: {len(pair)}')

print('-'*30)

print('Total pairs:',len(data))

unique_parts = list(set(data))

print('-'*30)

prob_dist = []

for i in range(len(pair)):
  prob_dist.append([pair[i], count[i], pair[i].split(' ')[-1]])

print(len(prob_dist))

# print(prob_dist)

# In summary, this cell takes the raw list of word pairs, identifies the unique pairs and how many times each appears, and then organizes this information into a new list called prob_dist which is set up for further calculations related to probability distributions (like calculating the probability of the second word given the first).



['0 05' '0 25' '0 45' ... 'zweck ist' 'zygoma in' 'zygomatic and']
Unique pairs: 390694
------------------------------
Total pairs: 1115584
------------------------------
390694


# **Next word prediction with Uni-Gram**

Predicting the Next Word

The 'predict' function within the code snippet plays a crucial role in the next word prediction. It works as follows:

1. For a given input word, the function searches through the probability distribution data frame ('df') to identify uni-gram pairs that start with the input word.

2. The function creates a new data frame ('df_pred') to store these identified pairs, along with their frequencies and potential next words.

3. It sorts the 'df_pred' data frame by frequency in descending order, revealing which words will likely follow the input word.

4. Finally, the function returns a list of the top five most probable next words.

In [18]:
df = pd.DataFrame(prob_dist, columns=['pair', 'freq', 'out'])

df = df[df['freq'] >=5]

df.head()

def predict(word):
  df_pred = []

  for i in df.values:

    if i[0].split(' ')[0] == word:
      # print(i)
      df_pred.append([i[0], i[1], i[2]])

  df_pred = pd.DataFrame(df_pred, columns=['in','freq','out'])
  return list(df_pred.sort_values(by='freq', ascending=False).head()['out'].values)



In [22]:
predict('he')

['had', 'was', 'said', 'is', 'would']

# **Next Word Prediction - Auto Generated**

**Auto-Generated Sequencing**

Auto-generated sequencing is the automated approach to next-word prediction. In this method, we use a starting word and iteratively predict the next word in the sequence, allowing the process to continue seamlessly. The code provided demonstrates this technique:

In [23]:
word = 'one'

for i in range(20):
  pred = predict(word)
  word = pred[0]
  print(word, end=' ')

of the same time to the same time to the same time to the same time to the same time 

**Manual Selection**

On the other hand, manual selection introduces human interaction into the prediction process. Users are presented with a list of potential next words, and they select which word they want to proceed with. The code for this manual sequencing approach is as follows:

In [24]:
word = 'this'

preds = []
preds.append(word)

for i in range(5):
  pred = []

  pred = predict(word)
  print(pred)

  word = pred[int(input("Enter the number: "))]

  preds.append(word)

print('-'*20)
print(' '.join(preds))
print('-'*20)

['is', 'was', 'and', 'way', 'time']
Enter the number: 1
['a', 'the', 'not', 'in', 'to']
Enter the number: 1
['same', 'french', 'first', 'old', 'emperor']
Enter the number: 1
['and', 'army', 'had', 'were', 'revolution']
Enter the number: 1
['and', 'was', 'of', 'to', 'he']
Enter the number: 1
--------------------
this was the french army was
--------------------


# **Working with Bi-Gram, Tri-Gram and N-Gram**

**Finding the Pairs: Bi-Gram, Tri-Gram, and N-Gram**

The first step in text analysis involves identifying pairs of words using N-grams. The code provided focuses on extracting N-grams of varying lengths, specifically four-grams in this case. Here's how it works:

In [26]:
# code
def get_pairs(words, n):
    n = n + 1  # To consider N words as a single N-gram
    data = []
    for i in range(len(words) - n):
        data.append(' '.join(words[i:i + n]))
    return data

This code generates N-grams by sliding a "window" of N words across the text, effectively creating pairs of words. For example, in the sentence "I love natural language processing," a four-gram analysis would produce pairs like "I love natural," "love natural language," and so on.



---



**Finding Occurrence Probabilities**

Once we have identified these pairs, the next step is to calculate occurrence probabilities. This is vital for understanding which words are more likely to follow others in a given context. The code for this process is as follows:

In [28]:
# code
def get_prob_dist(data):
    a = np.array(data)
    pair, count = np.unique(a, return_counts=True)
    unique_pairs = list(set(data))
    prob_dist = []
    for i in range(len(unique_pairs)):
        prob_dist.append([unique_pairs[i], ' '.join(unique_pairs[i].split(' ')[:-1]), unique_pairs[i].split(' ')[-1], count[i]])
    return prob_dist

This code uses NumPy to process the generated N-grams and create a probability distribution. It counts the occurrences of each unique N-gram and records the individual words as well as their frequencies.

In [29]:
# Generate N-grams (four-grams in this case)
data = get_pairs(words, 4)

# Calculate occurrence probabilities for the N-grams
prob_dist = get_prob_dist(data)

# **Sentence Generation with Bi, Tri and N-Gram**

**Predicting the Words**

Before diving into sentence generation, we need to understand how N-gram models are used to predict the next word. The code snippet below demonstrates this process:

In [30]:
df = pd.DataFrame(prob_dist, columns = ['seq', 'inp', 'out', 'freq'])

def predict(word):

  if len(df[df['inp'] == word]):

    df_ = df[df['inp'] == word]

    top_predictions = df_.sort_values(by='freq').head()['out'].values
    return top_predictions
  else:
    print('seq is not present')

predict('this is a beautiful')

array(['country'], dtype=object)

This code uses a DataFrame to store N-gram sequences, input words, output words, and their frequencies. The 'predict' function predicts the next word given an input word or sequence. It checks if the input word is present in the DataFrame and returns the most probable following words based on their frequencies.



---



**Prediction with Auto Sequencing**

For more extended sentence generation, the 'pred_seq' function takes a seed sequence and predicts the subsequent words to create a sentence:

In [31]:
# code
# Function to predict a sequence of words based on an initial input sequence and a number of words to predict
def pred_seq(seq, n):
    output = []  # Initialize an empty list to store the predicted sequence
    output.append(seq)  # Append the initial input sequence to the output list

    for i in range(n):
        pred = predict(seq)  # Predict the next word based on the current sequence
        seq = ' '.join(seq.split(' ')[1:]) + ' ' + pred[0]  # Update the sequence by removing the first word and adding the predicted word
        output.append(pred[0])  # Append the predicted word to the output list

    return ' '.join(output)  # Return the generated sequence as a string
pred_seq('of the united states', 50)
# Example usage:
# To generate a sequence of 5 predicted words based on an initial sequence 'apple is a fruit', you can call the function like this:
# generated_sequence = pred_seq('apple is a fruit', 5)
# The 'generated_sequence' variable will contain the sequence of words, including the initial input and the predicted words.

'of the united states although it was well known had doubts about the propriety of american action in hawaii for the purpose of making an inquiry into the matter he sent a special commissioner to the islands on the basis of beginnings made five years before at anna pavlovna s an unacknowledged sense of'