<a href="https://colab.research.google.com/github/Anggunasr/MSBA2425/blob/main/Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Language Modeling
Language modeling in natural language processing (NLP) plays a pivotal role in the development of intelligent systems that can understand and generate human language. Essentially, a language model aims to predict the likelihood of a sequence of words or the probability of the next word given a specific context. By capturing the underlying structure and patterns in textual data, language models facilitate various NLP tasks, such as machine translation, text summarization, sentiment analysis, and conversational AI.


#Probabilistic Language Modeling with n-grams
Probabilistic language modeling using n-grams is a fundamental approach in NLP that leverages the statistical properties of text to predict word sequences. An n-gram model represents text as contiguous sequences of n words, where the context for predicting the next word is limited to the previous n-1 words. For instance, a bigram (n=2) model restricts the context to a single preceding word, while a trigram (n=3) model considers the two preceding words.
N-gram models estimate the probabilities of word sequences by calculating their frequency in a given corpus. They utilize the Markov assumption, which states that the probability of the next word depends only on the preceding n-1 words, thus simplifying computation.

Despite their simplicity, n-gram models have been widely used in various NLP tasks, such as speech recognition, machine translation, and text generation. However, they have limitations, including data sparsity and the inability to capture long-range dependencies in text. The emergence of more sophisticated techniques like deep learning-based language models has shifted the focus, but n-gram models still hold relevance as a foundation for understanding language modeling and its development.

In this session, we will explore n-grams-based language modeling.


#Example: A Jane Austin Novel
We will explore ‘next word prediction’ and ‘text generation’ based on the Jane Austin novel ‘Sense and Sensibility’. Let’s look first at all the books that are available.


## View Books and Download

In [3]:
# !pip install --upgrade jax jaxlib numpy==1.24.3

Error: Command '['/content/my_jax_env/bin/python3', '-m', 'ensurepip', '--upgrade', '--default-pip']' returned non-zero exit status 1.
/bin/bash: line 1: my_jax_env/bin/activate: No such file or directory
Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
INFO: pip is looking at multiple versions of jax to determine which version is compatible with other requirements. This could take a while.
Collecting jax
  Using cached jax-0.6.0-py3-none-any.whl.metadata (22 kB)
  Downloading jax-0.5.3-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib
  Downloading jaxlib-0.5.3-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting jax
  Downloading jax-0.5.2-py3-none-any.whl.metadata (22 kB)
Collecting jaxlib
  Downloading jaxlib-0.5.1-cp311-cp311-manylinux2014_x86_64.whl.metadata (978 bytes)
Collecting jax
  Downloading jax-0.5.1-py3-none-any.whl.metadata (22 kB)
INFO: pip is still looking at multiple versi

**Select Runtime-Restart session after installing jax**

In [1]:
import nltk
nltk.download('gutenberg')  # Make sure the Gutenberg corpus is downloaded
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [2]:
# List available texts in the Gutenberg corpus
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [18]:
# Load "Sense and Sensibility" text
# sas = gutenberg.raw('austen-sense.txt')
sas = gutenberg.raw('melville-moby_dick.txt')

# Print the first 500 characters of "Sense and Sensibility"
print(sas[:5000])


[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true." --HACKLUYT

"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness
or rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER'S
DICTIONARY

"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;
A.S. WALW-IAN, to roll, to wallow." --RICHARDSON'S DICTIONARY


##n-gram Model for Next Word Prediction and Text Generation.

Following are the key steps.

* **Step 1: Preprocess the Text**

Start by tokenizing the text.

* **Step 2: Build the N-gram Model**

Create a trigram model, which will be used for predicting the next word based on the previous two words.

* **Step 3: Next Word Prediction**

Write a function that takes two words as input and predicts the most probable next word.

* **Step 4: Text Generation**

Using the trigram model, generate text by iteratively predicting the next word.

In [19]:
import nltk
from nltk import word_tokenize, ngrams
from collections import defaultdict, Counter
nltk.download('punkt')
nltk.download('punkt_tab')

# Tokenize the text
tokens = word_tokenize(sas.lower())  # Convert to lower case

# Generate trigrams from the tokens
trigrams = list(ngrams(tokens, 3))
trigram_freq = defaultdict(Counter)

# Populate the frequencies of trigrams
for w1, w2, w3 in trigrams:
    trigram_freq[(w1, w2)][w3] += 1

# Function to predict the next word
def predict_next_word(w1, w2):
    if (w1, w2) in trigram_freq:
        # Get the most common next word for the given bigram (w1, w2)
        return trigram_freq[(w1, w2)].most_common(1)[0][0]
    else:
        return None

# Function to generate text
def generate_text(start_words, num_words):
    if len(start_words) < 2:
        return "Please provide at least two starting words."

    generated_words = list(start_words)
    for _ in range(num_words):
        next_word = predict_next_word(generated_words[-2], generated_words[-1])
        if next_word is None:
            break  # Break if no next word is found
        generated_words.append(next_word)

    return ' '.join(generated_words)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
trigrams

[('[', 'sense', 'and'),
 ('sense', 'and', 'sensibility'),
 ('and', 'sensibility', 'by'),
 ('sensibility', 'by', 'jane'),
 ('by', 'jane', 'austen'),
 ('jane', 'austen', '1811'),
 ('austen', '1811', ']'),
 ('1811', ']', 'chapter'),
 (']', 'chapter', '1'),
 ('chapter', '1', 'the'),
 ('1', 'the', 'family'),
 ('the', 'family', 'of'),
 ('family', 'of', 'dashwood'),
 ('of', 'dashwood', 'had'),
 ('dashwood', 'had', 'long'),
 ('had', 'long', 'been'),
 ('long', 'been', 'settled'),
 ('been', 'settled', 'in'),
 ('settled', 'in', 'sussex'),
 ('in', 'sussex', '.'),
 ('sussex', '.', 'their'),
 ('.', 'their', 'estate'),
 ('their', 'estate', 'was'),
 ('estate', 'was', 'large'),
 ('was', 'large', ','),
 ('large', ',', 'and'),
 (',', 'and', 'their'),
 ('and', 'their', 'residence'),
 ('their', 'residence', 'was'),
 ('residence', 'was', 'at'),
 ('was', 'at', 'norland'),
 ('at', 'norland', 'park'),
 ('norland', 'park', ','),
 ('park', ',', 'in'),
 (',', 'in', 'the'),
 ('in', 'the', 'centre'),
 ('the', 'cent

In [6]:
len(trigrams)

141437

### Test the Results

In [13]:
# Example usage of the prediction function
print("Next word:", predict_next_word('by', 'the'))

Next word: entrance


In [8]:
# Example usage of the text generation function
start_words = ("the", "more")
generate_text(start_words, 100)

"the more easily reconciled , by the entrance of the house , and the two miss steeles , as she had been in the world . '' `` i am sure i would not be in town , and the two miss steeles , as she had been in the world . '' `` i am sure i would not be in town , and the two miss steeles , as she had been in the world . '' `` i am sure i would not be in town , and the two miss steeles , as she had been in the"

### Improving Text Generation
The approach of  picking the most frequent next word in text generation results in text repetition and uninteresting text.

We will modify the `predict_next_word` function to choose the next word based on a probability distribution rather than just picking the most frequent next word. This way, the selection will still favor more likely words but won't always select the same word every time.


In [20]:
import random

# Function to predict the next word with randomness
def predict_next_word(w1, w2):
    if (w1, w2) in trigram_freq:
        next_words = list(trigram_freq[(w1, w2)].elements())
        return random.choice(next_words) if next_words else None
    else:
        return None

# Function to generate text with randomness
def generate_text(start_words, num_words):
    if len(start_words) < 2:
        return "Please provide at least two starting words."

    generated_words = list(start_words)
    for _ in range(num_words):
        next_word = predict_next_word(generated_words[-2], generated_words[-1])
        if next_word is None:
            break  # Break if no next word is found
        generated_words.append(next_word)

    return ' '.join(generated_words)



In [25]:
# Example usage of the text generation function
start_words = ("it", "was", "the")
generate_text(start_words, 500)

"it was the immovable strain upon the fair face of the harpoon as compared with the leviathan -- to the deck . the frenzies of the rigging lived . the shavings into the victory 's plank where nelson fell . `` spread yourselves , if any strange face were visible ; for your englishman is rather reserved , and in this enchanted mood , thy subtlest thinkings may be deemed pre-eminently presuming and ridiculous . doubtless one leading reason why you do n't aggravate me -- let us squeeze ourselves universally into the bows , and give him much joy . for not only were the strong , unstaggering breeze abounded so , cutting my boat in certain books , whose allurements cover nothing but mist . and about thirty more behind it all came out of thyself , ishmael , that is , an interval of some sort , they have finally bestirred themselves ; the peculiar stair-like formation of all his successive meetings with various tints , seemed vacating itself of life that lives in a fog -- yea , and lying in var

# Your turn

Create next word prediction and text generation using n-grams based on the novel Moby Dick, by Herman Melville. You can access it with `melville-moby_dick.txt`.