<a href="https://colab.research.google.com/github/SRINIDHISAGI/NLP1/blob/main/NLP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Colab is making it easier than ever to integrate powerful Generative AI capabilities into your projects. We are launching public preview for a simple and intuitive Python library (google.colab.ai) to access state-of-the-art language models directly within Pro and Pro+ subscriber Colab environments.  This means subscribers can spend less time on configuration and set up and more time bringing their ideas to life. With just a few lines of code, you can now perform a variety of tasks:
- Generate text
- Translate languages
- Write creative content
- Categorize text

Happy Coding!


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/Getting_started_with_google_colab_ai.ipynb)

In [None]:
from collections import defaultdict

toy_corpus = [
    "low", "low", "low", "low", "low",
    "lowest", "lowest",
    "newer", "newer", "newer", "newer", "newer", "newer",
    "wider", "wider", "wider",
    "new", "new"
]
corpus = [" ".join(list(word)) + " _" for word in toy_corpus]
corpus = [tuple(token.split(" ")) for token in corpus]

def get_stats(corpus):
    pairs = defaultdict(int)
    for word in corpus:
        for i in range(len(word)-1):
            pairs[(word[i], word[i+1])] += 1
    return pairs

def merge_pair(pair, corpus):
    new_corpus = []
    replacement = "".join(pair)
    for word in corpus:
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word)-1 and (word[i], word[i+1]) == pair:
                new_word.append(replacement)
                i += 2
            else:
                 new_word.append(word[i])
                 i += 1
        new_corpus.append(tuple(new_word))
    return new_corpus

vocab = set([sym for word in corpus for sym in word])
print("Initial vocab:", vocab)

for step in range(1, 4):
    pairs = get_stats(corpus)
    best = max(pairs, key=pairs.get)
    corpus = merge_pair(best, corpus)
    vocab.add("".join(best))
    print(f"\nStep {step}: merge {best} → {''.join(best)}")
    print("Corpus sample:", corpus[:8])
    print("Updated vocab:", vocab)


Initial vocab: {'l', 'w', 'i', 'o', 'n', 's', 'e', 't', 'r', '_', 'd'}

Step 1: merge ('e', 'r') → er
Corpus sample: [('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', 'e', 's', 't', '_'), ('l', 'o', 'w', 'e', 's', 't', '_'), ('n', 'e', 'w', 'er', '_')]
Updated vocab: {'l', 'w', 'i', 'o', 'n', 's', 'e', 't', 'r', '_', 'd', 'er'}

Step 2: merge ('er', '_') → er_
Corpus sample: [('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', 'e', 's', 't', '_'), ('l', 'o', 'w', 'e', 's', 't', '_'), ('n', 'e', 'w', 'er_')]
Updated vocab: {'l', 'w', 'i', 'o', 'n', 's', 'e', 't', 'er_', 'r', '_', 'd', 'er'}

Step 3: merge ('n', 'e') → ne
Corpus sample: [('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', '_'), ('l', 'o', 'w', 'e', 's', 't', '_'), ('l', 'o', 'w', 'e', 's', 't', '_'), ('ne', 'w', 'er_')]
U

In [None]:
from collections import defaultdict

toy_corpus = [
    "low", "low", "low", "low", "low",
    "lowest", "lowest",
    "newer", "newer", "newer", "newer", "newer", "newer",
    "wider", "wider", "wider",
    "new", "new"
]

def get_stats(corpus):
    pairs = defaultdict(int)
    for word in corpus:
        for i in range(len(word)-1):
            pairs[(word[i], word[i+1])] += 1
    return pairs

def merge_pair(pair, corpus):
    new_corpus = []
    replacement = "".join(pair)
    for word in corpus:
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word)-1 and (word[i], word[i+1]) == pair:
                new_word.append(replacement)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_corpus.append(new_word)
    return new_corpus

class BPE:
    def __init__(self, num_merges):
        self.num_merges = num_merges
        self.merges = []

    def fit(self, corpus_words):
        corpus = [list(word) + ["_"] for word in corpus_words]
        for i in range(self.num_merges):
            pairs = get_stats(corpus)
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            corpus = merge_pair(best, corpus)
            self.merges.append(best)
            print(f"Step {i+1}: merge {best} → {''.join(best)} (vocab size={len(set(x for w in corpus for x in w))})")
        self.corpus = corpus

    def segment(self, word):
        tokens = list(word) + ["_"]
        for pair in self.merges:
            i = 0
            while i < len(tokens)-1:
                if (tokens[i], tokens[i+1]) == pair:
                    tokens[i:i+2] = ["".join(pair)]
                else:
                    i += 1
        return tokens

print("=== Q3.2 Mini-BPE Learner ===")
toy_bpe = BPE(num_merges=10)
toy_bpe.fit(toy_corpus)

test_words = ["new", "newer", "lowest", "widest", "newestest"]
for w in test_words:
    print(f"{w} → {toy_bpe.segment(w)}")

=== Q3.2 Mini-BPE Learner ===
Step 1: merge ('e', 'r') → er (vocab size=11)
Step 2: merge ('er', '_') → er_ (vocab size=11)
Step 3: merge ('n', 'e') → ne (vocab size=11)
Step 4: merge ('ne', 'w') → new (vocab size=11)
Step 5: merge ('l', 'o') → lo (vocab size=10)
Step 6: merge ('lo', 'w') → low (vocab size=10)
Step 7: merge ('new', 'er_') → newer_ (vocab size=11)
Step 8: merge ('low', '_') → low_ (vocab size=12)
Step 9: merge ('w', 'i') → wi (vocab size=11)
Step 10: merge ('wi', 'd') → wid (vocab size=10)
new → ['new', '_']
newer → ['newer_']
lowest → ['low', 'e', 's', 't', '_']
widest → ['wid', 'e', 's', 't', '_']
newestest → ['new', 'e', 's', 't', 'e', 's', 't', '_']


In [None]:
import re
from collections import defaultdict

paragraph = """Data engineering collects and prepares data for analysis.
It involves cleaning, transforming, and moving data between systems.
Good data pipelines make analytics and machine learning reliable and repeatable.
Engineers design systems for scale, latency, and fault tolerance."""

words = re.findall(r"\b\w+\b", paragraph.lower())

def get_stats(corpus):
    pairs = defaultdict(int)
    for word in corpus:
        for i in range(len(word)-1):
            pairs[(word[i], word[i+1])] += 1
    return pairs

def merge_pair(pair, corpus):
    new_corpus = []
    replacement = "".join(pair)
    for word in corpus:
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word)-1 and (word[i], word[i+1]) == pair:
                new_word.append(replacement)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_corpus.append(new_word)
    return new_corpus

class BPE:
    def __init__(self, num_merges):
        self.num_merges = num_merges
        self.merges = []
        self.vocab = set()

    def fit(self, corpus_words):
        corpus = [list(word) + ["_"] for word in corpus_words]
        vocab = set(sym for w in corpus for sym in w)

        for i in range(self.num_merges):
            pairs = get_stats(corpus)
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            self.merges.append(best)
            corpus = merge_pair(best, corpus)
            vocab.add("".join(best))
            print(f"Step {i+1}: merge {best} → {''.join(best)}")
        self.vocab = vocab
        self.corpus = corpus

    def segment(self, word):
        tokens = list(word) + ["_"]
        for pair in self.merges:
            i = 0
            while i < len(tokens)-1:
                if (tokens[i], tokens[i+1]) == pair:
                    tokens[i:i+2] = ["".join(pair)]
                else:
                    i += 1
        return tokens

para_bpe = BPE(num_merges=30)
para_bpe.fit(words)

print("\nTop 5 merges:", ["".join(p) for p in para_bpe.merges[:5]])
longest = sorted(para_bpe.vocab, key=lambda x: len(x), reverse=True)[:5]
print("5 longest tokens:", longest)

for w in ["data", "transforming", "fault", "pipelines", "reliable"]:
    print(f"{w} → {para_bpe.segment(w)}")




Step 1: merge ('i', 'n') → in
Step 2: merge ('a', 'n') → an
Step 3: merge ('s', '_') → s_
Step 4: merge ('l', 'e') → le
Step 5: merge ('a', 't') → at
Step 6: merge ('d', '_') → d_
Step 7: merge ('at', 'a') → ata
Step 8: merge ('in', 'g') → ing
Step 9: merge ('ing', '_') → ing_
Step 10: merge ('an', 'd_') → and_
Step 11: merge ('d', 'ata') → data
Step 12: merge ('data', '_') → data_
Step 13: merge ('e', 'n') → en
Step 14: merge ('in', 'e') → ine
Step 15: merge ('r', 'e') → re
Step 16: merge ('f', 'o') → fo
Step 17: merge ('fo', 'r') → for
Step 18: merge ('y', 's') → ys
Step 19: merge ('le', '_') → le_
Step 20: merge ('en', 'g') → eng
Step 21: merge ('eng', 'ine') → engine
Step 22: merge ('engine', 'e') → enginee
Step 23: merge ('enginee', 'r') → engineer
Step 24: merge ('o', 'l') → ol
Step 25: merge ('re', 'p') → rep
Step 26: merge ('for', '_') → for_
Step 27: merge ('an', 'a') → ana
Step 28: merge ('ana', 'l') → anal
Step 29: merge ('t', '_') → t_
Step 30: merge ('r', 'an') → ran

Top 

In [None]:
# @title List available models
from google.colab import ai

ai.list_models()

['google/gemini-2.0-flash',
 'google/gemini-2.0-flash-lite',
 'google/gemini-2.5-flash',
 'google/gemini-2.5-flash-lite',
 'google/gemini-2.5-pro',
 'google/gemma-3-12b',
 'google/gemma-3-1b',
 'google/gemma-3-27b',
 'google/gemma-3-4b']

Choosing a Model
The model names give you a hint about their capabilities and intended use:

Pro: These are the most capable models, ideal for complex reasoning, creative tasks, and detailed analysis.

Flash: These models are optimized for high speed and efficiency, making them great for summarization, chat applications, and tasks requiring rapid responses.

Gemma: These are lightweight, open-weight models suitable for a variety of text generation tasks and are great for experimentation.

In [None]:
# @title Simple batch generation example
# Only text-to-text input/output is supported
from google.colab import ai

response = ai.generate_text("What is the capital of France?")
print(response)

The capital of France is **Paris**.



In [None]:
# @title Choose a different model
from google.colab import ai

response = ai.generate_text("What is the capital of England", model_name='google/gemini-2.0-flash-lite')
print(response)

The capital of England is **London**.



For longer text generations, you can stream the response. This displays the output token by token as it's generated, rather than waiting for the entire response to complete. This provides a more interactive and responsive experience. To enable this, simply set stream=True.

In [None]:
# @title Simple streaming example
from google.colab import ai

stream = ai.generate_text("Tell me a short story.", stream=True)
for text in stream:
  print(text, end='')

The lighthouse keeper, Silas, was a man of routine. Every night, for fifty years, he'd lit the lamp, a beacon against the treacherous rocks that gnawed at the coastline. The sea was his companion, his enemy, and his only confidante. He knew its moods better than his own.

One stormy night, the wind howled like a banshee. The waves crashed against the tower, shaking it to its core. Silas, clinging to the railing, felt a fear he hadn't experienced in decades. This wasn't just a storm; this was a monster.

Suddenly, a small, wooden boat, tossed about like a toy, appeared in the raging sea. He squinted, his heart leaping into his throat. A child. Alone.

Ignoring the raging tempest, Silas raced down the winding stairs, his old bones protesting with every step. He launched his small rescue boat, a fragile craft against the fury of the storm.

Fighting the waves, he reached the child. A girl, no older than seven, clung to the wreckage, her face white with terror. With a strength born of desp

In [None]:
#@title Text formatting setup
#code is not necessary for colab.ai, but is useful in fomatting text chunks
import sys

class LineWrapper:
    def __init__(self, max_length=80):
        self.max_length = max_length
        self.current_line_length = 0

    def print(self, text_chunk):
        i = 0
        n = len(text_chunk)
        while i < n:
            start_index = i
            while i < n and text_chunk[i] not in ' \n': # Find end of word
                i += 1
            current_word = text_chunk[start_index:i]

            delimiter = ""
            if i < n: # If not end of chunk, we found a delimiter
                delimiter = text_chunk[i]
                i += 1 # Consume delimiter

            if current_word:
                needs_leading_space = (self.current_line_length > 0)

                # Case 1: Word itself is too long for a line (must be broken)
                if len(current_word) > self.max_length:
                    if needs_leading_space: # Newline if current line has content
                        sys.stdout.write('\n')
                        self.current_line_length = 0
                    for char_val in current_word: # Break the long word
                        if self.current_line_length >= self.max_length:
                            sys.stdout.write('\n')
                            self.current_line_length = 0
                        sys.stdout.write(char_val)
                        self.current_line_length += 1
                # Case 2: Word doesn't fit on current line (print on new line)
                elif self.current_line_length + (1 if needs_leading_space else 0) + len(current_word) > self.max_length:
                    sys.stdout.write('\n')
                    sys.stdout.write(current_word)
                    self.current_line_length = len(current_word)
                # Case 3: Word fits on current line
                else:
                    if needs_leading_space:
                        # Define punctuation that should not have a leading space
                        # when they form an entire "word" (token) following another word.
                        no_leading_space_punctuation = {
                            ",", ".", ";", ":", "!", "?",        # Standard sentence punctuation
                            ")", "]", "}",                     # Closing brackets
                            "'s", "'S", "'re", "'RE", "'ve", "'VE", # Common contractions
                            "'m", "'M", "'ll", "'LL", "'d", "'D",
                            "n't", "N'T",
                            "...", "…"                          # Ellipses
                        }
                        if current_word not in no_leading_space_punctuation:
                            sys.stdout.write(' ')
                            self.current_line_length += 1
                    sys.stdout.write(current_word)
                    self.current_line_length += len(current_word)

            if delimiter == '\n':
                sys.stdout.write('\n')
                self.current_line_length = 0
            elif delimiter == ' ':
                # If line is full and a space delimiter arrives, it implies a wrap.
                if self.current_line_length >= self.max_length:
                    sys.stdout.write('\n')
                    self.current_line_length = 0

        sys.stdout.flush()


In [None]:
# @title Formatted streaming example
from google.colab import ai

wrapper = LineWrapper()
for chunk in ai.generate_text('Give me a long winded description about the evolution of the Roman Empire.', model_name='google/gemini-2.0-flash', stream=True):
  wrapper.print(chunk)

Alright, settle in, because the Roman Empire’s evolution wasn't a tidy, linear
process. It was a centuries-long, tumultuous transformation, marked by
breathtaking innovation, brutal power struggles, and a slow, creeping societal
decay. We're talking about a journey from a humble city-state in the Italian
peninsula to a sprawling, multifaceted empire that left an indelible mark on
law, language, architecture, governance, and even our very understanding of the
world.

It all began, as legend would have it, with Romulus and Remus, twin brothers
raised by a she-wolf, who founded the city of Rome in 753 BCE. Now, that’s just
a legend, but it serves to highlight the foundational spirit of Rome: ambition,
strength, and a certain ruthlessness. Initially, Rome was ruled by a monarchy, a
system eventually deemed unsatisfactory by the powerful patrician class. This
led to the **Roman Republic**, established around 509 BCE, a watershed moment
that would define the early character of Rome.

The Rep