<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=150px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>ChatGPT And Friends</h1>
<h1>Language Models</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter, defaultdict
import random

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import reuters
from nltk import bigrams, trigrams

import tqdm as tq
from tqdm.notebook import tqdm

import watermark

%load_ext watermark
%matplotlib inline

We start by printing out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.12.3

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 23.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit

Git hash: 21b9940cec1a0c4befc502f81feeba07c252d364

json      : 2.0.9
nltk      : 3.8.1
tqdm      : 4.66.4
pandas    : 2.1.4
numpy     : 1.26.4
matplotlib: 3.8.0
watermark : 2.4.3



Load default figure style

In [3]:
plt.style.use('d4sci.mplstyle')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Segmentation

In [4]:
Pword = pd.read_csv('data/count_1w.txt.gz', sep='\t', header=None, index_col=0)
norm = float(Pword.sum().iloc[0])
Pword/=norm

pword_dict = defaultdict(lambda: 1/norm)
pword_dict.update(dict(Pword.reset_index().values))

In [5]:
def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return np.prod([float(pword_dict[w]) for w in words])

In [6]:
def splits(text, verbose=False):
    s = [(text[:i], text[i:]) 
            for i in range(1, len(text)+1)]
    
    if verbose:
        print(s)
    
    return s

In [7]:
def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text))
        return max(candidates, key=Pwords)

In [8]:
segment('choosespain')

['choose', 'spain']

In [9]:
Pwords(['choose', 'spain'])

1.3415578949797063e-08

In [10]:
Pwords(['chooses', 'pain'])

3.9488948824561327e-10

# "Small" Language Model

In [10]:
model = defaultdict(lambda: defaultdict(lambda: 0))

We start by counting number of trigram co-occurrences

In [11]:
list(trigrams(["the", "united", "states", "of", "america"], pad_right=True, pad_left=True))

[(None, None, 'the'),
 (None, 'the', 'united'),
 ('the', 'united', 'states'),
 ('united', 'states', 'of'),
 ('states', 'of', 'america'),
 ('of', 'america', None),
 ('america', None, None)]

In [12]:
for sentence in tqdm(reuters.sents(), total=54_716):
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        context = (w1, w2)
        model[context][w3] += 1

  0%|          | 0/54716 [00:00<?, ?it/s]

And normalizing the probabilities for each bigram. 

In [13]:
for context in model:
    total_count = float(sum(model[context].values()))

    for w3 in model[context]:
        model[context][w3] /= total_count

Our language model is just a weighted mapping between each bigram and the possible next words.

In [27]:
model[("United", "States")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'.': 0.10950080515297907,
             'where': 0.001610305958132045,
             'and': 0.11594202898550725,
             'in': 0.01610305958132045,
             'citizen': 0.001610305958132045,
             'could': 0.008051529790660225,
             'is': 0.03864734299516908,
             'has': 0.04669887278582931,
             'took': 0.00322061191626409,
             ',': 0.15780998389694043,
             'from': 0.00644122383252818,
             'decided': 0.001610305958132045,
             'agreed': 0.00644122383252818,
             'should': 0.014492753623188406,
             '."': 0.004830917874396135,
             'had': 0.00966183574879227,
             "'": 0.008051529790660225,
             'it': 0.00322061191626409,
             'imposes': 0.00322061191626409,
             'despite': 0.001610305958132045,
             ',"': 0.014492753623188406,
             'markets': 0.001610305958132045,
     

In [15]:
model[("United", "Kingdom")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {',': 0.21428571428571427,
             'and': 0.21428571428571427,
             'blender': 0.07142857142857142,
             ')': 0.14285714285714285,
             'company': 0.07142857142857142,
             'operations': 0.07142857142857142,
             'assets': 0.07142857142857142,
             'Ltd': 0.07142857142857142,
             '.': 0.07142857142857142})

This is all we need to generate new text staring from a bigram prompt. We must simply perform a random walk on this weighted graph starting from an initial prompt:

In [16]:
def generate_sentence_from_prompt(prompt, zero_temperature=False):
    text = [*prompt]

    # Dont impose any fixed sentence length
    while True:
        # the current not we're in is just the one that accounts
        # for the last two words in the text
        context = tuple(text[-2:])

        # We extract the list of possible next words and their probabilities
        words = []
        probs = []

        for word, prob in model[context].items():
            words.append(word)
            probs.append(prob)

        # Choose one word proportionally to each probability
        selection = np.random.multinomial(1, probs)
        
        # Check which one was chosen
        if zero_temperature:
            pos = np.argmax(probs) # Temperature = 0
        else:
            pos = np.argmax(selection) # Temperature = 1
            
        # Check which one was chosen
        pos = np.argmax(selection)

        word = words[pos]

        # Append the new word to our runnning text
        text.append(word)

        # Stop when we hit two None tokens in a row, that represnet the end of a sentence
        if text[-2:] == [None, None]:
            break
        
        # Make sure we don't run forever
        if len(text) > 100:
            break
                
    return " ".join([t for t in text if t])

In [20]:
generate_sentence_from_prompt(('United', 'States'))

'United States as well as Amsterdam , they could become competitive .'

United States => as

United States as -> States as => well

United States as well -> as well => as

United States as well as -> well as => Amsterdam

In [21]:
generate_sentence_from_prompt(('Today', 'the'))

"Today the company ' s absolutely no reason why the 7 - 3 ports Greece 40 , 000 vs 3 , 580 1st half Shr loss six cts Net profit 2 , 1 . 24 dlrs Oper net 15 . 7 PCT STAKE ( General Agreement on Tariffs and Trade that South Africa ' s & lt ; TOTE . O > 2ND QTR SHR 2 . 3 pct ."

In [22]:
generate_sentence_from_prompt(('financial', 'markets'))

'financial markets , and sources close to the union late last month to trading profits meant that the ITC or and placing matching buy orders at that point of view , telling them of his two - year high of 72 pct average annual growth of exports ( of Commons in the year due to affiliates .'

<center>
     <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>