<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=150px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>ChatGPT And Friends</h1>
<h1>Language Models</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter, defaultdict
import random

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import reuters
from nltk import bigrams, trigrams

import tqdm as tq
from tqdm.notebook import tqdm

import watermark

%load_ext watermark
%matplotlib inline

We start by printing out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.21.0

Compiler    : Clang 15.0.0 (clang-1500.1.0.2.5)
OS          : Darwin
Release     : 23.3.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Git hash: 1f87e80538ad172ebadf16b8ffe7f1e01f363ed6

matplotlib: 3.8.2
tqdm      : 4.66.1
json      : 2.0.9
pandas    : 2.2.0
watermark : 2.4.3
nltk      : 3.6.6
numpy     : 1.26.4



Load default figure style

In [3]:
plt.style.use('d4sci.mplstyle')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Segmentation

In [4]:
Pword = pd.read_csv('data/count_1w.txt.gz', sep='\t', header=None, index_col=0)
norm = float(Pword.sum().iloc[0])
Pword/=norm

pword_dict = defaultdict(lambda: 1/norm)
pword_dict.update(dict(Pword.reset_index().values))

In [5]:
def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return np.prod([float(pword_dict[w]) for w in words])

In [6]:
def splits(text):
    return [(text[:i], text[i:]) 
            for i in range(1, len(text)+1)]

In [7]:
def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text))
        return max(candidates, key=Pwords)

In [8]:
segment('choosespain')

['choose', 'spain']

# "Small" Language Model

In [9]:
model = defaultdict(lambda: defaultdict(lambda: 0))

We start by counting number of trigram co-occurrences

In [10]:
for sentence in tqdm(reuters.sents(), total=54_711):
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        bigram = (w1, w2)
        model[bigram][w3] += 1

  0%|          | 0/54711 [00:00<?, ?it/s]

And normalizing the probabilities for each bigram. 

In [11]:
for bigram in model:
    total_count = float(sum(model[bigram].values()))

    for w3 in model[bigram]:
        model[bigram][w3] /= total_count

Our language model is just a weighted mapping between each bigram and the possible next words.

In [12]:
model[("United", "States")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'.': 0.10950080515297907,
             'where': 0.001610305958132045,
             'and': 0.11594202898550725,
             'in': 0.01610305958132045,
             'citizen': 0.001610305958132045,
             'could': 0.008051529790660225,
             'is': 0.03864734299516908,
             'has': 0.04669887278582931,
             'took': 0.00322061191626409,
             ',': 0.15780998389694043,
             'from': 0.00644122383252818,
             'decided': 0.001610305958132045,
             'agreed': 0.00644122383252818,
             'should': 0.014492753623188406,
             '."': 0.004830917874396135,
             'had': 0.00966183574879227,
             "'": 0.008051529790660225,
             'it': 0.00322061191626409,
             'imposes': 0.00322061191626409,
             'despite': 0.001610305958132045,
             ',"': 0.014492753623188406,
             'markets': 0.001610305958132045,
     

In [13]:
model[("United", "Kingdom")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {',': 0.21428571428571427,
             'and': 0.21428571428571427,
             'blender': 0.07142857142857142,
             ')': 0.14285714285714285,
             'company': 0.07142857142857142,
             'operations': 0.07142857142857142,
             'assets': 0.07142857142857142,
             'Ltd': 0.07142857142857142,
             '.': 0.07142857142857142})

This is all we need to generate new text staring from a bigram prompt. We must simply perform a random walk on this weighted graph starting from an initial prompt:

In [14]:
def generate_sentence_from_prompt(prompt, zero_temperature=False):
    text = [*prompt]

    # Dont impose any fixed sentence length
    while True:
        # the current not we're in is just the one that accounts
        # for the last two words in the text
        bigram = tuple(text[-2:])

        # We extract the list of possible next words and their probabilities
        words = []
        probs = []

        for word, prob in model[bigram].items():
            words.append(word)
            probs.append(prob)

        # Choose one word proportionally to each probability
        selection = np.random.multinomial(1, probs)
        
        # Check which one was chosen
        if zero_temperature:
            pos = np.argmax(probs) # Temperature = 0
        else:
            pos = np.argmax(selection) # Temperature = 1\n
            
        # Check which one was chosen
        pos = np.argmax(selection)

        word = words[pos]

        # Append the new word to our runnning text
        text.append(word)

        # Wtop when we hit two None tokens in a row, that represnet the end of a sentence
        if text[-2:] == [None, None]:
            break
        
        # Make sure we don't run forever
        if len(text) > 100:
            break
                
    return " ".join([t for t in text if t])

In [15]:
generate_sentence_from_prompt(('United', 'States'))

'United States wanted it understood by all potential purchasers of the total outstanding .'

In [16]:
generate_sentence_from_prompt(('Today', 'the'))

'Today the World Bank President Barber Conable called on the table ," except those moves , analysts said the exchange cover scheme were 361 mln dlrs in cash and notes , with an injunction against the U . S . Agriculture Secretary Richard Lyng said it sold nearly its entire 47 pct to 7 , 983 , 000 vs profit 1 , 332 . 1 mln .'

In [17]:
generate_sentence_from_prompt(('financial', 'markets'))

'financial markets , and said it signed a letter of intent on expansion to lay fibre optic cables between Japan and other OPEC member has a long - term debt of about three million people from cities and crowded river deltas to the capital markets .'

<center>
     <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>