# Text Generation

In this section we go only through n-gram model. Other methods are presented as demo of working open source examples. Learning takes too long.

## N-gram model

In this example, we use the Wall Street Journal corpus to generate new sentences. The corpus is available as a part of NLTK library. This example is based on [Erroll Wood's work](https://github.com/errollw/gengram).

In [10]:
from nltk.book import *

wall_street = text7.tokens

We need to clean the text up and delete all meaningless words/characters. The easiest way is to use regular expressions:

In [11]:
import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

The next step is to build ngrams. It means that we group the tokens into a list of three that are placed next to each other. You can print the ngrams.

In [12]:
def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

The next step is to calculate the frequency of tokens in each ngram and sum if there are more than one tokens related to a ngram. There are 85826 ngrams and 54677 frequency ngrams.

In [7]:
def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

We choose the next word by using the most recent tokens and adds it.

In [8]:
def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

We need to setup a few parameters like the windows size N, the number of sentences that we want to generate and start of the sentence that we want to generate. The sentence start string are N-1 words that exists in our ngrams list.

In [14]:
import random

N=3

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

print(generated)

we have managed to get to vote for a line-item veto is characterized as a fledgling in the first appropriations bill in conference will be type F safety shape a four-foot-high concrete slab with no new issues of negotiable C.D.s usually on amounts of 1 million and 30 million barrels from new fields free of tobacco smoke . In addition Upjohn is offering a fixed-rate return and the beauty of a patient suffering from Parkinson disease . The Ginnie Mae 9 issue for November delivery ended at 58.64 cents a share or 55 million after an 89.9 million pretax charge mostly related to problems under a great deal of stress most of its personal computer lines by 5 to 17 billion yen . Douglas Madison a corporate trader with Bank of Japan said 0 it might not be feasible for large-scale commercial use . Still unresolved is Sony effort to promote vehicle occupant safety in light of other important economic issues will be assumed by Chairman Jay B.
