**Using business articles from online BBC News classification content, you are going to 'spin' some business articles based on existing content.**

In [1]:
import numpy as np
import pandas as pd

# Useful when printing results - wraps the text onscreen
import textwrap

# Tokenize string & De-tokenize back into single string
import nltk
from nltk import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shmel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
df = pd.read_csv('data/bbc_text_cls.csv')

In [4]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [5]:
df['labels'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: labels, dtype: int64

In [6]:
# We only want the business-related articles

texts = df[df['labels'] == 'business']['text']

In [7]:
texts.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [16]:
texts.iloc[0].split('\n')

['Ad sales boost Time Warner profit',
 '',
 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.',
 '',
 'The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.',
 '',
 "Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

In [9]:
# Each paragraph starts & ends with '\n'. This refers to difference between title and body also

# Create word probability dictionary

In [10]:
# Collect word counts then find probability and populate probability dictionary (dictionary of dictionaries):
# Dictionary Key: (w(t-1), w(t+1)) 
# Dictionary Value: {w(t): count(w(t))}

probs = {} 

for doc in texts:
    lines = doc.split("\n") 
    
    # Loop through each paragraph and tokenize the lines
    for line in lines:
        tokens = word_tokenize(line) 
        
        # Loop through the lines and grab 3 consecutive words simultaneously
        for i in range(len(tokens) - 2):
            t_0 = tokens[i] 
            t_1 = tokens[i + 1] 
            t_2 = tokens[i + 2] 
            # Form the dictionary key and add to dictionary
            key = (t_0, t_2) 
            if key not in probs:
                probs[key] = {}
            
            # Add count value for middle token with key given
            if t_1 not in probs[key]:
                probs[key][t_1] = 1 
            else:
                probs[key][t_1] += 1

In [12]:
dict(list(probs.items())[:3])

{('Ad', 'boost'): {'sales': 1},
 ('sales', 'Time'): {'boost': 1},
 ('boost', 'Warner'): {'Time': 1}}

In [13]:
# Normalize counts to probabilities

# Loop through each key-value pair
for key, d in probs.items():
    total = sum(d.values()) 
    
    # Loop through middle words and calculate probability
    for k, v in d.items():
        d[k] = v / total

In [15]:
dict(list(probs.items())[:10])

{('Ad', 'boost'): {'sales': 1.0},
 ('sales', 'Time'): {'boost': 1.0},
 ('boost', 'Warner'): {'Time': 1.0},
 ('Time', 'profit'): {'Warner': 1.0},
 ('Quarterly', 'at'): {'profits': 1.0},
 ('profits', 'US'): {'at': 1.0},
 ('at', 'media'): {'US': 1.0},
 ('US', 'giant'): {'media': 0.1,
  'telecoms': 0.1,
  'banking': 0.2,
  'foods': 0.1,
  'retail': 0.1,
  'oil': 0.2,
  'mortgage': 0.1,
  'agrochemical': 0.1},
 ('media', 'TimeWarner'): {'giant': 1.0},
 ('giant', 'jumped'): {'TimeWarner': 1.0}}

In [17]:
# Note that a lot of words have 1 probability, which means no other option for a middle word - don't bother to replace

Your Markov model is ready! 

# 'Spin' your article

You are going to spin your article **paragraph-by-paragraph**. You can spin sentence-by-sentence, but we want to retain the same structure as much as possible.

You need to:
1. Tokenize each paragraph in order to spin the article
2. Once the article has been spun, you need to turn the list of tokens back into a single string using NLTK **`TreebankWordDetokenizer`** Class, which takes into account the difference between words and punctuation and whether there should be a whitespace or not, e.g. "the end" vs "end. The"



In [18]:
# Function to spin each paragraph by calling spin_line function (defined below), and join paragraphs together to make article 
# Input is existing article

def spin_document(doc):
    # Split document into lines (paragraphs) 
    lines = doc.split("\n") 
    output = [] 
    
    # Loop through each paragraph
    for line in lines:
        # Check if line has content and spin line
        if line:
            new_line = spin_line(line)
        else:
            new_line = line 
        
        output.append(new_line)
            
    return "\n".join(output)

In [19]:
# Set up detokenizer object

detokenizer = TreebankWordDetokenizer()

In [20]:
# Pick random paragraph from text document

texts.iloc[0].split("\n")[2]

'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.'

In [21]:
# Tokenize paragraph then detokenize!

detokenizer.detokenize(word_tokenize(texts.iloc[0].split("\n")[2]))

'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.'

In [22]:
# Result is identical, which shows that the sentence and punctuation, spacing etc. is all treated correctly

### Random sampling of words from probability distribution

In [23]:
# Same as used with text generation (input is dictionary)

def sample_word(d):
    p0 = np.random.random() 
    cumulative = 0 
    
    for t, p in d.items():
        cumulative += p 
        if p0 < cumulative:
            return t 
    
    assert(False) # Should never get here!

In [24]:
# Spin line function that inputs paragraph, then:
# tokenizes the paragraphs 3 words at a time,
# if there is more than one word in token with a randomly-picked probability under 30%, 
# it replaces the middle word with another randomly-picked middle word, then skips ahead two words, i.e. spinning!
# In order to compare results and see how function works, append new word in < > brackets, next to original word,
# and you will never replace two words in a row

def spin_line(line):
    tokens = word_tokenize(line) 
    i = 0 
    output = [tokens[0]] 
    
    # Grab 3 words at a time
    while i < (len(tokens) - 2):
        t_0 = tokens[i] 
        t_1 = tokens[i + 1] 
        t_2 = tokens[i + 2] 
        key = (t_0, t_2) 
        p_dist = probs[key] 
        
        if len(p_dist) > 1 and np.random.random() < 0.3:
            # Replace the middle word 
            middle = sample_word(p_dist) 
            output.append(t_1) 
            output.append("<" + middle + ">") 
            output.append(t_2) 
            # Don't replace 3rd token since the middle token was dependent on it 
            # By adding two new words, skip ahead 2 steps in the index
            i += 2 
        else:
            # Do not replace this middle word 
            output.append(t_1) 
            i += 1 
            
    # Append the final token - in case there was no replacement for i
    if i == len(tokens) - 2:
        output.append(tokens[-1]) 
    
    return detokenizer.detokenize(output)

In [25]:
np.random.seed(42)

**Randomly select an article (document) to input to the `spin_document` function, that calls the 'spin_line' function, which in turn calls the 'sample_word' function:**

Use the `textwrap.fill()` function to print the output nicely.

In [28]:
# Randomly select article by index number within number of total articles
i = np.random.choice(texts.shape[0])

# Find document at random index location
doc = texts.iloc[i]

# Spin new document from random document
new_doc = spin_document(doc)

In [29]:
# Use textwrap.fill() to print output

print(textwrap.fill(new_doc, replace_whitespace=False, fix_sentence_endings=True))

Weak dollar trims Cadbury profits

The world's biggest confectionery
<confectionery> firm, Cadbury Schweppes, has <have> reported a modest
rise in profits <Asia> after the weak <falling> dollar took a bite out
of its <positive> results.

Underlying pre-tax profits rose 1% <%> to
£933m <£6.2bn> ($1.78bn) in 2004, but would have been 8% higher if
currency movements were stripped out . The owner <owner> of brands
<products> such as Dairy Milk, Dr Pepper and Snapple generates more
than 80% of its sales <sales> outside the UK <news>. Cadbury said it
was confident <finding> it would hit its targets for 2005 . <.> "While
the external commercial environment remains competitive <high>, we
<which> are confident that we have the strategy <pound>, brands and
people <Russia> to deliver within our goal ranges in 2005, <,>" said
chief executive Todd Stitzer.

The modest profit rise had been
expected by analysts after the company <company> said in December
<Europe> that the poor summer weather had hit

In [None]:
# You'll notice that sometimes the original word is replaced with the same word
# If you want to force a change, modify its random probability from 0.3 to 0

# Notice that detokenizer is failing at the end of some sentences by inserting whitespace before full-stop - you may want to add
# transformation rule

# Extension exercises:

### POS tags

Take parts-of-speech into account by tagging

### Synonyms

Use a dictionary of synonyms 

### Larger context window

You can change the size of the context window, by including more previous words and more words after the word in question. You can only judge this by understanding how the 'spinner' is behaving in its present state. Note that this adds more dimensions.

The number of words before-and-after do not have to be symmetrical, i.e. you can have one previous word and two words after. 