## Bulding an article spinner

1. Problem: How to replace enough words on a article so that it's different enough from the original and still makes sense


For article spinning:

We need to predict the middle word

p(wt | wt-1, wt+1) =   
count(wt-1 -> wt -> wt+1) /   
count(wt-1 -> ANY -> wt+1
)

In [18]:
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
nltk.download('punkt')

--2024-07-14 17:13:27--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4,8M) [text/csv]
Saving to: 'bbc_text_cls.csv'

     0K .......... .......... .......... .......... ..........  1%  595K 8s
    50K .......... .......... .......... .......... ..........  2% 99,2K 29s
   100K .......... .......... .......... .......... ..........  3%  215K 26s
   150K .......... .......... .......... .......... ..........  4% 92,7K 32s
   200K .......... .......... .......... .......... ..........  5% 1,16M 26s
   250K .......... .......... .......... .......... ..........  6%  402K 24s
   300K .......... .......... .......... .......... ..........  7%  198K 23s
   350K .......... .......... .......... .......... ..........  8%  172K 24s
   400K .......... ...

True

In [1]:
# TODO: have variable context size, 1,2, etc, 
# min acceptable prob to chnage a word

import pandas as pd
import numpy as np
import textwrap
import nltk
from nltk import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [2]:
# import dataset
df = pd.read_csv('bbc_text_cls.csv')#, on_bad_lines='skip')
df.head()

labels = set(df['labels'])
print(labels)

# train only from chosen labels
label = 'politics'

texts = df[df['labels'] == label]['text']
print(texts.head()) 


{'business', 'politics', 'entertainment', 'sport', 'tech'}
896    Labour plans maternity pay rise\n\nMaternity p...
897    Watchdog probes e-mail deletions\n\nThe inform...
898    Hewitt decries 'career sexism'\n\nPlans to ext...
899    Labour chooses Manchester\n\nThe Labour Party ...
900    Brown ally rejects Budget spree\n\nChancellor ...
Name: text, dtype: object


In [5]:
# collect counts

probs = {} # key: [(wt-1), (wt+1)] value: [ wt / count(wt)]

for doc in texts:
    lines = doc.split("\n")
    for line in lines:
        tokens = word_tokenize(line)
        for i in range(len(tokens)-2):
            t_0 = tokens[i]
            t_1 = tokens[i+1]
            t_2 = tokens[i+2]
            key = (t_0, t_2)
            print(key)
            if key not in probs:
                probs[key] = {}
                
            print(t_0)
            print(t_1)
            print(t_2)
            # add count for middle token
            if t_1 not in probs[key]:
                probs[key][t_1] = 1
            else: 
                probs[key][t_1] += 1
            break
        break
    break
probs 

('Labour', 'maternity')
Labour
plans
maternity


{('Labour', 'maternity'): {'plans': 1}}

In [4]:
# normalize propabilities
for key, d in probs.items():
# d should represent a distibution
    total = sum(d.values())
    # access the dictionary of dictionaries, in corresponding key
    # devide curernt count with total
    for k, v in d.items():
        d[k] = v/total 

In [5]:
# text is split on paragraphs
texts.iloc[0].split("\n")

['Labour plans maternity pay rise',
 '',
 'Maternity pay for new mothers is to rise by £1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.',
 '',
 'It would mean paid leave would be increased to nine months by 2007, Ms Hewitt told GMTV\'s Sunday programme. Other plans include letting maternity pay be given to fathers and extending rights to parents of older children. The Tories dismissed the maternity pay plan as "desperate", while the Liberal Democrats said it was misdirected.',
 '',
 'Ms Hewitt said: "We have already doubled the length of maternity pay, it was 13 weeks when we were elected, we have already taken it up to 26 weeks. "We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid. "We have made a firm commitme

In [6]:
def spin_document(doc):
    # split document into lines
    lines = doc.split("\n")
    output = []
    for line in lines:
        if line:
            new_line = spin_line(line)
        else:
            new_line = line
        output.append(new_line)
        
    return "\n".join(output)
                

In [7]:
detokenizer = TreebankWordDetokenizer()
texts.iloc[0].split("\n")[2]
detokenizer.detokenize(word_tokenize(texts.iloc[0].split("\n")[2]))

'Maternity pay for new mothers is to rise by £1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.'

In [8]:
def sample_word(d):
    p0 = np.random.random()
    cumulative = 0
    for t,p in d.items():
        cumulative += p
        if p0 < cumulative:
            return t
    assert(False)

In [21]:
def spin_line(line):
    tokens = word_tokenize(line)
    i = 0
    output = [tokens[0]]
    

    while i <(len(tokens)-2):
        t_0 = tokens[i]
        t_1 = tokens[i+1]
        t_2 = tokens[i+2]
        key = (t_0, t_2)
        
        try:
            p_dist = probs[key]
        except:
            print("hello")
        if len(p_dist) > 1 and np.random.random() < 0.3:
            # replace middle word
            middle = sample_word(p_dist)
            output.append(t_1)
            output.append("<"+middle+">")
            output.append(t_2)
            
            # skip 2 steps 
            i +=2
        
        else:
            # dont replace middle word
            output.append(t_1)
            i+=1
        
        
    # append the final token
    if i == (len(tokens)-2):
        output.append(tokens[-1])
    return detokenizer.detokenize(output)

In [22]:
np.random.seed(1234)
i = np.random.choice(texts.shape[0])
doc = texts.iloc[i]
new_doc = spin_document(doc)
print(textwrap.fill(new_doc, replace_whitespace=False,fix_sentence_endings=True))

UKIP outspent Labour on EU poll

The UK Independence Party outspent
both Labour and the Liberal Democrats in the European <European>
elections, new figures show <">.

UKIP, which campaigned <campaigned>
on a slogan of "Say <can> no to Europe" <has>, spent £2.36m <yearly>
on the campaign - second only <place> to the Conservatives' £3.13m .
The campaign took UKIP into third place with an extra 10 MEPs . Labour
<Australia>'s campaign cost £1.7m, the Lib Dems' £1.19m and the
Greens' £404,000, according to figures revealed <revealed> by the
Electoral Commission on Wednesday . Much of the UKIP funding came from
Yorkshire millionaire Sir Paul Sykes <Sykes>, who helped bankroll the
party's billboard <billboard> campaign . Critics <I> have accused the
party of effectively buying votes <votes>. But <Possibly> a UKIP
spokesman said Labour and <explaining> the Conservatives had spent
£10m between them on <attending> the last general election . "With the
advantages <end> of public <the> money the o