# Article Spinner Project

## Introduction

The goal of this project is to create an **article spinner**, a tool that generates a rewritten version of an article by substituting words with their alternatives while preserving the original context and meaning. This technique is often used in content creation, enabling the production of varied versions of a text for SEO optimization, creative writing, or summarization.

An article spinner works by leveraging natural language processing (NLP) techniques to identify suitable replacements for words or phrases based on their context in a sentence. In this project, we achieve this by building a **probability model** that predicts the likelihood of a word appearing in a specific position based on its surrounding words.

We are using the [BBC Full Text Document Classification Dataset](https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification) as the input data for building and testing our model.

## How the Article Spinner is Generated

1. **Building the Probability Model:**  
   A **frequentist approach** is used to construct a probability model. This model calculates the conditional probability of a word appearing at a specific position, given the word that precedes it (`p-1`) and the word that follows it (`p+1`). Mathematically, this can be expressed as:  
   \[
   P(\text{word}_{p} | \text{word}_{p-1}, \text{word}_{p+1})
   \]  
   The model uses word frequencies from the dataset to estimate these probabilities.

2. **Sampling New Words:**  
   Using the probability model, the spinner samples alternative words for each position in the article. The selection process ensures that the generated text remains coherent and meaningful by prioritizing words with higher probabilities in the given context.

3. **Reconstruction of the Article:**  
   The newly sampled words are assembled to create a "spun" version of the original article. This version maintains the original structure and message but uses different wording to achieve variation.

## Dataset Details

The project uses the BBC dataset, which contains articles categorized into various topics such as sports, politics, and entertainment. This diverse dataset provides a robust foundation for training the probability model, ensuring that the spinner can handle a variety of contexts and word associations effectively.

By the end of the project, we aim to have a functional article spinner that demonstrates the power of NLP techniques in generating context-aware variations of textual content.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from nltk.tokenize import word_tokenize

import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from nltk.tokenize.treebank import TreebankWordDetokenizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\giode\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
dir = './data/'

file = 'bbc_text_cls.csv'

df0 = pd.read_csv(dir+file)

# Punctuation list
punctuations = string.punctuation

# Stopwords list
stop_words = set(stopwords.words('english'))

print(punctuations)
print(stop_words)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
{'because', "she's", 'a', 'about', 'wouldn', 'yourself', 'did', 'ourselves', 'other', "weren't", 'do', "isn't", 'further', 'hers', 'same', 'mightn', 'them', 'have', "doesn't", "should've", "won't", 'when', 'were', 'i', 'shan', 'am', 'until', 'what', 'no', 'this', 'it', 'those', 'mustn', "wasn't", 'during', 'we', 'your', 'you', 'whom', "that'll", 'but', 'down', 'under', 'very', 'over', 'they', "didn't", 'her', 'to', 'more', 'will', 'now', 'all', 'in', 'below', 'me', "shan't", 'above', "aren't", 'doing', 'yours', "you'll", 'few', 'by', 'with', 'ain', "mightn't", 'being', 'ma', 'isn', 'o', 'as', 'most', 'be', 'are', 'couldn', 'such', 'off', 'after', 'then', "hadn't", 'just', 'needn', 'him', 'herself', 'theirs', 'for', 'the', 'or', 'up', 'if', 'any', 'some', "you've", 'where', 'own', 'can', 'should', 'he', 'against', 'each', 'haven', 'there', 'why', 'that', 'll', 'yourselves', 'been', 'which', 'nor', 'so', 'was', 'shouldn', 'ours', 'his', 'on', "wouldn't", 

In [5]:
df0.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [6]:
df0['labels'].value_counts()

labels
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

In [7]:
#Get the business articles

df = df0[df0['labels']=='business']['text']

In [8]:
df.shape

(510,)

In [9]:
df.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [10]:
df[0]

'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sig

In [11]:
print(df[0])

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL

It is using "\n\n" to separete paragraph and "\'" to indicate possessive form. As the goal is to create an article spin, we don't need to change the \'. Also, we will not remove punctuations.

In [12]:
#Tokenize the corpus:

df_token = df.apply(lambda x: word_tokenize(x))

In [13]:
df_token[0][:15]

['Ad',
 'sales',
 'boost',
 'Time',
 'Warner',
 'profit',
 'Quarterly',
 'profits',
 'at',
 'US',
 'media',
 'giant',
 'TimeWarner',
 'jumped',
 '76']

In [14]:
df[0]

'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sig

In [15]:
"\n\n" in df_token[1] #Check if paragraph was tokenized

False

In [16]:
vocab = set([word for token_list in df_token for word in token_list])
V = len(vocab)
print(f"Vocabulary Size: {V} words")

Vocabulary Size: 14703 words


In [17]:
#Create an encoder and decoder
word2idx = {}
idx2word = []
count = 0
for word in vocab:
    word2idx[word] = 0
    idx2word.append(word)
    count+=1

In [18]:
#Now we want to calculate the probability of: P(word_i | word_{i-1}, word_{i+1}).
#Lets use a dictionary dic -> (word_{i-1}, word_{i+1}) = p(word_i) to accomplish this

prob = {}
for token_list in df_token:

    for i in range(1,len(token_list)-1):

        if (token_list[i-1],token_list[i+1]) not in prob:
            prob[ (token_list[i-1],token_list[i+1]) ] = {}
        if token_list[i] not in prob[ (token_list[i-1],token_list[i+1]) ]:
            prob[ (token_list[i-1],token_list[i+1]) ][token_list[i]] = 0

        prob[(token_list[i-1],token_list[i+1])][token_list[i]] += 1

In [19]:
# Normalize prob

for key in prob.keys():

    list_dic = prob[key]
    count = 0
    total = sum(prob[key].values())

    for key2, value2 in prob[key].items():
        prob[key][key2] = value2/total

In [20]:
#Article Spinner
class ArticleSpinner():
    def __init__(self,prob):
        self.prob = prob

    def _randon_sampler(self,prob):

        rdn = np.random.rand()

        idx = 0
        cum_prob = prob[0]

        while cum_prob<=rdn:
            idx+=1
            cum_prob+=prob[idx]


        return idx



    def __call__(self, doc):

        destokenize = TreebankWordDetokenizer()
        self.old_doc = doc

        paragraphs = doc.split('\n\n')

        new_doc = []
        for paragraph in paragraphs:

            tokens = word_tokenize(paragraph)

            new_tokens = tokens.copy()

            for i in range(1,len(tokens)-1):
                if (tokens[i] not in punctuations) and (tokens[i] not in stop_words):

                    list_probs = list(self.prob[ (tokens[i-1], tokens[i+1]) ].values())


                    if len(list_probs) == 1 or np.random.random() > 0.3:  #This np.random.random() ensures that we change only 30% of the words
                        continue

                    idx_rdn = self._randon_sampler(list_probs)


                    new_word = list(self.prob[ (tokens[i-1],tokens[i+1]) ].keys())[idx_rdn]

                    if new_word in punctuations or new_word in stop_words:
                        continue
                    #if len(list_probs) > 1:
                     #   print(tokens[i], new_word)

                    new_tokens[i] = '{'+tokens[i]+'/'+new_word+'}'

            new_doc.append(destokenize.detokenize(new_tokens))

        return "\n\n".join(new_doc)

In [21]:
art_spinner = ArticleSpinner(prob)

In [22]:
print(art_spinner(df[0]))

Ad sales boost Time Warner profit

Quarterly profits at US {media/telecoms} giant TimeWarner jumped {76/1.8}% to $1.13bn (£600m) for the {three/busiest} months to {December/September}, from $639m year-earlier.

The firm, which is now {one/part} of the {biggest/biggest} investors in Google, benefited from {sales/members} of {high-speed/high-speed} internet connections and higher advert sales . TimeWarner said fourth quarter sales rose {2/1}% to $11.1bn from $10.9bn . Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for {AOL/months}.

Time Warner said on Friday that it now owns 8% of search-engine Google . But its own {internet/consulting} business, {AOL/Colombia}, had has mixed fortunes . It lost 464,000 subscribers in the {fourth/fourth} quarter profits were lower than in the preceding three quarters . However, the {company/US} said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising reve

In [23]:
print(art_spinner(df[105]))

Golden rule 'intact' says ex-aide

Chancellor Gordon Brown will meet his golden economic rule "with a {margin/deal} to spare", {according/dipping} to his {former/finance} chief economic adviser.

Formerly one of Mr Brown's closest Treasury aides, {Ed/Ed} Balls hinted at a Budget giveaway on {16/21} {March/January}. He said he {hoped/hoped} more {would/would} be done to build on {current/current} tax credit rules . {Any/Any} rate rise ahead of an expected May election would not {affect/praise} the {Labour/Labour} Party's chances of winning, he added . {Last/Last} July, {Mr/Mr} Balls won the right to step down from his Treasury position and run for {parliament/everybody}, defending the Labour stronghold of Normanton in West Yorkshire.

Mr {Balls/Ebbers} {rejected/told} the {allegation/revelation} that Mr {Brown/Elgindy} had been sidelined in the election campaign, saying he was playing a "{different/significant}" {role/attached} to the {one/charges} he {played/participated} in the last t