<a href="https://colab.research.google.com/github/GeenccMustafa/ArticleSpinning/blob/main/Article_Spinner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
# Downloading the dataset for BBC text document classification from Kaggle
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [2]:
# Importing necessary libraries for text processing and analysis
import numpy as np
import pandas as pd
import textwrap
import nltk
from nltk import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [24]:
# Downloading the NLTK tokenizer models for text processing
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
# Reading the dataset from the CSV file into a pandas DataFrame
df = pd.read_csv('bbc_text_cls.csv')

In [26]:
# Displaying the first few rows of the DataFrame for initial exploration
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [27]:
# Extracting unique labels from the 'labels' column of the DataFrame
labels = set(df['labels'])

In [28]:
# Displaying the set of unique labels present in the dataset
labels

{'business', 'entertainment', 'politics', 'sport', 'tech'}

In [29]:
# Selecting a specific label for training data extraction
label = 'sport'

In [30]:
# Extracting text data associated with the selected label from the DataFrame
texts = df[df['labels'] == label]['text']
texts.head()

1313    Claxton hunting first major medal\n\nBritish h...
1314    O'Sullivan could run in Worlds\n\nSonia O'Sull...
1315    Greene sets sights on world title\n\nMaurice G...
1316    IAAF launches fight against drugs\n\nThe IAAF ...
1317    Dibaba breaks 5,000m world record\n\nEthiopia'...
Name: text, dtype: object

In [32]:
import nltk
from nltk import word_tokenize

def collect_probs(texts):
    """
    Collects counts of word transitions for a given list of texts.

    Args:
        texts (list): A list of text documents.

    Returns:
        dict: A dictionary containing word transition probabilities.
              The keys are tuples of the form (w(t-1), w(t+1)),
              and the values are dictionaries with w(t) as key and count(w(t)) as value.
    """

    probs = {}  # key: (w(t-1), w(t+1)), value: {w(t): count(w(t))}

    for doc in texts:
        lines = doc.split("\n")
        for line in lines:
            tokens = word_tokenize(line)
            for i in range(len(tokens) - 2):  # because I need 3 words in a row
                t_0 = tokens[i]
                t_1 = tokens[i + 1]
                t_2 = tokens[i + 2]
                key = (t_0, t_2)
                if key not in probs:
                    probs[key] = {}

                # add count for middle token
                if t_1 not in probs[key]:
                    probs[key][t_1] = 1
                else:
                    probs[key][t_1] += 1

    return probs


In [33]:
probs = collect_probs(texts)

In [34]:
# normalize probabilities
for key, d in probs.items():
  # d should represent a distribution
  total = sum(d.values())
  for k, v in d.items():
    d[k] = v / total

In [35]:
probs

{('Claxton', 'first'): {'hunting': 1.0},
 ('hunting', 'major'): {'first': 1.0},
 ('first', 'medal'): {'major': 0.6, 'relay': 0.2, 'championship': 0.2},
 ('British', 'Sarah'): {'hurdler': 1.0},
 ('hurdler', 'Claxton'): {'Sarah': 1.0},
 ('Sarah', 'is'): {'Claxton': 1.0},
 ('Claxton', 'confident'): {'is': 1.0},
 ('is', 'she'): {'confident': 0.2857142857142857,
  'hopeful': 0.14285714285714285,
  'unlikely': 0.14285714285714285,
  'certain': 0.14285714285714285,
  'that': 0.14285714285714285,
  'because': 0.14285714285714285},
 ('confident', 'can'): {'she': 0.14285714285714285,
  'Hansen': 0.14285714285714285,
  'he': 0.14285714285714285,
  'we': 0.42857142857142855,
  'Real': 0.14285714285714285},
 ('she', 'win'): {'can': 1.0},
 ('can', 'her'): {'win': 0.5, 'surpass': 0.5},
 ('win', 'first'): {'her': 0.2727272727272727,
  'his': 0.45454545454545453,
  'their': 0.09090909090909091,
  'the': 0.18181818181818182},
 ('her', 'major'): {'first': 1.0},
 ('major', 'at'): {'medal': 0.3333333333333

In [36]:
# Splitting the text content of the first document into lines using newline characters
texts.iloc[0].split("\n")

['Claxton hunting first major medal',
 '',
 "British hurdler Sarah Claxton is confident she can win her first major medal at next month's European Indoor Championships in Madrid.",
 '',
 'The 25-year-old has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I take each race as it comes. "As long as I keep up my training but not do too much I think there is a chance of a medal." Claxton has won the national 60m hurdles title for the past three years but has struggled to translate her domestic success to the international stage. Now, the Scotland-born athlete owns the equal fifth-fastest time in the world this year. And at last week\'s Birmingham Grand Prix, Claxton left European medal favourite Russian Irina Shevchenko trailing in sixth spot.',
 '',
 'For the first time, Claxton has only been preparing for a campaign over the hurdles - which could explain her leap in

In [37]:
def spin_document(doc):
    """
    Applies text spinning to a given document.

    This function splits the input document into lines (paragraphs) and applies
    the 'spin_line' function to each line. The spun lines are then rejoined into a new document.

    Args:
        doc (str): The input document as a single string.

    Returns:
        str: The spun document, where each line has been spun individually.
    """

    # Splitting the document into lines (paragraphs)
    lines = doc.split("\n")
    output = []
    for line in lines:
        if line:
            new_line = spin_line(line)  # Assuming 'spin_line' is defined elsewhere
        else:
            new_line = line
        output.append(new_line)
    return "\n".join(output)


In [38]:
# Creating a TreebankWordDetokenizer instance for detokenization
detokenizer = TreebankWordDetokenizer()

In [39]:
def sample_word(d):
    """
    Samples a word from a given distribution.

    This function takes a dictionary representing a word distribution, where the keys are words
    and the values are corresponding probabilities. It samples a word based on the given distribution.

    Args:
        d (dict): A dictionary representing a word distribution with words as keys and probabilities as values.

    Returns:
        str: A sampled word from the distribution.
    Raises:
        AssertionError: If the function encounters an unexpected condition during sampling.
                       This assertion should never be triggered.
    """

    p0 = np.random.random()
    cumulative = 0
    for t, p in d.items():
        cumulative += p
        if p0 < cumulative:
            return t
    assert False  # should never get here


In [40]:
def spin_line(line):
    """
    Apply text spinning to a given line of text.

    This function takes a line of text, tokenizes it, and applies a word spinning process.
    It replaces middle words with sampled words based on a given probability distribution.
    The 'probs' dictionary is used to determine replacement probabilities.

    Args:
        line (str): The input line of text.

    Returns:
        str: The spun line of text after the word spinning process.
    """

    tokens = word_tokenize(line)
    i = 0
    output = [tokens[0]]
    while i < (len(tokens) - 2):
        t_0 = tokens[i]
        t_1 = tokens[i + 1]
        t_2 = tokens[i + 2]
        key = (t_0, t_2)
        p_dist = probs[key]
        if len(p_dist) > 1 and np.random.random() < 0.3:
            # Replace the middle word
            middle = sample_word(p_dist)
            # output.append(t_1)  # Comment this line out if you want to see the replaced word
            output.append(middle)  # New word to be added
            output.append(t_2)

            # Skip ahead 2 steps since the 3rd token is dependent on the middle token
            i += 2
        else:
            # Keep the middle word as it is
            output.append(t_1)
            i += 1
    # Append the final token only if there was no replacement
    if i == len(tokens) - 2:
        output.append(tokens[-1])
    return detokenizer.detokenize(output)


In [41]:
# Setting the random seed for reproducibility
np.random.seed(42)

In [42]:
# Selecting a random index to choose a document from the 'texts' DataFrame
i = np.random.choice(texts.shape[0])

# Extracting the selected document at index 'i'
doc = texts.iloc[i]

# Applying text spinning to the selected document
new_doc = spin_document(doc)


In [43]:
# Printing the original document with text wrapping
print(textwrap.fill(
    doc, replace_whitespace=False, fix_sentence_endings=True))

Duff ruled out of Barcelona clash

Chelsea's Damien Duff has been
ruled out of Wednesday's Champions League clash with Barcelona at the
Nou Camp.

Duff sustained a knee injury in the FA Cup defeat at
Newcastle and manager Jose Mourinho said: "He cannot run.  His injury
is very painful, so he is out."  But Mourinho has revealed defender
Willian Gallas and striker Didier Drogba will be in the starting line-
up.  The Blues boss took the unusual step of naming his side a day
before the match, with Jole Cole named in midfield.  Mourinho said:
"We have one more session but I think Drogba will play, and Gallas
will play.  "Drogba trained on Monday with no problems and will do the
same on Tuesday.  Gallas feels he can play and wants to play.  We are
protecting him still but he will be okay to play."  Drogba, Chelsea's
£24m striker, has missed the last three weeks through injury.

Cech,
Ferreira, Carvalho, Terry, Gallas, Tiago, Makelele, Lampard, Cole,
Drogba, Gudjohnsen.


In [44]:
# Printing the spun document with text wrapping
print(textwrap.fill(
    new_doc, replace_whitespace=False, fix_sentence_endings=True))


Duff ruled out of her clash

Chelsea's Damien Duff has been ruled out
on Wednesday's Champions League clash with victory at the Nou Camp.
Duff had a knee injury at the FA Cup defeat at sprinting and manager
Jose Mourinho said: "I cannot run . His injury is very painful, so he
pulled out ." But Mourinho has revealed defender Willian Gallas and
striker Didier Drogba will be in the starting line-up . The Toon boss
took the unusual step of naming his side a day before the match, with
Jole Cole named in midfield . Mourinho said: "We have gained more, but
I think Drogba will play, and Gallas will play . "Drogba trained on
Monday with no problems and will do the same on Tuesday . He feels he
can play and wants to France . We are protecting him still but it will
be challenging to respond ." Drogba, Chelsea's £24m striker, has
missed the last three weeks through injury.

Cech, Shields, Carvalho,
Terry, Gallas, Tiago, Fortune, Lampard, Shaw, Drogba, Horvath.
