In [1]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import numpy as np
import pandas as pd
import textwrap
from pprint import pprint

from transformers import pipeline

In [4]:
df = pd.read_csv("bbc_text_cls.csv")

In [5]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [6]:
labels = set(df.labels)
labels

{'business', 'entertainment', 'politics', 'sport', 'tech'}

In [7]:
# Pick a label
label = 'business'

In [8]:
texts = df[df["labels"]==label].text
texts.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [9]:
np.random.seed(1234)

In [10]:
i = np.random.choice(texts.shape[0])
doc = texts.loc[0]

In [11]:
print(textwrap.fill(doc, replace_whitespace=False, fix_sentence_endings=True))

Ad sales boost Time Warner profit

Quarterly profits at US media giant
TimeWarner jumped 76% to $1.13bn (£600m) for the three months to
December, from $639m year-earlier.

The firm, which is now one of the
biggest investors in Google, benefited from sales of high-speed
internet connections and higher advert sales.  TimeWarner said fourth
quarter sales rose 2% to $11.1bn from $10.9bn.  Its profits were
buoyed by one-off gains which offset a profit dip at Warner Bros, and
less users for AOL.

Time Warner said on Friday that it now owns 8% of
search-engine Google.  But its own internet business, AOL, had has
mixed fortunes.  It lost 464,000 subscribers in the fourth quarter
profits were lower than in the preceding three quarters.  However, the
company said AOL's underlying profit before exceptional items rose 8%
on the back of stronger internet advertising revenues.  It hopes to
increase subscribers by offering the online service free to TimeWarner
internet customers and will try to sign 

In [12]:
# The articles order will change, thus, let's save the article
doc1 = """
Ad sales boost Time Warner profit

Quarterly profits at US media giant
TimeWarner jumped 76% to $1.13bn (£600m) for the three months to
December, from $639m year-earlier.

The firm, which is now one of the
biggest investors in Google, benefited from sales of high-speed
internet connections and higher advert sales.  TimeWarner said fourth
quarter sales rose 2% to $11.1bn from $10.9bn.  Its profits were
buoyed by one-off gains which offset a profit dip at Warner Bros, and
less users for AOL.

Time Warner said on Friday that it now owns 8% of
search-engine Google.  But its own internet business, AOL, had has
mixed fortunes.  It lost 464,000 subscribers in the fourth quarter
profits were lower than in the preceding three quarters.  However, the
company said AOL's underlying profit before exceptional items rose 8%
on the back of stronger internet advertising revenues.  It hopes to
increase subscribers by offering the online service free to TimeWarner
internet customers and will try to sign up AOL's existing customers
for high-speed broadband.  TimeWarner also has to restate 2000 and
2003 results following a probe by the US Securities Exchange
Commission (SEC), which is close to concluding.

Time Warner's fourth
quarter profits were slightly better than analysts' expectations.  But
its film division saw profits slump 27% to $284m, helped by box-office
flops Alexander and Catwoman, a sharp contrast to year-earlier, when
the third and final film in the Lord of the Rings trilogy boosted
results.  For the full-year, TimeWarner posted a profit of $3.36bn, up
27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.
"Our financial performance was strong, meeting or exceeding all of our
full-year objectives and greatly enhancing our flexibility," chairman
and chief executive Richard Parsons said.  For 2005, TimeWarner is
projecting operating earnings growth of around 5%, and also expects
higher revenue and wider profit margins.

TimeWarner is to restate its
accounts as part of efforts to resolve an inquiry into AOL by US
market regulators.  It has already offered to pay $300m to settle
charges, in a deal that is under review by the SEC. The company said
it was unable to estimate the amount it needed to set aside for legal
reserves, which it previously set at $500m.  It intends to adjust the
way it accounts for a deal with German music publisher Bertelsmann's
purchase of a stake in AOL Europe, which it had reported as
advertising revenue.  It will now book the sale of its stake in AOL
Europe as a loss on the value of that stake.
"""

In [13]:
mlm = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [14]:
mlm("Ad sales boost Time Warner <mask>")

[{'score': 0.975560188293457,
  'token': 14433,
  'token_str': ' Cable',
  'sequence': 'Ad sales boost Time Warner Cable'},
 {'score': 0.0036186028737574816,
  'token': 14641,
  'token_str': ' Networks',
  'sequence': 'Ad sales boost Time Warner Networks'},
 {'score': 0.0027795506175607443,
  'token': 6076,
  'token_str': ' Communications',
  'sequence': 'Ad sales boost Time Warner Communications'},
 {'score': 0.001571234199218452,
  'token': 603,
  'token_str': ' Inc',
  'sequence': 'Ad sales boost Time Warner Inc'},
 {'score': 0.0012234053574502468,
  'token': 4,
  'token_str': '.',
  'sequence': 'Ad sales boost Time Warner.'}]

In [15]:
text = "Quarterly <mask> at US media giant " + \
    "TimeWarner jumped 76% to $1.13bn (£600m) for the three months to " + \
    "December, from $639m year-earlier."
pprint(mlm(text))

[{'score': 0.346161425113678,
  'sequence': 'Quarterly revenues at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 3883,
  'token_str': ' revenues'},
 {'score': 0.2683636248111725,
  'sequence': 'Quarterly earnings at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 1107,
  'token_str': ' earnings'},
 {'score': 0.19676564633846283,
  'sequence': 'Quarterly revenue at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 903,
  'token_str': ' revenue'},
 {'score': 0.0765751451253891,
  'sequence': 'Quarterly profits at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-e

In [16]:
text = "Quarterly profit at US media giant " + \
    "TimeWarner jumped 76% to $1.13bn (£600m) for the three <mask> to " + \
    "December, from $639m year-earlier."
pprint(mlm(text))

[{'score': 0.5059128999710083,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 377,
  'token_str': ' months'},
 {'score': 0.4703284800052643,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three quarters to December, from $639m '
              'year-earlier.',
  'token': 5666,
  'token_str': ' quarters'},
 {'score': 0.015661416575312614,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three years to December, from $639m '
              'year-earlier.',
  'token': 107,
  'token_str': ' years'},
 {'score': 0.0035203234292566776,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three quarter to December, from $639m '
              'year-earli

In [17]:
text = "Quarterly profit at US media giant " + \
    "TimeWarner <mask> 76% to $1.13bn (£600m) for the three months to " + \
    "December, from $639m year-earlier."
pprint(mlm(text))

[{'score': 0.37313273549079895,
  'sequence': 'Quarterly profit at US media giant TimeWarner rose 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 1458,
  'token_str': ' rose'},
 {'score': 0.12948235869407654,
  'sequence': 'Quarterly profit at US media giant TimeWarner fell 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 1064,
  'token_str': ' fell'},
 {'score': 0.06897484511137009,
  'sequence': 'Quarterly profit at US media giant TimeWarner soared 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 14622,
  'token_str': ' soared'},
 {'score': 0.05553628131747246,
  'sequence': 'Quarterly profit at US media giant TimeWarner slumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  

In [18]:
text = "Quarterly profit at US <mask> giant " + \
    "TimeWarner jumped 76% to $1.13bn (£600m) for the three months to " + \
    "December, from $639m year-earlier."
pprint(mlm(text))

[{'score': 0.23745384812355042,
  'sequence': 'Quarterly profit at US telecom giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 9146,
  'token_str': ' telecom'},
 {'score': 0.13893131911754608,
  'sequence': 'Quarterly profit at US tech giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 2903,
  'token_str': ' tech'},
 {'score': 0.11504440009593964,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'year-earlier.',
  'token': 433,
  'token_str': ' media'},
 {'score': 0.07908613234758377,
  'sequence': 'Quarterly profit at US telecommunications giant TimeWarner '
              'jumped 76% to $1.13bn (£600m) for the three months to December, '
              'from $639m y

In [19]:
text = "Quarterly profit at US media giant " + \
    "TimeWarner jumped 76% to $1.13bn (£600m) for the three months to " + \
    "December, <mask> $639m year-earlier."
pprint(mlm(text))

[{'score': 0.29505041241645813,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, versus $639m '
              'year-earlier.',
  'token': 4411,
  'token_str': ' versus'},
 {'score': 0.11734861880540848,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, topping $639m '
              'year-earlier.',
  'token': 11744,
  'token_str': ' topping'},
 {'score': 0.04699510335922241,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, hitting $639m '
              'year-earlier.',
  'token': 3022,
  'token_str': ' hitting'},
 {'score': 0.04089583083987236,
  'sequence': 'Quarterly profit at US media giant TimeWarner jumped 76% to '
              '$1.13bn (£600m) for the three months to December, from $639m '
              'y

END