In [1]:
import numpy as np
import pandas as pd
import textwrap
from pprint import pprint

from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
#create a dataframe and add the bbc_text_cls.csv file as it's source
df = pd.read_csv("bbc_text_cls.csv")

In [6]:
#display the df to see how it looks like
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [7]:
#print a unique set of labels in the df to see the values inside
print(df['labels']. unique())

['business' 'entertainment' 'politics' 'sport' 'tech']


In [9]:
#select the "business" label and assign it to a label variable
label = 'business'
#create a variable called texts and create a new dataframe (df)
texts = df[
    #from the df select only the labels which are equal to "business". This piece basically forms a mask where each entry is 'True' if the 
    #corresponding row in df['labels'] is equal to the value in 'label'
    #note, that this one is applied as a filter to the whole df above 
    df['labels'] == label]['text']
#select only the text column from the filtered row and assign it to variable 'texts'
#After filtering the DataFrame based on the 'labels' column, this part selects only the 'text' column.
#display the updated df
texts.head()


0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [15]:
#set the random seed parameter to replicate the results
#this will make sure the sequence of numbers that are generated by random numbe functions will be the same each time the code is run
np.random.seed(1234)

In [19]:
#randomly select an index 'i' from the range (number of rows in df, which in this case is 'texts')
#the important thing here is that we use numpy random choice which randomly selects an int index from 0 to the size of df
lenght_of_col = len(texts)
i = np.random.choice(lenght_of_col)
#fetch the document from the selected i range from 'text' column corresponding to the index i and asign it to 'doc' variable
doc = texts.iloc[i]
#print the result
print(doc)

Business confidence dips in Japan

Business confidence among Japanese manufacturers has weakened for the first time since March 2003, the quarterly Tankan survey has found.

Slower economic growth, rising oil prices, a stronger yen and weaker exports were blamed for the fall. December's confidence level was below that seen in September, the Bank of Japan said. However, September's reading was the strongest for 13 years. "The economy is at a pause but unlikely to fall", the economy minister said. "It will feel a bit slower (next year) than this year, and growth may be a bit more gentle but the situation is that the recovery will continue," said economy minister Heizo Takenaka. In the Bank of Japan's December survey, the balance of big manufacturers saying business conditions are better, minus those saying they are worse, was 22, down from 26 in September.

Japan's economy grew by just 0.1% in the three months to September, according revised data issued this month. With the recovery slow

In [23]:
#print the 'doc' variable output and make sure it is wrapped, as well as replace the whitespaces and fix sentence endings 
print(textwrap.fill(doc,replace_whitespace=False, fix_sentence_endings=True))

Orange colour clash set for court

A row over the colour orange could
hit the courts after mobile phone giant Orange launched action against
a new mobile venture from Easyjet's founder.

Orange said it was
starting proceedings against the Easymobile service for trademark
infringement.  Easymobile uses Easygroup's orange branding.  Founder
Stelios Haji-Ioannou has pledged to contest the action.  The move
comes after the two sides failed to come to an agreement after six
months of talks.  Orange claims the new low-cost mobile service has
infringed its rights regarding the use of the colour orange and could
confuse customers - known as "passing off".

"Our brand, and the
rights associated with it are extremely important to us," Orange said
in a statement.  "In the absence of any firm commitment from Easy, we
have been left with no choice but to start an action for trademark
infringement and passing off."  However, Mr Haji-Ioannou, who plans to
launch Easymobile next month, vowed to fight 

In [24]:
mlm = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 480/480 [00:00<00:00, 476kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading model.safetensors: 100%|██████████| 331M/331M [00:11<00:00, 28.8MB/s] 
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.99MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.38MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 6.24MB/s]


In [30]:
mlm('Bombardier chief to leave <mask>')

[{'score': 0.06950822472572327,
  'token': 633,
  'token_str': ' job',
  'sequence': 'Bombardier chief to leave job'},
 {'score': 0.0669306293129921,
  'token': 1470,
  'token_str': ' France',
  'sequence': 'Bombardier chief to leave France'},
 {'score': 0.05273548886179924,
  'token': 558,
  'token_str': ' office',
  'sequence': 'Bombardier chief to leave office'},
 {'score': 0.025822913274168968,
  'token': 2201,
  'token_str': ' Paris',
  'sequence': 'Bombardier chief to leave Paris'},
 {'score': 0.0213684793561697,
  'token': 896,
  'token_str': ' Canada',
  'sequence': 'Bombardier chief to leave Canada'}]

In [33]:
text = "A row over the colour <mask> could "+\
"hit the courts after mobile phone giant Orange launched action against "+\
"a new mobile venture from Easyjet's founder."

mlm(text)

[{'score': 0.6035970449447632,
  'token': 3552,
  'token_str': ' scheme',
  'sequence': "A row over the colour scheme could hit the courts after mobile phone giant Orange launched action against a new mobile venture from Easyjet's founder."},
 {'score': 0.04088282212615013,
  'token': 14284,
  'token_str': ' codes',
  'sequence': "A row over the colour codes could hit the courts after mobile phone giant Orange launched action against a new mobile venture from Easyjet's founder."},
 {'score': 0.0332195945084095,
  'token': 24943,
  'token_str': ' palette',
  'sequence': "A row over the colour palette could hit the courts after mobile phone giant Orange launched action against a new mobile venture from Easyjet's founder."},
 {'score': 0.01915041357278824,
  'token': 3260,
  'token_str': ' code',
  'sequence': "A row over the colour code could hit the courts after mobile phone giant Orange launched action against a new mobile venture from Easyjet's founder."},
 {'score': 0.012898219749331