In [19]:
# https://www.bbc.com/news/articles/c8jy2dpv722o
article = '''
Spain fines budget airlines including Ryanair €179m

Spain has fined five budget airlines a total of €179m (£149m) for "abusive practices" including charging for hand luggage.

Ryanair has been given the largest fine of €108m (£90m), followed by EasyJet's penalty of €29m (£24m).

Vueling, Norwegian and Volotea were issued with sanctions by Spain's Consumer Rights Ministry on Friday.

The ministry said it plans to ban practices such as charging extra for carry-on hand luggage and reserving seats for children.

The fines are the biggest sanction issued by the ministry, and follow an investigation into the budget airline industry.

The ministry said it had upheld fines that were first announced in May after dismissing appeals lodged by the companies.

Vueling, the budget arm of British Airways owner IAG, has been fined €39m (£32m), while Norwegian Airlines and Volotea have been fined €1.6m (£1.3m) and €1.2m (£1m) respectively.

The fines were issued because the airlines were found to have provided misleading information and were not transparent with prices, "which hinders consumers' ability to compare offers" and make informed decisions, the ministry said.

Ryanair was accused of violating a range of consumer rights, including charging for larger carry-on luggage, seat selection, and asking for "a disproportionate amount" to print boarding passes at terminals.

Each fine was calculated based on the "illicit profit" obtained by each airline from these practices.

Ryanair boss Michael O'Leary said the fines were "illegal" and "baseless", adding that he will appeal the case and take it to the EU courts.

"Ryanair has for many years used bag fees and airport check-in fees to change passenger behaviour and we pass on these cost savings in the form of lower fares to consumers," he said.

Easyjet and Norwegian said they would also appeal the decision.

The Spanish airline industry watchdog, ALA, plans a further appeal and has called the ministry's decision "nonsense", arguing the fine infringes EU free market rules.

But Andrés Barragán, secretary general for consumer affairs and gambling at the ministry, defended the fines, saying the government's decision was based on Spanish and EU law.

"It is an abuse to charge €20 for just printing the boarding card in the airport, [it's] something no one wants," he told the BBC's World Business Report programme.

"This is a problem consumers are facing not only in Spain but in other EU countries."

Consumer rights association Facua, which has campaigned against the fees for six years, said the decision was "historic".
'''

In [20]:
import joblib
import pickle
import re
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences

from transformers import pipeline

!pip install sumy
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer



In [21]:
tdl_model = joblib.load('svm_model.pkl')
tdl_vectorizer = joblib.load('tfidf_vectorizer.pkl')

In [22]:
nltk.download('stopwords')

def stop_words():
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')

  return all_stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
def clean_stem_text(text):
  # replace any non-alphabet characters by a space
  cleaned_text = re.sub('[^a-zA-Z]', ' ', text)

  # replace uppercase characters to lowercase characters
  cleaned_text = cleaned_text.lower()

  # split text into words
  tokens = cleaned_text.split()

  # stem each words of each article text
  ps = PorterStemmer()
  all_stopwords = stop_words()
  stemmed_text = [ps.stem(word) for word in tokens
                  if not word in set(all_stopwords)]
  # join the words together to become a single text separated by a space
  stemmed_text = ' '.join(stemmed_text)

  return stemmed_text

In [97]:
normalized_article = clean_stem_text(article)
vectorized_article = tdl_vectorizer.transform([normalized_article]).toarray()
tdl_model.predict(vectorized_article)[0]
probabilities = tdl_model.predict_proba(vectorized_article)[0]

# Find the index of the maximum probabilities
max_index = np.argmax(probabilities)

# Find the indices of probabilities that are within 0.01 of the maximum probabilities
threshold = 0.01
close_indices = np.where(np.abs(probabilities - probabilities[max_index]) <= threshold)[0]

# Merge indices
prediction_indices = [max_index] + close_indices

# Get category labels based on the predicted indices
categories = [':red[business]', ':orange[entertainment]', ':green[politics]', ':blue[sport]', ':violet[tech]']
predicted_categories =  [f'{categories[i]} (Confidence: {probabilities[i]:.2%})'  for i in prediction_indices]

In [98]:
', '.join(predicted_categories)

':red[business] (Confidence: 83.16%)'

In [26]:
# Load the saved model, tokenizer, and label encoder
dpl_model = load_model("cnn_model.keras")

with open("cnn_tokenizer.pkl", "rb") as handle:
    dpl_tokenizer = pickle.load(handle)

with open("cnn_label_encoder.pkl", "rb") as handle:
    dpl_label_encoder = pickle.load(handle)

In [114]:
max_length = 200

sequence = dpl_tokenizer.texts_to_sequences([article])
padded_sequence = pad_sequences(sequence, maxlen=max_length, padding='post')

# Predict the categories
probabilities = dpl_model.predict(padded_sequence)[0]

# Find the index of the maximum probabilities
max_index = np.argmax(probabilities)

# Find the indices of probabilities that are within 0.01 of the maximum probabilities
threshold = 0.01
close_indices = np.where(np.abs(probabilities - probabilities[max_index]) <= threshold)[0]

prediction_indices = [max_index] + close_indices

colors = {
    'business': 'red',
    'entertainment': 'orange',
    'politics': 'green',
    'sport': 'blue',
    'tech': 'violet'
}

predicted_categories =  [f'{dpl_label_encoder.inverse_transform([i])[0]} (Confidence: {probabilities[i]:.2%})'  for i in prediction_indices]
colored_predictions =  [f':{colors[predicted_category.split()[0]]}[{predicted_category}]'  for predicted_category in predicted_categories]


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step


In [115]:
', '.join(colored_predictions)



In [30]:
pretrained_summarizer = pipeline("summarization", model="google/pegasus-multi_news")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-multi_news and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [31]:
pretrained_summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [32]:
summary = pretrained_summarizer(article, max_length=150, min_length=30, do_sample=False)

In [33]:
summary[0]['summary_text']

"Spain fines budget airlines including Ryanair €179m (£149m) for 'abusive practices' Ryanair has been given the largest fine of €108m (£90m), followed by EasyJet's penalty of €29m (£24m) Vueling, Norwegian and Volotea were issued with sanctions by Spain's Consumer Rights Ministry."

In [34]:
nltk.download('punkt_tab')

def text_rank_summary(doc):

    summarizer = TextRankSummarizer()
    parser = PlaintextParser.from_string(doc.split("\n",1)[1],Tokenizer("english"))
    summary = summarizer(parser.document,sentences_count=3)

    sentence = ''
    for s in summary:
        sentence += str(s) + ' '

    return sentence


def lsa_summary(doc):

    summarizer = LsaSummarizer()
    parser = PlaintextParser.from_string(doc.split("\n",1)[1],Tokenizer("english"))
    summary = summarizer(parser.document,sentences_count=3)

    sentence = ''
    for s in summary:
        sentence += str(s) + ' '

    return sentence

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [35]:
text_rank_summary(article)

'The fines are the biggest sanction issued by the ministry, and follow an investigation into the budget airline industry. The fines were issued because the airlines were found to have provided misleading information and were not transparent with prices, "which hinders consumers\' ability to compare offers" and make informed decisions, the ministry said. But Andrés Barragán, secretary general for consumer affairs and gambling at the ministry, defended the fines, saying the government\'s decision was based on Spanish and EU law. '

In [36]:
lsa_summary(article)

'Vueling, Norwegian and Volotea were issued with sanctions by Spain\'s Consumer Rights Ministry on Friday. Ryanair was accused of violating a range of consumer rights, including charging for larger carry-on luggage, seat selection, and asking for "a disproportionate amount" to print boarding passes at terminals. "Ryanair has for many years used bag fees and airport check-in fees to change passenger behaviour and we pass on these cost savings in the form of lower fares to consumers," he said. '