In [None]:
import spacy
import pandas as pd
from transformers import pipeline

# Text classification

There are many parts of natural language that are understood by human readers using context and wider knowledge, but these points can be missed by computers on first reading. Luckily, there are many pre-trained classification tools that can assign these labels automatically.

Once again, we'll work with an example dataset of news articles about `climate change` or `global warming`.

In [None]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

We're going to start with the `spacy` library for our initial classification purposes as it automatically includes several useful functions for labelling text. Here we're going to define a simple pipeline over English using the default settings and apply it to the first article in the dataset.

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(df_news.body.iloc[0])
doc[:100]

Applying this pipeline does a few useful things, including parsing the text into a series of tokens (typically words). We've seen the first 100 tokens above.

As mentioned before, this pipeline has applied a few standard classification methods to the tokens in the text. Let's now look at part-of-speech tagging, which labels words by the role the play in the sentence. Here is a [list of the labels](https://universaldependencies.org/u/pos/) that spacy applies.

In [None]:
for token in doc[:10]:
    print(token,token.pos_)

Another component of the default spacy pipeline is named entity recognition (NER). NER looks for specific parts of the text that refer to people, places and other important objects. You may find a [list of standard classes](https://dataknowsall.com/blog/ner.html) useful.

In [None]:
for e in doc.ents[:10]:
    print(e.text,e.label_)

You'll notice that the entities found by NER are more specific that the full set of tokens seen through part of speech tagging, recognising key countries, people and dates.

It should be noted that these two methods are not perfect solutions given the messiness of natural language. Typographic errors and other spelling mistakes can cause these methods to fail - see the following example, where we replace all `a`s with `e`s in the text.

In [None]:
doc2 = nlp(df_news.body.iloc[0].replace('a','e'))
for e in doc2.ents[:10]:
    print(e.text,e.label_)

In this case we're seeing a few kinds of errors that illustrate a little more about how the methods work. It was still able to recognise `Justin Trudeeu` as a `PERSON` based on how it appears in the text. `Cenede` wasn't picked up at all as an entity - this is because place names are typically recognised based on a list (and hence `Sundey` begin registered as such).

In other cases, you may find that the same entity text is assigned different labels in different contexts. In some cases this can make sense. For example `Trump` can refer to both a person and an organisation, but this is not always the case. The best way to handle this should typically be decided on a case-by-case basis, but you will typically find that there is a majority usage with one specific entity type.

One other note to be aware of with large corpora is the speed of spacy models. If you use the default pipeline, you will often include many components that are not necessary for your purposes. You can easily customise the pipeline to included only the components you are interested in. More information can be found on the [pipeline documentation](https://spacy.io/usage/processing-pipelines).

In [None]:
doc = nlp.pipe(df_news.title[0], disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

# Sentiment analysis

Another common means of labelling and classifying text is through sentiment analysis. This technique uses pre-trained models to rate text as positive, neutral or negative. There are many different methods for this, using a range of different approaches. Some are based on sets of words with known valence, while others leverage word embeddings.

While there are methods that can be integrated into a spacy pipeline, this is a good opportunity to introduce [Hugging Face](https://huggingface.co/). Hugging Face is a repository for thousands of pretrained models for many different purposes that are well-integrated into Python through the `transformers` library. Let's try one that leverages the RoBERTa LLM.

In [None]:
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
print(sentiment_analysis("I love this!"))
print(sentiment_analysis("You're really annoying me"))

Classifiers can do more than just determine whether a text is positive or negative, such as label texts as relevant or not to a given topic. Take [EnvironmentalBERT](https://huggingface.co/ESGBERT/EnvironmentalBERT-environmental) for example, determines whether a text is about environmental concerns and ESG.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
 
tokenizer_name = "ESGBERT/EnvironmentalBERT-environmental"
model_name = "ESGBERT/EnvironmentalBERT-environmental"
 
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, max_len=512)
 
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

print(pipe("Scope 1 emissions are reported here on a like-for-like basis against the 2013 baseline and exclude emissions from additional vehicles used during repairs.", padding=True, truncation=True))
print(pipe("I hope England win the 2024 Euros.", padding=True, truncation=True))

This wide range of available models in Hugging Face means that there is often already a tool available for a given purpose and you do not need to train a bespoke model. The only catch is anyone can train and upload a model to Hugging Face - so ensure you check the model page carefully to understand the training data and information communicated in the outputs.

# Exercises

Find the 10 most common entities across the first 1000 articles. Hint: you may find the `collections` library useful.

Different part of speech tags will appear with different frequencies. Find the articles among the first 1000 articles that that have the highest and lowest proportion of proper nouns in their bodies.

News media is sometimes criticised as focused on negative stories. How are the sentiment labels and class probabilities split over the first 1000 headlines? Does this trend tell you anything about the classes or your chosen model?