# Workshop Use Case 1: Text Insights

Author: Simon van Baal

Date: 20240404

In [1]:
# Load in the packages.

# This allows us to access Hugging Face models via an easy-to-use function.
from transformers import pipeline

# This helps us use data frames to manipulate and save data.
import pandas as pd

In [2]:
# Define a function that conducts sentiment analysis for us.
sentiment = pipeline('sentiment-analysis')
# If you don't specify a model, it will pick one for you!
# Here: distilbert/distilbert-base-uncased-finetuned-sst-2-english

# Run it on two sentences.
sentiment(['HuggingFace makes NLP easy!',
           'Coding is so difficult. Bleh.'])

# The output, below, will show you negative/positive, and the confidence level.
# Here, the model is over 99% sure on both counts, but play around with the
# words and see what happens!



No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9935070276260376},
 {'label': 'NEGATIVE', 'score': 0.9995618462562561}]

In [3]:
# Provide labels so the model can judge the category of a piece of text.
classifier = pipeline('zero-shot-classification')

# Let's see if the following sentence belongs under work, social or economic
# psychology :)
classifier("Workshops are so boring - hopefully this increases my productivity!",
           candidate_labels = ["work", "social", "economic"])

# What do you think the output says?

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'Workshops are so boring - hopefully this increases my productivity!',
 'labels': ['work', 'economic', 'social'],
 'scores': [0.7377646565437317, 0.20721063017845154, 0.05502470210194588]}

In [4]:
# Maybe, if you have longer texts, you want to then see what is in it.

summarizer = pipeline("summarization")

# See if you can make it work on your own, sticking to the way it was done above.


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [5]:
# Or perhaps you want to know whether there are any named entities in a text.
# Then you could play around with this:
name_finder = pipeline("ner")

#It can find persons, organisations and places.


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]