<a href="https://colab.research.google.com/github/TyrealQ/AI-Conversation/blob/main/Basic_NLP/Basic_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Overview**

This codebook provides a collection of code and detailed guides for fundamental Natural Language Processing (NLP) applications using Hugging Face's Transformers library. It covers essential techniques such as sentiment analysis, text summarization, and named entity recognition. Each section includes step-by-step explanations, code snippets, and best practices to help users effectively implement transformer-based models for various NLP tasks. Ideal for beginners and intermediate users, this resource serves as a practical reference for leveraging state-of-the-art NLP models with minimal effort.

Code authored by: **[Tyreal Qian](https://tyrealq.github.io/)**

## Install dependencies

In [None]:
# %%capture
!pip install transformers
!pip install datasets
!pip install tqdm
!pip install vaderSentiment
!pip install scikit-learn

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading 

In [None]:
# Use transformers to load pre-trained models for NLP tasks
from transformers import pipeline

# Use VADER for lexicon-based sentiment analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Use datasets to easily load and preprocess NLP datasets
from datasets import load_dataset

# Use tqdm to display progress bars for loops and processing tasks
from tqdm import tqdm

# Use sklearn for evaluating model performance (accuracy, precision, recall, F1-score)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

## **Sentiment Analysis**
Sentiment Analysis is an NLP task that determines the emotional tone of a text. Approaches include:

- Lexicon-based → Uses predefined word lists with sentiment scores (e.g., VADER, SentiWordNet).
- Machine Learning → Employs classifiers like Naïve Bayes or SVM trained on labeled sentiment data.
- Deep Learning → Utilizes models like LSTMs, CNNs, and Transformer-based architectures (e.g., BERT, RoBERTa) for more nuanced analysis.

Common categories:
- Positive → e.g., *Dr. Q is amazing!*
- Negative → e.g., *I hated Dr. Q.*
- Neutral → e.g., *Dr. Q was okay, nothing special.*

### HF Pipeline

- [Hugging Face Text Classification Models](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending)<br> A collection of models that can analyze and classify text. These models can determine if a review is positive or negative, detect spam, or recognize emotions in text.

- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/en/index)<br>A user-friendly guide for using Hugging Face's "Transformers" library. It provides pre-trained models for tasks like text understanding, translation, and more—helping developers apply AI without needing to train models from scratch.



In [None]:
# Load sentiment analysis models
pipe1 = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
pipe2 = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment")
pipe3 = SentimentIntensityAnalyzer()

# Function to convert VADER scores to categorical labels
def vader_to_label(text):
    scores = pipe3.polarity_scores(text)
    compound = scores["compound"]
    label = "negative" if compound < -0.05 else "neutral" if -0.05 <= compound <= 0.05 else "positive"
    return {"label": label, "compound score": compound}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/299M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
# EN: Dr. Q is the best!
# EN: Dr. Q is the bestttttt!
# ZH: Dr. Q 是最棒的！
# FR: Dr. Q est le meilleur !
# ES: ¡Dr. Q es el mejor!
# JA: Dr. Q は最高です！
# KO: Dr. Q가 최고예요!

text = "Dr. Q is the best!"

# Get results from each model
result1 = pipe1(text)
result2 = pipe2(text)
result3 = vader_to_label(text)

# Print results
print("RoBERTa:", result1)
print("modernBERT:", result2)
print("VADER:", result3)

RoBERTa: [{'label': 'positive', 'score': 0.9885984659194946}]
modernBERT: [{'label': 'positive', 'score': 0.8801090717315674}]
VADER: {'label': 'neutral', 'compound score': 0.0}


Dataset: [tyrealqian/Stadium_RoBERTa_eval](https://huggingface.co/datasets/tyrealqian/Stadium_RoBERTa_eval)

This dataset is hosted on HF and is designed for evaluating text classification models in the context of college football stadium reviews. It contains text data along with corresponding labels, making it useful for testing and benchmarking models.


In [None]:
# Load the dataset
ds = load_dataset("tyrealqian/Stadium_RoBERTa_eval")

# Check the dataset structure
print(ds)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

RoBERTa_eval_data.csv:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})


In [None]:
# Extracts the 'text' column as a list
texts = ds["test"]["text"]

# Display first 10 samples
print(texts[:10])

['They pack them in there tight', 'biggest and baddest football stadium on the land', 'clean restrooms', 'Bring a water bottle - you can bring in an empty bottle and fill from the water fountain', 'Beaver stadium is a treat to watch a football game', 'As you approach the stadium, it is unimpressive and un-inspiring', 'The best place to be on a fall Saturday even in the rain', 'the players', 'The flea market is once a month, so there are only 1 or 2 months when you should not go', 'I have been asked many times by friends, some who are season ticket holders and others first-time game attendees, to offer up some examples of things to shoot in/around the stadium']


In [None]:
# Process Transformer models in batches with progress bars
def get_transformer_predictions(pipe, texts, model_name, batch_size=10):
    results = []
    with tqdm(total=len(texts), desc=f"Processing {model_name}", leave=True) as pbar:
        for i in range(0, len(texts), batch_size):
            batch = texts[i : i + batch_size]
            batch_results = pipe(batch)
            results.extend(batch_results)
            pbar.update(len(batch))
    return results

# Run models on the first 10 texts
texts_to_process = texts[:10]
results1 = get_transformer_predictions(pipe1, texts_to_process, "RoBERTa")
results2 = get_transformer_predictions(pipe2, texts_to_process, "modernBERT")

# Process VADER
results3 = []
with tqdm(total=len(texts_to_process), desc="Processing VADER") as pbar:
    for text in texts_to_process:
        results3.append(vader_to_label(text))
        pbar.update(1)

Processing RoBERTa: 100%|██████████| 10/10 [00:01<00:00,  5.45it/s]
Processing modernBERT: 100%|██████████| 10/10 [00:01<00:00,  5.21it/s]
Processing VADER: 100%|██████████| 10/10 [00:00<00:00, 5118.75it/s]


In [None]:
# Display results
print(results1)
print(results2)
print(results3)

[{'label': 'neutral', 'score': 0.7514070868492126}, {'label': 'negative', 'score': 0.8973892331123352}, {'label': 'neutral', 'score': 0.6642183065414429}, {'label': 'neutral', 'score': 0.8351931571960449}, {'label': 'positive', 'score': 0.9491836428642273}, {'label': 'negative', 'score': 0.8138198852539062}, {'label': 'positive', 'score': 0.9694348573684692}, {'label': 'neutral', 'score': 0.7940726280212402}, {'label': 'neutral', 'score': 0.5529196858406067}, {'label': 'neutral', 'score': 0.8566815853118896}]
[{'label': 'neutral', 'score': 0.5236952900886536}, {'label': 'negative', 'score': 0.457106351852417}, {'label': 'neutral', 'score': 0.5250077247619629}, {'label': 'positive', 'score': 0.48136577010154724}, {'label': 'neutral', 'score': 0.4832203686237335}, {'label': 'negative', 'score': 0.6712276339530945}, {'label': 'positive', 'score': 0.928337812423706}, {'label': 'positive', 'score': 0.7350209951400757}, {'label': 'positive', 'score': 0.5722312927246094}, {'label': 'neutral',

In [None]:
# Print results for comparison
for i, text in enumerate(texts_to_process):
    print(f"\nText: {text}")
    print(f"RoBERTa: {results1[i]['label']}; Confidence: {results1[i]['score']:.3f}")
    print(f"modernBERT: {results2[i]['label'].lower()}; Confidence: {results2[i]['score']:.3f}")
    print(f"VADER: {results3[i]['label']}; Compound: {results3[i]['compound score']:.3f}")
    print("-" * 60)


Text: They pack them in there tight
RoBERTa: neutral; Confidence: 0.751
modernBERT: neutral; Confidence: 0.524
VADER: neutral; Compound: 0.000
------------------------------------------------------------

Text: biggest and baddest football stadium on the land
RoBERTa: negative; Confidence: 0.897
modernBERT: negative; Confidence: 0.457
VADER: neutral; Compound: 0.000
------------------------------------------------------------

Text: clean restrooms
RoBERTa: neutral; Confidence: 0.664
modernBERT: neutral; Confidence: 0.525
VADER: positive; Compound: 0.402
------------------------------------------------------------

Text: Bring a water bottle - you can bring in an empty bottle and fill from the water fountain
RoBERTa: neutral; Confidence: 0.835
modernBERT: positive; Confidence: 0.481
VADER: negative; Compound: -0.202
------------------------------------------------------------

Text: Beaver stadium is a treat to watch a football game
RoBERTa: positive; Confidence: 0.949
modernBERT: neu

### Evaluation

To assess the performance of the models, we compute several key evaluation metrics: <br><br>

- Accuracy: Measures the proportion of correct predictions out of all predictions. It is calculated as:

  $$
  \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
  $$

- Precision: Indicates how many of the predicted positive (or negative/neutral) instances are actually correct. It is defined as:

  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$

- Recall: Measures the model's ability to correctly identify all relevant instances. It is given by:

  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

- F1-score: The harmonic mean of precision and recall, balancing both measures. It is computed as:

  $$
  \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  $$

<br>Each of these metrics provides insights into the effectiveness of the sentiment models, with a higher score indicating better performance.


In [None]:
# Load a dataset
dataset = load_dataset("tyrealqian/Stadium_RoBERTa_eval")

# Select a sample of 100 texts and corresponding ground-truth labels
test_texts = dataset["test"]["text"]
test_labels = dataset["test"]["label"]

In [None]:
# Display first 10 samples
print(test_texts[:10])
print(test_labels[:10])

['They pack them in there tight', 'biggest and baddest football stadium on the land', 'clean restrooms', 'Bring a water bottle - you can bring in an empty bottle and fill from the water fountain', 'Beaver stadium is a treat to watch a football game', 'As you approach the stadium, it is unimpressive and un-inspiring', 'The best place to be on a fall Saturday even in the rain', 'the players', 'The flea market is once a month, so there are only 1 or 2 months when you should not go', 'I have been asked many times by friends, some who are season ticket holders and others first-time game attendees, to offer up some examples of things to shoot in/around the stadium']
[1, 2, 2, 1, 2, 0, 2, 1, 1, 1]


In [None]:
# Load the sentiment analysis pipeline with the pre-trained models
pipe1 = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
pipe2 = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment")
pipe3 = SentimentIntensityAnalyzer()

# Function to convert VADER scores to categorical labels
def vader_to_label(text):
    score = pipe3.polarity_scores(text)["compound"]
    return 0 if score < -0.05 else 1 if -0.05 <= score <= 0.05 else 2  # Negative, Neutral, Positive

# Function to get predictions from transformer pipelines
def get_transformer_predictions(pipe, texts, model_name):
    results = []
    for text in tqdm(texts, desc=f"Processing {model_name}", leave=True):
        results.append(pipe(text)[0])
    predicted_labels = [
        0 if res['label'].lower() == 'negative' else 1 if res['label'].lower() == 'neutral' else 2
        for res in results
    ]
    return predicted_labels

# Generate predictions from all three approaches
predicted_labels_roberta = get_transformer_predictions(pipe1, test_texts, "RoBERTa")
predicted_labels_modernBERT = get_transformer_predictions(pipe2, test_texts, "modernBERT")
predicted_labels_vader = [vader_to_label(text) for text in tqdm(test_texts, desc="Processing VADER")]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Device set to use cpu
Processing RoBERTa: 100%|██████████| 500/500 [01:01<00:00,  8.12it/s]
Processing modernBERT: 100%|██████████| 500/500 [01:22<00:00,  6.08it/s]
Processing VADER: 100%|██████████| 500/500 [00:00<00:00, 19236.22it/s]


In [None]:
# Function to evaluate accuracy, precision, recall, and F1-score
def evaluate_model(test_labels, predicted_labels):
    accuracy = accuracy_score(test_labels, predicted_labels)
    precision, recall, f1, _ = precision_recall_fscore_support(
        test_labels, predicted_labels, average='weighted', zero_division=1
    )
    return accuracy, precision, recall, f1

# Evaluate all models
metrics_roberta = evaluate_model(test_labels, predicted_labels_roberta)
metrics_modernBERT = evaluate_model(test_labels, predicted_labels_modernBERT)
metrics_vader = evaluate_model(test_labels, predicted_labels_vader)

# Print results in a formatted table
print("\nModel Performance Comparison:\n")
print(f"{'Metric':<15}{'RoBERTa':<12}{'modernBERT':<12}{'VADER':<12}")
print("-" * 45)
print(f"{'Accuracy':<15}{metrics_roberta[0]:<12.3f}{metrics_modernBERT[0]:<12.3f}{metrics_vader[0]:.3f}")
print(f"{'Precision':<15}{metrics_roberta[1]:<12.3f}{metrics_modernBERT[1]:<12.3f}{metrics_vader[1]:.3f}")
print(f"{'Recall':<15}{metrics_roberta[2]:<12.3f}{metrics_modernBERT[2]:<12.3f}{metrics_vader[2]:.3f}")
print(f"{'F1-score':<15}{metrics_roberta[3]:<12.3f}{metrics_modernBERT[3]:<12.3f}{metrics_vader[3]:.3f}")


Model Performance Comparison:

Metric         RoBERTa     modernBERT  VADER       
---------------------------------------------
Accuracy       0.784       0.594       0.664
Precision      0.835       0.751       0.728
Recall         0.784       0.594       0.664
F1-score       0.798       0.633       0.684


### Next: Fine-tune a text classification model

## Named Entity Recognition (NER)

NER is a Natural Language Processing (NLP) task that identifies and classifies entities in text into predefined categories such as:

- PER (Person) → e.g., *Dr. Q*
- ORG (Organization) → e.g., *NASSM*
- LOC (Location) → e.g., *San Diego*
- MISC (Miscellaneous) → e.g., *AI*

In [None]:
text = '''
February NASSM Conversations The February installment of NASSM Conversations will dive into something we are all becoming more familiar with daily: AI. This session’s moderators, Drs. Yizhou Tyreal Qian from Louisiana State University and Mike Naraine from Brock University will discuss the various applications of AI in sport management. Date: Monday, March 3, 2-3 pm ET.
'''

# Load the NER pipeline
ner = pipeline(
    "ner",
    model="xlm-roberta-large-finetuned-conll03-english",

    # Change model here if needed:
    # "Jean-Baptiste/roberta-large-ner-english"
    # "xlm-roberta-large-finetuned-conll03-english"
    # "dbmdz/bert-large-cased-finetuned-conll03-english"

    aggregation_strategy="simple"  # Define how entities are merged

    # Aggregation Strategies:
    # "none"  -> No aggregation (each token is a separate entity)
    # "simple" -> Merges consecutive tokens with the same entity type
    # "first"  -> Keeps only the first token of an entity
    # "max"    -> Keeps the token with the highest confidence score
)

In [None]:
print(df)

In [None]:
df = pd.DataFrame(ner(text))
df

## **Text Summarization**
Text Summarization condenses long text into a shorter version while retaining key information. Approaches include:  

- Extractive Summarization → Selects the most important sentences from the text (e.g., TextRank, LexRank).  
- Abstractive Summarization → Generates new sentences to convey the main ideas (e.g., BART, T5).  

Examples:  
- Original Text: *DeepSeek announced a groundbreaking open-source model, making cutting-edge AI more accessible to everyone and driving innovation across industries.*
- Extractive Summary: *DeepSeek announced a groundbreaking model.*
- Abstractive Summary: *A new industry-changing model was introduced.*

In [None]:
# Load text summarization pipeline
text_summarizer = pipeline(
    "summarization",
    model="google/pegasus-xsum",  # Pre-trained model for summarization

    # Change model here if needed:
    # "google/pegasus-xsum"
    # "philschmid/bart-large-cnn-samsum"
    # "Falconsai/text_summarization"

    device=-1  # Use GPU (0) for faster processing (-1 for CPU)
)

In [None]:
# Sample text for summarization
text = '''
The mission of the North American Society for Sport Management (NASSM) is to promote, stimulate, and encourage study, research, scholarly writing, and professional development in the area of sport management. Members of the Society are concerned about the theoretical and applied aspects of management theory and practice specifically related to sport, exercise, dance, and play, as these fields are pursued by all sectors of the population. The Society endeavors to support and cooperate with local, regional, national, and international organizations that have similar purposes and organizes and administers conferences to promote its purposes.
'''

# Generate summary
summary = text_summarizer(
    text,
    max_length=50,  # Maximum length of the summary
    min_length=10,  # Minimum length of the summary
    num_beams=4,  # Beam search optimization (higher = better quality, slower speed)
    do_sample=False,  # Whether to introduce randomness (False = deterministic summary)
    temperature=0.5,  # Controls randomness (only used if do_sample=True; lower = more conservative)
    truncation=True,  # Truncate text if too long for the model
    clean_up_tokenization_spaces=True  # Remove extra spaces in output
)

In [None]:
print(summary)

In [None]:
# Print summarized text
print(summary[0]["summary_text"])

The North American Society for Sport Management (NASSM) is a not-for-profit society with members in the United States, Canada, and Mexico.


## **Topic Modeling**
Topic Modeling is an NLP technique for discovering hidden themes in a collection of texts. Approaches include:

- Traditional Methods → Uses probabilistic models like Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) to cluster words into topics.
- BERTopic → A more advanced method leveraging BERT embeddings, clustering algorithms (e.g., UMAP, HDBSCAN), and TF-IDF representations for more coherent and interpretable topic extraction.

Use cases:
- LDA/NMF → Suitable for structured, well-separated topics.
- BERTopic → Excels at capturing nuanced, contextual topics in large datasets.

### Demo: Esports at the 2023 Asian Games  

This project applied LLM-enhanced BERTopic modeling to analyze esports stakeholders' perceptions regarding the inclusion of esports as a medal event.

Explore the full analysis and demos in this [GitHub Repository](https://github.com/TyrealQ/Twitter-Perceptions-Esports-2023-Asian-Games_HICSS-58).