<a href="https://colab.research.google.com/github/JJingLu/CBS5055-Generative-Artificial-Intelligence-for-Innovative-Communications/blob/main/Workshop_3_Foundation_for_Social_Media_Data_Acquisition_and_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Workshop 3: Foundation for Social Media Data Acquisition and Analysis CBS5055

**Instructor: Jessie Lu**  

Welcome to Workshop 3!  

In today‚Äôs session, you will learn the foundational skills for acquiring and analyzing social media-style text data using Python.  

The specific dataset we will explore today is **tweet_eval**, originally created by CardiffNLP and published on Hugging Face. It features English Twitter text for seven classic NLP classification tasks, including sentiment analysis, hate speech detection and stance detection, with uniformly formatted train, validation and test splits. This dataset is widely used in social media NLP research, especially for model training and benchmarking on noisy, short user-generated textual content.

You can view and explore the dataset directly here:  https://huggingface.co/datasets/cardiffnlp/tweet_eval?utm_source=chatgpt.com

In [None]:
# =================================================
# Part 1: Install & Import Libraries
# =================================================

# Install necessary Python packages quietly, without verbose output.
!pip install datasets pandas matplotlib wordcloud --quiet

# Import the load_dataset function from the 'datasets' library.
# 'datasets' for loading data from Hugging Face.
from datasets import load_dataset

# Import the pandas library, aliasing it as 'pd' for convenience.
# 'pandas' for data manipulation and analysis.
import pandas as pd

# Import the matplotlib.pyplot module, aliasing it as 'plt' for convenience.
# 'matplotlib' for creating static, interactive, and animated visualizations.
import matplotlib.pyplot as plt

# Import the WordCloud class from the 'wordcloud' library.
# 'wordcloud' for generating word cloud images.
from wordcloud import WordCloud

# Print a message to confirm that the packages have been successfully installed and imported.
print("‚úÖ Packages installed and imported")

In [None]:
# =================================================
# Part 2: Load TweetEval Sentiment Dataset
# =================================================

# Load the 'tweet_eval' dataset from Hugging Face, specifically the 'sentiment' subset.
# The 'load_dataset' function downloads and caches the dataset.
dataset = load_dataset("cardiffnlp/tweet_eval", "sentiment")
# Print a header indicating that dataset information will follow.
print("\nüì¶ Dataset Info:")
# Print the loaded dataset object, which typically shows its structure (e.g., train, test, validation splits and features).
print(dataset)

In [None]:
# =================================================
# Part 3: Convert to Pandas DataFrame
# =================================================

# Convert the 'train' split of the loaded dataset into a pandas DataFrame.
# This makes it easier to perform data manipulation and analysis using pandas.
df_tweets = dataset["train"].to_pandas()

# Print a header indicating that a preview of the DataFrame will follow.
print("\nüîç First 5 rows preview:")
# Display the first 5 rows of the DataFrame.
# 'display()' is used in Colab/Jupyter for rich output.
display(df_tweets.head())

# Print the shape of the DataFrame (number of rows, number of columns).
print("\nüìê Shape (rows, columns):", df_tweets.shape)
# Print a list of all column names in the DataFrame.
print("üìã Columns:", df_tweets.columns.tolist())

In [None]:
# =================================================
# Part 4: Understand the Data Structure
# =================================================
# Columns:
# - text: tweet text
# - label: sentiment integer (0/1/2)
# Print a header indicating that data types will follow.
print("\nüß† Data types:")
# Display the data types of each column in the DataFrame.
display(df_tweets.dtypes)

In [None]:
# =================================================
# Part 5: Map Sentiment Labels to Names
# =================================================

# Define a dictionary to map numerical sentiment labels to descriptive string names.
sentiment_map = {0: "negative", 1: "neutral", 2: "positive"}
# Create a new column 'sentiment' in the DataFrame by applying the sentiment_map to the 'label' column.
df_tweets["sentiment"] = df_tweets["label"].map(sentiment_map)

# Print a header indicating that a preview with sentiment names will follow.
print("\nüìÇ With sentiment names:")
# Display the first 5 rows of the DataFrame, showing only the 'text' and the new 'sentiment' columns.
display(df_tweets[["text","sentiment"]].head())

In [None]:
# =================================================
# Part 6: Sentiment Distribution
# =================================================

# Print a header indicating that sentiment distribution information will follow.
print("\nüìä Sentiment distribution:")
# Calculate the count of each unique value in the 'sentiment' column.
# This gives the distribution of negative, neutral, and positive tweets.
dist = df_tweets["sentiment"].value_counts()
# Display the calculated sentiment distribution.
display(dist)

# Create a new figure for the plot.
plt.figure()
# Plot the sentiment distribution as a bar chart.
# Assign specific colors to the bars for visual distinction.
dist.plot(kind="bar", color=["red","gray","green"])
# Set the title of the plot.
plt.title("Sentiment Distribution in TweetEval ‚Äì Training Set")
# Set the label for the x-axis.
plt.xlabel("Sentiment")
# Set the label for the y-axis.
plt.ylabel("Number of Tweets")
# Display the plot.
plt.show()

In [None]:
# =================================================
# Part 7: Word Frequency (All Tweets)
# =================================================

# Print a header indicating the purpose of this section.
print("\nüìà Most common words in all tweets:")
# Calculate the top 10 most common words across all tweets.
# .str.lower(): Converts all text to lowercase to treat words like 'The' and 'the' as the same.
# .str.split(): Splits each tweet text into a list of words.
# .explode(): Transforms each element of a list-like entry to a separate row, creating a Series of individual words.
# .value_counts(): Counts the occurrences of each unique word.
# .head(10): Selects the top 10 most frequent words.
top10 = (
    df_tweets["text"]
    .str.lower()
    .str.split()
    .explode()
    .value_counts()
    .head(10)
)
# Display the top 10 most common words and their counts.
display(top10)

In [None]:
# =================================================
# Part 8: Compare Positive vs Negative Language
# =================================================

# Filter the DataFrame to get text from negative sentiment tweets.
neg_text = df_tweets[df_tweets["sentiment"]=="negative"]["text"]
# Filter the DataFrame to get text from positive sentiment tweets.
pos_text = df_tweets[df_tweets["sentiment"]=="positive"]["text"]

# Join all words from negative tweets into a single string, handling potential NaN values.
neg_words = " ".join(neg_text.dropna().tolist())
# Join all words from positive tweets into a single string, handling potential NaN values.
pos_words = " ".join(pos_text.dropna().tolist())

# Generate a word cloud for negative tweets.
# Configure width, height, and background color for the word cloud.
neg_cloud = WordCloud(width=600, height=300, background_color="white").generate(neg_words)
# Generate a word cloud for positive tweets.
# Configure width, height, and background color for the word cloud.
pos_cloud = WordCloud(width=600, height=300, background_color="white").generate(pos_words)

# Create a figure with a specific size for displaying two plots side-by-side.
plt.figure(figsize=(10, 4))
# Create the first subplot for the negative word cloud.
plt.subplot(1, 2, 1)
# Display the negative word cloud image; set its title and turn off axis.
plt.imshow(neg_cloud); plt.title("Negative Tweets Word Cloud"); plt.axis("off")
# Create the second subplot for the positive word cloud.
plt.subplot(1, 2, 2)
# Display the positive word cloud image; set its title and turn off axis.
plt.imshow(pos_cloud); plt.title("Positive Tweets Word Cloud"); plt.axis("off")
# Show the plots.
plt.show()

In [None]:
# =================================================
# Part 9: Show Sample Tweets by Sentiment
# =================================================

def show_samples(sentiment, n=3):
    # Print a header indicating the sentiment for the samples.
    print(f"\nüìç {sentiment.upper()} tweet samples:")
    # Filter the DataFrame to get tweets of the specified sentiment and randomly sample 'n' of them.
    samples = df_tweets[df_tweets["sentiment"] == sentiment].sample(n)
    # Iterate through the sampled tweets and print their text.
    for i, row in samples.iterrows():
        print("‚Äî"*60)
        print(row["text"])

# Call the function to display sample negative tweets.
show_samples("negative")
# Call the function to display sample positive tweets.
show_samples("positive")