# Sentiment Analysis with Hugging Face

This project applies **zero-shot classification** using Hugging Face Transformers to categorize news articles into themes such as politics, sports, business, science, and more. The workflow involves cleaning and preparing the dataset, predicting categories with a DistilBERT MNLI model, merging predictions with metadata (countries, cities, nationalities), and generating insightful visualizations. Interactive charts and maps built with Plotly highlight the distribution of categories globally and across specific locations, while the enriched dataset is exported for further exploration in Streamlit.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

import geonamescache
import geopandas as gpd
from shapely.geometry import Point

import plotly.express as px

from transformers import pipeline 
# import country_converter as coco

import torch

In [2]:
# import dataset for prediction 
df_pred = pd.read_csv('dataset/new_data.csv')
df_pred.head()

Unnamed: 0,text,countries,nationalities,cities
0,Ukraine: Angry Zelensky vows to punish Russian...,Ukraine,Ukrainian,
1,Ukraine: Angry Zelensky vows to punish Russian...,,Russian,
2,War in Ukraine: Taking cover in a town under a...,Ukraine,Russian,Irpin
3,Ukraine war 'catastrophic for global food' One...,Ukraine,,
4,Manchester Arena bombing: Saffie Roussos's par...,,,


In [3]:
df_pred.shape

(47087, 4)

In [4]:
# drop title duplicates
# and remove duplicates for prediction using HuggingFace 
df_pred = df_pred.drop_duplicates(subset='text')
df_pred.shape

(39653, 4)

In [5]:
df_pred.nationalities.value_counts()[:20]

nationalities
Russian          660
British          590
Tory             378
Israeli          341
Scottish         220
French           204
Ukrainian        203
American         153
European         150
Indian           127
Chinese          114
Australian       107
Irish            102
German            94
Spanish           80
Italian           73
Ukrainians        68
Republican        66
Conservatives     66
Canadian          63
Name: count, dtype: int64

In [6]:
# capitalize
# df_pred.nationalities = df_pred.nationalities.str.capitalize()

In [7]:
### some values in 'Nationalities' virtually mean the same thing. E.g. Russian and Russians, Ukrainian and ukrainians, Nigeria and Nigerians.
### We strip the last word 's' in the values for better exploration and Analysis

df_pred.nationalities = df_pred.nationalities.str.rstrip('s')

In [8]:
df_pred.nationalities.value_counts()[:20]

nationalities
Russian         722
British         590
Israeli         382
Tory            378
Ukrainian       271
Scottish        220
American        206
French          204
European        151
Indian          134
Australian      123
Chinese         114
Republican      113
Palestinian     103
German          103
Irish           102
Conservative     98
Spanish          80
Italian          77
Canadian         66
Name: count, dtype: int64

## Predictions using HuggingFace

###### Using 10K records for prediction as against ~49K records
###### Running on CPU

In [9]:
# df_pred = df_pred.drop_duplicates(subset='processed_desc')
df_pred = df_pred.sample(frac=1)
df_pred = df_pred.iloc[:10_000, :]

In [10]:
df_pred.drop(columns=['nationalities', 'cities', 'countries'], inplace=True)

In [11]:
df_pred.head()

Unnamed: 0,text
16096,Hampshire firefighters in 25-hour earthquake r...
37476,Boys detained for zombie-knife killing of teen...
2482,World Snooker Championship: Pigeon disrupts th...
37076,Man Utd fight back twice to beat Sheff Utd lat...
20945,Champions League final 2023: Fans react to Man...


##### Uncomment for the line of code below.
##### Running on CPU 

In [None]:
# print(torch.__version__)
# print("CUDA available:", torch.cuda.is_available())


2.8.0+cpu
CUDA available: False


In [None]:
# # Load classifier
# device = 0 if torch.cuda.is_available() else -1
# classifier = pipeline(
#     "zero-shot-classification",
#     model="./distilbert-mnli-model",
#     tokenizer="./distilbert-mnli-model",
#     device=device
# )

Device set to use cpu


In [None]:
# candidate_labels = ["sports", "politics", "business", "science", "climate", "weather", "entertainment", "travel", "crime", "war", "technology", "health", "education", "accidents"]

# # Batch function
# def classify_batch(texts, labels):
#     results = classifier(texts, labels)
#     # Results is a list of dicts when you pass multiple texts
#     return [r["labels"][0] for r in results]

# # Batch process DataFrame
# batch_size = 32   # increase to 64/128 if you have GPU
# predictions = []

# for i in range(0, len(df_pred), batch_size):
#     batch_texts = df_pred["text"].iloc[i:i+batch_size].tolist()
#     preds = classify_batch(batch_texts, candidate_labels)
#     predictions.extend(preds)
    
#     # (Optional) progress
#     print(f"Processed {i+len(batch_texts)} / {len(df_pred)}")

# # Add predictions back
# df_pred["predicted_label"] = predictions


# # if (i // batch_size) % 100 == 0:  # every 100 batches
# #     df_exp.iloc[:i+batch_size].to_csv("dataset/prediction.csv", index=False)


Processed 32 / 10000
Processed 64 / 10000
Processed 96 / 10000
Processed 128 / 10000
Processed 160 / 10000
Processed 192 / 10000
Processed 224 / 10000
Processed 256 / 10000
Processed 288 / 10000
Processed 320 / 10000
Processed 352 / 10000
Processed 384 / 10000
Processed 416 / 10000
Processed 448 / 10000
Processed 480 / 10000
Processed 512 / 10000
Processed 544 / 10000
Processed 576 / 10000
Processed 608 / 10000
Processed 640 / 10000
Processed 672 / 10000
Processed 704 / 10000
Processed 736 / 10000
Processed 768 / 10000
Processed 800 / 10000
Processed 832 / 10000
Processed 864 / 10000
Processed 896 / 10000
Processed 928 / 10000
Processed 960 / 10000
Processed 992 / 10000
Processed 1024 / 10000
Processed 1056 / 10000
Processed 1088 / 10000
Processed 1120 / 10000
Processed 1152 / 10000
Processed 1184 / 10000
Processed 1216 / 10000
Processed 1248 / 10000
Processed 1280 / 10000
Processed 1312 / 10000
Processed 1344 / 10000
Processed 1376 / 10000
Processed 1408 / 10000
Processed 1440 / 10000

In [None]:
# df_pred.to_csv("dataset/prediction.csv", index=False)

#### Labels Selected for Prediction 

* "sports" 
* "politics"
* "business" 
* "science" 
* "climate" 
* "weather" 
* "entertainment"
* "travel"
* "crime"
* "war"
* "technology"
* "health"
* "education"
* "accidents"


In [18]:
df_preds = pd.read_csv('dataset/prediction.csv')
# df_preds.head()

In [19]:
# df_preds.head(10)

In [20]:
df_preds.predicted_label.value_counts()

predicted_label
politics         2085
climate          2043
sports           1027
entertainment     866
health            649
travel            598
war               566
weather           542
crime             533
accidents         522
business          314
education         114
technology         92
science            49
Name: count, dtype: int64

In [21]:
df_pred1 = pd.read_csv('dataset/new_data.csv')
df_pred1.head(2)

Unnamed: 0,text,countries,nationalities,cities
0,Ukraine: Angry Zelensky vows to punish Russian...,Ukraine,Ukrainian,
1,Ukraine: Angry Zelensky vows to punish Russian...,,Russian,


In [22]:
df_pred1.nationalities = df_pred1.nationalities.str.rstrip('s')
df_pred1.nationalities.value_counts()[:4]

nationalities
Russian    943
British    597
Israeli    425
Tory       395
Name: count, dtype: int64

In [23]:
### merge non-duplicated text values with newly predicted categories for larger datset

df_new_pred = df_pred1.merge(df_preds,
                             on = 'text',
                             how = 'left')

In [24]:
df_new_pred = df_new_pred.dropna(subset='predicted_label')

In [25]:
# for i, row in df_preds.iterrows():
#     for j, x_row in df_pred1.iterrows():
#         if row[0] == x_row[0]:
#             df_preds['nationalities'] = df_pred1['nationalities']
#             df_preds['cities'] = df_pred1['cities']
#             df_preds['countries'] = df_pred1['countries']
#     # print(i, row[0])
# # for i, row in df_preds.iterrows():
# #     print(row[0])

In [26]:
x =  df_new_pred.shape[0]
y = df_preds.shape[0]

print(f"Combine dataset has {x} rows \nPredicted value dataset has {y} rows")

Combine dataset has 11871 rows 
Predicted value dataset has 10000 rows


In [27]:
# capitalize all values
df_new_pred.nationalities = df_new_pred.nationalities.str.capitalize()
df_new_pred.cities = df_new_pred.cities.str.capitalize()
df_new_pred.predicted_label = df_new_pred.predicted_label.str.capitalize()

In [28]:
# Mapping of non-standard → official country names
# country_fix = {
#     "UK": "United Kingdom",
#     "USA": "United States",
#     "UAE": "United Arab Emirates",
#     "South Korea": "Korea, Republic of",
#     "North Korea": "Korea, Democratic People's Republic of",
#     }

# # Apply fixes
# df_preds["countries"] = df_preds["countries"].replace(country_fix)

In [29]:
country_list = set(list(df_new_pred.countries.values))
label_list = set(list(df_new_pred.predicted_label.values))
city_list = set(list(df_new_pred.cities.values))
nat_list = set(list(df_new_pred.nationalities.values))

In [30]:
# n_labels = df_new_pred.predicted_label.value_counts(normalize=True).reset_index()
# n_labels

In [31]:
n_labels = df_new_pred.predicted_label.value_counts().reset_index()

fig = px.bar(n_labels, x="predicted_label", y="count") 

    
fig.update_layout(
    title_text=" ",
    xaxis_title="\n Category Labels",
    yaxis_title="Number of Occurence",
    # yaxis_range=[0,10]
)

fig.update_traces(marker_color='#873260')

fig.show()

### What news was cognizance with a particular country

In [32]:
def country_labels(country):
    """
    Plot the distribution of news categories (predicted labels)
    associated with a given country.

    Steps:
    1. Filter df_preds for rows matching the selected country.
    2. Count how many times each predicted label (category) appears.
    3. Plot the results as a bar chart.
    """

    if country in country_list:
        # Filter rows for the selected country and count label frequencies
        df = (
            df_new_pred[df_new_pred['countries'] == country]['predicted_label']
            .value_counts()  # count each category
            .reset_index()   # convert to DataFrame
        )

        # Rename columns for clarity
        df.columns = ['Category', 'Count']

        # Create a bar chart of categories vs. counts
        fig = px.bar(df, x="Category", y="Count")

        # Customize chart layout
        fig.update_layout(
            title_text=" ",                       # optional: empty title
            xaxis_title="\n News Category",       # label for x-axis
            yaxis_title="Number of Occurence",    # label for y-axis
            # yaxis_range=[0,10]                  # optional fixed y-axis range
        )

        # Customize bar color
        fig.update_traces(marker_color='#873260')

        # Show the plot
        fig.show()

    else:
        # If the country is not found in your country_list
        print(f"{country} could not be sourced!")


# Example usage
country_labels('Italy')


In [33]:
# dfu = df_preds.cities.value_counts()[:20].reset_index()
# fig = px.bar(dfu, x="cities", y="count") 
# fig.update_layout(
# title_text="Cities Mentioned Mostly in News",
# xaxis_title="\n Category Labels",
# yaxis_title="Number of Occurence",
# # yaxis_range=[0,10]
# )

# fig.update_traces(marker_color='#6D8196')

# fig.show()

## What news was cognizance with a particular city?

In [34]:
def city_labels(city):
    """
    Plot the distribution of news categories (predicted labels)
    associated with a given city.

    Steps:
    1. Filter df_preds for rows matching the selected city.
    2. Count how many times each predicted label (category) appears.
    3. Plot the results as a bar chart.
    """

    if city in city_list:
        # Filter rows for the selected city and count label frequencies
        df = (
            df_new_pred[df_new_pred['cities'] == city]['predicted_label']
            .value_counts()    # count occurrences of each category
            .reset_index()     # convert Series → DataFrame
        )

        # By default, value_counts gives columns ["index", "predicted_label"]
        # You might want to rename them for clarity:
        # df.columns = ["predicted_label", "count"]

        # Create a bar chart of categories vs. counts
        fig = px.bar(df, x="predicted_label", y="count")

        # Customize layout (title, axis labels, hide y-ticks if desired)
        fig.update_layout(
            title_text=f"News often related to {city}",
            xaxis_title="\n News Category",
            yaxis_title="Number of Occurence",
            yaxis=dict(title="Occurence", showticklabels=False, ticks="")
            # yaxis_range=[0,10]   # optional: fix the y-axis range
        )

        # Customize bar color
        fig.update_traces(marker_color='#6D8196')

        # Show the plot
        fig.show()
    else:
        # If the city is not found in your city_list
        print(f"{city} could not be sourced!")


# Example usage
city_labels('Gaza')


### What news was cognizance with a particular Nationality?

In [35]:
# Build a set of all unique nationalities in df_preds for quick lookup



def nat_labels(nat):
    """
    Plot the distribution of news categories (predicted labels)
    associated with a given nationality.

    Steps:
    1. Check if the input nationality exists in the dataset.
    2. Filter df_preds for rows matching that nationality.
    3. Count how many times each predicted label (category) appears.
    4. Plot the results as a bar chart.
    """

    if nat in nat_list:
        # Filter rows by nationality and count label frequencies
        df = (
            df_new_pred[df_new_pred['nationalities'] == nat]['predicted_label']
            .value_counts()    # count each category
            .reset_index()     # convert Series → DataFrame
        )

        # Optional: rename columns for clarity
        # df.columns = ["predicted_label", "count"]

        # Create bar chart
        fig = px.bar(df, x="predicted_label", y="count")

        # Customize chart layout
        fig.update_layout(
            title_text=f"News category often related to a {nat}",
            xaxis_title="\n News Category",
            yaxis=dict(title="Occurence", showticklabels=False, ticks="")
            # yaxis_range=[0,10]   # optional: lock y-axis range
        )

        # Customize bar color
        fig.update_traces(marker_color='#6D8196')

        # Show the plot
        fig.show()

    else:
        # If nationality not found in dataset
        print(f"{nat} could not be found")


# Example usage
nat_labels('Chinese')


## Determine Prevalence of Categories per Country

In [36]:
def label_country(label):
    """
    Plot the global spread of a given news category (predicted label)
    by mapping its mentions across countries.

    Steps:
    1. Check if the input label exists in the dataset.
    2. Filter df_preds for rows matching that label.
    3. Count how many times each country is associated with the label.
    4. Match country names with geonamescache ISO3 codes.
    5. Plot a choropleth world map where color intensity = frequency.
    """

    if label in label_list:
        # Filter rows for the selected label, group by country, and count occurrences
        dfc = (
            df_new_pred[df_new_pred['predicted_label'] == label]
            .groupby('predicted_label')['countries']
            .value_counts()      # count how often each country appears
            .reset_index()       # convert Series → DataFrame
        )

        # Rename columns for clarity
        dfc.columns = ['labels', 'country', 'count']

        # Load geonamescache data for countries
        gc = geonamescache.GeonamesCache()
        countries_dict = gc.get_countries()

        # Convert geonamescache dictionary → DataFrame
        gcountries = pd.DataFrame.from_dict(countries_dict, orient="index")
        gcountries = gcountries[["name", "iso3"]]  # keep only useful columns

        # Normalize names for consistent merging (capitalize first letter)
        dfc["country_cap"] = dfc["country"].str.capitalize()
        gcountries["name_cap"] = gcountries["name"].str.capitalize()

        # Merge label-country counts with official geonames country info
        merged = dfc.merge(
            gcountries, 
            left_on="country_cap", 
            right_on="name_cap", 
            how="left"
        )

        # --- Choropleth map ---
        plot = px.choropleth(
            merged,
            locations="iso3",                # ISO3 country codes
            color="count",                   # frequency as color intensity
            color_continuous_scale="ylorrd", # color palette
            title=f"Global Spread of '{label}' mentions"
        )

        # Fix the figure size and projection
        plot.update_layout(
            width=1000,
            height=600,
            geo=dict(
                projection_type="natural earth",  # natural earth projection
                showframe=False,
                showcoastlines=True,
                showcountries=True,
                projection_scale=1,               # keep globe scaling
                center=dict(lat=0, lon=0)         # center map
            )
        )

        # Disable zooming/dragging
        plot.update_layout(dragmode=False)

        # Auto-fit the map bounds to locations
        plot.update_geos(fitbounds="locations", visible=False)

        # Show the final plot
        plot.show()

    else:
        # If label not in your label_list, show warning
        print(f"Please, Select the right category \n{label} not in Categories available")


# Example usage
label_country('Technology')


In [37]:
# export new data for streamlit app 
df_new_pred.to_csv('dataset/pred_streamlit.csv')