This notebook is the following part to analysis of the Redpajama, analyzed on Google Colab using GPU T4 to deal with advanced NLP techniques.

During this process, the project has not leveraged Spark due to limitation of using Spark **with GPU-based NLP libraries like HuggingFace Transformers and SpaCy, which are not natively supported in distributed Spark environments**.

Instead, NLP techniques has been processed using Pandas to fully utilize GPU acceleration for SpaCy, Sentiment Engine (e.g., HuggingFace Transformers, modelling.

Remarks; Intentinally added progress bar due to the long run time monitoring of NLP techniques and to check latency

Multiple brands has been tested to scale up the project and gains more insights. The brand included Primark, Asos, Burberry, River island, Reiss, Superdry, Ted Baker, Zara, Vivienne Westwood, and John lewis


# Table of Contents

1. [Notebook Setup](#1-Notebook-Setup)  
2. [Advanced Text Processing](#2-Advanced-Text-Processing)  
3. [Keyword-based Filtering and Additional Feature Engineering](#3-Keyword-based-Filtering-and-Additional-Feature-Engineering)  
4. [Keyword Extraction](#4-Keyword-Extraction)  
5. [Sentiment Classification and Modelling](#5-Sentiment-Classification-and-Modelling)  
6. [Correlation Analysis](#6-Correlation-Analysis)  
7. [Hypothesis Testing](#7-Hypothesis-Testing)


# 1 Notebook Setup



In [None]:
!pip install -r ../requirements.txt

<div style="border: 2px solid #ffcc00; padding: 10px; border-radius: 6px; background-color: #fff9e6;">
  <strong>Note:</strong> Some libraries (like <code>thinc</code>, <code>spaCy</code>, <code>cupy</code>) were <strong>compiled with NumPy 1.x</strong> and are <strong>not yet compatible with NumPy 2.x</strong>. This has been forced to downgrade for this NLP and modelling task. Normally, NumPy 2.0.x is used in the environment.
</div>

In [None]:
!pip install "numpy<2.0"

In [None]:
# Core Libraries
import os
import re
import gc
import sys
import ast
import psutil
import numpy as np
import datetime as dt
import dateutil
from itertools import islice

# Data Handling
import pandas as pd
from pathlib import Path

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import wordcloud
from IPython.display import display, HTML

# Statistics & Math
from scipy.stats import spearmanr
from scipy.special import softmax

# NLP & Transformers
import spacy
import torch
import transformers
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification
)

# Utilities
from collections import Counter
from tqdm import tqdm
from tqdm.notebook import tqdm as notebook_tqdm
notebook_tqdm.pandas()  # Enable tqdm for pandas

# External Tools
import gdown


In [None]:
# Install spaCy model inside the notebook (only needs to be done once) to save time when build docker image
import spacy.cli
spacy.cli.download("en_core_web_sm")

In [None]:
# keybert
from keybert import KeyBERT

## Version Check

In [None]:
print("Python:", sys.version)
print("*NumPy*:", np.__version__)
print("Pandas:", pd.__version__)

In [None]:
print("Torch:", torch.__version__)
print("Transformers:", transformers.__version__)
print("spaCy:", spacy.__version__)
print("WordCloud:", wordcloud.__version__)

## File path

In [None]:
# Get and display the current working directory for file verification
# The environment is based on a Jupyter container with the default 'jovyan' user, as configured by the server.
main_path = os.getcwd()
current_directory = os.path.dirname(main_path) + "/"
current_directory

In [None]:
df = pd.read_parquet(current_directory + 'data/csv_data/600k_selected_brand_redpajama.parquet')

In [None]:
df.head()

## DeepSeek Fixed Ranking

In [None]:
# Brand ranking categories in the UK 2023
top_rk = ("primark", "asos", "burberry")
med_rk = ("river island", "reiss")
low_rk = ("superdry", "ted baker")

# Brands appearing once
one_app = ("zara", "Vivienne Westwood", "john lewis")

# Find brands that appear in more than one rank (intersection)
# Convert tuples to sets for set operations
twice_app = set(top_rk) & set(med_rk) & set(low_rk)

# 2 Advanced Text Processing

<div style="border-left: 4px solid #cc0000; padding: 1em; background-color: #ffe6e6; border-radius: 4px; margin-top: 1em;">
  <strong>⚠️🚫 Warning:</strong><br><br>
  This process may take a significant amount of time to complete — potentially up to several days — due to model inference latency, system resource constraints, or unexpected runtime issues (e.g., Java heap space errors, Spark job failures, or kernel crashes) if the instance have no GPU and use only CPU.<br><br>

  If sentiment analysis steps are interrupted, rerunning the process may be required from the beginning.<br><br>
  <em>Note: The separator cell below is intentionally added to help you pause and verify environment readiness before proceeding.</em>
</div>


## SpaCy

**SpaCy** is a fast, open-source library for advanced **Natural Language Processing (NLP)** in Python. Key Features includes as below;
- **Tokenization** – Split text into words
- **Lemmatization** – Reduce words to their base form
- **Part-of-Speech Tagging** – Identify word types (noun, verb, etc.)
- **Named Entity Recognition (NER)** – Detect names, places, dates, etc.
- **Dependency Parsing** – Understand sentence structure


SpaCy Named Entity Recognition (NER) Labels

| Label | Description |
|-------|-------------|
| PERSON     | People, including fictional |
| NORP       | Nationalities, religious and political groups |
| FAC        | Facilities (e.g., buildings, airports, highways) |
| ORG        | Organizations (e.g., companies, agencies, institutions) |
| GPE        | Countries, cities, states (Geopolitical Entities) |
| LOC        | Non-GPE locations (e.g., mountain ranges, bodies of water) |
| PRODUCT    | Products (e.g., vehicles, devices, food) |
| EVENT      | Named events (e.g., World War II, Olympics) |
| WORK_OF_ART| Titles of creative works (books, songs, films) |
| LAW        | Named legal documents (e.g., treaties, laws) |
| LANGUAGE   | Any named language |
| DATE       | Absolute or relative dates (e.g., "2022", "next week") |
| TIME       | Times smaller than a day (e.g., "2 PM", "morning") |
| PERCENT    | Percentage values (e.g., "50%") |
| MONEY      | Monetary values (e.g., "$100", "€20") |
| QUANTITY   | Measurements (e.g., "10 kg", "5 miles") |
| ORDINAL    | First, second, third, etc. |
| CARDINAL   | Numerical values (e.g., "one", "100") |


## Extract entities; Named Entity Recognition (NER)

In [None]:
nlp = spacy.load("en_core_web_sm")  # Load a small English NLP model from spaCy

doc = nlp("H&M launched a new clothing line in Paris.")

for ent in doc.ents:
    print(ent.text, ent.label_)

## Sample from DataFrame

In [None]:
sample_text = df['content'].iloc[0]  # sample from the first row

doc = nlp(sample_text)

for ent in doc.ents:
    print(ent.text, ent.label_)

## Apply to all dataframe

Apply batch processing to the entire DataFrame using nlp.pipe which is significant faster than .apply

In [None]:
# Check RAM before applying NLP techniques~
import psutil
print(f"Available memory: {psutil.virtual_memory().available / 1e9:.2f} GB")

In [None]:
from tqdm import tqdm  # or from tqdm import tqdm if not in Jupyter

texts = df['content'].tolist()
entities = []

for doc in tqdm(nlp.pipe(texts, batch_size=100), total=len(texts)):  # add tqdm wrapper
    entities.append([(ent.text, ent.label_) for ent in doc.ents])

df['entities'] = entities

In [None]:
print(df["entities"].iloc[0])

In [None]:
df["entities"].apply(lambda x: isinstance(x, list) and all(isinstance(i, tuple) and len(i) == 2 for i in x)).all()

In [None]:
type(df["entities"].iloc[0])

In [None]:
df["entities"].iloc[0][0]

In [None]:
type(df["entities"].iloc[0][0])

## Text Preprocessing; Lemmatization + Stopwords Removal

In [None]:
# Redefine your function if needed
def preprocess_spacy(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])

# Apply with progress bar
df['processed_text'] = df['content'].progress_apply(preprocess_spacy)

In [None]:
# Replace old content features
df["content"] = df["processed_text"]
df = df.drop(columns=["processed_text"])

# 3 Keyword-based Filtering and Additional Feature Engineering

Additional feature engineering has been applied for further filtering and analysis, Decided to remove polarity from VADER cause no insight gains

## is_relevant

In [None]:
# Define domain-specific keywords for filtering retail/fashion-related pages
keywords = [
    'dress', 'shirt', 't-shirt', 'jeans', 'pants', 'trousers', 'blouse',
    'jacket', 'coat', 'skirt', 'shorts', 'sweater', 'cardigan', 'fashion',
    'trend', 'style', 'sale', 'discount', 'shop', 'shopping', 'quality',
    'cheap', 'price', 'return', 'delivery', 'fit', 'comfortable', 'look'
]

# Define filtering function
def is_relevant(row):
    # Convert content to lowercase
    content = row['content'].lower()

    # Check if any keyword exists in the content
    has_keyword = any(kw in content for kw in keywords)

    # Content must be longer than 50 characters (filter out short/noise content)
    enough_length = len(content) > 50

    # Return True only if all conditions are met
    return has_keyword and enough_length

# Apply filtering to your DataFrame
df['is_relevant'] = df.progress_apply(is_relevant, axis=1)


In [None]:
# filter only is_relevant page
df = df[df["is_relevant"] == True]

## Co-mentioned Brands; Extract brands mentioned in the same page

The `co_mentioned_brands` feature captures other brand names mentioned on the same page as the target `brand_name`. For example, if `brand_name = 'Primark'` and theThe `co_mentioned_brands` feature captures other brand names mentioned on the same page as the target `brand_name`. For example, if `brand_name = 'Primark'` and the page also includes `Reiss` and `Burberry`, then `co_mentioned_brands = ['Reiss', 'Burberry']`. The main brand is excluded from this list.page also includes `Reiss` and `Burberry`, then `co_mentioned_brands = ['Reiss', 'Burberry']`. The main brand is excluded from this list.

In [None]:
# Define a list of target brands we want to track co-mentions for
brands = ['primark', 'asos', 'burberry', 'river island', 'reiss', 'superdry', 'ted baker', 'zara', 'Vivienne Westwood', 'john lewis']

# Function to extract co-mentioned brands from a list of named entities
def extract_co_mentions_from_entities(entities, primary_brand):
    mentions = []  # This will hold other brands mentioned in the same text
    for name, label in entities:  # Each entity is a (name, label) tuple, like ('primark', 'ORG')
        name_lower = name.lower()  # Convert entity name to lowercase for case-insensitive comparison
        # Check if the name is one of the known brands, and not the same as the main brand for the row
        if name_lower in brands and name_lower != primary_brand.lower():
            mentions.append(name_lower)  # Add it to the co-mention list
    return list(set(mentions))  # Remove duplicates by converting to a set, then back to a list

# Apply the co-mention extraction function to each row of the DataFrame
df['co_mentioned_brands'] = df.progress_apply(
    lambda row: extract_co_mentions_from_entities(row['entities'], row['brand_name']),
    axis=1  # Apply the function row by row
)

In [None]:
# astype(bool) return False for [], None, or NaN
df[df["co_mentioned_brands"].astype(bool)].head()

In [None]:
# return non-null co_mention_brands
count = df["co_mentioned_brands"].astype(bool).sum()
print(count)

No insight gains from co_mention_grands

In [None]:
df.shape
list(df.columns)

In [None]:
df.head()

## Save as a checkpoint

**Note:** After downloading the dataset once, the code block below should be intentionally hidden to prevent unintentional multiple downloads. This is considered a best practice to avoid redundant operations that may consume unnecessary resources or result in duplicate file writes.

In [None]:
os.getcwd()

## Define main roots

In [None]:
# Create Output path/folder name
csv_path = current_path.parent / "data/csv_data/nlp-added-nlp-features-1.csv"

# Save DataFrame
df.to_csv(csv_path, index=False)

In [None]:
csv_path

In [None]:
# Read CSV
df = pd.read_csv(csv_path)

# 4 Keyword Extraction

## KeyBERT: Keyword Extraction with BERT (Theme-based Insight)

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to identify the most relevant keywords or keyphrases in a document. It uses contextual word embeddings from BERT to compare document embeddings with candidate keyword embeddings, selecting those with the highest semantic similarity.

### Key Features:
- Utilizes pre-trained BERT models for semantic similarity
- Extracts keywords that are contextually relevant
- Supports customization with different embedding models and parameters

In [None]:
kw_model = KeyBERT()

df['keywords'] = df['content'].progress_apply(
    lambda x: [kw[0] for kw in kw_model.extract_keywords(x, top_n=5)]
)

## Save as a checkpoint

**Note:** After downloading the dataset once, the code block below should be intentionally hidden to prevent unintentional multiple downloads. This is considered a best practice to avoid redundant operations that may consume unnecessary resources or result in duplicate file writes.

In [None]:
# Create Output path/folder name
csv_path = current_path.parent / "data/csv_data/nlp-added-nlp-features-2.csv"

# Save DataFrame
df.to_csv(csv_path, index=False)

In [None]:
# Load DataFrame
df = pd.read_csv(csv_path)

In [None]:
df.head()

## Convert to correct category

In [None]:
# Convert numeric columns
numeric_cols = ['content_length', 'mention_count']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Convert boolean columns
df['has_ugc_keyword'] = df['has_ugc_keyword'].astype(bool)
df['is_relevant'] = df['is_relevant'].astype(bool)

# Remaining object columns should generally stay as string (object) unless a specific conversion is needed

# Verify the changes
print(df.dtypes)

## Rename column

In [None]:
df.rename(columns={'content_length': 'page_length'}, inplace=True)

## Top N words by brand

In [None]:
all_keywords = [kw for kws in df['keywords'] for kw in eval(kws)]
keyword_counts = Counter(all_keywords)

# top 20 keywords
top_keywords = keyword_counts.most_common(20)
labels, values = zip(*top_keywords)

plt.figure(figsize=(12, 6))
plt.barh(labels[::-1], values[::-1])
plt.title("Top 20 Keywords Overall")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

## Filtering out of neutral keyword

as used to filtered when added feature is_relevant words

In [None]:
neutral_words = {
    'blouse', 'cardigan', 'coat', 'delivery', 'discount', 'dress',
    'fashion', 'fit', 'jacket', 'jeans', 'look', 'pants', 'price',
    'return', 'sale', 'shirt', 'shop', 'shopping', 'shorts',
    'skirt', 'style', 'sweater', 't-shirt', 'trousers', 'trend','outfit', 'wear'
}

In [None]:
# Flatten keywords and filter out neutral words
all_keywords = [
    kw for kws in df['keywords']
    for kw in eval(kws)
    if kw not in neutral_words
]

# Count keywords
keyword_counts = Counter(all_keywords)

# Get top 20
top_keywords = keyword_counts.most_common(20)
labels, values = zip(*top_keywords)

# Plot
plt.figure(figsize=(12, 6))
plt.barh(labels[::-1], values[::-1])
plt.title("Top 20 Keywords Overall (Filtered)")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()


In [None]:
brand_keywords = df.groupby('brand_name')['keywords'].apply(
    lambda x: [kw for sublist in x for kw in eval(sublist) if kw not in neutral_words]
)

top_brand_keywords = brand_keywords.apply(lambda x: Counter(x).most_common(20))

In [None]:
# Selected brand
brand = "zara"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "zara"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_zara = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_zara

In [None]:
# Selected brand
brand = "john lewis"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "john lewis"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_john_lewis = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_john_lewis

In [None]:
# Selected brand
brand = "Vivienne Westwood"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "Vivienne Westwood"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_vivienne_westwood = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_vivienne_westwood

In [None]:
# Selected brand
brand = "ted baker"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "ted baker"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_ted_baker = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_ted_baker

In [None]:
# Selected brand
brand = "superdry"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "superdry"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_superdry = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_superdry

In [None]:
# Selected brand
brand = "reiss"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "reiss"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_reiss = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_reiss

In [None]:
# Selected brand
brand = "river island"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Selected brand
brand = "river island"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_river_island = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_river_island

In [None]:
# Selected brand
brand = "burberry"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
brand = "burberry"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_burberry = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_burberry

In [None]:
# Selected brand
import seaborn as sns
brand = "asos"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
brand = "asos"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_asos = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_asos

In [None]:
# Selected brand
import seaborn as sns
brand = "primark"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
bbrand = "primark"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_primark = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_primark

In [None]:
# Selected brand
import seaborn as sns
brand = "Vivienne Westwood"
keywords = dict(top_brand_keywords[brand])

plt.figure(figsize=(8,4))
sns.barplot(x=list(keywords.values()), y=list(keywords.keys()))
plt.title(f"Top Keywords for {brand}")
plt.xlabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
bbrand = "Vivienne Westwood"
filtered_keywords = [kw for kw in brand_keywords[brand] if kw not in neutral_words]

# Counter to Dataframe
keyword_counts = Counter(filtered_keywords)
df_top_keywords_vivienne_westwood = pd.DataFrame(keyword_counts.most_common(40), columns=["keyword", "count"])

df_top_keywords_vivienne_westwood

In [None]:
# Label each dataframe with its brand
df_top_keywords_zara['brand'] = 'zara'
df_top_keywords_vivienne_westwood['brand'] = 'Vivienne Westwood'
df_top_keywords_john_lewis['brand'] = 'john lewis'
df_top_keywords_primark['brand'] = 'primark'
df_top_keywords_asos['brand'] = 'asos'
df_top_keywords_burberry['brand'] = 'burberry'
df_top_keywords_superdry['brand'] = 'superdry'
df_top_keywords_ted_baker['brand'] = 'ted baker'
df_top_keywords_river_island['brand'] = 'river island'
df_top_keywords_reiss['brand'] = 'reiss'

# Combine all into one dataframe
df_all = pd.concat([
    df_top_keywords_zara,
    df_top_keywords_vivienne_westwood,
    df_top_keywords_john_lewis,
    df_top_keywords_primark,
    df_top_keywords_asos,
    df_top_keywords_burberry,
    df_top_keywords_superdry,
    df_top_keywords_ted_baker,
    df_top_keywords_river_island,
    df_top_keywords_reiss
], ignore_index=True)

#  Create a pivot table with keywords across all brands
pivot_df = df_all.pivot_table(
    index='keyword',
    columns='brand',
    values='count',
    fill_value=0
).reset_index()

# Find keywords that appear in **all brands**
brands = [
    'zara', 'Vivienne Westwood', 'john lewis', 'primark',
    'asos', 'burberry', 'superdry', 'ted baker', 'river island', 'reiss'
]

common_keywords = pivot_df[
    pivot_df[brands].gt(0).all(axis=1)  # keyword appears in all brands
].sort_values(by=brands, ascending=False)

# Find keywords unique to **only one** brand
unique_keywords = pivot_df[
    pivot_df[brands].gt(0).sum(axis=1) == 1  # keyword appears in only one brand
]

In [None]:
print(common_keywords)

In [None]:
from IPython.display import display, HTML

# Create HTML layout with side-by-side display and section headers
html_output = f"""
<h2>Keyword Comparison Across Brands</h2>

<div style="display: flex; gap: 40px;">

  <div style="flex: 1;">
    <h3>Common Keywords (All Brands)</h3>
    {common_keywords.to_html(index=False)}
  </div>

  <div style="flex: 1;">
    <h3>Unique Keywords (Only One Brand)</h3>
    {unique_keywords.to_html(index=False)}
  </div>

</div>
"""

display(HTML(html_output))

## Heatmap of Top Keywords

In [None]:
# Create a long-form DataFrame with brand, keyword, and count
brand_keyword_df = pd.DataFrame([
    (brand, kw, count)
    for brand, kws in brand_keywords.items()  # brand_keywords: dict of brand → list of keywords
    for kw, count in Counter(kws).items()     # count keywords for each brand
    if kw not in neutral_words                # exclude neutral words
], columns=['brand', 'kw', 'count'])          # name the columns

# Pivot to make a matrix: rows = keywords, columns = brands, values = counts
pivot = brand_keyword_df.pivot_table(
    index='kw', columns='brand', values='count', fill_value=0
)

# Select top 20 keywords by total count across all brands
top_keywords = pivot.sum(axis=1).sort_values(ascending=False).head(20).index
pivot = pivot.loc[top_keywords]  # Filter to keep only top keywords

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, fmt=".0f", cmap='YlGnBu')
plt.title("Top Keywords Across Brands (Filtered)")
plt.ylabel("Keyword")
plt.xlabel("Brand")
plt.tight_layout()
plt.show()


## Keyword Word Cloud - Overall top keyword

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(keyword_counts)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Keyword Word Cloud")
plt.show()

## Keyword Word Cloud per Brand 

Filtering out common word

In [None]:
# Add 'brand' column to each dataframe
df_top_keywords_zara['brand'] = 'zara'
df_top_keywords_vivienne_westwood['brand'] = 'Vivienne Westwood'
df_top_keywords_john_lewis['brand'] = 'john lewis'
df_top_keywords_primark['brand'] = 'primark'
df_top_keywords_asos['brand'] = 'asos'
df_top_keywords_burberry['brand'] = 'burberry'
df_top_keywords_superdry['brand'] = 'superdry'
df_top_keywords_ted_baker['brand'] = 'ted baker'
df_top_keywords_river_island['brand'] = 'river island'
df_top_keywords_reiss['brand'] = 'reiss'

# Combine all dataframes into one
df_all = pd.concat([
    df_top_keywords_zara,
    df_top_keywords_vivienne_westwood,
    df_top_keywords_john_lewis,
    df_top_keywords_primark,
    df_top_keywords_asos,
    df_top_keywords_burberry,
    df_top_keywords_superdry,
    df_top_keywords_ted_baker,
    df_top_keywords_river_island,
    df_top_keywords_reiss
], ignore_index=True)

# Create pivot table
brands = [
    'zara', 'Vivienne Westwood', 'john lewis', 'primark',
    'asos', 'burberry', 'superdry', 'ted baker', 'river island', 'reiss'
]

pivot_df = df_all.pivot_table(
    index='keyword',
    columns='brand',
    values='count',
    fill_value=0
).reset_index()

# Identify common keywords (those that appear in **all brands**)
common_keywords = pivot_df[
    pivot_df[brands].gt(0).all(axis=1)
]['keyword']

# Filter out common keywords from df_all
df_filtered = df_all[~df_all['keyword'].isin(common_keywords)]

# Word cloud function
def generate_wordcloud(df, brand_name):
    brand_df = df[df['brand'] == brand_name]
    word_freq = dict(zip(brand_df['keyword'], brand_df['count']))
    if word_freq:  # Only generate if there are words left after filtering
        wc = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
        plt.figure(figsize=(10, 5))
        plt.imshow(wc, interpolation='bilinear')
        plt.title(f'Word Cloud for {brand_name} (Excluding Common Keywords)', fontsize=14)
        plt.axis('off')
        plt.show()
    else:
        print(f"No keywords available for {brand_name} after filtering.")

# Generate word clouds for each brand
for brand in brands:
    generate_wordcloud(df_filtered, brand)

# 5 Sentiment Classification and Modelling

Due to polarity from VADER is too weak, this time the project will replace with sentiment engine from Hugging face (more Advanced techniques to see the sentiment from each brand)

Methods
- Uses Hugging Face Transformers
- Loads a pretrained model via Hugging Face Hub
- Applies pipeline() to abstract away tokenizer/model handling
- Returns label and score per input
- Is commonly used in production-ready sentiment tasks

## First Sentiment Model

#### Sentiment Engine; [Twitter-roBERTa-base for Sentiment Analysis - UPDATED (2022)](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)
**`cardiffnlp/twitter-roberta-base-sentiment-latest`**

This model is a fine-tuned version of the RoBERTa-base architecture specifically adapted for sentiment analysis on social media content, particularly Twitter. It has been trained on approximately 124 million tweets collected between January 2018 and December 2021. The model was fine-tuned using the TweetEval benchmark, which is widely used for evaluating sentiment classification tasks on short-form and informal text for retail sentiment.


- Fine-tuned on approximately 124 million tweets, making it well-suited for short, informal, and noisy web content.
- Trained using the TweetEval benchmark, widely accepted for sentiment classification.
- Adaptable to a variety of content types including editorial, blog posts, and general online discourse.
- Seamlessly integrates with the Hugging Face `pipeline()` for efficient implementation.
- The RoBERTa-based model is broad generalization capabilities and training on diverse social media content. Possible to apply more to non-review data


### Key Characteristics

- **Base Architecture**: RoBERTa-base (pretrained transformer model by Facebook AI)
- **Training Data**: ~124M tweets (Jan 2018 – Dec 2021)
- **Task**: Sentiment analysis (text classification)
- **Fine-tuning Benchmark**: TweetEval
- **Language**: English

### Label Mapping

The model produces sentiment scores using the following label mapping:

- `0` → Negative  
- `1` → Neutral  
- `2` → Positive

When used via the Hugging Face `transformers` pipeline with `pipeline("sentiment-analysis", model=...)`, it automatically maps these internal numeric labels to descriptive class names (e.g., `label='positive'`).



With Confidence features in RoBERTa Sentiment Classification when using Hugging Face sentiment models like `cardiffnlp/twitter-roberta-base-sentiment-latest`,  
each prediction includes confidence score: the model’s confidence in its prediction (also called **confidence probability**)

### Use Cases

- Sentiment classification for brand monitoring
- Social media opinion mining
- Public sentiment tracking over time
- Fine-grained analysis in retail, politics, and events

### Integration

This model is fully integrated into the Hugging Face `transformers` library and can be loaded with:


## Limitations of `cardiffnlp/twitter-roberta-base-sentiment-latest`

- **English-Only Support**  
  The model is trained exclusively on English tweets and does not support other languages.

- **Limited Contextual Understanding**  
  It may misinterpret sarcasm, irony, idioms, or other nuanced language elements.

- **Training Data Constraints**  
  Based on tweets from 2018–2021; may not reflect recent slang or language trends.

- **Token Limitations**  
  Can only process up to 512 tokens per input. Longer texts are truncated, possibly omitting important context.

- **Preprocessing Requirements**  
  Inputs should replace usernames with `@user` and URLs with `http` for accurate results, matching the format used during training.

- **Domain Specificity**  
  Optimized for Twitter content—performance may degrade on data outside this domain (e.g., formal text, customer reviews).



In [None]:
# Copy the original DataFrame
df_robert = df.copy()

In [None]:
# Load model and tokenizer
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define sentiment labels
sentiment_labels = ['negative', 'neutral', 'positive']

# Function to get sentiment and confidence
def get_sentiment_with_confidence(text):
    text = str(text).strip().replace('\n', ' ')
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        output = model(**encoded_input)
    scores = output.logits[0].numpy()
    probs = softmax(scores)
    pred_index = probs.argmax()
    return sentiment_labels[pred_index], float(probs[pred_index])

# Apply function with progress bar
tqdm.pandas(desc="Processing sentiments")
df_robert[['sentiment', 'confidence']] = df_robert['content'].progress_apply(
    lambda x: pd.Series(get_sentiment_with_confidence(x))
)

# View result
print(df_robert[['content', 'sentiment', 'confidence']])


In [None]:
df_robert.head()

## Save as a checkpoint

**Note:** After downloading the dataset once, the code block below should be intentionally hidden to prevent unintentional multiple downloads. This is considered a best practice to avoid redundant operations that may consume unnecessary resources or result in duplicate file writes.

In [None]:
# Create Output path/folder name
csv_path = current_path.parent / "data/csv_data/robert-sentiment.csv"

# Save DataFrame
df_robert.to_csv(csv_path, index=False)

In [None]:
# Read CSV
df_robert = pd.read_csv(csv_path)

In [None]:
df_filtered_robert = df_robert[df_robert['confidence'] >= 0.75].copy()

In [None]:
df_filtered_robert.head()

## Sentiment Distribution Per Brand

In [None]:
# Filter out neutral sentiment
df_pos_neg = df_filtered_robert[df_filtered_robert['sentiment'].isin(['positive', 'negative'])]

# Count sentiment per brand
counts = df_pos_neg.groupby(['brand_name', 'sentiment']).size().reset_index(name='count')

# Compute total counts per brand
counts['total'] = counts.groupby('brand_name')['count'].transform('sum')

# Compute percentage
counts['percentage'] = (counts['count'] / counts['total']) * 100

# Final summary table
summary_df = counts[['brand_name', 'sentiment', 'percentage']]
summary_df

## Average Confidence Per Sentiment Type

In [None]:
# Create pivot table for average confidence score
avg_confidence_table = df_filtered_robert.pivot_table(
    index='sentiment',
    columns='brand_name',
    values='confidence',
    aggfunc='mean'
)

# Format as percentage with 2 decimal places
avg_confidence_table = (avg_confidence_table * 100).applymap(lambda x: f"{x:.2f}%")

# Display the table
print("Average Confidence Score per Sentiment and Brand:\n")
avg_confidence_table

## NSS score by brand

In [None]:
# Count sentiment occurrences per brand
sentiment_counts = df_filtered_robert.groupby(['brand_name', 'sentiment']).size().unstack(fill_value=0)

# Calculate total mentions per brand
sentiment_counts['total'] = sentiment_counts.sum(axis=1)

# Calculate percentages
sentiment_counts['%positive'] = (sentiment_counts.get('positive', 0) / sentiment_counts['total']) * 100
sentiment_counts['%negative'] = (sentiment_counts.get('negative', 0) / sentiment_counts['total']) * 100

# Calculate NSS
sentiment_counts['NSS'] = sentiment_counts['%positive'] - sentiment_counts['%negative']

# Format NSS as percentage with 2 decimal places
sentiment_counts['NSS'] = sentiment_counts['NSS'].map(lambda x: f"{x:.2f}%")

# Display only NSS per brand
nss_scores = sentiment_counts[['NSS']]

print("Net Sentiment Score (NSS) per Brand:\n")
print(nss_scores)

## Second Sentiment Model


**`DistilBERT Clothing Review Model`**

- Specifically trained on customer reviews from a structured dataset. Even it seems suitable but the content features are generic one and unstructured text
- Optimized for opinion mining in product feedback rather than unstructured website or article content.
- Narrow domain scope may lead to reduced performance on broader fashion-related discussions.

## Limitations of `ongaunjie/distilbert-cloths-sentiment`

- **Domain-specific training**  
  The model is fine-tuned on women's clothing reviews, so it may underperform on other domains such as electronics or general social media text.

- **Limited label set**  
  The sentiment classification is limited to three categories: `positive`, `neutral`, and `negative`. It does not support nuanced sentiment like sarcasm, mixed emotion, or multi-aspect sentiment.

- **Short to medium text**  
  It performs best on short to medium-length reviews (1-3 sentences). Long product descriptions or multi-topic paragraphs may be truncated or misclassified.

- **English-only**  
  The model is trained on English text and may not generalize well to other languages or multilingual content.

- **No aspect-based sentiment**  
  It does not identify sentiment by product features (e.g., "quality is good but size is small"), which is often required in fine-grained retail analysis.

- **Batch inference may vary**  
  On large batches or very noisy input, the model may return inconsistent results or fail unless preprocessed properly.


In [None]:
# Copy the original DataFrame
df_bert = df.copy()

In [None]:
# Load model and tokenizer
model_name = "ongaunjie/distilbert-cloths-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Get sentiment labels from the model config
sentiment_labels = model.config.id2label.values()

# Function to get sentiment and confidence
def get_sentiment_with_confidence(text):
    text = str(text).strip().replace('\n', ' ')
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        output = model(**encoded_input)
    scores = output.logits[0].numpy()
    probs = softmax(scores)
    pred_index = probs.argmax()
    return list(sentiment_labels)[pred_index], float(probs[pred_index])

# Apply with progress bar
tqdm.pandas(desc="Analyzing sentiment")
df_bert[['sentiment', 'confidence']] = df_bert['content'].progress_apply(
    lambda x: pd.Series(get_sentiment_with_confidence(x))
)

# Display results
print(df_bert[['content', 'sentiment', 'confidence']])


## Save as a checkpoint

**Note**: The code block below might have been intentionally hidden to prevent unintentional multiple downloads during testing of the dataset. This is a best practice to avoid redundant operations that may consume unnecessary resources or lead to duplicate file writes.

In [None]:
# Create Output path/folder name
csv_path = current_path.parent / "data/csv_data/bert-sentiment.csv"

# Save DataFrame
df_bert.to_csv(csv_path, index=False)

In [None]:
# Load DataFrame
df_bert = pd.read_csv(csv_path)

In [None]:
df_bert.head()

## Drop old web_category to revise the logics

In [None]:
df.head()

## Convert Data to correct category

In [None]:
# Extract host from URL using regex (like SQL regexp_extract)
df['url_host'] = df['source_domain'].astype(str).str.lower()

# Drop the old web_category column if it exists
if 'web_category' in df.columns:
    df.drop(columns=['web_category'], inplace=True)

# Define the revised categorization logic
def categorize_website(row):
    domain = row['url_host']
    url = str(row['url']).lower()

    if re.search(r'cnn|bbc|nytimes|reuters|guardian|forbes|bloomberg|ft\.com|cnbc|npr|washingtonpost|wsj', domain) \
       or re.search(r'/news|/breaking|/politics|/world|/article|/headlines', url):
        return 'News'
    elif re.search(r'/blog|/post|/mypage|/mystory|/user|/forum|/profile|/comment|/thread', url) \
         or re.search(r'wordpress|medium|blogspot|tumblr|livejournal', domain):
        return 'Blogs & Community'
    elif re.search(r'/shop|/product|/buy|/cart|/checkout|/store|/item|/deal|/brand|/collection|/sale|/pricing', url) \
         or re.search(r'amazon|ebay|bestbuy|alibaba|etsy|shopify|shein|zara|nike|adidas', domain):
        return 'E-commerce & Commercial'
    elif re.search(r'\.edu$', domain) \
         or re.search(r'edu|university|college|khanacademy|coursera|edx|mit|harvard|stanford', domain) \
         or re.search(r'/learn|/curriculum|/syllabus|/classroom', url):
        return 'Education'
    elif re.search(r'netflix|hulu|spotify|imdb|rottentomatoes|disney|youtube|vimeo|soundcloud', domain) \
         or re.search(r'/music|/tv|/movies|/video|/trailer|/playlist|/watch', url):
        return 'Media & Entertainment'
    elif re.search(r'\.gov$', domain) \
         or re.search(r'gov|nasa|cdc|whitehouse|senate|house\.gov|europa\.eu', domain) \
         or re.search(r'/regulation|/policy|/bill|/law|/agency', url):
        return 'Government'
    elif re.search(r'/user|/status|/likes|/shares|/posts', url) \
        or re.search(r'reddit|facebook|twitter|tiktok|linkedin|pinterest|instagram|snapchat', domain):
        return 'Social'
    elif re.search(r'/investor|/financial|/annual-report|/results|/statement|/earnings|/balance|/report|/10-k|/sec', url) \
         or re.search(r'nasdaq|bloomberg|yahoo\.finance|marketwatch|investopedia', domain):
        return 'Financial'
    elif re.search(r'wikipedia\.org', domain):
        return 'Education'
    elif re.search(r'mozilla\.org', domain):
        return 'Media & Entertainment'
    elif re.search(r'archive\.org', domain):
        return 'Education'
    elif re.search(r'who\.int', domain):
        return 'Government'
    else:
        return 'Others'

# Apply new logic to create updated web_category column
df['web_category'] = df.apply(categorize_website, axis=1)

In [None]:
df['web_category'].value_counts()

No Web_category as 'Social ' in sample redpajama dataset

In [None]:
df.head()

In [None]:
# Lowercase the url column and convert to string just in case
url_series = df['url'].astype(str).str.lower()

# Check if any row matches the 'Social' pattern
has_social_match = url_series.str.contains(
    r'reddit|facebook|twitter|tiktok|linkedin|pinterest|instagram|snapchat|/user|/status|/likes|/shares|/posts',
    regex=True
)

print("Any match:", has_social_match.any())
print("Total matches:", has_social_match.sum())

In [None]:
# Count web categories and exclude 'Others'
category_counts = df['web_category'].value_counts()
category_counts = category_counts[category_counts.index != 'Others']

# Plot horizontal bar chart
plt.figure(figsize=(10, 6))
category_counts.plot(kind='barh', color='steelblue', edgecolor='black')
plt.title('Document Count by Web Category (Excluding "Others")')
plt.xlabel('Number of Documents')
plt.ylabel('Web Category')
plt.gca().invert_yaxis()  # Highest count on top
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
df.info()

In [None]:
# Convert numeric columns
numeric_cols = ['content_length', 'mention_count']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Convert boolean columns
df['has_ugc_keyword'] = df['has_ugc_keyword'].astype(bool)
df['is_relevant'] = df['is_relevant'].astype(bool)

# Remaining object columns should generally stay as string (object) unless a specific conversion is needed

# Verify the changes
print(df.dtypes)

# Box plot with page_length

In [None]:
print("Min page_length:", df['page_length'].min())
print("Max page_length:", df['page_length'].max())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='brand_name', y='page_length')
plt.title('Box Plot of Page Length by Brand')
plt.xlabel('Brand')
plt.ylabel('Page Length')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

## Change page length to categorical data

However, the numerical of page length still be kept

In [None]:
df.info()

In [None]:
df['page_length'] = pd.to_numeric(df['page_length'].str.replace(',', ''), errors='coerce').fillna(0).astype(int)
df['word_count'] = pd.to_numeric(df['word_count'].str.replace(',', ''), errors='coerce').fillna(0).astype(int)

In [None]:
df.info()

In [None]:
df['page_length_bin'] = pd.qcut(df['page_length'], q=5, labels=['very_short', 'short', 'medium', 'long', 'very_long'])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Rename column
df.rename(columns={'source_domain': 'url_domain'}, inplace=True)
df.rename(columns={'content_length': 'page_length'}, inplace=True)

In [None]:
null_count = df['page_length'].isnull().sum()
print("Number of rows with null page_length:", null_count)

In [None]:
df.count()

## Check co_mention_brands

In [None]:
df_non_null_comention = df[
    df['co_mentioned_brands'].notnull() &
    (df['co_mentioned_brands'].str.strip() != '[]') &
    (df['co_mentioned_brands'].str.strip() != '')
]

df_non_null_comention

In [None]:
df_non_null_comention.count()

In [None]:
df.info()

The number of rows too less to do further analysis. However, if we could scale up, the data can further to see co_mention across the brands

# 6 Correlation Analysis

In [None]:
df_corr = df.copy()

In [None]:
df['url_tld'].unique()

In [None]:
# Drop irrelevant columns to do one-hot encoding
drop_cols = ['url', 'content', 'url_domain',
            'entities', 'is_relevant', 'keywords', 'co_mentioned_brands', 'url_host']
df_corr.drop(columns=drop_cols, inplace=True)

In [None]:
df_corr.info()

In [None]:
df_corr.head()

In [None]:
# One-hot encode 'brand_name', 'web_category', 'url_tld'
df_corr = pd.get_dummies(df_corr, columns=['brand_name', 'has_ugc_keyword', 'web_category', 'url_tld', 'page_length_bin'], drop_first=True)

In [None]:
# Check data type
print(df_corr.dtypes)

In [None]:
# Compute the correlation matrix
df_corr_matrix = df_corr.corr()

# Display the correlation matrix
df_corr_matrix

In [None]:
# Unstack the matrix, drop self-correlations, and sort
top_10_corr = (
    df_corr_matrix.unstack()
    .reset_index()
    .rename(columns={"level_0": "Feature_1", "level_1": "Feature_2", 0: "Correlation"})
)

# Remove duplicates (like A,B and B,A)
top_10_corr = top_10_corr[top_10_corr["Feature_1"] != top_10_corr["Feature_2"]]
top_10_corr["Pairs"] = top_10_corr[["Feature_1", "Feature_2"]].apply(lambda row: tuple(sorted(row)), axis=1)
top_10_corr = top_10_corr.drop_duplicates(subset="Pairs")

# Get top 10 highest absolute correlations
top_10_corr = top_10_corr.reindex(top_10_corr["Correlation"].abs().sort_values(ascending=False).index)
top_10_corr = top_10_corr.head(10)

# Display
top_10_corr[["Feature_1", "Feature_2", "Correlation"]]


In [None]:
correlation_matrix = df_corr.corr()

# Identify brand and web feature columns
brand_cols = [col for col in df_corr.columns if col.startswith('brand_name_')]
web_feature_cols = [col for col in df_corr.columns if any(prefix in col for prefix in ['web_category_', 'url_tld_', 'mention_count', 'has_ugc_keyword', 'page_length_bin'])]
# Extract correlation values between brands and web features
filtered_corr = correlation_matrix.loc[brand_cols, web_feature_cols]

# Keep only values with abs(correlation) > 0.2
threshold = 0.2
high_corr_only = filtered_corr.where(filtered_corr.abs() > threshold)

# Drop rows and columns where all values are NaN (i.e., below threshold)
high_corr_only.dropna(how='all', axis=0, inplace=True)
high_corr_only.dropna(how='all', axis=1, inplace=True)

# Plot only the high-correlation values
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.heatmap(high_corr_only, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)
plt.title(f'Correlation (|r| > {threshold}): Brand Name vs Web Characteristics')
plt.tight_layout()
plt.show()


In [None]:
df_corr[['brand_name_primark', 'url_tld_uk']].corr()

In [None]:
# Calculate full correlation matrix
correlation_matrix = df_corr.corr()

# Filter only brand and url_tld columns
brand_cols = [col for col in df_corr.columns if col.startswith('brand_name_')]
url_tld_cols = [col for col in df_corr.columns if col.startswith('url_tld_')]

# Extract correlation values between brands and url_tld
filtered_corr = correlation_matrix.loc[brand_cols, url_tld_cols]

# Keep only high correlation values (threshold)
threshold = 0.2
high_corr_only = filtered_corr.where(filtered_corr.abs() > threshold)

# Drop rows and columns where all values are NaN
high_corr_only.dropna(how='all', axis=0, inplace=True)
high_corr_only.dropna(how='all', axis=1, inplace=True)

# Plot heatmap
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
sns.heatmap(high_corr_only, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)
plt.title('Filtered Correlation: Brand Name vs Significant Web Characteristics')
plt.tight_layout()
plt.show()

In [None]:
df = pd.read_csv("data/csv_data/nlp-added-nlp-features-2.csv")

In [None]:
df.head()

In [None]:
# Count of missing (null) values per column
df.isnull().sum()

In [None]:
# Summary of missing values: count and percentage
missing_summary = pd.DataFrame({
    'Missing Count': df.isnull().sum(),
    'Missing %': (df.isnull().sum() / len(df)) * 100
})

# Sort descending by most missing
missing_summary = missing_summary[missing_summary['Missing Count'] > 0].sort_values(by='Missing Count', ascending=False)

print(missing_summary)

In [None]:
df = pd.read_csv("data/csv_data/nlp-added-nlp-features-2.csv")

In [None]:
df.dtypes

In [None]:
df.info()

## Convert to correct category

In [None]:
# Convert boolean columns
bool_cols = ['has_ugc_keyword', 'is_relevant']
df[bool_cols] = df[bool_cols].astype(bool)

# Convert float columns (already float64, just showing for clarity)
float_cols = ['language_score', 'language_perplexity', 'text_entropy']
df[float_cols] = df[float_cols].astype(float)

# Convert integer column (already int64, just showing for clarity)
df['mention_count'] = pd.to_numeric(df['mention_count'], errors='coerce')

# Convert numeric columns stored as object
object_numeric_cols = ['content_length', 'word_count']
df[object_numeric_cols] = df[object_numeric_cols].apply(pd.to_numeric, errors='coerce')

# Convert other object columns to string
string_cols = [
    'url', 'source_domain', 'url_tld', 'web_category', 'content',
    'brand_name', 'entities', 'co_mentioned_brands', 'keywords'
]
df[string_cols] = df[string_cols].astype(str)

# Optional: verify the changes
print(df.dtypes)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
# Normalize brand names to lowercase for consistency
df['brand_name'] = df['brand_name'].str.lower()

# Count token frequency per brand
token_freq_df = df['brand_name'].value_counts().reset_index()
token_freq_df.columns = ['brand_name', 'token_freq']

# Add token frequency back to main DataFrame
df = df.merge(token_freq_df, on='brand_name', how='left')

# Define top-ranked brands
top_ranked = ['zara', 'burberry', 'asos', 'vivienne westwood', 'reiss']
df['is_top_ranked'] = df['brand_name'].isin(top_ranked).astype(int)

In [None]:
correlation = df[['token_freq', 'is_top_ranked']].corr().iloc[0, 1]
print(f"Correlation between token frequency and top-rank status: {correlation:.2f}")

## Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select only content-related numeric columns
content_features = ['content_length', 'language_score', 'language_perplexity',
                    'text_entropy', 'word_count', 'mention_count']

# Filter the DataFrame to keep only these columns
content_df = df[content_features]

# Drop rows with missing values in any of the selected columns
content_df = content_df.dropna()

# Compute the correlation matrix between selected features
corr_matrix = content_df.corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))  # Set the size of the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Correlation Heatmap of Content-Related Features")
plt.tight_layout()
plt.show()

# 7 Hypothesis Testing

As monitoring DeepSeek response and match the pattern to gain insights. The Framework of Hypothesis has been developed to analyze the pretrained dataset

### Correlation Methods Used For hypothesis testing

| Method     | Use Case                                    | Assumptions                            |
|------------|---------------------------------------------|-----------------------------------------|
| **Pearson**  | Measures linear correlation between numeric values | Requires normal distribution and linearity |
| **Spearman** | Measures correlation based on rank order     | **Non-parametric** (no distribution assumptions) |

### Why Spearman?

Given the **small sample size (n = 3 brands)** and the potential **non-linear** nature of relationships, **Spearman is more appropriate**. It evaluates whether increases in one variable are associated with consistent changes in the ranking of the other — making it **robust and conservative** for this use case.


## Generic Prompt for brand ranking and Specific brand provided

<div style="
    background-color: #e7f3fe;
    border-left: 6px solid #2196F3;
    padding: 15px;
    font-family: Arial, sans-serif;
    font-size: 15px;">
    💡 <b>Note:</b> Due to small sample size (n=3), <code>p-value</code> is mostly not statistically significant. statistical significance cannot be established. The observed correlation (r = ...) should be interpreted as exploratory insight rather than conclusive evidence.

</div>


**Disclaimer: This time, the p-value will not be considered due to extreamely small sample size,  Correlation results are exploratory and serve to illustrate directional trends, not confirm causal or predictive relationships.**

## Hypothesis testing 1 - Token Frequency (sum of mention_count) Correlates with Rank Stability

Goal; Online presence measuring via Token count (mention_count) impact the higher LLM ranking

With General asking in LLM for brand ranking (DeepSeek), Below are categorized as the top rank. This will be the fixed rank as testing until the prompt changes

In [None]:
df_hypo = df.copy()

**Assumption**; Brands that appear frequently in pretraining data, measured via total token/mention count, may be ranked higher by LLMs, regardless of the surrounding context or semantic framing.

**Metric Used**:
- mention_count (sum) = proxy for token frequency
- avg_rank = fixed brand ranking (lower is better)

In [None]:
# Group by brand_name to get total mentions and number of rows with each brand
mention_counts_df = df_hypo.groupby("brand_name").agg(
    mention_count=('mention_count', 'sum'),
    brand_occurrence_count=('brand_name', 'count')
).reset_index()

mention_counts_df

In [None]:
# Brand ranking categories in the UK 2023
top_rk = ("primark", "asos", "burberry")
med_rk = ("river island", "reiss")
low_rk = ("superdry", "ted baker")

# Brands appearing once
one_app = ("zara", "Vivienne Westwood", "john lewis")

In [None]:
# Define static ranking: lower rank = better
static_rank_data = [
    ("Vivienne Westwood", 1),
    ("Reiss", 2),
    ("ASOS", 3),
    ("Primark", 4),
]

In [None]:
# Convert both to lowercase
mention_counts_df['brand_name'] = mention_counts_df['brand_name'].str.lower()
static_rank_data = [(brand.lower(), rank) for brand, rank in static_rank_data]

In [None]:
# Static Rank info
rank_df = pd.DataFrame(static_rank_data, columns=["brand_name", "avg_rank"])

# Merge mention counts with rank info
joined_df = pd.merge(mention_counts_df, rank_df, on="brand_name")

# Spearman Correlation: Total Mention Count vs LLM Rank
corr_s1, pval_s1 = spearmanr(joined_df["mention_count"], joined_df["avg_rank"])
print(f"Spearman Correlation (mention count vs rank): r = {corr_s1:.3f}, p = {pval_s1:.3f}")

# Scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data=joined_df, x="mention_count", y="avg_rank", s=120, color="royalblue")

# Annotate points
for _, row in joined_df.iterrows():
    plt.text(row["mention_count"], row["avg_rank"] + 0.1, row["brand_name"], ha='center', fontsize=10)

plt.title("Total Mention Count vs Brand Rank", fontsize=14)
plt.xlabel("Total Mention Count", fontsize=12)
plt.ylabel("Average Rank (Lower = Better)", fontsize=12)
plt.gca().invert_yaxis()  # Lower rank (1) at the top
plt.grid(True)
plt.tight_layout()
plt.show()

# Bias Analysis

## Hypothesis testing 2 - Formal vs Informal Content Bias (Unsolved; high irrelevant page)

Goal; Output leans toward formal tone, likely from news/research. LLM might rank the higher rank for those brands that seems formal tone or on the reliable web page

For this hypothesis, the method is to use web_category to classify formal/informal, Calculate the proportion of content that is formal (e.g., News, Financial) vs informal (e.g., Blogs, Community).

In [None]:
df_hypo.groupby(['brand_name', 'web_category'])['mention_count'].sum().unstack(fill_value=0)

In [None]:
df_hypo.head()

In [None]:
# Catogorize formal and informal cats
formal_cats = ['Education', 'Financial', 'Government', 'News']
informal_cats = ['Blogs & Community', 'Media & Entertainment', 'Social', 'Others', 'E-commerce & Commercial']

In [None]:
total_docs = len(df_hypo)
formal_docs = df_hypo['web_category'].isin(formal_cats).sum()
formal_ratio = formal_docs / total_docs
print(f"Formal category ratio: {formal_ratio:.2%}")

In [None]:
# Group the original DataFrame to get formal mentions only
df_formal_mentions = df_hypo[df_hypo['web_category'].isin(formal_cats)]

# Sum formal mentions per brand
df_formal_sum = df_formal_mentions.groupby('brand_name')['mention_count'].sum()

# Compute % share for each brand
df_formal_pct = df_formal_sum / df_formal_sum.sum() * 100

In [None]:
pd.set_option('display.max_columns', None)  # Show all columns

In [None]:
df_formal_mentions.head()

### Human Evaluation Check

The most page return are irrelevant to the sentiment analysis. There are few that refer to the brand sentiment context.

In [None]:
df_formal_mentions[
    df_formal_mentions['web_category'].isin(["Education", "Financial", "Government"])
][['url', 'url_domain','content']].head(10)

In [None]:
fashion_keywords = [
    "dress", "clothing", "style", "fashion", "outfit", "wear",
    "collection", "trend", "runway", "model", "lookbook", "wardrobe",
    "design", "brand", "shop", "retail"
]

def contains_zara_and_3_fashion_words(text):
    if not isinstance(text, str):
        return False
    text_lower = text.lower()
    zara_match = re.search(r'\bzara\b', text_lower)
    count = sum(1 for word in fashion_keywords if word in text_lower)
    return bool(zara_match) and count >= 3

filtered_df = df_formal_mentions[
    df_formal_mentions['content'].apply(contains_zara_and_3_fashion_words)
][['brand_name', 'content']]

# Show all text in 'content'
pd.set_option('display.max_colwidth', None)
filtered_df.head(12)


In [None]:
# Visualize the Ratios
plt.figure(figsize=(6, 5))
df_formal_pct.plot(kind='bar', title='Share of Formal Mentions by Brand', ylabel='Percentage (%)')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
df_formal_mentions[
    df_formal_mentions['web_category'].isin(["Education", "Financial", "Government"])
][['content','url', 'url_domain']].head(10)

# Filtering relevant page

In [None]:
fashion_keywords = [
    "dress", "clothing", "style", "fashion", "outfit", "wear",
    "collection", "trend", "runway", "model", "lookbook", "wardrobe",
    "design", "brand", "shop", "retail"
]

def contains_zara_and_3_fashion_words(text):
    if not isinstance(text, str):
        return False
    text_lower = text.lower()
    zara_match = re.search(r'\bzara\b', text_lower)
    count = sum(1 for word in fashion_keywords if word in text_lower)
    return bool(zara_match) and count >= 3

filtered_df = df_formal_mentions[
    df_formal_mentions['content'].apply(contains_zara_and_3_fashion_words)
][['brand_name', 'content']]

# Show all text in 'content'
pd.set_option('display.max_colwidth', None)
filtered_df.head(3)

## Hypothesis 3: Location Bias from TLD

Goal: Are certain brands more common in .uk, .fr, etc.?, is there any Bias in Location in LLM response

In [None]:
crosstab_df = pd.crosstab(df_hypo['brand_name'], df_hypo['url_tld'], normalize='index')

## Heatmap for TLD

In [None]:
# Filter top N TLDs by overall usage (column sum)
top_tlds = crosstab_df.sum(axis=0).sort_values(ascending=False).head(10).index
filtered = crosstab_df[top_tlds]

# Heatmap
plt.figure(figsize=(10, 4))
sns.heatmap(filtered, annot=True, cmap='Blues', fmt=".2%", linewidths=0.5)
plt.title("Brand Presence Across Top 10 TLDs (Normalized by Brand)", fontsize=14)
plt.xlabel("TLD")
plt.ylabel("Brand")
plt.tight_layout()
plt.show()

## Remove .com

In [None]:
# Remove 'com' column first
crosstab_excl_com = crosstab_df.drop(columns='com', errors='ignore')

# Re-normalize so each row sums to 1 again (without 'com')
crosstab_excl_com = crosstab_excl_com.div(crosstab_excl_com.sum(axis=1), axis=0)

# Pick top TLDs excluding 'com' by global presence
top_tlds = crosstab_excl_com.sum(axis=0).sort_values(ascending=False).head(10).index
filtered = crosstab_excl_com[top_tlds]

# Plot heatmap
plt.figure(figsize=(10, 4))
sns.heatmap(filtered, annot=True, cmap='Blues', fmt=".2%", linewidths=0.5)
plt.title("Brand Presence Across Top TLDs (Excluding .com, Normalized by Brand)", fontsize=14)
plt.xlabel("TLD")
plt.ylabel("Brand")
plt.tight_layout()
plt.show()

## Top-Level Domain (TLD) Descriptions

Below are brief descriptions of each TLD extracted from the image:

- **.co.uk**: United Kingdom – Used by commercial entities based in the UK.
- **.org**: Organization – Commonly used by non-profit and non-governmental organizations.
- **.net**: Network – Originally for network providers, now broadly used.
- **.co**: Colombia – Country-code TLD often marketed globally for “company”.
- **.ca**: Canada – The official country code TLD for Canada.
- **.info**: Information – Generic TLD intended for informational websites.
- **.com.au**: Australia – Commercial organizations in Australia.
- **.ie**: Ireland – Country code TLD for Ireland.
- **.ru**: Russia – Country code TLD for the Russian Federation.
- **.in**: India – Country code TLD for India.


For all details of top-level domains, including gTLDs worldwide, please see [Link](https://www.iana.org/domains/root/db).

## Hypothesis 4: Revenue Context Bias

Goal: Check if top-ranked brands co-occur with financial context or not as Reasoning model prioritize the market value, Revenue, Financial statement. The response related to the numerical figures

In this hypothesis, the ranking based on **R1 model** will be highlihgted due to the nature of reasoning model and understanding forcus

In [None]:
static_rank_data_rev_bias = [
    ("Vivienne Westwood", 1),
    ("burberry", 2),
    ("reiss", 3),
    ("ted baker", 4),
    ("john lewis", 5),
    ("zara", 6),
    ("river island", 7),
    ("asos", 8),
    ("superdry", 9),
    ("primark", 10)
]

static_rank_df = pd.DataFrame(static_rank_data_rev_bias, columns=["brand_name", "rank"])

In [None]:
df_hypo.info()

In [None]:
financial_keywords = [
    'revenue', 'billion', 'million', 'sales', 'usd', '£',
    'profit', 'net income', 'earnings', 'financial statement', 'market value'
]

df_hypo['contains_financial_context'] = df_hypo['content'].str.lower().apply(
    lambda text: any(keyword in text for keyword in financial_keywords)
)

df_hypo['content_word_count'] = df_hypo['content'].str.split().str.len()

brand_context_bias = df_hypo.groupby('brand_name').agg(
    total_mentions=('content', 'count'),
    financial_context_mentions=('contains_financial_context', 'sum'),
    avg_word_count=('content_word_count', 'mean'),
    avg_word_count_financial=('content_word_count', lambda x: x[df_hypo.loc[x.index, 'contains_financial_context']].mean())
)

brand_context_bias['pct_financial_context'] = (
    brand_context_bias['financial_context_mentions'] / brand_context_bias['total_mentions']
) * 100

brand_context_bias_sorted = brand_context_bias.sort_values(by='pct_financial_context', ascending=False)

# Round numeric columns to 2 decimals
brand_context_bias_sorted = brand_context_bias_sorted.round(2)

# Format percentage with % sign
brand_context_bias_sorted['pct_financial_context'] = brand_context_bias_sorted['pct_financial_context'].astype(str) + '%'

# Display
from IPython.display import display
display(brand_context_bias_sorted.head(10))


In [None]:
# Merge on brand_name
brand_context_bias = pd.merge(brand_context_bias, static_rank_df, on='brand_name', how='left')

# Sort by financial context %
brand_context_bias_sorted = brand_context_bias.sort_values(by='pct_financial_context', ascending=False)

# Round numeric columns to 2 decimals
brand_context_bias_sorted = brand_context_bias_sorted.round(2)

# Format the percentage column (create a display version)
brand_context_bias_sorted['pct_financial_context_display'] = brand_context_bias_sorted['pct_financial_context'].astype(str) + '%'

# Display top 10 brands (with rank and formatted %)
from IPython.display import display
display(brand_context_bias_sorted.head(10))

In [None]:
import matplotlib.pyplot as plt

# Sort by rank (ascending = best to worst)
rank_sorted_df = brand_context_bias_sorted.sort_values(by='rank')

# Plot
plt.figure(figsize=(10, 6))
bars = plt.barh(
    y=[f"{row['brand_name']} (#{int(row['rank'])})" for _, row in rank_sorted_df.iterrows()],
    width=rank_sorted_df['pct_financial_context']
)

# Add text labels (percentages) at the end of each bar
for bar, pct in zip(bars, rank_sorted_df['pct_financial_context_display']):
    width = bar.get_width()
    plt.text(width + 1, bar.get_y() + bar.get_height() / 2, pct, va='center', fontsize=9)

# Labels and title
plt.ylabel("Brand (Static Rank)")
plt.xlabel("Share of Mentions in Financial Context (%)")
plt.title("Financial Context Share by Brand (Rank 1–10)")
plt.gca().invert_yaxis()  # Rank 1 at the top
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()

# Show the chart
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr

# Use clean DataFrame (rank_sorted_df should already be defined and sorted by rank)
df_corr = rank_sorted_df.copy()

# Spearman correlation
r_spear, p_spear = spearmanr(df_corr['pct_financial_context'], df_corr['rank'])
print(f"Spearman Correlation: r = {r_spear:.2f}, p = {p_spear:.3f}")

# Plot
plt.figure(figsize=(8, 6))
sns.regplot(
    x='pct_financial_context',
    y='rank',
    data=df_corr,
    scatter=True,
    ci=None,
    line_kws={'color': 'red'}
)

# Annotate brands
for _, row in df_corr.iterrows():
    plt.text(row['pct_financial_context'] + 0.5, row['rank'], row['brand_name'], fontsize=8)

# Axis labels and title
plt.xlabel("Share of Mentions in Financial Context (%)")
plt.ylabel("Static Rank (Lower = Better)")
plt.title(f"Correlation: Financial Context % vs Rank (r = {r_spear:.2f})")
plt.gca().invert_yaxis()  # Rank 1 at the top
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
df_hypo['contains_financial_context'].value_counts(normalize=True)

In [None]:
df_hypo.head()

In [None]:
df_hypo.info()

-- End of the notebook --