# Embeddings (Traditional Statistical Vector-based Embeddings)

Yes, traditional statistical vector-based embeddings are foundational techniques in natural language processing (NLP) that represent text data using various statistical measures. Here are some of these traditional methods:

### 1. Bag of Words (BoW)
- **Description**: Represents text by the occurrence (count) of each word in the document without considering the word order or context.
- **Implementation**: Typically uses a Count Vectorizer.
- **Characteristics**: Produces sparse vectors where each dimension corresponds to a specific term from the vocabulary and the value is the word count.
- **Use Cases**: Simple and effective for basic text classification and clustering tasks.

### 2. Term Frequency-Inverse Document Frequency (TF-IDF)
- **Description**: Enhances the Bag of Words model by weighting terms based on their frequency in a document and their inverse frequency across all documents in the corpus.
- **Implementation**: Uses TF-IDF Vectorizer.
- **Characteristics**: Produces sparse vectors with weighted values, reducing the impact of common words and highlighting important terms.
- **Use Cases**: Widely used in information retrieval and text mining.

### 3. Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
- **Description**: Applies Singular Value Decomposition (SVD) to the term-document matrix (typically after applying TF-IDF) to reduce dimensions and capture latent semantic relationships between terms.
- **Implementation**: Perform SVD on the term-document matrix.
- **Characteristics**: Transforms high-dimensional sparse vectors into lower-dimensional dense vectors.
- **Use Cases**: Useful for topic modeling and capturing underlying semantic structures.

### 4. Latent Dirichlet Allocation (LDA)
- **Description**: A generative probabilistic model that represents documents as mixtures of topics and topics as mixtures of words.
- **Implementation**: Uses probabilistic algorithms to infer topic distributions.
- **Characteristics**: Produces dense vectors representing the distribution of topics in each document.
- **Use Cases**: Widely used for topic modeling and discovering abstract topics in large text corpora.

### 5. Pointwise Mutual Information (PMI)
- **Description**: Measures the association between a pair of words by comparing the probability of their co-occurrence to the probabilities of their individual occurrences.
- **Implementation**: Uses co-occurrence matrices.
- **Characteristics**: Produces dense vectors that capture the likelihood of words appearing together.
- **Use Cases**: Useful for capturing word associations and semantic relationships.

In [None]:
import ast
import os
import pickle
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from alive_progress import alive_bar
from scipy.sparse import csr_matrix
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import (
    cosine_similarity,
    euclidean_distances,
    manhattan_distances,
)
from sklearn.pipeline import make_pipeline

In [None]:
statistical_methods = {
    "bow": (CountVectorizer,),
    "tfidf": (TfidfVectorizer,),
    "lsa": (TfidfVectorizer, TruncatedSVD),
    "lda": (CountVectorizer, LatentDirichletAllocation),
}

similarity_metrics = {
    "cosine": cosine_similarity,
    "euclidean": euclidean_distances,
    "dot": np.dot,
    "manhattan": manhattan_distances,
}

In [None]:
# Parameters
CONTRIBUTOR: str = "Health Promotion Board"
CATEGORY: str = "live-healthy"
METHOD: str = "bow"
KWARGS: dict = {"max_features": 384}
METRIC: str = "dot"

In [None]:
CLEAN_DATA_PATH = os.path.join("..", "data", "healthhub_small_clean")

CLEANED_CHUNK_ID_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_chunk_id_list_small_clean.pkl"
)
CLEANED_SOURCE_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_source_list_small_clean.pkl"
)
CLEANED_DOMAIN_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_domain_list_small_clean.pkl"
)
CLEANED_TITLE_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_title_list_small_clean.pkl"
)
CLEANED_CONTRIBUTOR_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_contributor_list_small_clean.pkl"
)
CLEANED_CONTENT_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_content_list_small_clean.pkl"
)
CLEANED_CATEGORY_LIST_PATH = os.path.join(
    CLEAN_DATA_PATH, "healthhub_category_list_small_clean.pkl"
)

OUTPUT_CM_PATH = os.path.join(
    "..",
    "artifacts",
    "outputs",
    f"{METHOD}_{'_'.join([f'{k}_{v}' for k, v in KWARGS.items()])}_{METRIC}_cm.png",
)
OUTPUT_SIM_PATH = os.path.join(
    "..",
    "artifacts",
    "outputs",
    "statistical_vector_based_embeddings_similarity_scores.xlsx",
)

SHEET_NAME = f"{METHOD}_{METRIC}"

## Load Metadata

In [None]:
with open(CLEANED_CHUNK_ID_LIST_PATH, "rb") as file:
    loaded_chunk_id = pickle.load(file)  # list of chunk ids

with open(CLEANED_SOURCE_LIST_PATH, "rb") as file:
    loaded_source = pickle.load(file)  # list of hyperlinks

with open(CLEANED_DOMAIN_LIST_PATH, "rb") as file:
    loaded_domain = pickle.load(file)  # website domain

with open(CLEANED_TITLE_LIST_PATH, "rb") as file:
    loaded_title = pickle.load(file)  # list of titles each chunk belongs to

with open(CLEANED_CONTRIBUTOR_LIST_PATH, "rb") as file:
    loaded_contributor = pickle.load(file)  # list of contributors

with open(CLEANED_CONTENT_LIST_PATH, "rb") as file:
    loaded_content = pickle.load(file)  # list of chunks of contents

with open(CLEANED_CATEGORY_LIST_PATH, "rb") as file:
    loaded_category = pickle.load(file)  # list of categories

## Create Dataframe

In [None]:
df = pd.DataFrame(
    {
        "chunk_id": loaded_chunk_id,
        "doc_source": loaded_source,
        "doc_domain": loaded_domain,
        "doc_title": loaded_title,
        "contributor": loaded_contributor,
        "text": loaded_content,
        "category": loaded_category,
    }
)

df = df[df["contributor"] == CONTRIBUTOR].reset_index(drop=True)
df = df[df["doc_source"].apply(lambda x: x.split("/")[3] == CATEGORY)].reset_index(
    drop=True
)

print(df.shape)
df.head()

## Combine Chunks into Single Articles

In [None]:
df["combined_text"] = None

with alive_bar(df["doc_source"].nunique(), force_tty=True) as bar:
    for source in df["doc_source"].unique():
        combined_text = " ".join(df.query("doc_source == @source")["text"].values)
        indices = df.query("doc_source == @source").index.values
        df.loc[indices, "combined_text"] = combined_text
        bar()

# After combining chunks one article, remove all duplicate articles
df = df[~df["doc_source"].duplicated()].reset_index(drop=True)
df["chunk_id"] = df["chunk_id"].apply(lambda x: "_".join(x.split("_")[:-1]))

df

## Load Ground Truth Dataframe

In [None]:
ground_df = pd.read_excel(
    os.path.join(
        "..", "data", "Synapxe Content Prioritisation - Live Healthy_020724.xlsx"
    ),
    sheet_name="All Live Healthy",
    index_col=False,
)

ground_truth_col = "Combine Group ID"

ground_df = ground_df[ground_df[ground_truth_col].notna()].reset_index(drop=True)
ground_df[ground_truth_col] = ground_df[ground_truth_col].astype(int)

# Merge dfs so we can get the document title and content
merge_df = pd.merge(ground_df, df, how="inner", left_on="URL", right_on="doc_source")

col_of_int = ["Combine Group ID", "Page Title", "Meta Description", *df.columns]
final_df = merge_df[col_of_int]

print(final_df.shape)
final_df.head()

In [None]:
def generate_statistical_embeddings(
    corpus: list[str], method: str, **kwargs: dict
) -> tuple[csr_matrix, pd.DataFrame] | tuple[np.ndarray, None]:
    components = statistical_methods.get(method, None)

    df = None

    if len(components) == 1 and components is not None:
        vectorizer = components[0](**kwargs)
        print(vectorizer)
        X = vectorizer.fit_transform(corpus)

        # Get words from stopwords array to use as headers
        feature_names = vectorizer.get_feature_names_out()
        # Combine header titles and weights
        df = pd.DataFrame(X.toarray(), columns=feature_names)

    elif len(components) > 1 and components is not None:
        pipeline = make_pipeline(components[0](), components[1](**kwargs))
        print(pipeline)
        X = pipeline.fit_transform(corpus)

    return X, df

In [None]:
X, mat_df = generate_statistical_embeddings(
    final_df["combined_text"].to_list(), method=METHOD, **KWARGS
)

print(X.shape)  # (num_docs, emb_dim)
if mat_df is not None:
    display(mat_df.head(7))

In [None]:
# Compute similarity matrix
similarity_metric = similarity_metrics[METRIC]

if METRIC == "dot":
    similarities = X @ X.T
    if type(similarities) != np.ndarray:
        similarities = similarities.toarray()
elif METRIC in ["euclidean", "manhattan"]:
    distances = similarity_metric(X, X)
    # https://stats.stackexchange.com/questions/158279/how-i-can-convert-distance-euclidean-to-similarity-score#:~:text=If,is%20commonly%20used.
    similarities = 1 / (1 + distances)
else:
    similarities = similarity_metric(X, X)

print(similarities.shape)  # (num_docs, num_docs)

In [None]:
if METHOD == "bow" and METRIC == "dot":
    similarities = np.divide(similarities, similarities.max(), casting="same_kind")

In [None]:
# Function to darken a hex color


def darken_hex_color(hex_color, factor=0.7):
    # Ensure factor is between 0 and 1
    factor = max(0, min(1, factor))

    # Convert hex color to RGB
    r = int(hex_color[1:3], 16)
    g = int(hex_color[3:5], 16)
    b = int(hex_color[5:7], 16)

    # Darken the color
    r = int(r * factor)
    g = int(g * factor)
    b = int(b * factor)

    # Convert RGB back to hex
    darkened_color = f"#{r:02x}{g:02x}{b:02x}".upper()

    return darkened_color

In [None]:
article_titles = final_df.loc[:, "doc_title"].tolist()

start = 0
end = 20

cutoff_similarities = similarities[start:end, start:end]
cutoff_article_titles = article_titles[start:end]

# Generate random colours
hexadecimal_alphabets = "0123456789ABCDEF"
ground_truth_cluster_ids = final_df.iloc[start:end]["Combine Group ID"].unique()
colours = {
    id: darken_hex_color(
        "#" + "".join([random.choice(hexadecimal_alphabets) for _ in range(6)])
    )
    for id in ground_truth_cluster_ids
}


plt.subplots(figsize=(20, 18))
ax = sns.heatmap(
    cutoff_similarities,
    xticklabels=cutoff_article_titles,
    yticklabels=cutoff_article_titles,
    annot=True,
    fmt=".2g",
)

for x_tick_label, y_tick_label in zip(
    ax.axes.get_xticklabels(), ax.axes.get_yticklabels()
):
    ground_truth_cluster_id = (
        final_df[final_df["doc_title"] == y_tick_label.get_text()]["Combine Group ID"]
        .values[0]
        .astype(int)
    )
    colour = colours[ground_truth_cluster_id]
    y_tick_label.set_color(colour)
    x_tick_label.set_color(colour)

ax.set_title(f"Method: {METHOD}, kwargs: {KWARGS}, metric: {METRIC}", fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
ax.figure.savefig(OUTPUT_CM_PATH, dpi=400)

## Save Similarity Scores

In [None]:
sim_df = pd.DataFrame(similarities)

sim_df.index = merge_df["Page Title"]
sim_df.columns = merge_df["Page Title"]

# Store kwwargs as index name
sim_df.index.name = str(KWARGS)
sim_df.columns.name = None

In [None]:
if os.path.isfile(OUTPUT_SIM_PATH):
    with pd.ExcelWriter(
        OUTPUT_SIM_PATH, mode="a", engine="openpyxl", if_sheet_exists="replace"
    ) as writer:  # Open with pd.ExcelWriter
        sim_df.to_excel(writer, sheet_name=SHEET_NAME)
else:
    sim_df.to_excel(OUTPUT_SIM_PATH, sheet_name=SHEET_NAME)

In [None]:
file = pd.ExcelFile(OUTPUT_SIM_PATH)
print(file.sheet_names)

In [None]:
tmp = pd.read_excel(OUTPUT_SIM_PATH, sheet_name=SHEET_NAME)
print(ast.literal_eval(tmp.columns[0]))