## **Building a Document Topic Classifier**

> Next we will focus on using information and connection between entities taken from bipartite entity-document graph to **train multi-label classifiers** `to predict the document topics`. To do this we will analyze two different approaches.  
- `A shallow machine-learning approach` : embeded all the node from bipartite after preprocessing using graph then use the embdeded to train traditional classifiers such as **Random Forest classifier**  
- `A more integrated and differentiable approach` : based on graphical neural network on to heterogeneous graph

In [1]:
from nltk.corpus import reuters
import pandas as pd
from langdetect import detect
import numpy as np
import spacy
import re

corpus = pd.DataFrame([
    {"id": _id,
     "text": reuters.raw(_id).replace("\n", ""), 
     "label": reuters.categories(_id)}
    for _id in reuters.fileids()
 ])
# corpus = corpus.loc[:4]

# Clean the Text
def clean_text(text):
    # Remove escape characters
    text = text.replace("\n", "")
    # Convert to lowercase
    text = text.lower()
    # Remove quotes around company names
    text = re.sub(r'<(.*?)>', r'\1', text)
    return text
corpus['clean_text']=corpus["text"].apply(clean_text)
# corpus["clean_text"] = corpus["text"].apply(
#     lambda x: x.replace("\n", "")
#  )

#Detect Language within Each Article of dataset
def getLanguage(text: str):
    try:
        return detect(text)
    except:
        return np.nan
corpus["language"] = corpus["text"].apply(detect)

# load the model NLP and apply to the clean text
nlp = spacy.load('en_core_web_md')
corpus["parsed"] = corpus["clean_text"]\
.apply(nlp)

#Extracting Keyword fro corpus
from gensim.summarization import keywords
corpus['keywords'] = corpus["clean_text"].apply(lambda text: keywords(text, split=True, scores=True, pos_filter=('NN', 'JJ'), lemmatize=True))


In [2]:
def extractEntities(ents, minValue=1, typeFilters=["GPE", "ORG", "PERSON"]):
    entities = pd.DataFrame([
        {
            "lemma": e.lemma_,
            "lower": e.lemma_.lower(),
            "type": e.label_
        } for e in ents if hasattr(e, "label_")
    ])
    if len(entities) == 0:
        return pd.DataFrame()
    g = entities.groupby(["type", "lower"])
    summary = pd.concat({
        "alias": g.apply(lambda x: x["lemma"].unique()),
        "count": g["lower"].count()
    }, axis=1)
    
    # Use boolean indexing to filter rows based on typeFilters
    filtered_summary = summary[summary["count"] > minValue]
    filtered_summary = filtered_summary[filtered_summary.index.get_level_values('type').isin(typeFilters)]
    
    return filtered_summary

def getOrEmpty(parsed, _type):
    try:
        return list(parsed.loc[_type]["count"]\
            .sort_values(ascending=False).to_dict().items())
    except:
        return []

def toField(ents):
    typeFilters = ["GPE", "ORG", "PERSON"]
    parsed = extractEntities(ents, 1, typeFilters)
    return pd.Series({_type: getOrEmpty(parsed, _type)
                      for _type in typeFilters})


> When training topic classifier, we must restrict our focus to only those document that belong to such labels. So, First we will consider the **top 10 common topic across the document**.

In [3]:
from collections import Counter
topics = Counter(
 [label 
 for document_labels in corpus["label"] 
 for label in document_labels]
).most_common(10)

topics

[('earn', 3964),
 ('acq', 2369),
 ('money-fx', 717),
 ('grain', 582),
 ('crude', 578),
 ('trade', 485),
 ('interest', 478),
 ('ship', 286),
 ('wheat', 283),
 ('corn', 237)]

In [4]:
topicsList = [topic[0] for topic in topics]
topicsSet = set(topicsList)
dataset = corpus[corpus["label"].apply(
 lambda x: len(topicsSet.intersection(x))>0
)]

In [5]:
dataset

Unnamed: 0,id,text,label,clean_text,language,parsed,keywords
0,test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,[trade],asian exporters fear damage from u.s.-japan ri...,en,"(asian, exporters, fear, damage, from, u.s.-ja...","[(trading, 0.4615130639538527), (said, 0.31598..."
1,test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,[grain],china daily says vermin eat 7-12 pct grain sto...,en,"(china, daily, says, vermin, eat, 7, -, 12, pc...","[(vermin, 0.312061438028717), (daily, 0.261102..."
2,test/14829,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,"[crude, nat-gas]",japan to revise long-term energy demand downwa...,en,"(japan, to, revise, long, -, term, energy, dem...","[(energy demand, 0.36686090947000344), (nuclea..."
3,test/14832,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Th...,"[corn, grain, rice, rubber, sugar, tin, trade]",thai trade deficit widens in first quarter th...,en,"(thai, trade, deficit, widens, in, first, quar...","[(pct, 0.5457455609144308), (export, 0.2656069..."
5,test/14839,AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...,[ship],australian foreign ship ban ends but nsw ports...,en,"(australian, foreign, ship, ban, ends, but, ns...","[(dispute shipping, 0.28151707573325435), (nsw..."
...,...,...,...,...,...,...,...
10783,training/999,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...,"[interest, money-fx]",u.k. money market shortage forecast revised do...,en,"(u.k, ., money, market, shortage, forecast, re...","[(forecast, 0.3364640504513776), (market, 0.33..."
10784,training/9992,KNIGHT-RIDDER INC &lt;KRN> SETS QUARTERLY Qtl...,[earn],knight-ridder inc &lt;krn> sets quarterly qtl...,en,"(knight, -, ridder, inc, &, lt;krn, >, sets, q...","[(sets, 0.33086349685229766), (april, 0.330863..."
10785,training/9993,TECHNITROL INC &lt;TNL> SETS QUARTERLY Qtly d...,[earn],technitrol inc &lt;tnl> sets quarterly qtly d...,en,"(technitrol, inc, &, lt;tnl, >, sets, quarterl...","[(april, 0.47842480045583535), (sets, 0.336643..."
10786,training/9994,NATIONWIDE CELLULAR SERVICE INC &lt;NCEL> 4TH ...,[earn],nationwide cellular service inc &lt;ncel> 4th ...,en,"(nationwide, cellular, service, inc, &, lt;nce...","[(shrs, 0.48295618305741017), (loss, 0.4437435..."


> Now that we have extracted structured dataset, we are ready to start training our topic models and evaluating their performance.

## **Shallow Learning Methods**

> Now we will prepare the dataset onto the bipartite graph

> **Some KeyWords**:  
- `Embedding`: is vector representation of items here it can represent node or entities and their relationship within the graph   
- `Grid Search Cross Validation` : technique use to fine tune hyperparameter( settings that are not learned from the data but need to be specified beforehand) of a ML model. It searches through *a predefined set of hyperparameter*, evaluate each combination using cross-validation and identifies the combination that yields the best performance.


In [6]:

entities = dataset["parsed"].apply(lambda x: toField(x.ents))
merged = pd.concat([dataset, entities], axis=1)


edges = pd.DataFrame([
{"source": _id, "target": keyword, "weight": score, "type":
_type}
for _id, row in merged.iterrows()
for _type in ["keywords", "GPE", "ORG", "PERSON"]
for (keyword, score) in row[_type]
])

import networkx as nx

# Assuming 'edges' is a DataFrame with columns 'source' and 'target'
G = nx.Graph()

# Add nodes with bipartite attribute
G.add_nodes_from(edges["source"].unique(), bipartite=0) #1st set -- Document
G.add_nodes_from(edges["target"].unique(), bipartite=1) #2nd set -- Keywords of Texts

# Add edges
G.add_edges_from([(row["source"], row["target"]) for _, row in edges.iterrows()])


# take the two set from graph and define as doc-node and entity-node
document_nodes = {n
        for n, d in G.nodes(data=True)
        if d["bipartite"] == 0}
entity_nodes = {n
        for n, d in G.nodes(data=True)
        if d["bipartite"] == 1}




> Then we load the model node2vec on to the graph which we will embeded all the node within the graph into vector

In [7]:
from node2vec import Node2Vec

node2vec = Node2Vec(G, dimensions=10, workers=6)  # Use the number of available CPU cores
model = node2vec.fit(window=20)
embeddings = model.wv


  from .autonotebook import tqdm as notebook_tqdm
Computing transition probabilities:   0%|          | 0/19836 [00:00<?, ?it/s]

Computing transition probabilities: 100%|██████████| 19836/19836 [00:46<00:00, 426.50it/s]  


In [8]:
embeddings.vectors

array([[ 0.0529803 , -0.28791276, -0.01597374, ..., -0.3570909 ,
        -0.29400387, -0.02616971],
       [ 0.5733612 , -0.47963986,  0.23776445, ..., -0.02688649,
        -0.12002277,  0.2997235 ],
       [ 0.67850226, -0.78957736,  0.09599695, ...,  0.09617652,
        -0.42856067,  0.0775395 ],
       ...,
       [-0.94714403,  0.05905417, -0.81775063, ...,  0.19457108,
        -0.23917626,  0.00524572],
       [-0.5830384 , -0.84181184, -0.4153224 , ..., -1.0292932 ,
        -1.0040509 ,  0.44983572],
       [ 0.63257056, -0.4008611 , -0.03733917, ..., -1.3890903 ,
        -1.7813815 , -0.9665787 ]], dtype=float32)

> NExt for enhancing the efficiency of graph-based ML, we will precomputing embedding, save them to disk in **pickle format** with filename based on the dimensions and window parameter and use them in optimization process.

In [9]:
# dimension=10
# window=20
# pd.DataFrame(embeddings.vectors, index=embeddings.index2word)\
#     .to_pickle(f"./embeddings/bipartiteGraphEmbeddings_{dimensions}_{window}.p")


In [10]:
corpus1= pd.read_pickle('graphEmbeddings_10_20.p')
corpus1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
said,-0.068634,-0.108728,0.279320,0.502636,-0.352387,-0.141525,0.122002,-0.174865,0.270644,-0.536599
mln,0.806816,-0.031284,0.831844,0.338622,-0.450020,-0.301777,0.314351,-0.175108,0.585830,-0.098098
net,0.927049,0.026581,1.356840,0.432589,-0.294489,-0.322550,0.490149,-0.077926,0.527095,0.199232
u.s.,-0.144313,0.180694,0.365246,0.834751,-0.384607,-0.121756,-0.214999,0.088578,0.340159,-0.201328
dlrs,0.167431,-0.229972,0.686198,0.227649,-0.532566,-0.382044,0.405773,0.072278,0.351097,-0.298843
...,...,...,...,...,...,...,...,...,...,...
liedtke,0.161181,-0.653445,0.384399,0.404352,-0.495372,-0.831178,1.201550,-0.669810,-0.097228,-0.398567
minerals properties,-0.126188,-0.666247,1.043137,-0.001206,-1.271500,0.104546,-0.107410,-1.483430,1.339801,0.451293
sand technology,0.562546,-0.375001,0.410943,0.634887,-0.779979,-0.991260,0.834942,-0.043466,0.543723,0.262858
schlecht,-0.109026,0.547412,1.187024,0.542943,0.514080,-0.357270,-0.807087,-0.077646,0.053627,-0.773175


In [11]:
print("=====Bipartite Graph=====")
print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())
print("Average degree:", sum(dict(G.degree()).values()) / G.number_of_nodes())

=====Bipartite Graph=====
Number of nodes: 19836
Number of edges: 54064
Average degree: 5.45109901189756


> and we gonna make a class to use in grid search cross-validation process

In [12]:
for i in topicsList: print(i, end=', ')

earn, acq, money-fx, grain, crude, trade, interest, ship, wheat, corn, 

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator, TransformerMixin

class EmbeddingsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, embeddings_file):
        self.embeddings_file = embeddings_file  # Add this line to store the embeddings file path
        self.embeddings = pd.read_pickle(embeddings_file)

    def fit(self, X, y=None):
        return self

    def transform(self, documents):
        return np.array([self.get_document_vector(document) for document in documents])

    def get_word_vector(self, word):
        if word in self.embeddings.index:
            return self.embeddings.loc[word].values
        else:
            return np.zeros_like(self.embeddings.values[0])

    def get_document_vector(self, document):
        word_vectors = [self.get_word_vector(f'{word}') for word in document]
        document_vector = np.mean(word_vectors, axis=0)
        return document_vector
    

> To build a modeling training pipeline, we will split our corpus into training and test sets, where the dataset already tell us which article to be a test set and which one to be training set 

In [14]:
# Step 4: Split the dataset into training and test sets
def train_test_split(corpus):
    train_mask = corpus['id'].str.contains("training/")
    test_mask = corpus['id'].str.contains("test/")

    train = corpus[train_mask]
    test = corpus[test_mask]

    return train, test

train, test = train_test_split(dataset)



In [15]:
train.head(3)

Unnamed: 0,id,text,label,clean_text,language,parsed,keywords
3020,training/10,COMPUTER TERMINAL SYSTEMS &lt;CPML> COMPLETES ...,[acq],computer terminal systems &lt;cpml> completes ...,en,"(computer, terminal, systems, &, lt;cpml, >, c...","[(price, 0.2331391311935398), (said, 0.2093850..."
3022,training/1000,NATIONAL AMUSEMENTS AGAIN UPS VIACOM &lt;VIA> ...,[acq],national amusements again ups viacom &lt;via> ...,en,"(national, amusements, again, ups, viacom, &, ...","[(viacom, 0.46000329134063395), (holdings, 0.2..."
3023,training/10000,ROGERS &lt;ROG> SEES 1ST QTR NET UP SIGNIFICAN...,[earn],rogers &lt;rog> sees 1st qtr net up significan...,en,"(rogers, &, lt;rog, >, sees, 1st, qtr, net, up...","[(quarter, 0.3204543230825579), (rogers, 0.296..."


In [16]:
# Check data types and shapes
print(type(train), type(test))
print(train.shape, test.shape)

<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
(6489, 7) (2545, 7)


> after we extract the train and test set from the dataset, now we will `get feature` (clean text that already clean and apply NLP model on >> corpus['parsed']) and `get label` to get label topic of each article, where each label within each article will transform into 0, or 1 if the label is define within that article.

In [17]:
def get_features(corpus):
    return corpus["parsed"]

def get_labels(corpus, topicsList):
    return corpus["label"].apply(
        lambda labels: pd.Series(
            {label: 1 for label in labels}
        ).reindex(topicsList).fillna(0)
    )[topicsList]

def get_features_and_labels(corpus):
    return get_features(corpus), get_labels(corpus, topicsList)

features, labels = get_features_and_labels(train)

In [18]:
labels.head(3)

Unnamed: 0,earn,acq,money-fx,grain,crude,trade,interest,ship,wheat,corn
3020,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3022,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3023,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Check data types and shapes
print(type(features), type(labels))
print(features.shape, labels.shape)

<class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>
(6489,) (6489, 10)


> Now we will put everything in a pipeline so this will allows you to chain multiple steps together, ensuring that the data flows seamlessly from one step to the next
- Fist it will embeded the input corpus text into vector with the class that being made for tranforming those corpus name `EmbeddingsTransformer`  
- Next it will train the model using the above transforming data for training the model.  
- when prediction the model will transform the data first the same as before training the model, so we do not to call it many time. 

In [20]:
# Step 6: Instantiate the modeling pipeline
pipeline = Pipeline([
    ("embeddings", EmbeddingsTransformer("graphEmbeddings_10_20.p")),  # Load the embeddings file
    ("model", MultiOutputClassifier(RandomForestClassifier()))
])

In [21]:
# Step 7: Define the parameter space and perform grid search
from sklearn.metrics import f1_score 
from glob import glob
param_grid = {
    "embeddings__embeddings_file": glob("graphEmbeddings_*"),
    "model__estimator__n_estimators": [50, 100],
    "model__estimator__max_features": [0.2, 0.3],
}

grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, 
                           scoring=lambda y_true, y_pred: f1_score(y_true, y_pred,average='weighted'))



In [22]:
features

3020     (computer, terminal, systems, &, lt;cpml, >, c...
3022     (national, amusements, again, ups, viacom, &, ...
3023     (rogers, &, lt;rog, >, sees, 1st, qtr, net, up...
3024     (island, telephone, share, split, approved,  ,...
3025     (u.k, ., growing, impatient, with, japan, -, t...
                               ...                        
10783    (u.k, ., money, market, shortage, forecast, re...
10784    (knight, -, ridder, inc, &, lt;krn, >, sets, q...
10785    (technitrol, inc, &, lt;tnl, >, sets, quarterl...
10786    (nationwide, cellular, service, inc, &, lt;nce...
10787    (&, lt;a.h.a, ., automotive, technologies, cor...
Name: parsed, Length: 6489, dtype: object

In [23]:
# Step 8: Train the topic model
model = grid_search.fit(features, labels)



In [24]:
model.best_estimator_

> Next we will **evaluate the performance of the model** that we have train on random forest classifier with the help of **grid search** to find the best parameter for random forest using the test set that we have split from the dataset 

In [25]:
def get_predictions(model, features, topicsList):
    return pd.DataFrame(
        model.predict(features), 
        columns=topicsList, 
        index=features.index
    )

In [26]:
preds = get_predictions(model, get_features(test), topicsList)
labels = get_labels(test, topicsList)

In [27]:
preds

Unnamed: 0,earn,acq,money-fx,grain,crude,trade,interest,ship,wheat,corn
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
3012,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3013,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3014,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3015,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
labels

Unnamed: 0,earn,acq,money-fx,grain,crude,trade,interest,ship,wheat,corn
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
3012,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3013,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3014,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3015,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
(labels - preds).abs().sum()

earn         65.0
acq         108.0
money-fx     80.0
grain        30.0
crude        47.0
trade        52.0
interest     76.0
ship         49.0
wheat        48.0
corn         47.0
dtype: float64

In [30]:
(labels != preds).sum()

earn         65
acq         108
money-fx     80
grain        30
crude        47
trade        52
interest     76
ship         49
wheat        48
corn         47
dtype: int64

In [31]:
accuracy = 1 - (labels - preds).abs().sum().sum() / labels.abs().sum().sum()
accuracy

0.7839971295299606

The accuracy is calculated as follows:

\begin{equation}
\text{Accuracy} = 1 - \frac{\sum_{i,j} |\text{labels}_{ij} - \text{preds}_{ij}|}{\sum_{i,j} |\text{labels}_{ij}|}
\end{equation}

When you obtain an accuracy value around **0.78**, it means that approximately 78\% of the labels in your test set are correctly predicted by your model. In other words, your model is accurate in assigning the correct labels to the documents about 78\% of the time.



> Now let lookup more on F1-score, presicion, and recall of this model

In [32]:
from sklearn.metrics import classification_report

In [33]:
print(classification_report(labels, preds))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1087
           1       0.96      0.88      0.92       719
           2       0.77      0.79      0.78       179
           3       0.95      0.84      0.89       149
           4       0.93      0.81      0.87       189
           5       0.87      0.66      0.75       117
           6       0.90      0.47      0.62       131
           7       0.83      0.56      0.67        89
           8       0.72      0.54      0.61        71
           9       0.70      0.29      0.41        56

   micro avg       0.94      0.84      0.89      2787
   macro avg       0.86      0.68      0.75      2787
weighted avg       0.93      0.84      0.88      2787
 samples avg       0.88      0.87      0.87      2787



  _warn_prf(average, modifier, msg_start, len(result))
