# Shallow-Learning Topic Modelling

In the following we will show you how to create a topic model, using a shallow-learning approach. Here we will use the results and the embeddings obtained from the document-document projection of the bipartite graph.

**NOTE: This Notebook can only be run after the 01_nlp_graph_creation notebook, as some of the results computed in the first notebook will be here reused.** 

### Load Dataset

In [1]:
import pandas as pd

In [2]:
corpus = pd.read_pickle("corpus.p")

In [3]:
from collections import Counter
topics = Counter([label for document_labels in corpus["label"] for label in document_labels]).most_common(10)

In [4]:
topics

[('earn', 3964),
 ('acq', 2369),
 ('money-fx', 717),
 ('grain', 582),
 ('crude', 578),
 ('trade', 485),
 ('interest', 478),
 ('ship', 286),
 ('wheat', 283),
 ('corn', 237)]

In [5]:
topicsList = [topic[0] for topic in topics]
topicsSet = set(topicsList)
dataset = corpus[corpus["label"].apply(lambda x: len(topicsSet.intersection(x))>0)]

Create a class to "simulate" the training of the embeddings

In [6]:
from sklearn.base import BaseEstimator

class EmbeddingsTransformer(BaseEstimator):
    
    def __init__(self, embeddings_file):
        self.embeddings_file = embeddings_file
        
    def fit(self, *args, **kwargs):
        self.embeddings = pd.read_pickle(self.embeddings_file)
        return self
        
    def transform(self, X):
        return self.embeddings.loc[X.index]
    
    def fit_transform(self, X, y):
        return self.fit().transform(X)



In [None]:
from glob import glob 
files = glob("./embeddings/*")

In [15]:
graphEmbeddings = EmbeddingsTransformer(files[0]).fit()

Train/Test split

In [16]:
def get_labels(corpus, topicsList=topicsList):
    return corpus["label"].apply(
        lambda labels: pd.Series({label: 1 for label in labels}).reindex(topicsList).fillna(0)
    )[topicsList]

In [17]:
def get_features(corpus):
    return corpus["parsed"] #graphEmbeddings.transform(corpus["parsed"])

In [18]:
def get_features_and_labels(corpus):
    return get_features(corpus), get_labels(corpus)

In [19]:
def train_test_split(corpus):
    graphIndex = [index for index in corpus.index if index in graphEmbeddings.embeddings.index]
    
    train_idx = [idx for idx in graphIndex if "training/" in idx]
    test_idx = [idx for idx in graphIndex if "test/" in idx]
    return corpus.loc[train_idx], corpus.loc[test_idx]

In [20]:
train, test = train_test_split(dataset)

Build the model and cross-validation 

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier 
from sklearn.multioutput import MultiOutputClassifier

In [22]:
model = MultiOutputClassifier(RandomForestClassifier())

In [23]:
pipeline = Pipeline([
    ("embeddings", graphEmbeddings),
    ("model", model)
])

In [24]:
from sklearn.model_selection import GridSearchCV

In [25]:
from sklearn.model_selection import RandomizedSearchCV

In [27]:
files

['./bipartiteGraphEmbeddings_20_10.p',
 './bipartiteGraphEmbeddings_10.p',
 './bipartiteGraphEmbeddings_20_30.p',
 './bipartiteGraphEmbeddings_20.p',
 './bipartiteGraphEmbeddings_20_20.p',
 './bipartiteGraphEmbeddings_30.p']

In [237]:
param_grid = {
    "embeddings__embeddings_file": files,
    "model__estimator__n_estimators": [50, 100], 
    "model__estimator__max_features": [0.2,0.3, "auto"], 
    #"model__estimator__max_depth": [3, 5]
}

In [255]:
features, labels = get_features_and_labels(train)

In [256]:
from sklearn.metrics import f1_score 

In [257]:
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, 
                           scoring=lambda y_true, y_pred: f1_score(y_true, y_pred,average='weighted'))

In [258]:
model = grid_search.fit(features, labels)

 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]


In [259]:
model

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('embeddings',
                                        EmbeddingsTransformer(embeddings_file='bipartiteGraphEmbeddings_20.p')),
                                       ('model',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(class_weight='balanced')))]),
             n_jobs=-1,
             param_grid={'embeddings__embeddings_file': ['./bipartiteGraphEmbeddings_20_10.p',
                                                         './bipartiteGraphEmbeddings_10.p',
                                                         './bipartiteGraphEmbeddings_20_30.p',
                                                         './bipartiteGraphEmbeddings_20.p',
                                                         './bipartiteGraphEmbeddings_20_20.p',
                                                         './bipartiteGraphEmbeddings_30.p'],
                         'model__estimator__max_featur

In [260]:
model.best_params_

{'embeddings__embeddings_file': './bipartiteGraphEmbeddings_20_10.p',
 'model__estimator__max_features': 0.2,
 'model__estimator__n_estimators': 50}

Evaluate performance 

In [261]:
def get_predictions(model, features):
    return pd.DataFrame(
        model.predict(features), 
        columns=topicsList, 
        index=features.index
    )

In [262]:
preds = get_predictions(model, get_features(test))
labels = get_labels(test)

In [263]:
errors = 1 - (labels - preds).abs().sum().sum() / labels.abs().sum().sum()

In [264]:
errors

0.6702547542160029

In [265]:
from sklearn.metrics import classification_report

In [266]:
print(classification_report(labels, preds))

              precision    recall  f1-score   support

           0       0.97      0.94      0.95      1087
           1       0.93      0.74      0.83       719
           2       0.79      0.45      0.57       179
           3       0.96      0.64      0.77       149
           4       0.95      0.59      0.73       189
           5       0.95      0.45      0.61       117
           6       0.87      0.41      0.56       131
           7       0.83      0.21      0.34        89
           8       0.69      0.34      0.45        71
           9       0.61      0.25      0.35        56

   micro avg       0.94      0.72      0.81      2787
   macro avg       0.85      0.50      0.62      2787
weighted avg       0.92      0.72      0.79      2787
 samples avg       0.76      0.75      0.75      2787



  _warn_prf(average, modifier, msg_start, len(result))
