# lbl2vec 
### Made by Taiki Papandreou and Thomas Meeusen

This notebook illustarates how lbl2vec works on the provided Huggingface dataset. The original notebook was made by @sebischair. 

Link to the github repository: https://github.com/sebischair/Lbl2Vec

### Data Set: 
https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv

The AG's news topic classification dataset is constructed by choosing the 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. 

The classes are: 

* World
* Sports
* Business
* Science/Technology

#### For more information on how to use Lbl2Vec, visit the [API Guide](https://lbl2vec.readthedocs.io/en/latest/api.html#)

In [1]:
from lbl2vec import Lbl2Vec
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models.doc2vec import TaggedDocument
from gensim.parsing.preprocessing import strip_tags
from sklearn.metrics import f1_score

### Load data

In [2]:
# load train data
ag_train = pd.read_csv('data/train.csv',sep=',',header=None, names=['class','title','description'], skiprows=1)

# load test data
ag_test = pd.read_csv('data/test.csv',sep=',',header=None, names=['class','title','description'], skiprows=1)

# load labels with keywords
labels = pd.read_csv('data/labels.csv',sep=';')

# split keywords by separator and save them as array
labels['keywords'] = labels['keywords'].apply(lambda x: x.split(' '))

# convert description keywords to lowercase
labels['keywords'] = labels['keywords'].apply(lambda description_keywords: [keyword.lower() for keyword in description_keywords])

# get number of keywords for each class
labels['number_of_keywords'] = labels['keywords'].apply(lambda row: len(row))

In [3]:
# ag_train['class']=ag_train['class'].astype(str)
# ag_test['class']=ag_test['class'].astype(str)

In [4]:
labels

Unnamed: 0,class_index,class_name,keywords,number_of_keywords
0,0,FakeAccidents,"[police,, president,, u.s.,, state,, govern, ,...",8
1,1,RealAccidents,"[crash, haiti, fire, rescue, quake, flood, pla...",18


### Tokenize data

In [5]:
# doc: document text string
# returns tokenized document
# strip_tags removes meta tags from the text
# simple preprocess converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long 
# simple preprocess also removes numerical values as well as punktuation characters
def tokenize(doc):
    return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=15)

In [6]:
# add data set type column
ag_train['data_set_type'] = 'train'
ag_test['data_set_type'] = 'test'

# concat train and test data
ag_full_corpus = pd.concat([ag_train,ag_test]).reset_index(drop=True)

In [7]:
# tokenize and tag documents combined title + description for Lbl2Vec training
ag_full_corpus['tagged_docs'] = ag_full_corpus.apply(lambda row: TaggedDocument(tokenize(row['title'] + '. ' + row['description']), [str(row.name)]), axis=1)

In [8]:
# add doc_key column
ag_full_corpus['doc_key'] = ag_full_corpus.index.astype(str)

In [9]:
# add class_name column
ag_full_corpus = ag_full_corpus.merge(labels, left_on='class', right_on='class_index', how='left').drop(['class', 'keywords'], axis=1)

In [10]:
ag_full_corpus.head()

Unnamed: 0,title,description,data_set_type,tagged_docs,doc_key,class_index,class_name,number_of_keywords
0,U.N. meeting in Hong Kong to draw up new conve...,"HONG KONG, China (CNN) -- The United Nation's ...",train,"([meeting, in, hong, kong, to, draw, up, new, ...",0,0,FakeAccidents,8
1,Eleven people have been killed already this mo...,(CNN) -- One 12-year-old Virginia boy was play...,train,"([eleven, people, have, been, killed, already,...",1,0,FakeAccidents,8
2,"Sean Mulveyhill, Kayla Narey and Austin Renaud...","Northampton, Massachusetts (CNN) -- Three tee...",train,"([sean, mulveyhill, kayla, narey, and, austin,...",2,0,FakeAccidents,8
3,Gordon Brown: Afghanistan campaign crucial to ...,"LONDON, England (CNN) -- British Prime Ministe...",train,"([gordon, brown, afghanistan, campaign, crucia...",3,0,FakeAccidents,8
4,"Dave Matthews: ""I just see [racism] everywhere...","LOS ANGELES, California (CNN) -- Watching the ...",train,"([dave, matthews, just, see, racism, everywher...",4,0,FakeAccidents,8


# Train Lbl2Vec

Train a new model from scratch with the following parameters:
* keywords_list : iterable list of lists with descriptive keywords for each topic.
* tagged_documents : iterable list of [gensim.models.doc2vec.TaggedDocument](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument) elements. Each element consists of one document.
* label_names : iterable list of custom names for each label. Label names and keywords of the same topic must have the same index.
* similarity_threshold : only documents with a higher similarity to the respective description keywords than this treshold are used to calculate the label embedding.
* min_num_docs : minimum number of documents that are used to calculate the label embedding. 
* epochs : number of iterations over the corpus.

In [11]:
# init model with parameters
lbl2vec_model = Lbl2Vec(keywords_list=list(labels['keywords']), tagged_documents=ag_full_corpus['tagged_docs'], label_names=list(labels['class_name']), similarity_threshold=0.30, min_num_docs=100, epochs=10)

In [12]:
# train model
lbl2vec_model.fit()

2022-06-20 13:27:09,354 - Lbl2Vec - INFO - Train document and word embeddings
2022-06-20 13:37:28,294 - Lbl2Vec - INFO - Train label embeddings


# Predict document topics of documents used to train Lbl2Vec

Compute similarity scores of learned document vectors from documents that were used to train the model to each of the learned label vectors. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

In [13]:
# predict similarity scores
model_docs_lbl_similarities = lbl2vec_model.predict_model_docs()

2022-06-20 13:37:28,632 - Lbl2Vec - INFO - Get document embeddings from model
2022-06-20 13:37:28,727 - Lbl2Vec - INFO - Calculate document<->label similarities


In [14]:
model_docs_lbl_similarities.head()

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,FakeAccidents,RealAccidents
0,0,RealAccidents,0.20486,0.088236,0.20486
1,1,RealAccidents,0.367604,0.180544,0.367604
2,2,FakeAccidents,0.240857,0.240857,0.206558
3,3,FakeAccidents,0.341001,0.341001,0.237668
4,4,RealAccidents,0.135777,0.085611,0.135777


## Evaluate prediction of documents used to train Lbl2Vec

In [15]:
# merge DataFrames to compare the predicted and true topic labels
evaluation_train = model_docs_lbl_similarities.merge(ag_full_corpus[ag_full_corpus['data_set_type']=='train'], left_on='doc_key', right_on='doc_key')

In [16]:
y_true_train = evaluation_train['class_name']
y_pred_train = evaluation_train['most_similar_label']
print('F1 score:',f1_score(y_true_train, y_pred_train, average='micro'))

F1 score: 0.357


# Well then how about test data?

In [17]:
# predict similarity scores of new test documents (they were not used during Lbl2Vec training)
new_docs_lbl_similarities = lbl2vec_model.predict_new_docs(tagged_docs=ag_full_corpus['tagged_docs'][ag_full_corpus['data_set_type']=='test'])

2022-06-20 13:37:41,336 - Lbl2Vec - INFO - Calculate document embeddings
2022-06-20 13:38:01,701 - Lbl2Vec - INFO - Calculate document<->label similarities


In [18]:
# merge DataFrames to compare the predicted and true topic labels
evaluation_test = new_docs_lbl_similarities.merge(ag_full_corpus[ag_full_corpus['data_set_type']=='test'], left_on='doc_key', right_on='doc_key')

In [19]:
y_true_test = evaluation_test['class_name']
y_pred_test = evaluation_test['most_similar_label']
print('F1 score:',f1_score(y_true_test, y_pred_test, average='micro'))

F1 score: 0.336
