<a href="https://colab.research.google.com/github/Ankur3107/Machine-Learning-Notes/blob/master/topic_modeling/Kitty_Human_in_the_loop_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Kitty: Human-in-the-loop Classifier

(last updated 20-09-2021)

In this tutorial, we are going to use our **Kitty** to classify documents using a human in the loop approach supported by Contextualized Topic Models.

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo_kitty.png)

## Side Note: Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents. 

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://github.com/MilaNLProc/contextualized-topic-models/actions](https://github.com/MilaNLProc/contextualized-topic-models/workflows/Python%20package/badge.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)



In [None]:
%%capture
!pip install contextualized-topic-models==2.2.0

### Restart Runtime

In [None]:
from contextualized_topic_models.models.kitty_classifier import Kitty
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import torch
import random
import numpy as np
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True


We now fix the random seeds so that we can replicate the results


In [None]:
torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
nltk.download('stopwords')
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Download Sample Data

In [None]:
%%capture
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt


In [None]:
training = list(map(lambda x : x.strip(), open("dbpedia_sample_abstract_20k_unprep.txt").readlines()))

In [None]:
!head dbpedia_sample_abstract_20k_unprep.txt

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and
Henry Howard, 13th Earl of Suffolk, 6th Earl of Berkshire (8 August 1779 – 10 August 1779) was a British peer, the son of Henry Howard, 12th Earl of Suffolk. His father died on 7 March 1779, leaving behind his pregnant widow. The Earldom of Suffolk became dormant until she
Marinko Matošević (Croatian pronunciation: [mariŋko matoʃeʋit͡ɕ]; born 8 August 1985) is an Aus

## Train

The first training with Kitty is going to download a bunch of stuff. We run kitty with an english embedding model (paraphrase-distilroberta-base-v2) and we specify the langauge so that we can apply some pre-processing to the text.



In [None]:
kt = Kitty()
kt.train(training, topics=5, embedding_model="paraphrase-distilroberta-base-v2", language="english")

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Epoch: [10/10]	 Seen Samples: [20000/20000]	Train Loss: 134.7000938720703	Time: 0:00:01.120163: : 10it [00:11,  1.13s/it]


# Let's see the topics 

We can check the topics that the model has collected

In [None]:
print(kt.pretty_print_word_classes())

0	family, plant, types, type, moth
1	district, mi, area, village, west
2	released, series, television, album, film
3	school, station, historic, public, states
4	born, football, team, played, season


# Let's assing the topic to labels

Note: with new versions of the packages this mapping can change due to randomeness. You just need to update the labels, if that happens.

In [None]:
kt.assigned_classes = {0 : "nature", 1 : "location", 2 : "entertainment", 3 : "shop/offices", 4: "sport"}

# Let's predict the labels of new documents

In [None]:
kt.predict(["the village of Puza is a very nice village in Italy", "Pussetto is a soccer player that currently plays for Udiense Calcio"])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Sampling: [20/20]: : 20it [00:02,  6.91it/s]


['location', 'sport']

Note that nothing prevents you from mapping multiple topics to the same labels. You don't even need to map all the topics. The unmapped topics will be automatically mapped to "other".

In [None]:
kt.assigned_classes = {0 : "nature", 1 : "location", 3 : "location", 4: "sport"}

# Interative Mapping

We also have a very simple widget that can help you feeding the mapped labels to kitty! 

You just need to fill the empty fields with the label you want and click "save". You don't even need to fill all the fileds, just the one you are interested in.

In [None]:
kt.widget_annotation()

Text(value='', description='0 -  season, film, competition, event, tournament, game, cup, women, annual, team'…

Text(value='', description='1 -  west, located, population, station, km, county, south, village, census, capit…

Text(value='', description='2 -  french, term, long, lead, released, use, album, songwriter, studio, rock', la…

Text(value='', description='3 -  mm, lake, south, family, moth, range, navy, war, politician, general', layout…

Text(value='', description='4 -  played, born, competed, professional, career, olympics, player, university, r…

Button(description='Save', style=ButtonStyle(button_color='lightgreen'))

In [None]:
kt.assigned_classes

{0: 'other', 1: 'location', 2: 'other', 3: 'other', 4: 'sport'}

Thus, we can use this on our data

In [None]:
kt.predict(["the village of Puza is a very nice village in Italy", "Pussetto is a soccer player that currently plays for Udinese Calcio"])

# Cross-Lingual Classification

We can use the cross-lingual capabilities of the underline ZeroShotTM model to train a cross-lingual classifier! Here we train in english with multilingual embeddings and we then test on Italian data.

In [None]:
kt = Kitty()
kt.train(training, topics=5, embedding_model="paraphrase-multilingual-mpnet-base-v2", language="english")

Batches:   0%|          | 0/100 [00:00<?, ?it/s]

Epoch: [10/10]	 Seen Samples: [200000/200000]	Train Loss: 126.84143291015624	Time: 0:00:08.464196: : 10it [01:25,  8.52s/it]


In [None]:
kt.widget_annotation()

Text(value='', description='0 -  born, league, football, cup, september, played, team, season, player, champio…

Text(value='', description='1 -  village, district, area, mi, km, county, west, lies, kilometres, north', layo…

Text(value='', description='2 -  nigeria, mm, divided, moth, wide, humans, plant, costa, discovered, fish', la…

Text(value='', description='3 -  school, historic, building, states, station, high, state, united, built, hous…

Text(value='', description='4 -  album, released, series, film, band, music, studio, novel, directed, american…

Button(description='Save', style=ButtonStyle(button_color='lightgreen'))

In [None]:
kt.predict(["Pussetto è un calciatore dell'Udinese Calcio",  "Pussetto is a soccer player that currently plays for Udinese Calcio"])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Sampling: [20/20]: : 20it [00:04,  4.27it/s]


['sports', 'sports']