# **Topic Identification**
Topic identification is the challenge of automatically finding topics
in a given text. This can be done in supervised and unsupervised ways. For example, an algorithm labels newspaper articles with known topics such
as ”sports,” ”politics,” or ”culture.” In this case, we have predefined topics and labeled training data and could train our model in a supervised way. This is called topic classification. If we do not know the topics in advance and want our algorithm to find clusters of similar topics, we deal with topic modeling or topic discovery, in an unsupervised way [[1]](#scrollTo=1eUuDaNxZ_ms).


This notebook shows examples of supervised topic classification with ``simpletransformers``.

## **Supervised topic classification with ``simpletransformers``**

In this section, we show how to train and evaluate our own topic classification model using the ``simpletransformers`` library.

We perform the following steps:
* Install the ``simpletransformers`` library
* Import other libraries and packages; ``pandas``, ``ClassificationModel``, ``train_test_split``and ``preprocessing``
* Download dataset from Kaggle
* Create a general classification model
* Fine-tune the general model
* Evaluate the fine-tuned model
* Make predictions for a given text

### Install ``simpletransformers``
First, we install the ``simpletransformers`` library. This library is based on the Hugging Face transformers library. ``simpletransformers`` helps us to quickly train and evaluate transformer models. 

In this notebook, we use the following functions of the ``simpletransformers``library:
* ``ClassificationModel()`` to create a general classification model
* ``train_model()`` to fine-tune the general model
* ``eval_model()`` to evaluate the fine-tuned model
* ``predict()`` to make predictions for a given text

**Note:**<br>
Deep Learning (DL) models typically run on CUDA-enabled GPUs as the performance is better compared to running on a CPU [[5]](https://simpletransformers.ai/docs/usage/#enablingdisabling-cuda). CUDA is a parallel computing platform created by NVIDIA.

On all ``simpletransformers`` models, CUDA is enabled by default. Because of that, in order to proceed, you should enable CUDA in your GPU. If you are using Google Colab, you do not need to do anything since CUDA is pre-installed. In your Colab top menu, please click on "Runtime/Change runtime type" and choose "GPU". 
If you want to run the code without CUDA, you should disable it during the ["Create classification model"](#scrollTo=DmfjRTC21KCt)  process.


In [None]:
# Install the simpletransformers library
# Important: If you see a button "RESTART RUNTIME" after installing simpletransformers, click on this button to restart the runtime.
!pip install simpletransformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.7-py3-none-any.whl (249 kB)
[K     |████████████████████████████████| 249 kB 16.3 MB/s 
Collecting wandb>=0.10.32
  Downloading wandb-0.12.21-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 65.1 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 62.0 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 59.3 MB/s 
Collecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 19.0 MB/s 
Collecting streamlit
  Downloading streamlit-1.10.0-py2.py3-none-any.whl (9.1 MB)
[K  

### Import libraries and packages
We import all necessary libraries: In addition to ``simpletransformers``,
we need functions from ``sklearn`` and ``pandas`` to process our dataset [[1]](#scrollTo=1eUuDaNxZ_ms).


In [None]:
# Import the pandas library
import pandas as pd

# Import the "ClassificationModel" package for text classification tasks
from simpletransformers.classification import ClassificationModel

# Import the "train_test_split" function from the sklearn library
from sklearn.model_selection import train_test_split

# Import the "preprocessing" package from sklearn
## We use the "LabelEncoder()" function of this package to convert string labels into numerical values
from sklearn import preprocessing


### Download dataset
We download the dataset from [kaggle.com](https://www.kaggle.com). For this, we must sign up for an account first. After the login, we need to apply the following steps:

##### 1- Create Kaggle API token

For authenticating our Colab account to download datasets from Kaggle, we create an API token at ``https://www.kaggle.com/<username>/account``.

For that, go to the 'Account' tab of your user profile and 
select 'Create API Token'. This will trigger the download of ``kaggle.json``, a file containing your API credentials. 

##### 2- Create folders

Create a Kaggle folder in the Colab environment.

In [None]:
# Create 'kaggle' folder
!mkdir '/content/kaggle'

In [None]:
# Prepare folders in the Colab environment
import os
os.mkdir('/root/.kaggle')
os.chdir('/root/.kaggle')

##### 3- Upload Kaggle API token

In [None]:
# After downloading the API token from kaggle.com, upload it to Colab
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"emrahyener","key":"18159badcc78760ab64a5c9a80b62671"}'}

##### 4- Allocate permission

In [None]:
# Allocate the required permission for this API token
## This code modifies the access such that only you can access and read the kaggle.json file
## The permission code 600 means "the owner can read and write"
os.chmod('/root/.kaggle/kaggle.json', 600)

# Get back to the Kaggle folder
os.chdir('/content/kaggle')

##### 5- Download and unpack dataset

In [None]:
# Download dataset
!kaggle datasets download -d rmisra/news-category-dataset

Downloading news-category-dataset.zip to /content/kaggle
 20% 5.00M/25.4M [00:00<00:01, 15.9MB/s]
100% 25.4M/25.4M [00:00<00:00, 68.2MB/s]


In [None]:
 # Extract dataset
!unzip news-category-dataset.zip

Archive:  news-category-dataset.zip
  inflating: News_Category_Dataset_v2.json  


In [None]:
# Get back to the default ('content/') location
!cd ..

### Data preparation
After downloading the ``news-category-dataset`` file from Kaggle, we have extracted the ``News_Category_Dataset_v2.json`` file which contains our news articles labeled with the topics. The content will be used as training and test sets.

To use this data for fine-tuning and testing of our classification model with ``simpletransformers``, the labeled news articles need to be provided in a Pandas DataFrame structure with 2 columns: One column contains the text and the other one contains the labels. The text column should be ``str`` (string). The label column should be ``int`` (integer).

#### Convert dataset to Pandas DataFrame
As we have explained above, our topic classification model expects its input as a Pandas Dataframe. 

First, we use the ``read_json()`` function to convert the ``News_Category_Dataset_v2.json`` file into a Pandas DataFrame ``df``.




In [None]:
# Read data from JSON
df = pd.read_json("/content/kaggle/News_Category_Dataset_v2.json", orient="records", lines=True)

#### List the content of the dataset
Now the DataFrame ``df`` contains the complete dataset. Below, we list the first three rows to see the content of our dataset with the ``head()`` function.




In [None]:
# List the first three rows
df.head(3)

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26


#### Create a new empty DataFrame
As we see above, our dataset ``df`` contains 6 columns and some of them contains irrelevant data for our task. 

In the [Data preparation](#scrollTo=W0du7Fa21C-1) section, we have explained that we prepare a DataFrame with two columns: One column contains the text and the other one contains the labels. For this reason, we create a new empty Pandas DataFrame and create two columns as ``text`` and ``labels``. Then we extract the data we need from the dataset ``df``.

In [None]:
# Create a new DataFrame
data = pd.DataFrame()

#### Define columns of the DataFrame

We have created a new empty DataFrame ``data``. Now we create two columns as ``text`` and ``labels``.


First of all, we have to decide the content of our ``text`` column. It will contain information about the related news article. If we look at our dataset ``df``, we see that it has two columns which contain information about the news. They are ``headline`` and ``short_description`` columns. For example, let`s look at the ``headline`` and ``short_description`` columns of the first row:

In [None]:
# Print the "headline" and "short_description" columns of the first row
print(" Headline: ",df["headline"][0],"\n","Short Description: ", df["short_description"][0])

 Headline:  There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV 
 Short Description:  She left her husband. He killed their children. Just another day in America.


To collect all information available, we use both columns for our ``text`` column. For this, we concatenate the ``headline`` and ``short_description`` columns of the dataset ``df``. 

In [None]:
# Create "text" column
# Concatenate ""headline""" and ""short_description" columns
data["text"] = df.headline + " " + df.short_description

The ``text`` column is ready for our topic classification model. Now we prepare the ``labels`` column. For this, we use the ``category`` column of the dataset ``df``.

In [None]:
# Create "labels" column and extract all labels from the "category" column of the dataset "df"
data["labels"] = df.category

Now the DataFrame ``data`` has two columns. We print the first 3 rows of the DataFrame ``data`` to see the content.

In [None]:
# Print the first three rows of the dataframe
data.head(3)

Unnamed: 0,text,labels
0,There Were 2 Mass Shootings In Texas Last Week...,CRIME
1,Will Smith Joins Diplo And Nicky Jam For The 2...,ENTERTAINMENT
2,Hugh Grant Marries For The First Time At Age 5...,ENTERTAINMENT


As explained in the [Data preparation](#scrollTo=W0du7Fa21C-1) section, the ``labels`` column should be ``int`` (integer). However, our DataFrame ``labels`` column contains string values. 

To convert string values into integers, we peform the following steps:
* Create a list for the unique labels
* Encode these unique labels as integer values by using ``LabelEncoder()`` function
* Update the DataFrame ``data`` by replacing all string values on the ``labels`` column with integer values.

#### Create a unique labels list

We create a list which contains only the unique labels in the ``labels``column. For this, we use ``unique()`` function. 

In [None]:
# List unique labels from the DataFrame "data" and save it to a new list "unique_labels"
unique_labels = list(data["labels"].unique())

# Print unique labels
for each in unique_labels:
  print(each)

CRIME
ENTERTAINMENT
WORLD NEWS
IMPACT
POLITICS
WEIRD NEWS
BLACK VOICES
WOMEN
COMEDY
QUEER VOICES
SPORTS
BUSINESS
TRAVEL
MEDIA
TECH
RELIGION
SCIENCE
LATINO VOICES
EDUCATION
COLLEGE
PARENTS
ARTS & CULTURE
STYLE
GREEN
TASTE
HEALTHY LIVING
THE WORLDPOST
GOOD NEWS
WORLDPOST
FIFTY
ARTS
WELLNESS
PARENTING
HOME & LIVING
STYLE & BEAUTY
DIVORCE
WEDDINGS
FOOD & DRINK
MONEY
ENVIRONMENT
CULTURE & ARTS


#### Encode labels as integers

We have created the list ``unique_labels`` which contains only the unique labels. Now we convert these labels to integer values.  

For this purpose, we use the ``LabelEncoder()`` function of the ``sklearn`` library to encode our labels as integers.



In [None]:
# Convert the labels in the "unique_labels" list to numerical values
le = preprocessing.LabelEncoder()
le.fit(unique_labels)

LabelEncoder()

#### Update the DataFrame with encoded labels
We have encoded all unique labels as integers by using the ``label_encoder`` function. 

Now we update all values in the ``labels`` column of the DataFrame ``data`` to prepare it for our topic classification model. For this, we use ``transform()`` function of the ``sklearn`` library. 

In [None]:
# Delete string labels in the "labels" column of the DataFrame "data" and write encoded integer values instead.
data["labels"] = le.transform(data["labels"])

# Print the first three rows of the DataFrame "data"
data.head(3)

Unnamed: 0,text,labels
0,There Were 2 Mass Shootings In Texas Last Week...,6
1,Will Smith Joins Diplo And Nicky Jam For The 2...,10
2,Hugh Grant Marries For The First Time At Age 5...,10


#### Create a dictionary to keep labels as string and integer

This step is optional. 

Above, we have updated all labels as integer values. After the model training and evaluation processes, our model will predict a label for a given text and it will return an integer value as predicted label. To understand the meaning of the predicted labels, we create a dictionary to keep each encoded label with its string value. We use this dictionary ``categories_dict`` at the [Prediction](#scrollTo=vcUjnz5U7Zpq&) step to convert predicted integer labels to the corresponding string value.

In [None]:
# Create a dictionary representation for the labels
categories_dict = {}
unique_labels_str=unique_labels
unique_labels_int=list(data["labels"].unique())

for i in range(len(unique_labels_int)):
    categories_dict[unique_labels_int[i]]=unique_labels_str[i]

# Print the keys and values of the dictionary
for key, value in categories_dict.items():
  print(key, " : ", value)

6  :  CRIME
10  :  ENTERTAINMENT
39  :  WORLD NEWS
18  :  IMPACT
24  :  POLITICS
36  :  WEIRD NEWS
2  :  BLACK VOICES
38  :  WOMEN
5  :  COMEDY
25  :  QUEER VOICES
28  :  SPORTS
3  :  BUSINESS
34  :  TRAVEL
20  :  MEDIA
32  :  TECH
26  :  RELIGION
27  :  SCIENCE
19  :  LATINO VOICES
9  :  EDUCATION
4  :  COLLEGE
23  :  PARENTS
1  :  ARTS & CULTURE
29  :  STYLE
15  :  GREEN
31  :  TASTE
16  :  HEALTHY LIVING
33  :  THE WORLDPOST
14  :  GOOD NEWS
40  :  WORLDPOST
12  :  FIFTY
0  :  ARTS
37  :  WELLNESS
22  :  PARENTING
17  :  HOME & LIVING
30  :  STYLE & BEAUTY
8  :  DIVORCE
35  :  WEDDINGS
13  :  FOOD & DRINK
21  :  MONEY
11  :  ENVIRONMENT
7  :  CULTURE & ARTS


#### Create training and evaluation set
We split our DataFrame``data`` into training (80%) and evaluation set (20%) using the
``train_test_split()`` function of the ``sklearn`` library. Please note that we will not create a
test set for the final evaluation to simplify this demonstration [[1]](#scrollTo=1eUuDaNxZ_ms).

In [None]:
# Create training and evaluation datasets
## test_size=0.2 means that the size of the evaluation dataset is 20%
## and the training dataset is 80%
train_df, eval_df = train_test_split(data, test_size=0.2)

### Create classification model
Now, we create our classification model. We use the ``bert_base_uncased`` model from the ``bert`` model family. The number of labels (categories) is set
through the ``num_labels`` parameter [[1]](#scrollTo=1eUuDaNxZ_ms).

**NOTE:** 
On all ``simpletransformers`` models, CUDA is enabled by default. If you want, you can disable CUDA. Below you can see both options. We recommend to create your model with CUDA.

Option-1: With CUDA (Recommended)

In [None]:
# Create a classification model
## We use "bert" classification model
## We choose "bert-base-uncased" (lowercase) "bert" model
## "num_labels" specifies the number of labels or classes in the dataset

model = ClassificationModel('bert',
                            'bert-base-uncased',
                            num_labels=len(unique_labels))

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Option-2: Without CUDA


In [None]:
# Run this code to only disable CUDA:

#model = ClassificationModel('bert',
#                            'bert-base-uncased',
#                            num_labels=len(labels),
#                            use_cuda=False))

### Train model

We train our model with the ``train_model()`` function of the ``simpletransformers`` library.

When we start training our model, it automatically downloads the pre-trained
``bert`` model, initializes its parameters and preprocesses our training data using a
subword tokenizer before the actual training process is started [[1]](#scrollTo=1eUuDaNxZ_ms).

**NOTE:** 
Depending on the GPU settings, the training of this model can take up to 2 hours.

In [None]:
# Train the model
model.train_model(train_df)

  0%|          | 0/160682 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/20086 [00:00<?, ?it/s]

(20086, 1.9774431613032162)

### Evaluation

We evaluate the model with the ``eval_model()`` function of the ``simpletransformers`` library.

In [None]:
# Evaluate the model
result, model_outputs, wrong_preds  = model.eval_model(eval_df)

  0%|          | 0/40171 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/5022 [00:00<?, ?it/s]

### Prediction

We make predictions on unlabelled data with the ``predict()`` function of the ``simpletransformers`` library.

In [42]:
# Predict the label of a given string
prediction1, raw_outputs = model.predict(["Chase Bank mortgage review: Low down payments available for those who don’t qualify for a VA loan."])

# Print the predicted label as integer and string
print("Predicted label as integer: ",int(prediction1))
print("Predicted label as string: ",categories_dict[int(prediction1)])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted label as integer:  3
Predicted label as string:  BUSINESS


In [43]:
# Predict the label of a given string
prediction2, raw_outputs = model.predict(["President Biden revealed the NASA telescope's image of ancient galaxies whose light has been traveling 13 billion years to reach us."])

# Print the predicted label as integer and string
print("Predicted label as integer: ",int(prediction2))
print("Predicted label as string: ",categories_dict[int(prediction2)])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted label as integer:  27
Predicted label as string:  SCIENCE


In [45]:
# Predict the label of a given string
prediction3, raw_outputs = model.predict(["A dentist is on trial in Denver for the death of his wife during a safari trip to Zambia. \
                                          His wife's death was called into question after he was accused of having an affair."])

# Print the predicted label as integer and string
print("Predicted label as integer: ",int(prediction3))
print("Predicted label as string: ",categories_dict[int(prediction3)])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted label as integer:  6
Predicted label as string:  CRIME


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://www.nltk.org/api/nltk.html#nltk.wsd.lesk
- [3] https://en.wikipedia.org/wiki/WordNet
- [4] https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html
- [5] https://simpletransformers.ai/docs/usage/#enablingdisabling-cuda


Copyright © 2022 IU International University of Applied Sciences