# **Topic Identification**
Topic identification is the challenge of automatically finding topics
in a given text. This can be done in supervised and unsupervised ways. For example, an algorithm labels newspaper articles with known topics such
as ”sports,” ”politics,” or ”culture.” In this case, we have predefined topics and labeled training data and could train our model in a supervised way. This is called topic classification. If we do not know the topics in advance and want our algorithm to find clusters of similar topics, we deal with topic modeling or topic discovery, in an unsupervised way [[1]](#scrollTo=1eUuDaNxZ_ms).


This notebook shows examples of supervised topic classification with ``simpletransformers`` [[2]](https://simpletransformers.ai/about/).


## **Supervised topic classification with ``simpletransformers``**

In this section, we show how to train and evaluate our own topic classification model using the ``simpletransformers`` library.

We perform the following steps:
* Install the ``simpletransformers`` library
* Import other libraries and packages; ``pandas``, ``ClassificationModel``, ``train_test_split``and ``preprocessing``
* Download dataset from Kaggle
* Create a general classification model
* Fine-tune the general model
* Evaluate the fine-tuned model
* Make predictions for a given text

### Install ``simpletransformers``
First, we install the ``simpletransformers`` library. This library is based on the Hugging Face transformers library [[3]](https://huggingface.co/docs/transformers/index). ``simpletransformers`` helps us to quickly train and evaluate transformer models. For more details about the ``simpletransformers``


In this notebook, we use the following functions of the ``simpletransformers``library:
* ``ClassificationModel()`` to create a general classification model
* ``train_model()`` to fine-tune the general model
* ``eval_model()`` to evaluate the fine-tuned model
* ``predict()`` to make predictions for a given text

**Note:**<br>
Deep Learning (DL) models typically run on CUDA-enabled GPUs as the performance is better compared to running on a CPU [[4]](https://simpletransformers.ai/docs/usage/#enablingdisabling-cuda). CUDA is a parallel computing platform created by NVIDIA.

On all ``simpletransformers`` models, CUDA is enabled by default. Because of that, in order to proceed, we should enable CUDA in our GPU. If we use Google Colab, we do not need to do anything since CUDA is pre-installed. But we need to enable the GPU in our Colab top menu by clicking on "Runtime/Change runtime type" and choosing "GPU".
 
In order to proceed without CUDA, we run the following code snippets until we can disable CUDA in ["Create classification model"](#scrollTo=DmfjRTC21KCt).


In [1]:
# Install the simpletransformers library
# Important: If you see a button "RESTART RUNTIME" after installing simpletransformers, click on this button to restart the runtime.
!pip install simpletransformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.7-py3-none-any.whl (249 kB)
[K     |████████████████████████████████| 249 kB 6.7 MB/s 
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.2 MB/s 
Collecting transformers>=4.6.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 55.6 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 31.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 55.5 MB/s 
Collecting streamlit
  Downloading streamlit-1.10.0-py2.py3-none-any.whl (9.1 MB)
[K     |█████

### Import libraries and packages
We import all necessary libraries: In addition to ``simpletransformers``,
we need functions from ``sklearn`` and ``pandas`` to process our dataset. For more details about the ``sklearn`` and ``pandas`` libraries, please refer to [[5]](https://scikit-learn.org/stable/) and [[6]](https://pandas.pydata.org/).


In [1]:
# Import the pandas library
import pandas as pd

# Import the "ClassificationModel" package for text classification tasks
from simpletransformers.classification import ClassificationModel

# Import the "train_test_split" function from the sklearn library
from sklearn.model_selection import train_test_split

# Import the "preprocessing" package from sklearn
## We use the "LabelEncoder()" function of this package to convert string labels into integer values
from sklearn import preprocessing


### Download dataset
We download the dataset from [kaggle.com](https://www.kaggle.com). For this, we sign up for an account first. After the login, we apply the following steps:

##### 1- Create Kaggle API token

For authenticating our Colab account to download datasets from Kaggle, we create an API token at ``https://www.kaggle.com/<username>/account``.

For that, we go to the 'Account' tab of our user profile and 
select 'Create API Token'. This will trigger the download of ``kaggle.json``, a file containing our API credentials. 

##### 2- Create folders

We create a Kaggle folder in Colab.

In [2]:
# Create 'kaggle' folder
!mkdir '/content/kaggle'

In [3]:
# Prepare folders in Colab
import os
os.mkdir('/root/.kaggle')
os.chdir('/root/.kaggle')

##### 3- Upload Kaggle API token
After downloading the API token from kaggle.com, we upload it to Colab.

In [4]:
# After downloading the API token from kaggle.com, upload it to Colab
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"emrahyener","key":"18159badcc78760ab64a5c9a80b62671"}'}

##### 4- Allocate permission
We allocate the required permission for the API token.

In [5]:
# Allocate the required permission for the API token
## This code modifies the access such that only the owner of the file has access to the kaggle.json file
## The permission code 600 means "the owner can read and write"
os.chmod('/root/.kaggle/kaggle.json', 600)

# Get back to the Kaggle folder
os.chdir('/content/kaggle')

##### 5- Download and unpack dataset
Since the dataset ``news-category-dataset.zip`` is compressed in zip format on a Kaggle server, we download and extract it to ``News_Category_Dataset_v2.json``.

In [6]:
# Download dataset
!kaggle datasets download -d rmisra/news-category-dataset

Downloading news-category-dataset.zip to /content/kaggle
 35% 9.00M/25.4M [00:00<00:00, 23.8MB/s]
100% 25.4M/25.4M [00:00<00:00, 64.1MB/s]


In [7]:
 # Extract dataset
!unzip news-category-dataset.zip

Archive:  news-category-dataset.zip
  inflating: News_Category_Dataset_v2.json  


In [8]:
# Get back to the default ('content/') location
!cd ..

### Data preparation
After downloading the ``news-category-dataset`` file from Kaggle, we have extracted the ``News_Category_Dataset_v2.json`` file which contains our news articles labeled with the topics. The content will be used as training and test sets.

To use this data for fine-tuning and testing our classification model with ``simpletransformers``, the labeled news articles need to be provided in a Pandas DataFrame structure with 2 columns: One column contains the text and the other one contains the labels. The text column should be ``str`` (string). The label column should be ``int`` (integer).

#### Convert dataset to Pandas DataFrame
As we have explained above, our topic classification model expects its input as a Pandas Dataframe. 

First, we use the ``read_json()`` function to convert the ``News_Category_Dataset_v2.json`` file into a Pandas DataFrame ``df``.




In [9]:
# Read data from JSON
df = pd.read_json("/content/kaggle/News_Category_Dataset_v2.json", orient="records", lines=True)

#### List the content of the dataset
Now the DataFrame ``df`` contains the complete dataset. Below, we list the first three rows to see the content of our dataset with the ``head()`` function.




In [10]:
# List the first three rows
df.head(3)

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26


#### Create a new empty DataFrame
As we see above, our dataset ``df`` contains 6 columns and some of them contain data which is irrelevant for our task. 

In the [Data preparation](#scrollTo=W0du7Fa21C-1) section, we have explained that we prepare a DataFrame with two columns: One column contains the text and the other one contains the labels. For this reason, we create a new empty Pandas DataFrame and create two columns as ``text`` and ``labels``. Then we extract only the text and label data which we need from the dataset ``df``.

In [51]:
# Create a new DataFrame
data = pd.DataFrame()

#### Define columns of the DataFrame

We have created a new empty DataFrame ``data``. Now we create the columns ``text`` and ``labels``.

##### 1- Define the ``text`` column 

As you see on the following code cell, the ``headline`` and  ``short_description`` columns of the dataset ``df`` contain information about the news:

In [52]:
# Print the "headline" and "short_description" columns of the first row
print(" Headline: ",df["headline"][0],"\n","Short Description: ", df["short_description"][0])

 Headline:  There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV 
 Short Description:  She left her husband. He killed their children. Just another day in America.


To define the ``text`` column, we concatenate the ``headline`` and ``short_description`` columns.

In [53]:
# Create "text" column
# Concatenate ""headline""" and ""short_description" columns
data["text"] = df.headline + " " + df.short_description

##### 2- Define the ``labels`` column 
To define the ``labels`` column, we extract all labels from the ``category`` column of the dataset ``df``.

In [54]:
# Create the "labels" column and extract all labels from the "category" column of the dataset "df"
data["labels"] = df.category

Now the DataFrame ``data`` has two columns, ``text`` and ``labels``. We print the first 3 rows to see the content.

In [55]:
# Print the first three rows of the DataFrame
data.head(3)

Unnamed: 0,text,labels
0,There Were 2 Mass Shootings In Texas Last Week...,CRIME
1,Will Smith Joins Diplo And Nicky Jam For The 2...,ENTERTAINMENT
2,Hugh Grant Marries For The First Time At Age 5...,ENTERTAINMENT


The DataFrame ``df`` is not ready for the topic classification model yet. As explained in the [Data preparation](#scrollTo=W0du7Fa21C-1) section, the ``labels`` column should be ``int`` (integer). However, it contains string values. 

In the "Perform label encoding" step, we convert labels into integer format.



##### 3- Perform label encoding

To perform label encoding, first we create a list which contains only the unique labels in the ``labels``column. For this, we use the ``unique()`` function. 

Then we use the ``LabelEncoder()`` and ``transform()`` functions of the ``sklearn`` library to convert the ``labels`` column into integer format.

In [56]:
# List unique labels from the DataFrame "data" and save it to a new list "unique_labels"
unique_labels = list(data["labels"].unique())

# Convert the labels in the "unique_labels" list into integer values
le = preprocessing.LabelEncoder()
le.fit(unique_labels)

# Delete string labels in the "labels" column of the DataFrame "data" and write integer values instead.
data["labels"] = le.transform(data["labels"])

# Print the first three rows of the DataFrame "data"
data.head(3)

Unnamed: 0,text,labels
0,There Were 2 Mass Shootings In Texas Last Week...,6
1,Will Smith Joins Diplo And Nicky Jam For The 2...,10
2,Hugh Grant Marries For The First Time At Age 5...,10


#### Create a dictionary to keep labels as string and integer

As we see above, we have converted ``labels`` column into the integer format and it is not easy for us to know the meaning of integer labels. 

After the model training and evaluation processes, our model will predict a label for a given text and it will return an integer as predicted label. 

To understand the meaning of the integers, we create a dictionary ``categories_dict`` which contains labels as integers and strings. This dictionary will be used at the [Prediction](#scrollTo=vcUjnz5U7Zpq&) step.

In [57]:
# Create a dictionary representation for the labels
categories_dict = {}
unique_labels_str=unique_labels
unique_labels_int=list(data["labels"].unique())
for i in range(len(unique_labels_int)):
    categories_dict[unique_labels_int[i]]=unique_labels_str[i]

# Print the keys and values of the dictionary
for key, value in sorted(categories_dict.items()):
  print(key, " : ", value)

0  :  ARTS
1  :  ARTS & CULTURE
2  :  BLACK VOICES
3  :  BUSINESS
4  :  COLLEGE
5  :  COMEDY
6  :  CRIME
7  :  CULTURE & ARTS
8  :  DIVORCE
9  :  EDUCATION
10  :  ENTERTAINMENT
11  :  ENVIRONMENT
12  :  FIFTY
13  :  FOOD & DRINK
14  :  GOOD NEWS
15  :  GREEN
16  :  HEALTHY LIVING
17  :  HOME & LIVING
18  :  IMPACT
19  :  LATINO VOICES
20  :  MEDIA
21  :  MONEY
22  :  PARENTING
23  :  PARENTS
24  :  POLITICS
25  :  QUEER VOICES
26  :  RELIGION
27  :  SCIENCE
28  :  SPORTS
29  :  STYLE
30  :  STYLE & BEAUTY
31  :  TASTE
32  :  TECH
33  :  THE WORLDPOST
34  :  TRAVEL
35  :  WEDDINGS
36  :  WEIRD NEWS
37  :  WELLNESS
38  :  WOMEN
39  :  WORLD NEWS
40  :  WORLDPOST


#### Create training and evaluation set
We split our DataFrame``data`` into training (80%) and evaluation set (20%) using the
``train_test_split()`` function of the ``sklearn`` library. Please note that we will not create a
test set for the final evaluation to simplify this demonstration [[1]](#scrollTo=1eUuDaNxZ_ms).

In [20]:
# Create training and evaluation datasets
## test_size=0.2 means that the size of the evaluation dataset is 20%
## and the training dataset is 80%
train_df, eval_df = train_test_split(data, test_size=0.2)

### Create classification model
Now, we create our classification model. We use the ``bert_base_uncased`` model from the ``bert`` model family. The number of labels (categories) is set
through the ``num_labels`` parameter [[1]](#scrollTo=1eUuDaNxZ_ms).

**NOTE:** 
On all ``simpletransformers`` models, CUDA is enabled by default which is recommended. If a system is not able to operate with CUDA, we can disable it. Below we find code for both options. 

Option-1: With CUDA (Recommended)

In [21]:
# Create a classification model
## We use "bert" classification model
## We choose "bert-base-uncased" (lowercase) "bert" model
## "num_labels" specifies the number of labels or classes in the dataset

model = ClassificationModel('bert',
                            'bert-base-uncased',
                            num_labels=len(unique_labels))

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Option-2: Without CUDA


In [22]:
# Run this code to only disable CUDA:

#model = ClassificationModel('bert',
#                            'bert-base-uncased',
#                            num_labels=len(labels),
#                            use_cuda=False))

### Train model

We train our model with the ``train_model()`` function of the ``simpletransformers`` library.

When we start training our model, it automatically downloads the pre-trained
``bert`` model, initializes its parameters and preprocesses our training data using a
subword tokenizer before the actual training process is started [[1]](#scrollTo=1eUuDaNxZ_ms).

**NOTE:** 
Depending on the GPU settings, the training of this model can take up to 2 hours.

In [23]:
# Train the model
model.train_model(train_df)

  0%|          | 0/160682 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/20086 [00:00<?, ?it/s]

(20086, 1.9819615015669743)

### Evaluation

We evaluate the model with the ``eval_model()`` function of the ``simpletransformers`` library.

In [24]:
# Evaluate the model
result, model_outputs, wrong_preds  = model.eval_model(eval_df)

  0%|          | 0/40171 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/5022 [00:00<?, ?it/s]

### Prediction

We make predictions on a given text with the ``predict()`` function of the ``simpletransformers`` library.

In [49]:
# Predict the label of a given string
prediction1, raw_outputs = model.predict(["Chase Bank mortgage review: Low down payments available for those who don’t qualify for a VA loan."])

# Print the predicted label as integer and string
print("Predicted label as integer: ",int(prediction1))
print("Predicted label as string: ",categories_dict[int(prediction1)])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted label as integer:  21
Predicted label as string:  MONEY


In [26]:
# Predict the label of a given string
prediction2, raw_outputs = model.predict(["President Biden revealed the NASA telescope's image of ancient galaxies whose light has been traveling 13 billion years to reach us."])

# Print the predicted label as integer and string
print("Predicted label as integer: ",int(prediction2))
print("Predicted label as string: ",categories_dict[int(prediction2)])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted label as integer:  27
Predicted label as string:  SCIENCE


In [27]:
# Predict the label of a given string
prediction3, raw_outputs = model.predict(["A dentist is on trial in Denver for the death of his wife during a safari trip to Zambia. \
                                          His wife's death was called into question after he was accused of having an affair."])

# Print the predicted label as integer and string
print("Predicted label as integer: ",int(prediction3))
print("Predicted label as string: ",categories_dict[int(prediction3)])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Predicted label as integer:  6
Predicted label as string:  CRIME


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://simpletransformers.ai/about/
- [3] https://huggingface.co/docs/transformers/index
- [4] https://simpletransformers.ai/docs/usage/#enablingdisabling-cuda
- [5] https://scikit-learn.org/stable/
- [6] https://pandas.pydata.org/

Copyright © 2022 IU International University of Applied Sciences