Data Science 2 - Seminar Work - prediction through classification

The goal of this seminar work is to create a prediction through classification of the arxiv dataset. This dataset is a dataset of scholarly articels from the vast branches of science. It contains 1.7 millions data sampels with 176 diffrent categories and with this structure:

With this dataset I tried to make a prediction about the category of an article through the abstract of the article. In this case it was particularly difficult as this is a multilabel dataset. Specifically, this means that an article can belong to several categories.

In order to find the best possible model for this application, I tried several models with only 100000 sampels of the 1.7 million arxiv dataset. They were Decision Tree, Random Forest, Bagging Model, LinearSVC, Power Set SVC Model, Boosting Model, Multinomial Naive Bayes Model and k-nearest neighbors. Based on the metric results of the classification report, I decided in favor of the LinearSVC. 

First, we need to import the needed libraries like nltk, scikitlearn and pandas. Those are necessary for the handling of dataframes and to work with text.

In [1]:
from nltk.corpus import stopwords
import nltk
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn import model_selection
import warnings
import datetime

nltk.download('wordnet')
warnings.filterwarnings("ignore")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\marin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We need to load the arxiv dataset into a dataframe. In a previous step, the dataset was already downloaded as a JSON file. After that, we split the dataset into train and test for abstract and categories. We will give out the timestamp of diffrent steps of our process to have a good overview over the computing time of these steps.

In [2]:
print("START")
print(datetime.datetime.now())
df = pd.read_json('arxiv-metadata-oai-snapshot.json', lines=True)
df.dropna()
train_abstracts, test_abstracts, train_categories, test_categories = model_selection.train_test_split(df['abstract'], df['categories'],
                                                                    test_size=0.2, random_state=42)

print("Training sampels:", len(train_abstracts))
print("Testing sampels:", len(test_abstracts))

START
2021-01-05 21:32:14.891001
Training sampels: 1412550
Testing sampels: 353138


After the splitting, we need to take care of the categories. In our case, we have a multilabel dataset. Therefore, one article can be assigned to more than one category. For this, we use the MultiLabelBinarizer to transform the multi-categories. After that, we create new dataframes which contain only the abtract texts, which we need in the next step.

In [3]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_categories)
test_labels = mlb.transform(test_categories)

trainData = {"abstract": train_abstracts}
testData = {"abstract": test_abstracts}
trainDf = pd.DataFrame(trainData, columns=["abstract"])
testDf = pd.DataFrame(testData, columns=["abstract"])

Here, we carry out the preprocessing of the abstract text. For this, we use the WordNetLemmatizer and the PorterStemmer from nltk. First we replace the line breacks with empty. After that, we convert the text to lowercase and tokenize it. Then we need to check if the string contains only alphabetic characters only. Next, we need to perform the lemmatization and stemming to convert the words into their origin form. Finally, we remove the word which are shorter than 2 characters and stopwords because they don't give us value for our later steps.

In [4]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stopwords = set(words.rstrip() for words in stopwords.words('english'))

def preprocessing(text):
    text = text.replace("\n", " ")
    tokens = nltk.tokenize.word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [token for token in tokens if len(token) > 2]
    tokens = [token for token in tokens if token not in stopwords]
    cleanedText = " ".join(tokens)
    return cleanedText

def cleaning(df):
    data = df.copy()
    data["abstract"] = data["abstract"].apply(preprocessing)
    return data

After the preprocessing of the text, we need to create tfidf vectors as an input for our classification later with the training and testing dataframes. For this, we use the TfidfVectorizer.

In [5]:
cleanedTrainData = cleaning(trainDf)
cleanedTestData = cleaning(testDf)
print("DATA PREPROCESSED")
print(datetime.datetime.now())
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

vectorizer = TfidfVectorizer()
vectorised_train_abstracts = vectorizer.fit_transform(cleanedTrainData["abstract"])
vectorised_test_abstracts = vectorizer.transform(cleanedTestData["abstract"])
print("DATA VECTORIZED")
print(datetime.datetime.now())

DATA PREPROCESSED
2021-01-05 22:48:01.284221
DATA VECTORIZED
2021-01-05 22:49:13.152078


As the last step, we train the LinearSVC with our training data. We also need to use the OneVsRestClassifier, which creates a classifier for each categories. After the training process is finished, we predict the abstracts from the testing dataframe. From this prediction, we can compare the prediction result with the original categories. To have a good overview about the performance of our classification, we give a classification report. It contains the precision, recall and f1 and we give also the accuracy and hamming loss.

In [6]:
from sklearn.multiclass import OneVsRestClassifier

from sklearn.metrics import classification_report, accuracy_score, hamming_loss
from sklearn.svm import LinearSVC

print("CLASSIFICATION START")
print(datetime.datetime.now())
svmClassifier = OneVsRestClassifier(LinearSVC(), n_jobs=-1)
svmClassifier.fit(vectorised_train_abstracts, train_labels)

svmPreds = svmClassifier.predict(vectorised_test_abstracts)
print("Classification Report:")
print(classification_report(test_labels, svmPreds))
print("Accuracy: ", accuracy_score(test_labels, svmPreds))
print("Hamming Loss: ", hamming_loss(test_labels, svmPreds))
print("END")
print(datetime.datetime.now())

CLASSIFICATION START
2021-01-05 22:49:13.160028
Classification Report:
              precision    recall  f1-score   support

           0       0.62      0.50      0.55    150525
           1       0.96      0.95      0.96    223989
           2       0.92      0.95      0.93    263754
           3       0.77      0.54      0.64     55793
           4       0.71      0.36      0.48      2228
           5       0.75      0.53      0.62     62883
           6       0.78      0.46      0.58     23186
           7       0.74      0.42      0.54     21260
           8       0.76      0.32      0.45      5971
           9       0.77      0.56      0.65     48603
          10       0.73      0.34      0.46     13413
          11       0.75      0.45      0.57     24988
          12       0.69      0.18      0.29       901
          13       0.79      0.64      0.71     23390
          14       0.67      0.26      0.37     36384
          15       0.78      0.47      0.59     18892
          

If we consider the computation time, the recall, precision and f1-score, LinearSVC is a good classifier for this use case. It took only 1 hour and 22 minutes do load 1.7 million datasampels, preprocess the text, train the classifier and test the classifier in comparison to the total computation time of decision tree with 8 hours and 24 minutes.

An alternative to this approach would be to use the deep learning model BERT or achieve text classification with an RNN (recurrent neural network). For these alternative there are examples for text classification on the tensorflow webpage. With this and the large dataset, even better results could be achieved. The downside of these alternatives is that they need a lot more time to train.