# Task 1. Classification 

In order to perform classification we are going to use the following modules (see requirements.txt in order to set the virtual environment):

In [13]:
import pandas as pd
import nltk
from nltk.corpus import PlaintextCorpusReader
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel



We build the corpus based on the challenge documents using nltk:

In [6]:
docs_dir = './documents_challenge/'
documents = PlaintextCorpusReader(docs_dir, '.*')

We define a `labeler` function that explores the title in order to see in which directory the text is stored. We will use this information to tag the data.

In [4]:
def labeler(title):
    if 'APR/' in title:
        return 0
    elif 'Conference_papers/' in title:
        return 1
    elif 'PAN11/' in title:
        return 2
    elif 'Wikipedia/' in title:
        return 3
    else:
        return -1

In order to use the BERT model for our task we need our data to be in a table so we use the `generate_table` functino in order to create a pandas dataframe with a column *Text* for the text and a column *Category* for the tag:

In [7]:
def generate_table(corpus):
    data_table = {'Text' : [], 'Category' : []}
    for title in corpus.fileids():
        data_table['Text'].append(corpus.raw(title)[0:35000])
        data_table['Category'].append(labeler(title))
    data = pd.DataFrame.from_dict(data_table)
    return data

In [8]:
tabular_data = generate_table(documents)

We see that the table contains all of the data:

In [9]:
tabular_data.shape

(23128, 2)

We see that there was no problem categorizing:

In [30]:
tabular_data[tabular_data.Category==-1]

Unnamed: 0,Text,Category


We are going to split our data in train and test sets. Since we don't have any specification on how to do it, we do it randomly:

In [11]:
train_df, test_df = train_test_split(tabular_data, test_size = 0.10)

We are using a BERT model. This model requires a special preprocessing that is automated by the library `simpletransformers` in a very efficient way. This model comes pre-trained so we don't need a lot of epochs of training in order to get an awesome performance:

In [19]:
# define hyperparameter
train_args ={"reprocess_input_data": True,"fp16":False,"num_train_epochs": 3}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "bert-base-multilingual-uncased",
    num_labels=4,
    args=train_args,
    use_cuda=False
)

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

Once the model is created we need to train it. In my computer (no GPU) it took around 14 hours for 3 epochs:

In [20]:
model.train_model(train_df)


HBox(children=(FloatProgress(value=0.0, max=20815.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 3', max=2602.0, style=ProgressStyle(de…






HBox(children=(FloatProgress(value=0.0, description='Running Epoch 1 of 3', max=2602.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 2 of 3', max=2602.0, style=ProgressStyle(de…





Once we have trained the model we evaluate over the test set that we had left out of the training samples of the model:

In [21]:
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=2313.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=290.0, style=ProgressStyle(descr…




In [22]:
result

{'mcc': 0.9976673740932693,
 'f1': 0.9987029831387808,
 'acc': 0.9987029831387808,
 'eval_loss': 0.009810411997907444}

In [37]:
len(wrong_predictions)

3

As we can see only with 3 epochs we have achieved awesome results, failing only three times. Let's explore the missclassified texts:

In [31]:
wrong_predictions[0].text_a

"Que des femmes désaeuvrées aient des fantasmes, sado-masochistes ou non, de prostitution ou simplement de sexualité avec des inconnus, est une absolue banalité. Elles sont comme les hommes en dernière analyse, avec la différence que c'est le désaeuvrement qui semble cultiver ces fantasmes chez les femmes, comme une façon de se doter d'une activité, alors que c'est plutôt l'hyperactivité qui semble produire la floraison des fantasmes chez l'homme, comme un divertissement dans le travail ou une distraction du travail. Du moins c'est la vision que nous donne Bunuel de ces fantasmes où se rencontrent les femmes désaeuvrées et les hommes hyperactifs. Par contre que l'on classifie ces femmes de grandes bourgeoises est une aberration. Belle de Jour n'est que la femme d'un médecin-chirurgien hospitalier. Ce n'est pas là une grande bourgeoise. Simplement une femme de la classe moyenne supérieure qui a le privilège de pouvoir ne rien faire. Ceci étant, cette femme va tomber amoureuse bien sûr d

In [34]:
wrong_predictions[0].label

0

In [32]:
wrong_predictions[1].text_a

'Turns out that some economic entrepreneurs are in the process of exploiting people. (All people. O.K. the world.) They seem to have a different physical appearance than you or I and for the sake of keeping the exploration covert, have electronically masked themselves to appear normal.'

In [35]:
wrong_predictions[1].label

0

In [33]:
wrong_predictions[2].text_a

'A mysterious yank (The Quiet Man) arrives on the train and asks for directions to Innisfree. This quiet man turns out to be returning to his home after a hard life in America. There he purchases his former home to the shegrim of the neighbor "Red" Will Danaher (Victor McLagen); Danaher covets the house himself. To ad insult to injury it looks like Danaher\'s sister Mary Kate Danaher (Maureen O\'Hara) may be destined to marry our quiet man Sean Thornton (John Wayne.)\n We learn a few Irish no-nos; such as you do not play patty fingers with the holy water. I will not go through the whole story as it is fun to watch it unfold. However there is a good example of horse sense as the horse knows to stop at the pub for Michaeleen Oge Flynn (Barry Fitzgerald) whom has a very dry throat.\n The scenery along the bay and the fields gives the story a run for its money.\n Be aware that different distributors have different quality of this product so the rating is for the Movie alone not the packagi

In [36]:
wrong_predictions[2].label

0