This notebook shall construct the dataset for the binary classifier Suitable <-> Non-Suitable.

In [6]:
import os
import time
import json
import utils
import parse
import fasttext
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt

suitable_path = "/home/peterr/macocu/task5_webgenres/data/original/dataset/dataset.json"
nonsuitable_path = "/home/peterr/macocu/task5_webgenres/data/original/dataset/not_suitable_dataset.json"

with open(suitable_path) as f:
    suitable_content = json.load(f)
with open(nonsuitable_path) as f:
    nonsuitable_content = json.load(f)

I shall first produce a tabular dataset with as much data as possible. Paragraphs will be joined with "\<p\/\>"tag. Duplicate tag will be disregarded.

In [7]:
for item in suitable_content:
    item["paragraphs"] = " <p/> ".join([i["text"] for i in item["paragraphs"]])
for item in nonsuitable_content:
    item["paragraphs"] = " <p/> ".join([i["text"] for i in item["paragraphs"]])

In [10]:
suitable_df = pd.DataFrame(data=suitable_content)
nonsuitable_df = pd.DataFrame(data=nonsuitable_content)

suitable_df["suitable"] = 1
nonsuitable_df["suitable"] = 0

df = pd.concat([suitable_df, nonsuitable_df], ignore_index=True)
df.head()

Unnamed: 0,id,url,crawled,primary,secondary,tertiary,hard,paragraphs,suitable
0,3949,http://www.pomurje.si/aktualno/sport/zimska-li...,2014,News/Reporting,,,False,"Šport <p/> Zimska liga malega nogometa sobota,...",1
1,3726,http://www.ss-sezana.si/sss/index.php?option=c...,2014,Information/Explanation,,,False,JEDILNIK <p/> Iskalnik <p/> Poglavitni cilj pr...,1
2,5621,http://www.kamnik-starejsi.si/novice/144-sodel...,2014,Promotion of Services,Opinion/Argumentation,Information/Explanation,False,Projekt INNOVAge in zavod Oreli <p/> Zavod Ore...,1
3,3776,http://www.radiocelje.si/novica.php?id=13007&a...,2014,News/Reporting,,,False,"V novembru, mesecu preprečevanja odvisnosti, b...",1
4,2102,http://www.mtv.si/novice/selena-gomez-ponudila...,2014,Opinionated News,,,False,Selena Gomez ponudila v poslušanje novi album ...,1


In [12]:
df.to_csv("data/interim/suitable_tabular.csv", index=False)

In [14]:
sum(df.suitable==0)

123

In [15]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, train_size = 0.6, random_state=42)

In [16]:
sum(train.suitable)/train.shape[0]

0.8903703703703704

In [17]:
sum(test.suitable)/test.shape[0]

0.8911111111111111

It seems we have a nice stratification along suitability between both splits.

In [28]:
def train_model(train_df):
    from simpletransformers.classification import ClassificationModel
    model_args = {
        "num_train_epochs": 30,
        "learning_rate": 1e-5,
        "overwrite_output_dir": True,
        "train_batch_size": 32,
        "no_save": True,
        "no_cache": True,
        "overwrite_output_dir": True,
        "save_steps": -1,
        "max_seq_length": 512,
        "silent": True
    }

    model = ClassificationModel(
        "camembert", "EMBEDDIA/sloberta",
        num_labels = 2,
        use_cuda = True,
        args = model_args
    )
    model.train_model(train_df)
    return model

train.loc[:, "labels"] = train.loc[:, "suitable"]


In [29]:
for i in range(15):
    model = train_model(train.loc[:, ["paragraphs", "labels"]])
    y_pred = model.predict(test.paragraphs.tolist())[0]
    test[f"run_{i}"] = y_pred

Some weights of the model checkpoint at EMBEDDIA/sloberta were not used when initializing CamembertForSequenceClassification: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at EMBEDDIA/sloberta and are newly initialized: ['classifier.out_proj.weight', 'roberta.pooler.dense.wei

In [30]:
cols = [c for c in test.columns if c != "paragraphs"]

test.loc[:, cols].to_csv("backup_25.csv", index=False)

In [32]:
test.columns

Index(['id', 'url', 'crawled', 'primary', 'secondary', 'tertiary', 'hard',
       'paragraphs', 'suitable', 'run_0', 'run_1', 'run_2', 'run_3', 'run_4',
       'run_5', 'run_6', 'run_7', 'run_8', 'run_9', 'run_10', 'run_11',
       'run_12', 'run_13', 'run_14'],
      dtype='object')

In [33]:
test.loc[:, ["suitable", "run_1"]]

Unnamed: 0,suitable,run_1
1089,0,0
1103,0,1
739,1,1
140,1,1
1018,0,1
...,...,...
1079,0,1
529,1,1
1121,0,1
7,1,1


In [40]:
from sklearn.metrics import f1_score, confusion_matrix

In [41]:
for i in range(15):
    print(f1_score(test.suitable, test[f"run_{i}"]))
    print(confusion_matrix(test.suitable, test[f"run_{i}"]))

0.9569377990430622
[[ 14  35]
 [  1 400]]
0.9520383693045562
[[ 13  36]
 [  4 397]]
0.9563106796116504
[[ 20  29]
 [  7 394]]
0.9552599758162031
[[ 18  31]
 [  6 395]]
0.9490291262135923
[[ 17  32]
 [ 10 391]]
0.9542168674698795
[[ 16  33]
 [  5 396]]
0.9591346153846153
[[ 17  32]
 [  2 399]]
0.9527272727272728
[[ 18  31]
 [  8 393]]
0.9501822600243014
[[ 18  31]
 [ 10 391]]
0.9513381995133819
[[ 19  30]
 [ 10 391]]
0.9577804583835947
[[ 18  31]
 [  4 397]]
0.9542168674698795
[[ 16  33]
 [  5 396]]
0.9577804583835947
[[ 18  31]
 [  4 397]]
0.9554753309265944
[[ 16  33]
 [  4 397]]
0.9550425273390036
[[ 20  29]
 [  8 393]]
