# Learning on various text representations - shortened

This notebook was used to ran all of the experiments, based on the functions, defined in the utils.py. To inspect step-by-step procedure, see the *3-Learning-On-Various_Representations.ipynb* file.

## Preparing the dataset for FastText

Importing the necessary libraries

In [None]:
!pip install parse

In [1]:
import json
import pandas as pd
from copy import deepcopy
import re
from tqdm import tqdm
import fasttext as ft
import parse
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from utils.py import *

In [2]:
# Import the file with additional text representations (only the paragraphs marked to be kept
# in the original corpus are included)

with open("/kaggle/input/ginco-with-additional-text-representations/Language-Processed-GINCO.json") as f:
    dataset = json.load(f)

dataset[0]

In [4]:
dataset[0].keys()

### Pre-processing dataset

Here we can create additional representations if we wish (see the notebook *2-Language-Processing-of-GINCO*).

1. Remove punctuation from each token

In [None]:
        for instance in tqdm(dataset):
        text = instance["baseline_text"]
        
        # split text into tokens by white space
        token = text.split()
        
         
        # remove punctuation from each token
        table = str.maketrans('', '', punctuation)
        token = [word.translate(table) for word in token]

        # add a new key with punctuation removed
        instance["nopunctuation"] = " ".join(token)

2. Remove numbers from each token

3. Remove stopwords

4. Apply all (lowercase, remove punctuation, numbers, stopwords)

### Downcasting number of labels

In these experiments, we will not use all of the texts but only texts from 5 main categories, meaning that some categories will be merged into them, whereas some categories with a very small frequency will be discarded. Additionally, the texts marked us hard, will be discarded (see notebook *1-Preparing_Data_Hyperparameter_Search*).

We will start with a reduced set of labels (primary_level_3), then merge News and Opinionated News, and discard some of the lables.

In [5]:
# merge News and Opinionated News
for i in dataset:
    if i["primary_level_3"] == "Opinionated News" or i["primary_level_3"] == "News/Reporting":
        i["primary_level_3"] = "News"

Let's create train:test:dev split that contains only the wanted labels.

In [6]:
downcasted_labels = ['Information/Explanation', 'Promotion', 'News', 'Forum', 'Opinion/Argumentation']

train = [i for i in dataset if i["split"] == "train" and i["primary_level_3"] in downcasted_labels and not i["hard"]]
test = [i for i in dataset if i["split"] == "test" and i["primary_level_3"] in downcasted_labels and not i["hard"]]
dev = [i for i in dataset if i["split"] == "dev" and i["primary_level_3"] in downcasted_labels and not i["hard"]]

print("The train-dev-test splits consist of the following numbers of examples:", len(train), len(test), len(dev))

In [7]:
print(f"Number of all texts is {len(train)+len(test)+len(dev)}")

### Creating FastText texts

Use the function fastText_files(representation) from the utils.py.

This function creates and saves the test and train file
    from the test, train and dev split of the dataset (named test, dev and train),
    using the "primary_level_3" level labels, and the chosen text representation.

In [14]:
representation = fastText_files('lowercase')

In [10]:
# Define the label list:
LABELS = representation[0]

LABELS

# Train a fastText model

Importing the necessary libraries

## Input the data:

In [18]:
FT_train_file = representation[1]
FT_test_file = representation[2]

## Training

Use the trainFastText(representation) function from utils.py

In [None]:
final_results = list()

In [None]:
final_results