# Preprocessing data 

- This notebook consists of several preprocessing steps designed for a dataset, FinnSentiment, to be used for a text classification task.
- In order the text classification to work with some other dataset than FinnSentiment, you may need to tailor this notebook to fit your case. For example, FinnSentiment is originally a tsv file, but with a csv file you may load it with the following line:
```python
df = pd.read_csv(cfg.datafolder+"filename.csv")
```
- This notebook involves breaking down the data into multiple subsets, that are needed in later phases of the text classification. Separate sets are created for pretraining (MLM) and finetuning. Duplicate and NAN values are also checked and removed. 
- For running the text classification, or finetuning, the data should have annotations or labels that indicate a class for each data sample. For pretraining labels are omitted.
- The dataset presented here is openly available [FinnSentiment](https://korp.csc.fi/download/finsen/src/). Reference: \
Linden, K., Jauhiainen, T., & Hardwick, S. (2023). FinnSentiment: A Finnish Social Media Corpus for Sentiment Polarity Annotation. Language Resources and Evaluation, 57(2), 581-609. https://doi.org/10.1007/s10579-023-09644-5 https://arxiv.org/pdf/2012.02613.pdf

## 1. Import libraries, define configuration class, and helper functions

If some libraries are not installed, use command prompt with e.g.
```
pip install transformers
```

In [1]:
from transformers import AutoTokenizer
import transformers
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import train_test_split
import csv
print(f'transformers version: {transformers.__version__}')

transformers version: 4.16.2


In [2]:
class cfg():
    model_name = "TurkuNLP/bert-base-finnish-cased-v1" #"TurkuNLP/bert-large-finnish-cased-v1" 
    data_folder = "./" #alternatively use a full path: data_folder = "/path/to/data/"

In [3]:
#creating k folds
def create_kfolds(data, num_splits, random_seed):
    data["kfold"] = -1
    kf = model_selection.KFold(n_splits=num_splits, shuffle=True, random_state=random_seed)
    for f, (t_, v_) in enumerate(kf.split(X=data)):
        data.loc[v_, 'kfold'] = f
    return data

## 2. Taking a look at the data


**README file from [here](https://korp.csc.fi/download/finsen/src/README.txt).**

NAME: FinnSentiment 1.1, source
LICENSE: This corpus is licensed with CC BY 4.0

For more information see http://urn.fi/urn:nbn:fi:lb-2023012701

FinnSentiment is a Finnish social media corpus for sentiment polarity annotation. 27,000 sentence data set annotated independently with sentiment polarity by three native annotators.

The creation of the corpus is documented in K. Lindén, T. Jauhiainen, S. Hardwick (2023): FinnSentiment - A Finnish Social Media Corpus for Sentiment Polarity Annotation. Language Resources and Evaluation 2023.

This is a supplementary release containing additional re-annotations done by the authors of the article.

The corpus is available in a utf-8 encoded TSV (tab-separated values) file with columns as indicated in the following list. In the list, "split" refers to the cross-validation split to which a sentence belongs, and "batch" to the work package the sentence belongs to. Indexes to the original corpus are strings consisting of a filename, like comments2008c.vrt, a space character, and a sentence id number in the file.

The additional annotations, new in this release, are in columns 6-14. In each case, only some sentences were re-annotated, and missing values are indicated by empty fields.

Columns 6-8 contain annotations for 1000 randomly selected sentences done by author A, author B, and author C respectively. Annotations are -1, 0 or 1.

Columns 9-11 contain annotations for 505 sentences where there was a strong disagreement between the original annotators, ie. both positive and negative annotations were given. The annotators are again author A, author B and author C, and annotations are again -1, 0 or 1.

Columns 12-14 contain annotations for 100 random sentences each from those sentences which had a derived score (column 5) of 1, 2, 3, 4, and 5, for a total of 500 sentences. Annotations are 1-5, and annotators are as in the previous case.


| Column | Column name                            | Range / data type      |
|--------|----------------------------------------|------------------------|
| 1      | A sentiment                            | [-1, 1]                |
| 2      | B sentiment                            | [-1, 1]                |
| 3      | C sentiment                            | [-1, 1]                |
| 4      | majority value                         | [-1, 1]                |
| 5      | derived value                          | [1, 5]                 |
| 6      | Author A random sentence sentiment     | [-1, 1]                |
| 7      | Author B random sentence sentiment     | [-1, 1]                |
| 8      | Author C random sentence sentiment     | [-1, 1]                |
| 9      | Author A strong disagree sentiment     | [-1, 1]                |
| 10     | Author B strong disagree sentiment     | [-1, 1]                |
| 11     | Author C strong disagree sentiment     | [-1, 1]                |
| 12     | Author A derived score sentiment       | [1, 5]                 |
| 13     | Author B derived score sentiment       | [1, 5]                 |
| 14     | Author C derived score sentiment       | [1, 5]                 |
| 15     | pre-annotated sentiment smiley         | [-1, 1]                |
| 16     | pre-annotated sentiment product review | [-1, 1]                |
| 17     | split #                                | [1, 20]                |
| 18     | batch #                                | [1,9]                  |
| 19     | index in original corpus               | Filename & sentence id |
| 20     | sentence text                          | Raw string             |

Loading FinnSentiment 1.1.

Using the function read_csv from pandas library gives the following results. This is usually a solid way to load your dataset in the format of csv, tsv or txt (remember to specify the delimiter as "\t" for tsv files or whichever you are using in other formats).

In [4]:
df = pd.read_csv(cfg.data_folder+"FinnSentiment-1.1.tsv", delimiter="\t",
                 names=['A_sentiment', 'C_sentiment', 'B_sentiment', 'majority_value', 'derived_value',
                        'Author_A_random_sentence_sentiment',
                        'Author_B_random_sentence_sentiment',
                        'Author_C_random_sentence_sentiment',
                        'Author_A_strong_disagree_sentiment',
                        'Author_B_strong_disagree_sentiment',
                        'Author_C_strong_disagree_sentiment',
                        'Author_A_derived_score_sentiment',
                        'Author_B_derived_score_sentiment',
                        'Author_C_derived_score_sentiment',
                        'pre-annotated_sentiment_smiley',
                        'pre-annotated_sentiment_product_review',
                        'split', 'batch', 'index_in_original_corpus', 'sentence_text'])
df

Unnamed: 0,A_sentiment,C_sentiment,B_sentiment,majority_value,derived_value,Author_A_random_sentence_sentiment,Author_B_random_sentence_sentiment,Author_C_random_sentence_sentiment,Author_A_strong_disagree_sentiment,Author_B_strong_disagree_sentiment,Author_C_strong_disagree_sentiment,Author_A_derived_score_sentiment,Author_B_derived_score_sentiment,Author_C_derived_score_sentiment,pre-annotated_sentiment_smiley,pre-annotated_sentiment_product_review,split,batch,index_in_original_corpus,sentence_text
0,1,0,1,1,4,-1.0,-1.0,0.0,,,,4.0,4.0,4.0,0,-1,1,1,comments2008c.vrt 2145269,- Tervetuloa skotlantiin...
1,0,1,0,0,4,,,,,,,2.0,2.0,3.0,0,-1,12,1,comments2011c.vrt 3247745,"...... No, oikein sopiva sattumaha se vaan oli..."
2,0,0,0,0,3,,,,,,,3.0,3.0,3.0,0,-1,14,1,comments2007c.vrt 3792960,40.
3,1,1,1,1,5,,,,,,,3.0,3.0,3.0,0,1,7,1,comments2010d.vrt 2351708,Kyseessä voi olla loppuelämäsi nainen.
4,1,1,1,1,5,,,,,,,4.0,4.0,4.0,0,1,12,1,comments2007d.vrt 1701675,Sinne vaan ocean clubiin iskemään!
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23867,-1,0,0,0,2,,,,,,,,,,0,1,15,9,threads2015a.vrt 5122042,sais nauraa.
23868,-1,0,-1,-1,2,,,,,,,,,,0,-1,10,9,comments2015d.vrt 3337506,"Menkää töihin, Jonnekin muuannekin kun niihin ..."
23869,0,0,0,0,3,,,,,,,,,,0,0,14,9,comments2011b.vrt 960278,Ja tiedätkö että minä joka olen Jumalan armost...
23870,-1,-1,0,-1,2,,,,,,,,,,0,0,14,9,comments2018a.vrt 2653599,Ei näiltä happopäiltä jotka itseään kutsuvat s...


Note that there are now 23 872 rows instead of 27 000 in the dataframe! So let's try another way.

In [5]:
d=[]

with open(cfg.data_folder+"FinnSentiment-1.1.tsv") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t", quoting=csv.QUOTE_NONE)
    for line in tsvreader:
        d.append(line)

df = pd.DataFrame(data=d, columns=['A_sentiment', 'C_sentiment', 'B_sentiment', 'majority_value', 'derived_value',
                        'Author_A_random_sentence_sentiment',
                        'Author_B_random_sentence_sentiment',
                        'Author_C_random_sentence_sentiment',
                        'Author_A_strong_disagree_sentiment',
                        'Author_B_strong_disagree_sentiment',
                        'Author_C_strong_disagree_sentiment',
                        'Author_A_derived_score_sentiment',
                        'Author_B_derived_score_sentiment',
                        'Author_C_derived_score_sentiment',
                        'pre-annotated_sentiment_smiley',
                        'pre-annotated_sentiment_product_review',
                        'split', 'batch', 'index_in_original_corpus', 'sentence_text'])
df

Unnamed: 0,A_sentiment,C_sentiment,B_sentiment,majority_value,derived_value,Author_A_random_sentence_sentiment,Author_B_random_sentence_sentiment,Author_C_random_sentence_sentiment,Author_A_strong_disagree_sentiment,Author_B_strong_disagree_sentiment,Author_C_strong_disagree_sentiment,Author_A_derived_score_sentiment,Author_B_derived_score_sentiment,Author_C_derived_score_sentiment,pre-annotated_sentiment_smiley,pre-annotated_sentiment_product_review,split,batch,index_in_original_corpus,sentence_text
0,1,0,1,1,4,-1,-1,0,,,,4,4,4,0,-1,1,1,comments2008c.vrt 2145269,- Tervetuloa skotlantiin...
1,0,1,0,0,4,,,,,,,2,2,3,0,-1,12,1,comments2011c.vrt 3247745,"...... No, oikein sopiva sattumaha se vaan oli..."
2,0,0,0,0,3,,,,,,,3,3,3,0,-1,14,1,comments2007c.vrt 3792960,40.
3,1,1,1,1,5,,,,,,,3,3,3,0,1,7,1,comments2010d.vrt 2351708,Kyseessä voi olla loppuelämäsi nainen.
4,1,1,1,1,5,,,,,,,4,4,4,0,1,12,1,comments2007d.vrt 1701675,Sinne vaan ocean clubiin iskemään!
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26995,-1,0,0,0,2,,,,,,,,,,0,1,15,9,threads2015a.vrt 5122042,sais nauraa.
26996,-1,0,-1,-1,2,,,,,,,,,,0,-1,10,9,comments2015d.vrt 3337506,"Menkää töihin, Jonnekin muuannekin kun niihin ..."
26997,0,0,0,0,3,,,,,,,,,,0,0,14,9,comments2011b.vrt 960278,Ja tiedätkö että minä joka olen Jumalan armost...
26998,-1,-1,0,-1,2,,,,,,,,,,0,0,14,9,comments2018a.vrt 2653599,Ei näiltä happopäiltä jotka itseään kutsuvat s...


Loading FinnSentiment2020.

In [6]:
# "mangled" data: 23 872 rows instead of 27 000
# df = pd.read_csv(cfg.data_folder+"FinnSentiment2020.tsv", delimiter="\t",
#                              names=['A_sentiment', 'B_sentiment', 'C_sentiment',
#                                    'majority_value', 'derived_value', 'sentiment_smiley',
#                                    'sentiment_productreview', 'split', 'batch',
#                                    'orig_index', 'sentence_text'])
# df

In [6]:
d=[]

with open(cfg.data_folder+"FinnSentiment2020.tsv") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t", quoting=csv.QUOTE_NONE)
    for line in tsvreader:
        d.append(line)

df = pd.DataFrame(data=d, columns=['A_sentiment', 'B_sentiment', 'C_sentiment',
                           'majority_value', 'derived_value', 'sentiment_smiley',
                           'sentiment_productreview', 'split', 'batch',
                           'orig_index', 'sentence_text'])
df

Unnamed: 0,A_sentiment,B_sentiment,C_sentiment,majority_value,derived_value,sentiment_smiley,sentiment_productreview,split,batch,orig_index,sentence_text
0,1,0,1,1,4,0,-1,1,1,comments2008c.vrt 2145269,- Tervetuloa skotlantiin...
1,0,1,0,0,4,0,-1,12,1,comments2011c.vrt 3247745,"...... No, oikein sopiva sattumaha se vaan oli..."
2,0,0,0,0,3,0,-1,14,1,comments2007c.vrt 3792960,40.
3,1,1,1,1,5,0,1,7,1,comments2010d.vrt 2351708,Kyseessä voi olla loppuelämäsi nainen.
4,1,1,1,1,5,0,1,12,1,comments2007d.vrt 1701675,Sinne vaan ocean clubiin iskemään!
...,...,...,...,...,...,...,...,...,...,...,...
26995,-1,0,0,0,2,0,1,15,9,threads2015a.vrt 5122042,sais nauraa.
26996,-1,0,-1,-1,2,0,-1,10,9,comments2015d.vrt 3337506,"Menkää töihin, Jonnekin muuannekin kun niihin ..."
26997,0,0,0,0,3,0,0,14,9,comments2011b.vrt 960278,Ja tiedätkö että minä joka olen Jumalan armost...
26998,-1,-1,0,-1,2,0,0,14,9,comments2018a.vrt 2653599,Ei näiltä happopäiltä jotka itseään kutsuvat s...


Let's pick only majority_value and sentence_text, and rename them to label and text.

In [6]:
df = df[['majority_value','sentence_text']]
df = df.rename({'sentence_text': 'text', 'majority_value': 'label'}, axis=1)
df

Unnamed: 0,label,text
0,1,- Tervetuloa skotlantiin...
1,0,"...... No, oikein sopiva sattumaha se vaan oli..."
2,0,40.
3,1,Kyseessä voi olla loppuelämäsi nainen.
4,1,Sinne vaan ocean clubiin iskemään!
...,...,...
26995,0,sais nauraa.
26996,-1,"Menkää töihin, Jonnekin muuannekin kun niihin ..."
26997,0,Ja tiedätkö että minä joka olen Jumalan armost...
26998,-1,Ei näiltä happopäiltä jotka itseään kutsuvat s...


Let's see some examples of data samples belonging to each class. The class labels include:

- 0 = neutral
- 1 = positive
- -1 = negative

In [7]:
df[df.label=="0"][:5].text.values

array(['...... No, oikein sopiva sattumaha se vaan oli, vai mitä?', '40.',
       'Kamppi, Kontula, Kluuvi', 'huomista päivää odotellen..',
       'Ihmiset joutuvat joskus monenlaisien päätelmien kohteeksi.'],
      dtype=object)

In [8]:
df[df.label=="1"][:5].text.values

array(['- Tervetuloa skotlantiin...',
       'Kyseessä voi olla loppuelämäsi nainen.',
       'Sinne vaan ocean clubiin iskemään!',
       'Itsekin pidän Keskustan kampanjointia ihan hyvänä.',
       'Muutenkin suosittelen kaikille asiasta kiinnostuneille tuota Mark "Gravy" Robertsin mainiota paperia.'],
      dtype=object)

In [9]:
df[df.label=="-1"][:5].text.values

array(['en haluaisi että kissani vuotaa.. =)',
       'Nyt olisi lääkitys paikallaan.',
       'Eniten pelkään sitä, että jos mies vain koko ajan siirtää perheenperustamista vuosilla eteenpäin, kunnes emme enää saakkaan lapsia..tiedä häntä.',
       'Työstä kuuluu maksaa palkkaa vai meinaatko että palkkatyöläiset menisivät 9eurolla hommiin ja maksaisivat sillä laskunsa.Ei hyvää päivää mihin maamme on vajonnut ja hallitus ajaa tälläistä politiikkaa.',
       '"Vastahan sinä myönsit, ettei tuolla ominaisuuden hyödyllisyydellä - eikä siten vahingollisuudella - ole mitään tekemistä asian kanssa!'],
      dtype=object)

Check if there's duplicate or NaN values left, especially in the text column.

In [10]:
#check and print the amount of duplicate and nan values in a dataframe
def check_dupl_nan(df):
    print('NaNs: ',df.isna().sum())
    for i in df.columns:
        print('Duplicates in {0}: {1}'.format(i,df[i].duplicated().sum()))

check_dupl_nan(df)

NaNs:  label    0
text     0
dtype: int64
Duplicates in label: 26997
Duplicates in text: 479


Let's remove duplicates (and NANs) in text column.

In [11]:
#remove duplicate and nan values in a dataframe
#nan values in nan_columns, duplicate values in dupl_columns
#note that this function was speficially created to edit one or two columns, may be irrelevant for other types of use
def remove_dupl_nan(df, nan_columns=['text'], dupl_columns=['text']):
    df = df.dropna(subset=nan_columns)
    df = df.drop_duplicates(subset=dupl_columns)
    df = df.reset_index(drop=True)
    return df

df = remove_dupl_nan(df)
df

Unnamed: 0,label,text
0,1,- Tervetuloa skotlantiin...
1,0,"...... No, oikein sopiva sattumaha se vaan oli..."
2,0,40.
3,1,Kyseessä voi olla loppuelämäsi nainen.
4,1,Sinne vaan ocean clubiin iskemään!
...,...,...
26516,0,sais nauraa.
26517,-1,"Menkää töihin, Jonnekin muuannekin kun niihin ..."
26518,0,Ja tiedätkö että minä joka olen Jumalan armost...
26519,-1,Ei näiltä happopäiltä jotka itseään kutsuvat s...


Print a few metrics about token lengths (data samples sizes, i.e. how long each row is). Token can be a word, or punctuation, for example.

In [12]:
#check the length of data samples (each row in dataframe)
#length is calculated from tokenized form of text data (samples)
def check_token_length(df, print_lengths=True):
    tokenizer = AutoTokenizer.from_pretrained(cfg.model_name)
    x = df["text"].values
    
    # Encode our concatenated data
    encoded = [tokenizer.encode(sent, add_special_tokens=True) for sent in x]

    # Find the maximum, minimum, mean and median length
    t_lengths=[len(sent) for sent in encoded]
    max_len = max(t_lengths)
    mean_len = np.mean(t_lengths)
    median_len = np.median(t_lengths)
    min_len = min(t_lengths)
    
    if print_lengths:
        print('Min length: ', min_len)
        print('Mean length: ', mean_len)
        print('Median length: ', median_len)
        print('Max length: ', max_len)
    
    return t_lengths

t_lengths = check_token_length(df)

Token indices sequence length is longer than the specified maximum sequence length for this model (1163 > 512). Running this sequence through the model will result in indexing errors


Min length:  3
Mean length:  18.891746163417668
Median length:  15.0
Max length:  2338


Note that you can specify the amount of tokens taken into account in model training by adjusting the maximum sequence length (max_seq_length in pretraining notebook, and max_len in finetuning notebook).

Let's take a closer look into the data and add the information about the token length of each row into the dataframe.

In [13]:
df['length']=t_lengths
df

Unnamed: 0,label,text,length
0,1,- Tervetuloa skotlantiin...,11
1,0,"...... No, oikein sopiva sattumaha se vaan oli...",21
2,0,40.,4
3,1,Kyseessä voi olla loppuelämäsi nainen.,9
4,1,Sinne vaan ocean clubiin iskemään!,12
...,...,...,...
26516,0,sais nauraa.,5
26517,-1,"Menkää töihin, Jonnekin muuannekin kun niihin ...",20
26518,0,Ja tiedätkö että minä joka olen Jumalan armost...,32
26519,-1,Ei näiltä happopäiltä jotka itseään kutsuvat s...,23


In [14]:
df_sorted = df.sort_values(by="length")
df_sorted = df_sorted.reset_index(drop=True)
df_sorted

Unnamed: 0,label,text,length
0,0,alus,3
1,0,Hei,3
2,0,eli,3
3,0,A,3
4,0,e,3
...,...,...,...
26516,0,".............................,.-”................",952
26517,0,Running processes: C:\\WINDOWS\\System32\\smss...,1049
26518,0,Running processes: C:\\WINDOWS\\System32\\smss...,1163
26519,0,O4 - HKLM\\..\\Run: [Symantec NetDriver Monito...,1663


Let's check 10th quantiles, or deciles, to get a better view on the sentence length distribution.

In [15]:
for i in np.linspace(0.1, 1.0, num=10):
    print('{0:.1g}: {1}'.format(i,np.quantile(df_sorted.length.values, i)))
    #print(f'{i:.2}: {np.quantile(df_sorted.length.values, i)}')

0.1: 7.0
0.2: 9.0
0.3: 11.0
0.4: 13.0
0.5: 15.0
0.6: 18.0
0.7: 21.0
0.8: 25.0
0.9: 33.0
1: 2338.0


In [16]:
tokenizer = AutoTokenizer.from_pretrained(cfg.model_name)
some_samples = df[:10].text.values
print(len(some_samples))
some_samples

10


array(['- Tervetuloa skotlantiin...',
       '...... No, oikein sopiva sattumaha se vaan oli, vai mitä?', '40.',
       'Kyseessä voi olla loppuelämäsi nainen.',
       'Sinne vaan ocean clubiin iskemään!',
       'Itsekin pidän Keskustan kampanjointia ihan hyvänä.',
       'Kamppi, Kontula, Kluuvi', 'huomista päivää odotellen..',
       'en haluaisi että kissani vuotaa.. =)',
       'Nyt olisi lääkitys paikallaan.'], dtype=object)

Tokenizing the array gives us a look on how the input embeddings are constructed for BERT models. The input embeddings consists of 1) the token embeddings, 2) the segmentation embeddings, and 3) the position embeddings (the picture is Figure 2 from [Devlin et al. (2019)](https://arxiv.org/pdf/1810.04805.pdf)). Note, that the following special tokens can be added: [CLS] indicating the start of a sentence, and [SEP] the end of a sentence (102 and 103, respectively).


![](devlinetal_figure2_bertinputembeddings.png)

In [17]:
for i in some_samples:
    print(f'{i}: {tokenizer.encode(i, add_special_tokens=True)}')

- Tervetuloa skotlantiin...: [102, 166, 14414, 27594, 44856, 32612, 107, 111, 111, 111, 103]
...... No, oikein sopiva sattumaha se vaan oli, vai mitä?: [102, 111, 111, 111, 111, 111, 111, 1174, 119, 1374, 5727, 42146, 394, 199, 559, 280, 119, 1317, 382, 305, 103]
40.: [102, 2502, 111, 103]
Kyseessä voi olla loppuelämäsi nainen.: [102, 6574, 326, 439, 45797, 197, 2125, 111, 103]
Sinne vaan ocean clubiin iskemään!: [102, 15965, 559, 114, 40198, 50009, 1211, 19890, 107, 39036, 380, 103]
Itsekin pidän Keskustan kampanjointia ihan hyvänä.: [102, 15218, 6623, 9921, 24114, 5839, 572, 10619, 111, 103]
Kamppi, Kontula, Kluuvi: [102, 29433, 443, 119, 2716, 22757, 119, 10933, 154, 313, 103]
huomista päivää odotellen..: [102, 4531, 734, 2859, 1792, 3700, 111, 111, 103]
en haluaisi että kissani vuotaa.. =): [102, 228, 5010, 206, 9856, 50007, 23506, 111, 111, 2199, 308, 103]
Nyt olisi lääkitys paikallaan.: [102, 1130, 527, 22504, 9278, 111, 103]


Let's edit the label column from having distribution of [-1,0,1] into [2,0,1] for convenience's sake.

In [18]:
df.label.unique()

array(['1', '0', '-1'], dtype=object)

The column type may vary (sometimes because of using the pandas library). Let's check the type, and change the -1 to 2 after that.

In [19]:
if df.label.dtype=='int64':
    df.label = df.label.replace(-1,2)
if df.label.dtype=='str' or df.label.dtype=='O':
    df.label = df.label.str.replace(r'-1','2')
    df = df.astype({'label': 'int'})
df

Unnamed: 0,label,text,length
0,1,- Tervetuloa skotlantiin...,11
1,0,"...... No, oikein sopiva sattumaha se vaan oli...",21
2,0,40.,4
3,1,Kyseessä voi olla loppuelämäsi nainen.,9
4,1,Sinne vaan ocean clubiin iskemään!,12
...,...,...,...
26516,0,sais nauraa.,5
26517,2,"Menkää töihin, Jonnekin muuannekin kun niihin ...",20
26518,0,Ja tiedätkö että minä joka olen Jumalan armost...,32
26519,2,Ei näiltä happopäiltä jotka itseään kutsuvat s...,23


Finally, let's see how many samples we have for each class (0, 1, 2 or neutral, positive, negative).

In [20]:
#check how many samples there are for each class
def check_class_distribution(df, print_lengths=True):
    all_class_dist=[]
    for i in sorted(df.label.unique()):
        class_dist = len(df[df.label==i])
        if print_lengths:
            print('For label {0} there is {1} data samples'.format(i, class_dist))
        all_class_dist.append(class_dist)
    return all_class_dist

_ = check_class_distribution(df)

For label 0 there is 19505 data samples
For label 1 there is 2928 data samples
For label 2 there is 4088 data samples


## 2. Creating dataset for pretraining (MLM)

Let's pick only the text column since labels are unnecessary for pretraining (MLM). Then let's create folds to divide the data into two subsets, validation and training sets. 

In [11]:
df_pretrain = df[['text']]
df_pretrain = create_kfolds(df_pretrain, num_splits=5, random_seed=2023)
df_pretrain

Unnamed: 0,text,kfold
0,- Tervetuloa skotlantiin...,3
1,"...... No, oikein sopiva sattumaha se vaan oli...",4
2,40.,3
3,Kyseessä voi olla loppuelämäsi nainen.,4
4,Sinne vaan ocean clubiin iskemään!,2
...,...,...
26516,sais nauraa.,0
26517,"Menkää töihin, Jonnekin muuannekin kun niihin ...",1
26518,Ja tiedätkö että minä joka olen Jumalan armost...,2
26519,Ei näiltä happopäiltä jotka itseään kutsuvat s...,4


Let's pick one fold for validation set and use the other folds for a training set. 

In [12]:
one_fold = 3
mlm_val = df_pretrain[df_pretrain.kfold==one_fold]
mlm_train = df_pretrain[df_pretrain.kfold!=one_fold]
mlm_val = mlm_val[['text']]
mlm_train = mlm_train[['text']]
mlm_val = mlm_val.reset_index(drop=True)
mlm_train = mlm_train.reset_index(drop=True)

In [13]:
print("Dataset lengths")
print(f"validation set: {len(mlm_val)}")
print(f"training set: {len(mlm_train)}")

Dataset lengths
validation set: 5304
training set: 21217


In [14]:
mlm_val

Unnamed: 0,text
0,- Tervetuloa skotlantiin...
1,40.
2,"""Koska naiset yleensä arvostavat miehessä enit..."
3,Työstä kuuluu maksaa palkkaa vai meinaatko ett...
4,"Tämähän ei maksa kaupungille mitään, mutta ope..."
...,...
5299,"Haluan myös kiittää lääkeyhtiöitä kuten Wyeth,..."
5300,Onhan teillä oma Temppelinnekin jo….
5301,Siinä mielessä seksuaalisuutta ei tarvita lisä...
5302,"Keskiansio, mitä se nyt uutisissa olikaan, mel..."


In [15]:
mlm_train

Unnamed: 0,text
0,"...... No, oikein sopiva sattumaha se vaan oli..."
1,Kyseessä voi olla loppuelämäsi nainen.
2,Sinne vaan ocean clubiin iskemään!
3,Itsekin pidän Keskustan kampanjointia ihan hyv...
4,"Kamppi, Kontula, Kluuvi"
...,...
21212,sais nauraa.
21213,"Menkää töihin, Jonnekin muuannekin kun niihin ..."
21214,Ja tiedätkö että minä joka olen Jumalan armost...
21215,Ei näiltä happopäiltä jotka itseään kutsuvat s...


Finally, let's save the datasets.

In [16]:
mlm_train.to_csv(cfg.data_folder+'mlm_train.csv', index=False)
mlm_val.to_csv(cfg.data_folder+'mlm_valid.csv', index=False)

## 3. Creating dataset for finetuning

For finetuning, let's divide the data into two subsets: training and testing sets. Let's try sklearn's train_test_split function this time.

Consider also other data splitting methods, such as stratified k fold [to preserve the percentage of samples for each class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html).

In [17]:
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=2023)
training_data = training_data.reset_index(drop=True)
testing_data = testing_data.reset_index(drop=True)

In [18]:
training_data

Unnamed: 0,label,text
0,2,– Ei...
1,0,"Kurkist kehtoon, kuinka kultaa Lapsi paljon tu..."
2,0,Entä jos miehellä ei enää seiso?
3,0,Eihän ala-ikäiset saa muutakaan tehdä ilman va...
4,0,"""Mies on naisen pää, koska Allah on toisia suo..."
...,...,...
21211,0,"En voi käyttä samaa nimeä, koska Jumanmi Keklr..."
21212,0,Metsästysharrastusta kait tarkoitettiin päättö...
21213,0,Monet näistä hurjannäköisistä roduista eivät o...
21214,0,Puolilämpimällä koneella kierrokset 2000:ssa h...


In [19]:
testing_data

Unnamed: 0,label,text
0,0,"Ensinnäkin, korvikkeen saa lämmittää mikrossa,..."
1,0,"En tiedä, miksi trollaat asialla, joka on help..."
2,0,Nyt todella tiedän mitä glamour elämä on...
3,0,Kissa oli saanut olla itse valitsemansa ajan e...
4,2,Mietippä rehellisesti tiedätkö oikeasti narsis...
...,...,...
5300,0,itse olen nelikymppinen ja aion sairaanhoitoal...
5301,0,"Niin siis ""kannattaako"" jonkin harvinaisemman ..."
5302,0,Mulle kävi niin että se vaan yhtäkkii laitto s...
5303,2,"""Kalle ei ole Turtolan kanssa keskustellutkaan..."


In [20]:
print("Testing data distribution")
_ = check_class_distribution(testing_data)
print("Training data distribution")
_ = check_class_distribution(training_data)

Testing data distribution
For label 0 there is 3956 data samples
For label 1 there is 560 data samples
For label 2 there is 789 data samples
Training data distribution
For label 0 there is 15549 data samples
For label 1 there is 2368 data samples
For label 2 there is 3299 data samples


Finally, let's save the datasets.

In [21]:
testing_data.to_csv(cfg.data_folder+'finetune_testset.csv', index=False)
training_data.to_csv(cfg.data_folder+'finetune_trainset.csv', index=False)