Credits to: https://www.kaggle.com/cheongwoongkang/distilbert-qa-starter-cross-validation

## Problem Formulation
I formulate this task as an extractive question answering problem, such as SQuAD.  
Given a question and context, the model is trained to find the answer spans in the context.

Therefore, I use sentiment as question, text as context, selected_text as answer.
- Question: sentiment
- Context: text
- Answer: selected_text

## Hyperparameters & Options 

In [1]:
# Hyperparameters
batch_size = 64 # batch size
lr = 5e-5 # learning rate
epochs = 2 # number of epochs
max_seq_len = 128 # max sequence length
doc_stride = 64 # document stride

# Options
cross_validation = True # whether to use cross-validation
K = 5 # number of CV splits
post_processing = True # whether to use post-processing

## Import Packages

In [2]:
import numpy as np
import pandas as pd
import json
import os

## Data Preprocessing
### Load Data

In [3]:
pd_train = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
pd_test = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')

In [4]:
np_train = np.array(pd_train)
np_test = np.array(pd_test)

### K-fold Split
Split the data into K folds for cross validation. Use the fixed random seed for reproducibility.

In [5]:
# Given a data size, return the train/valid indicies for K splits.
def split_data(num_examples, K):
    np.random.seed(0)
    idx = np.arange(num_examples)
    np.random.shuffle(idx)
    
    boundary = num_examples // K
    splits = [{} for _ in range(K)]
    for i in range(K):
        splits[i]['valid_idx'] = idx[i*boundary:(i+1)*boundary]
        splits[i]['train_idx'] = np.concatenate((idx[:i*boundary], idx[(i+1)*boundary:]))

        valid = np_train[splits[i]['valid_idx']]
        d = {'neutral':0, 'positive':0, 'negative':0}
        for line in valid:
            d[line[-1]] += 1
        print(d)
        
    return splits

In [6]:
splits = split_data(len(np_train), K)

{'neutral': 2272, 'positive': 1688, 'negative': 1537}
{'neutral': 2243, 'positive': 1687, 'negative': 1567}
{'neutral': 2192, 'positive': 1733, 'negative': 1572}
{'neutral': 2210, 'positive': 1748, 'negative': 1539}
{'neutral': 2200, 'positive': 1726, 'negative': 1571}


### Convert Data to SQuAD-style
In this part, I convert the data into SQuAD-style.  
Since I think most of the errors in the dataset are irreducible, I do not use additional preprocessing methods to handle them.

In [7]:
# Convert data to SQuAD-style
def convert_data(data, directory, filename):
    def find_all(input_str, search_str):
        l1 = []
        length = len(input_str)
        index = 0
        while index < length:
            i = input_str.find(search_str, index)
            if i == -1:
                return l1
            l1.append(i)
            index = i + 1
        return l1
    
    output = {}
    output['version'] = 'v1.0'
    output['data'] = []
    
    for line in data:
        paragraphs = []
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        answers = []
        answer = line[2]
        if type(context) != str:
            print(context, type(context))
            continue
        answer_starts = find_all(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        paragraphs.append({'context': context, 'qas': qas})
        output['data'].append({'title': 'None', 'paragraphs': paragraphs})

    if not os.path.exists(directory):
        os.makedirs(directory)

    with open(os.path.join(directory, filename), 'w') as outfile:
        json.dump(output, outfile)

In [8]:
# convert k-fold train data
for i, split in enumerate(splits):
    data = np_train[split['train_idx']]
    directory = 'split_' + str(i+1)
    filename = 'train.json'
    convert_data(data, directory, filename)

nan <class 'float'>
nan <class 'float'>
nan <class 'float'>
nan <class 'float'>


In [9]:
# convert original train/test data
data = np_train
directory = 'original'
filename = 'train.json'
convert_data(data, directory, filename)

data = np_test
filename = 'test.json'
convert_data(data, directory, filename)

nan <class 'float'>


## Finetuning
Install the pytorch-transformers package (v2.5.1) of [huggingface](https://github.com/huggingface/transformers).

In [10]:
!cd /kaggle/input/pytorchtransformers/transformers-2.5.1; pip install .

Processing /kaggle/input/pytorchtransformers/transformers-2.5.1
Building wheels for collected packages: transformers
  Building wheel for transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for transformers: filename=transformers-2.5.1-py3-none-any.whl size=498878 sha256=a20b51141194483b94d6abdf91524847da250393ff1736662816c06908a08024
  Stored in directory: /root/.cache/pip/wheels/b0/58/32/c9de6e928489e1884c4a65f0aaf8a6ebe5484617eaadbea3b6
Successfully built transformers
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 2.7.0
    Uninstalling transformers-2.7.0:
      Successfully uninstalled transformers-2.7.0
Successfully installed transformers-2.5.1


### Cross-Validation
Finetune QA models for cross-validation.

In [11]:
def run_script(train_file, predict_file, batch_size=16, lr=5e-5, epochs=2, max_seq_len=128, doc_stride=64):
    !python /kaggle/input/pytorchtransformers/transformers-2.5.1/examples/run_squad.py \
    --model_type distilbert \
    --model_name_or_path distilbert-base-uncased \
    --cache_dir /kaggle/input/cached-distilbert-base-uncased/cache \
    --do_lower_case \
    --do_train \
    --do_eval \
    --train_file=$train_file \
    --predict_file=$predict_file \
    --overwrite_cache \
    --learning_rate=$lr \
    --num_train_epochs=$epochs \
    --max_seq_length=$max_seq_len \
    --doc_stride=$doc_stride \
    --output_dir ./results \
    --overwrite_output_dir \
    --per_gpu_eval_batch_size=$batch_size \
    --per_gpu_train_batch_size=$batch_size \
    --save_steps=100000

In [12]:
!mkdir results

In [13]:
if cross_validation:
    for i in range(1, K+1):
        train_file = "split_" + str(i) + "/train.json"
        predict_file = "original/train.json"
        run_script(train_file, predict_file, batch_size, lr, epochs, max_seq_len, doc_stride)
        !mv "results/predictions_.json" "results/predictions_"$i".json"

2020-04-05 08:29:31.117090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
100%|██████████████████████████████████| 21989/21989 [00:01<00:00, 12905.34it/s]
convert squad examples to features: 100%|█| 21989/21989 [00:44<00:00, 488.66it/s
add example index and unique id: 100%|█| 21989/21989 [00:00<00:00, 525995.35it/s
Epoch:   0%|                                              | 0/2 [00:00<?, ?it/s]
Iteration:   0%|                                        | 0/344 [00:00<?, ?it/s][A
Iteration:   0%|                                | 1/344 [00:01<06:38,  1.16s/it][A
Iteration:   1%|▏                               | 2/344 [00:01<05:15,  1.09it/s][A
Iteration:   1%|▎                               | 3/344 [00:01<04:17,  1.32it/s][A
Iteration:   1%|▎                               | 4/344 [00:02<03:37,  1.56it/s][A
Iteration:   1%|▍                               | 5/344 [00:02<03:08,  1.79it/s][A
Iteration:   2%

### Evaluation
Calculate train/valid scores.

In [14]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [15]:
def evaluate(splits, np_train, post_processing=False):
    K = len(splits)
    predictions = [json.load(open('results/predictions_' + str(i+1) + '.json', 'r')) for i in range(K)]

    train_score = [{'neutral':[], 'positive':[], 'negative':[], 'total':[]} for _ in range(K+1)]
    valid_score = [{'neutral':[], 'positive':[], 'negative':[], 'total':[]} for _ in range(K+1)]

    for train_idx, line in enumerate(np_train):
        text_id = line[0]
        text = line[1]
        answer = line[2]
        sentiment = line[-1]

        if type(text) != str:
            continue

        for i, prediction in enumerate(predictions):
            if text_id not in prediction:
                print('key error:', text_id)
                continue
            else:
                if post_processing and (sentiment == 'neutral' or len(text.split()) < 4): # post-processing
                    score = jaccard(answer, text)
                else:
                    score = jaccard(answer, prediction[text_id])

                if train_idx in splits[i]['valid_idx']:
                    valid_score[i][sentiment].append(score)
                    valid_score[i]['total'].append(score)
                    valid_score[K][sentiment].append(score)
                    valid_score[K]['total'].append(score)

                else:
                    train_score[i][sentiment].append(score)
                    train_score[i]['total'].append(score)
                    train_score[K][sentiment].append(score)
                    train_score[K]['total'].append(score)

    for i, score_dict in enumerate([train_score, valid_score]):
        if i == 0:
            print('train score \n')
        else:
            print('valid score \n')
        for j in range(K+1):
            for sentiment in ['neutral', 'positive', 'negative', 'total']:
                score = np.array(score_dict[j][sentiment])
                if j < K:
                    print('split', j+1)
                else:
                    print('all data')
                print(sentiment + ' - ' + str(len(score)) + ' examples, average score: ' + str(score.mean()))
            print()

In [16]:
if cross_validation:
    evaluate(splits, np_train, post_processing)

train score 

split 1
neutral - 8846 examples, average score: 0.9695459627071343
split 1
positive - 6894 examples, average score: 0.5615043951474108
split 1
negative - 6249 examples, average score: 0.5794303550477854
split 1
total - 21989 examples, average score: 0.7307506105301365

split 2
neutral - 8874 examples, average score: 0.9700491933222442
split 2
positive - 6895 examples, average score: 0.5611708256409118
split 2
negative - 6219 examples, average score: 0.5782213161020319
split 2
total - 21988 examples, average score: 0.7310099940501281

split 3
neutral - 8925 examples, average score: 0.9689940947351157
split 3
positive - 6849 examples, average score: 0.5647490910059514
split 3
negative - 6214 examples, average score: 0.5805615281327627
split 3
total - 21988 examples, average score: 0.7333021718950179

split 4
neutral - 8907 examples, average score: 0.9700262516409257
split 4
positive - 6834 examples, average score: 0.5597554039874609
split 4
negative - 6247 examples, average

### Test
Finetune a model for the test.

In [17]:
train_file = "original/train.json"
predict_file = "original/test.json"
run_script(train_file, predict_file, batch_size, lr, epochs, max_seq_len, doc_stride)
!mv results/predictions_.json results/test_predictions.json

2020-04-05 09:26:28.303148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
100%|██████████████████████████████████| 27485/27485 [00:02<00:00, 12513.27it/s]
convert squad examples to features: 100%|█| 27485/27485 [00:56<00:00, 486.41it/s
add example index and unique id: 100%|█| 27485/27485 [00:00<00:00, 561245.78it/s
Epoch:   0%|                                              | 0/2 [00:00<?, ?it/s]
Iteration:   0%|                                        | 0/430 [00:00<?, ?it/s][A
Iteration:   0%|                                | 1/430 [00:00<04:25,  1.61it/s][A
Iteration:   0%|▏                               | 2/430 [00:00<03:53,  1.83it/s][A
Iteration:   1%|▏                               | 3/430 [00:01<03:29,  2.03it/s][A
Iteration:   1%|▎                               | 4/430 [00:01<03:13,  2.20it/s][A
Iteration:   1%|▎                               | 5/430 [00:02<03:02,  2.34it/s][A
Iteration:   1%

## Submission

In [18]:
def f(selected):
    return " ".join(set(selected.lower().split()))

# Copy predictions to submission file.
predictions = json.load(open('results/test_predictions.json', 'r'))
submission = pd.read_csv(open('/kaggle/input/tweet-sentiment-extraction/sample_submission.csv', 'r'))
for i in range(len(submission)):
    id_ = submission['textID'][i]
    if post_processing and (pd_test['sentiment'][i] == 'neutral' or len(pd_test['text'][i].split()) < 4): # post-processing
        submission.loc[i, 'selected_text'] = f(pd_test['text'][i])
    else:
        submission.loc[i, 'selected_text'] = f(predictions[id_])

In [19]:
submission.head()

Unnamed: 0,textID,selected_text
0,11aa4945ff,i wish
1,fd1db57dc0,done.haha. i'm
2,2524332d66,concerned i'm
3,0fb19285b2,worry. hey guys working need to it's no
4,e6c9e5e3ab,26th february


In [20]:
# Save the submission file.
submission.to_csv('submission.csv', index=False)