<a href="https://colab.research.google.com/github/DianaMoyano1/NLP-Sentiment_Extraction_Challenge/blob/master/Tutorial_SingleM_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SECTION 1: Setup


### Mount Your Own Gdrive

Below command will require you to validate your account, and it will provide you with a temporary access code to paste in the field

In [2]:
# Mount your local Google drive and show the models you have
from google.colab import drive
drive.mount('/gdrive')
%ls '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


In [3]:
#install the following packages. The --quiet command will reduce the output lines
!pip install transformers==2.11.0 --quiet
!pip install tensorflow==2.2.0 --quiet
!pip install tensorboardX --quiet
!pip install simpletransformers --quiet

[K     |████████████████████████████████| 675kB 3.5MB/s 
[K     |████████████████████████████████| 890kB 14.7MB/s 
[K     |████████████████████████████████| 3.8MB 20.2MB/s 
[K     |████████████████████████████████| 1.1MB 48.4MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 204kB 3.4MB/s 
[K     |████████████████████████████████| 194kB 3.5MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Setup NVIDIA APEX

Tool to enable mixed precision training in Pytorch (the underlying structure for SimpleTransformers). More info here: https://github.com/NVIDIA/apex

In [4]:
%%writefile setup.sh
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [5]:
#TODO --> Find more info



#this will take 7-10 mins to run
import timeit
start = timeit.default_timer()

!sh setup.sh --quiet

stop = timeit.default_timer()
print('Time: ', stop - start)  

Cloning into 'apex'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 7274 (delta 9), reused 6 (delta 0), pack-reused 7255[K
Receiving objects: 100% (7274/7274), 13.87 MiB | 24.66 MiB/s, done.
Resolving deltas: 100% (4909/4909), done.
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-263ezpas
Created temporary directory: /tmp/pip-req-tracker-18sco4pt
Created requirements tracker '/tmp/pip-req-tracker-18sco4pt'
Created temporary directory: /tmp/pip-install-s__idw5p
Processing ./apex
  Created temporary directory: /tmp/pip-req-build-u35pwtcl
  Added file:///content/apex to build tracker '/tmp/pip-req-tracker-18sco4pt'
    Running setup.py (path:/tmp/pip-req-build-u35pwtcl/setup.py) egg_info for package from file:///content/apex
    Running command python setup.py egg_info


    torch.__version__  = 1.5.0+cu101


    running 

### Import Packages

In [6]:
#Import packages
from os.path import join
import numpy as np 
import pandas as pd 
from apex import amp
import json


use_cuda = True ##If True, GPU will be used

### Load the Data



Before running below command, make sure you have...
- Created a *'tweet-sentiment-extraction'* folder inside the *'Colab Notebooks'* directory
- Uploaded the *train.csv* and *test.csv* files to the *'tweet-sentiment-extraction'* folder 

Finally, make sure you have a folder called *'models'* inside the *'tweet-sentiment-extraction'* directory

In [7]:
train_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/test.csv')

In [9]:
train_df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [10]:
test_df.head()

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


# Prepare the Data

Split into train and validation sets

In [11]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state = 42)

In [12]:
#drop selected_text column from the validation dataset (it will be added back once we are comparing it to our predictions)
val_df_new = val_df.drop('selected_text', axis=1)

In [13]:
print(train_df.shape)
print(val_df_new.shape)
print(test_df.shape)

(21984, 4)
(5497, 3)
(3534, 3)


In [14]:
train = np.array(train_df)
val = np.array(val_df_new)
test = np.array(test_df)

### Initiate the SimpleTransformers Task



The SimpleTransformers library supports numerous tasks:  


- Sequence Classification
- Token Classification (NER)
- Question Answering
- Language Model Fine-Tuning
- Language Model Training
- Language Generation
- T5 Model
- Seq2Seq Tasks
- Multi-Modal Classification
- Conversational AI

In this case, we are performing a <ins>Question Answering</ins> task.

Supported model types:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- XLM
- XLNet

In [15]:
# Import the Question Answering model
from simpletransformers.question_answering import QuestionAnsweringModel

### Format the data under the SimpleTransformer's *Question&Answer* schema 



To input the dataset, we need to assign each column to specific inputs
- Context: The entire tweet
- Question: The sentiment (positive, negative or neutral). In other words, we are asking *\"What part of the entire tweet best represents this sentiment?\"*
- Answer: the label - the extracted text

The formated data is assigned to the variables *qa_train, qa_val* and *qa_test* respectively



In [16]:
#@title Create list for training

## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def find_all(input_str, search_str):
    l1 = []
    length = len(input_str)
    index = 0
    while index < length:
        i = input_str.find(search_str, index)
        if i == -1:
            return l1
        l1.append(i)
        index = i + 1
    return l1

def do_qa_train(train):

    output = []
    for line in train:
        context = line[1]

        qas = []
        question = line[-1]
        qid = line[0]
        answers = []
        answer = line[2]
        if type(answer) != str or type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answer_starts = find_all(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer.lower()})
            break
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        output.append({'context': context.lower(), 'qas': qas})
        
    return output

qa_train = do_qa_train(train)


nan <class 'float'>
nan <class 'float'>
neutral <class 'str'>


In [28]:
qa_train[1:3]


[{'context': ' you should.',
  'qas': [{'answers': [{'answer_start': 1, 'text': 'you should.'}],
    'id': '415660cb0e',
    'is_impossible': False,
    'question': 'neutral'}]},
 {'context': 'back at school again. almost weekend. oh wait, i gotta work from eight to four tonight',
  'qas': [{'answers': [{'answer_start': 0,
      'text': 'back at school again. almost weekend. oh wait, i gotta work from eight to four tonight'}],
    'id': '4fdc228bbe',
    'is_impossible': False,
    'question': 'neutral'}]}]

In [17]:
#@title Create val list
## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def do_qa_val(val):
    output = []
    for line in val:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_val = do_qa_val(val)

In [18]:
#@title Create test list
## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def do_qa_test(test):
    output = []
    for line in test:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_test = do_qa_test(test)

### Load, Train and Evaluate a SimpleTransformers' Pre-Trained Model **<ins>OR</ins>** Load and Evaluate a Richardson's Pre-Trained Model  

Follow the section that best applies to your case.  

# OPTION 1: Load, train and evaluate a SimpleTransformers' Pre-trained Model

Create a folder that will contain the new model's PyTorch and hyperameters files. Follow below instructions to assign a name to the *'NAME_OF_MODEL'*  folder:


>>**Basic Structure:**

>>\<Name>_\<Model>_\<Version>  

>>>Where:
- Name: Your name
- Model: Based on the model names used in the official Transformers site: https://huggingface.co/transformers/pretrained_models.html
- Version: For notebooks with same name and model but different hyperparameters, include the version (A, B, C...)
  
  >>>Examples:
  - Lucas_distilroberta-base_A
  - Lucas_distilroberta-base_B
  - Landis_bert_A  


Supported model types for Question&Answering:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- XLM
- XLNet

Related link: https://huggingface.co/transformers/pretrained_models.html

In [31]:
# Change this BEFORE RUNNING *********************************************************************************************
YOUR_NAME = 'richardson'
YOUR_LETTER = 'A'     # identify your model A,B,C,D,E...
MODEL_ARCHITECTURE = 'distilbert'
MODEL_NAME = 'distilbert-base-uncased-distilled-squad'
# ************************************************************************************************************************

#Don't change below lines:
NAME_OF_MODEL = YOUR_NAME + '_' + MODEL_NAME + '_' + YOUR_LETTER 


ROOT = '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' 
FULL_PATH = join(ROOT, NAME_OF_MODEL)

Below command will create a folder where all the model's files will be stored

In [32]:
#Change directory to "tweet-sentiment-extraction/models"
%cd '{ROOT}'
#It creates the folder where the model components will be saved. If you have a folder with the same name, it will give you an error
%mkdir '{NAME_OF_MODEL}' 
#Change the workspace to the recently created folder
%cd '{FULL_PATH}' 

/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models
/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models/richardson_distilbert-base-uncased-distilled-squad_A


In [33]:
#For more arguments, refer to this link --> https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model

args_train={'reprocess_input_data': True,
'overwrite_output_dir': True,
'learning_rate': 5e-5,
'num_train_epochs': 1,
'max_seq_length': 192,
'doc_stride': 64,
'fp16': False,
}

#Fit the model
model = QuestionAnsweringModel(MODEL_ARCHITECTURE, MODEL_NAME, args=args_train, use_cuda=use_cuda)

INFO:filelock:Lock 139691865630144 acquired on /root/.cache/torch/transformers/e88f38f2c8bc669ef7873de68f36bf764d4f64b9833ca8401efe271aab476745.0f15800a5b4c30725c555e054e3d0262e9916635f0de9d397c30acd86c21dc73.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=451.0, style=ProgressStyle(description_…

INFO:filelock:Lock 139691865630144 released on /root/.cache/torch/transformers/e88f38f2c8bc669ef7873de68f36bf764d4f64b9833ca8401efe271aab476745.0f15800a5b4c30725c555e054e3d0262e9916635f0de9d397c30acd86c21dc73.lock





INFO:filelock:Lock 139691839776080 acquired on /root/.cache/torch/transformers/dfa987aac92dc15d249af90a287974fd64aedb6548e287a4c031a16b06eb173c.f4565e3948d4331d7e0460adbcbdcac536e9886f24a2fad1190d6b53c231a3a3.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=265481570.0, style=ProgressStyle(descri…

INFO:filelock:Lock 139691839776080 released on /root/.cache/torch/transformers/dfa987aac92dc15d249af90a287974fd64aedb6548e287a4c031a16b06eb173c.f4565e3948d4331d7e0460adbcbdcac536e9886f24a2fad1190d6b53c231a3a3.lock





INFO:filelock:Lock 139691826876256 acquired on /root/.cache/torch/transformers/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 139691826876256 released on /root/.cache/torch/transformers/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock





In [34]:
#Train the model
import timeit
start = timeit.default_timer()

model.train_model(qa_train)

stop = timeit.default_timer()
print('Time: ', stop - start)  

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 21983/21983 [00:18<00:00, 1164.25it/s]
add example index and unique id: 100%|██████████| 21983/21983 [00:00<00:00, 785110.57it/s]


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=2748.0, style=ProgressStyle(descr…

Running loss: 1.887723



Running loss: 0.412387



Running loss: 0.855325



INFO:simpletransformers.question_answering.question_answering_model: Training of distilbert model complete. Saved to outputs/.


Time:  322.10444504800034


In [35]:
#Predict the evaluation and test sets
predictions_val = model.predict(qa_val)
predictions_test = model.predict(qa_test)


INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 5497/5497 [00:04<00:00, 1223.29it/s]
add example index and unique id: 100%|██████████| 5497/5497 [00:00<00:00, 806805.79it/s]


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 3534/3534 [00:02<00:00, 1182.21it/s]
add example index and unique id: 100%|██████████| 3534/3534 [00:00<00:00, 598653.89it/s]


HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))




Let's check the structure of the predictions

In [50]:
#It displays truncated long texts
pd.set_option('display.max_colwidth',100)

#Each ID contains multiple predicted extractions and their corresponding probabilities (prediction with highest probability is first)
pd.DataFrame.from_dict(predictions_val)[1]

0    {'id': 'ef42dee96c', 'answer': ['resting had a whole day of walking', 'resting had a whole day',...
1    {'id': 'ef42dee96c', 'probability': [0.9983912693956408, 0.0005431288872461738, 0.00043454031598...
Name: 1, dtype: object

Below commands will select the extracted text with the highest likelyhood (first item), as well as its corresponding probability

In [38]:
#@title Obtain output with the highest prob - Validation set

#Validation Set highest probability output
predictions_df_val = pd.DataFrame.from_dict(predictions_val)
text_val = pd.DataFrame(predictions_val[0])
prob_val = pd.DataFrame(predictions_val[1])
prop1_val = prob_val['probability'].tolist()
prop2_val = pd.DataFrame(prop1_val)
text1_val = text_val['answer'].tolist()
text2_val = pd.DataFrame(text1_val)

In [51]:
#@title Obtain output with the highest prob - Test set
predictions_df_test = pd.DataFrame.from_dict(predictions_test)
text_test = pd.DataFrame(predictions_test[0])
prob_test = pd.DataFrame(predictions_test[1])
prop1_test = prob_test['probability'].tolist()
prop2_test = pd.DataFrame(prop1_test)
text1_test = text_test['answer'].tolist()
text2_test = pd.DataFrame(text1_test)

In [52]:
# Make a copy of the validation and test sets so that we are not modifying the original sets
sub_val_df = val_df.copy()
sub_test_df = test_df.copy()

In [53]:
#Add the predicted result to the copied data frames 
sub_val_df['predicted_selected_text'] = text2_val[0].values
sub_test_df['predicted_selected_text'] = text2_test[0].values

In [54]:
#Add the probability of the prediction
sub_val_df['prob'] = prop2_val[0].values
sub_test_df['prob'] = prop2_test[0].values

## Evaluate Validation Test with Jaccard Score

In [55]:
# Check head of dataset
sub_val_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,predicted_selected_text,prob
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at any point? lovelovelove<3,t? lovelovelove,positive,lovelovelove,0.323002
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking,0.998391
6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try to post pictures tom.","was in Palawan a couple of days ago, i`ll try to post pictures tom.",neutral,"was in palawan a couple of days ago, i`ll try to post pictures tom.",0.975718
2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ON ME!,horrible.,negative,horrible.,0.340405
4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, and glad to be afterpartying it up @ Beth`s I`m back!",glad,positive,glad,0.234629


In [56]:
#Make a copy of the original validation set and reset indexes
df_js=sub_val_df.copy()
df_js=df_js.reset_index()

In [57]:
#Define the Jaccard Score function
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [58]:
#Obtain JS for the entire set
results = []
for i in range(len(df_js)):
    score = jaccard(df_js['selected_text'].iloc[i], df_js['predicted_selected_text'].iloc[i])
    results.append(score)
    
Jaccard_score = sum(results) / len(results)
Jaccard_score

0.7043270849968862

## Prepare and Submit Test Set

In [None]:
# Check head of dataset
sub_test_df.head()

Unnamed: 0,textID,text,sentiment,predicted_selected_text,prob
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day,0.27224
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,exciting,0.114143
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,shame!,0.29143
3,01082688c6,happy bday!,positive,happy bday!,0.469685
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,i like it!!,0.215219


In [None]:
#Prepare file for submission
final_test=sub_test_df[['textID','predicted_selected_text']]
final_test.columns=['textID','selected_text']
final_test.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,last session of the day
1,96d74cb729,exciting
2,eee518ae67,shame!
3,01082688c6,happy bday!
4,33987a8ee5,i like it!!


In [None]:
#Submit
final_test[['textID','selected_text']].to_csv('submission.csv', index=False)
print("Submission successful")

Submission successful


## Save trained model arguments and other files

In [60]:
#This line creates a JSON file that is required to load the model in the future
with open('args_train.json', 'w') as fp: 
    json.dump(args_train, fp)

In [None]:
#Additonal files if required
"""from google.colab import files
sub_val_df.to_csv('sub_val.csv') 
files.download('sub_val.csv')
sub_test_df.to_csv('sub_test.csv') 
files.download('sub_test.csv')
train_df.to_csv("new_train_df")"""

#### TODO To assess via Jaccard Score, please refer to the last part of this notebook

# OPTION 2: Load and Evaluate a Richardson's Pre-Trained Model

#### Distilbert --> A faster yet powerful version of BERT
https://arxiv.org/abs/1910.01108

#### SQuAD --> Standford Question Answering Dataset
https://rajpurkar.github.io/SQuAD-explorer/

In [61]:

ROOT= '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models'
FOLDER_NAME= 'richardson_distilbert-base-uncased-distilled-squad_A'

FULL_PATH = join(ROOT, FOLDER_NAME)

#Change the workspace to the model folder
%cd '{FULL_PATH}' 

#Load the model's arguments list (required to setup the existing model) 
with open('args_train.json') as json_file: 
    train_args = json.load(json_file) 

/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models/richardson_distilbert-base-uncased-distilled-squad_A


## Setup loaded model

In [62]:
#Load the model
loaded_model = QuestionAnsweringModel(MODEL_ARCHITECTURE, 'outputs/', args=train_args, use_cuda=use_cuda)

In [63]:
#Predict the evaluation and test sets
predictions_val = loaded_model.predict(qa_val)
predictions_test = loaded_model.predict(qa_test)

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 5497/5497 [00:04<00:00, 1140.94it/s]
add example index and unique id: 100%|██████████| 5497/5497 [00:00<00:00, 821788.18it/s]


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 3534/3534 [00:03<00:00, 1145.94it/s]
add example index and unique id: 100%|██████████| 3534/3534 [00:00<00:00, 734122.65it/s]


HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))




Let's check the structure of the predictions

In [64]:
#It displays truncated long texts
pd.set_option('display.max_colwidth',100)

#Each ID contains multiple predicted extractions and their corresponding probabilities (prediction with highest probability is first)
pd.DataFrame.from_dict(predictions_val)[1]

0    {'id': 'ef42dee96c', 'answer': ['resting had a whole day of walking', 'resting had a whole day',...
1    {'id': 'ef42dee96c', 'probability': [0.9983912693956408, 0.0005431288872461738, 0.00043454031598...
Name: 1, dtype: object

Below commands will select the extracted text with the highest likelyhood (first item), as well as its corresponding probability

In [65]:
#@title Obtain output with the highest prob - Validation set
predictions_df_val = pd.DataFrame.from_dict(predictions_val)
text_val = pd.DataFrame(predictions_val[0])
prob_val = pd.DataFrame(predictions_val[1])
prop1_val = prob_val['probability'].tolist()
prop2_val = pd.DataFrame(prop1_val)
text1_val = text_val['answer'].tolist()
text2_val = pd.DataFrame(text1_val)


In [66]:
#@title Obtain output with the highest prob - Test set
predictions_df_test = pd.DataFrame.from_dict(predictions_test)
text_test = pd.DataFrame(predictions_test[0])
prob_test = pd.DataFrame(predictions_test[1])
prop1_test = prob_test['probability'].tolist()
prop2_test = pd.DataFrame(prop1_test)
text1_test = text_test['answer'].tolist()
text2_test = pd.DataFrame(text1_test)

In [67]:
# Make a copy of the validation and test sets so that we are not modifying the original sets
sub_val_df = val_df.copy()
sub_test_df = test_df.copy()

In [68]:
#Add the predicted result to the copied data frames 
sub_val_df['predicted_selected_text'] = text2_val[0].values
sub_test_df['predicted_selected_text'] = text2_test[0].values

In [69]:
#Add the probability of the prediction
sub_val_df['prob'] = prop2_val[0].values
sub_test_df['prob'] = prop2_test[0].values

## Evaluate Validation Test with Jaccard Score

In [70]:
# Check head of dataset
sub_val_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,predicted_selected_text,prob
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at any point? lovelovelove<3,t? lovelovelove,positive,lovelovelove,0.323002
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking,0.998391
6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try to post pictures tom.","was in Palawan a couple of days ago, i`ll try to post pictures tom.",neutral,"was in palawan a couple of days ago, i`ll try to post pictures tom.",0.975718
2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ON ME!,horrible.,negative,horrible.,0.340405
4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, and glad to be afterpartying it up @ Beth`s I`m back!",glad,positive,glad,0.234629


In [71]:
#Make a copy of the original validation set and reset indexes
df_js=sub_val_df.copy()
df_js=df_js.reset_index()

In [72]:
#Define the Jaccard Score function
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [73]:
#Obtain JS for the entire set
results = []
for i in range(len(df_js)):
    score = jaccard(df_js['selected_text'].iloc[i], df_js['predicted_selected_text'].iloc[i])
    results.append(score)
    
Jaccard_score = sum(results) / len(results)
Jaccard_score

0.7043270849968862

## Prepare and Submit Test Set

In [None]:
# Check head of dataset
sub_test_df.head()

Unnamed: 0,textID,text,sentiment,selected_text_results
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day http://twitpic.com/67ezh
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,such a shame!
3,01082688c6,happy bday!,positive,happy bday!
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,i like it!!


In [None]:
#Prepare file for submission
final_test=sub_test_df[['textID','predicted_selected_text']]
final_test.columns=['textID','selected_text']
final_test.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,last session of the day http://twitpic.com/67ezh
1,96d74cb729,shanghai is also really exciting
2,eee518ae67,such a shame!
3,01082688c6,happy bday!
4,33987a8ee5,i like it!!


In [None]:
#Submit
final_test[['textID','selected_text']].to_csv('submission.csv', index=False)
print("Submission successful")

Submission successful


In [None]:
#Additonal files if required
"""from google.colab import files
sub_val_df.to_csv('sub_val.csv') 
files.download('sub_val.csv')
sub_test_df.to_csv('sub_test.csv') 
files.download('sub_test.csv')
train_df.to_csv("new_train_df")"""