<a href="https://colab.research.google.com/github/DianaMoyano1/NLP-Sentiment_Extraction_Challenge/blob/master/diana_bert-large-cased_A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SECTION 1: Setup


### Mount Your Own Gdrive

Below command will require you to validate your account, and it will provide you with a temporary access code to paste in the required field

In [1]:
# Mount your local Google drive and show the models you have
from google.colab import drive
drive.mount('/gdrive')
%ls '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive
[0m[01;34mDiana_bert-base-cased_A[0m/
[01;34mdiana_bert-base-uncased_A[0m/
[01;34mDiana_distilbert-base-cased_A[0m/
[01;34mdiana_distilbert-base-cased-distilled-squad_A[0m/
[01;34mDiana_distilbert-base-uncased_A[0m/
[01;34mdiana_distilbert-base-uncased-distilled-squad_A[0m/
[01;34mdiana_distilbert-base-uncased-distilled-squad_B[0m/
[01;34mdiana_distilbert-base-uncased-distilled-squad_C[0m/
[01;34mdiana_distilbert-base-uncased-distilled-squad_D[0m/


In [2]:
#install the following first
!pip install transformers==2.11.0 --quiet
!pip install tensorflow==2.2.0 --quiet
!pip install tensorboardX --quiet
!pip install simpletransformers --quiet

[K     |████████████████████████████████| 675kB 3.5MB/s 
[K     |████████████████████████████████| 890kB 17.9MB/s 
[K     |████████████████████████████████| 1.1MB 7.3MB/s 
[K     |████████████████████████████████| 3.8MB 31.5MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 204kB 3.2MB/s 
[K     |████████████████████████████████| 194kB 3.4MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Setup NVIDIA APEX

Tool to enable mixed precision training. More info here: https://github.com/NVIDIA/apex

In [3]:
%%writefile setup.sh
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [4]:
#this will take 10mins to run
import timeit
start = timeit.default_timer()

!sh setup.sh --quiet

stop = timeit.default_timer()
print('Time: ', stop - start)  

Cloning into 'apex'...
remote: Enumerating objects: 7255, done.[K
remote: Total 7255 (delta 0), reused 0 (delta 0), pack-reused 7255[K
Receiving objects: 100% (7255/7255), 13.86 MiB | 26.28 MiB/s, done.
Resolving deltas: 100% (4900/4900), done.
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-pmvi5bfj
Created temporary directory: /tmp/pip-req-tracker-ytbpqu6r
Created requirements tracker '/tmp/pip-req-tracker-ytbpqu6r'
Created temporary directory: /tmp/pip-install-6j6wifad
Processing ./apex
  Created temporary directory: /tmp/pip-req-build-tla4g8sp
  Added file:///content/apex to build tracker '/tmp/pip-req-tracker-ytbpqu6r'
    Running setup.py (path:/tmp/pip-req-build-tla4g8sp/setup.py) egg_info for package from file:///content/apex
    Running command python setup.py egg_info


    torch.__version__  = 1.5.0+cu101


    running egg_info
    creating /tmp/pip-req-build-tla4g8sp/pip-egg-info/apex.egg-info
    writing /tmp/pip-r

### Import Packages

In [0]:
#Import packages
from os.path import join
import numpy as np 
import pandas as pd 
from apex import amp
#from glob import glob
#import os
#from random import random
#from pathlib import Path
import json
#import torch
#from transformers import AutoModel, AutoTokenizer, BertTokenizer, AutoModelForQuestionAnswering
#from transformers import TFBertModel, BertModel, DistilBertModel, XLNetModel, RobertaModel
#from tensorboardX import SummaryWriter
#from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
#from transformers import AdamW, get_linear_schedule_with_warmup

#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"

#from os.path import join


use_cuda = True ##If True, GPU will be used

### Load the Data



Before running below command, make sure you have...
- Created a *'tweet-sentiment-extraction'* folder inside the *'Colab Notebooks'* directory
- Uploaded the *train.csv* and *test.csv* files to the *'tweet-sentiment-extraction'* folder 

Finally, make sure you have a folder called *'models'* inside the *'tweet-sentiment-extraction'* directory

In [0]:
train_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/test.csv')



#sub_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/sample_submission.csv') #Optional

### Prepare the Data

Split into train and validation sets

In [0]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state = 42)

In [0]:
#drop selected_text column from the validation dataset (it will be later compared to the ground truth)
val_df_new = val_df.drop('selected_text', axis=1)

In [9]:
print(train_df.shape)
print(val_df_new.shape)
print(test_df.shape)

(21984, 4)
(5497, 3)
(3534, 3)


In [0]:
train = np.array(train_df)
val = np.array(val_df_new)
test = np.array(test_df)

### Initiate the SimpleTransformers Task



The SimpleTransformers library supports numerous tasks:  


- Sequence Classification
- Token Classification (NER)
- Question Answering
- Language Model Fine-Tuning
- Language Model Training
- Language Generation
- T5 Model
- Seq2Seq Tasks
- Multi-Modal Classification
- Conversational AI

In this case, we are performing a <ins>Question Answering</ins> task.

Supported model types:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- XLM
- XLNet

In [0]:
# Import the Question Answering model
from simpletransformers.question_answering import QuestionAnsweringModel

### Format the data under the SimpleTransformer's *Question&Answer* schema 



To input the dataset, we need to assign each column to specific inputs
- Context: The entire tweet
- Question: The sentiment (positive, negative or neutral). In other words, we are asking *\"What part of the entire tweet best represents this sentiment?\"*
- Answer: the label - the extracted text **TODO (for validation and test it will be empty as it its the model's prediction)**

The formated data is assigned to the variables *qa_train, qa_val* and *qa_test* respectively



In [12]:
#@title Create list for training

## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def find_all(input_str, search_str):
    l1 = []
    length = len(input_str)
    index = 0
    while index < length:
        i = input_str.find(search_str, index)
        if i == -1:
            return l1
        l1.append(i)
        index = i + 1
    return l1

def do_qa_train(train):

    output = []
    for line in train:
        context = line[1]

        qas = []
        question = line[-1]
        qid = line[0]
        answers = []
        answer = line[2]
        if type(answer) != str or type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answer_starts = find_all(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer.lower()})
            break
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        output.append({'context': context.lower(), 'qas': qas})
        
    return output

qa_train = do_qa_train(train)


nan <class 'float'>
nan <class 'float'>
neutral <class 'str'>


In [0]:
#@title Create val list
## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def do_qa_val(val):
    output = []
    for line in val:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_val = do_qa_val(val)

In [0]:
#@title Create test list
## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def do_qa_test(test):
    output = []
    for line in test:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_test = do_qa_test(test)

### Create a Logging Module --> More info [here](https://realpython.com/python-logging/#:~:text=The%20Logging%20Module,-The%20logging%20module&text=It%20is%20used%20by%20most,homogeneous%20log%20for%20your%20application.&text=With%20the%20logging%20module%20imported,that%20you%20want%20to%20see.)


Logs provide developers with an extra set of eyes that are constantly looking at the flow that an application is going through. They can store information, like which user or IP accessed the application.  

With the logging module imported, you can use something called a “logger” to log messages that you want to see. By default, there are 5 standard levels indicating the severity of events.
- DEBUG
- INFO
- WARNING
- ERROR
- CRITICAL

In this case, we picked INFO and WARNING

In [0]:
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

### Train a SimpleTransformers Model **<ins>OR</ins>** Load an Existing Richardson Model  

Select the option that best applies to your case.  


>#### Option 1: Train a SimpleTransformers Model

>1. Create a folder that will contain the new model's PyTorch and hyperameters files. Follow below instructions to assign a name to the *'NAME_OF_MODEL'*  folder:


>>>**Basic Structure:**

>>>\<Name>_\<Model>_\<Version>  

>>>>Where:
- Name: Your name
- Model: Based on the model names used in the official Transformers site: https://huggingface.co/transformers/pretrained_models.html
- Version: For notebooks with same name and model but different hyperparameters, include the version (A, B, C...)
  
  >>>>Examples:
  - Lucas_distilroberta-base_A
  - Lucas_distilroberta-base_B
  - Landis_bert_A  

>2. Follow **OPTION 1** section



---



  >#### Option 2: Load an Existing Richardson Model

  >1. Under *'NAME_OF_MODEL'*, enter the name of the model folder you want to load
  >2. Skip *OPTION 1* and follow **OPTION 2** section





In [0]:
# Change this BEFORE RUNNING *********************************************************************************************
YOUR_NAME = 'diana'
YOUR_LETTER = 'A'     # identify your model A,B,C,D,E...
MODEL_ARCHITECTURE = 'bert'
MODEL_NAME = 'bert-large-cased'
# ************************************************************************************************************************

NAME_OF_MODEL = YOUR_NAME + '_' + MODEL_NAME + '_' + YOUR_LETTER 

#Don't change this:
ROOT = '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' 
FULL_PATH = join(ROOT, NAME_OF_MODEL)

# OPTION 1: Load, train and evaluate a SimpleTransformers Pre-trained Model

#### Next Steps
A this point, you should be saving your work:

1. Save a copy of this notebook in GitHub with the same name you used under *'NAME_OF_MODEL'* 
2. Go to the Experiments project and add a note with the following info


>*   Name you enter under NAME_OF_MODEL
>*   Jaccard Score (once you have it)
>*   List of arguments (you'll find them in SECTION 2 under <ins>**args_train**</ins>



Supported model types for Question&Answering:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- XLM
- XLNet

Related link: https://huggingface.co/transformers/pretrained_models.html

In [33]:
#Change the workspace to the "tweet-sentiment-extraction/models" folder
%cd '{ROOT}'
#It creates the folder where the model components will be saved. If you have a folder with the same name, it will give you an error
%mkdir '{NAME_OF_MODEL}' 
#Change the workspace to the recently created folder
%cd '{FULL_PATH}' 

/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models
/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models/diana_bert-large-cased_A


In [34]:
#For more arguments, refer to this link --> https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model

args_train={'reprocess_input_data': True,
'overwrite_output_dir': True,
'learning_rate': 5e-5,
'num_train_epochs': 1,
'max_seq_length': 192,
'doc_stride': 64,
'fp16': False,
}

#Fit the model
model = QuestionAnsweringModel(MODEL_ARCHITECTURE, MODEL_NAME, args=args_train, use_cuda=use_cuda)

INFO:filelock:Lock 140081153125736 acquired on /root/.cache/torch/transformers/90deb4d9dd705272dc4b3db1364d759d551d72a9f70a91f60e3a1f5e278b985d.9019d8d0ae95e32b896211ae7ae130d7c36bb19ccf35c90a9e51923309458f70.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…

INFO:filelock:Lock 140081153125736 released on /root/.cache/torch/transformers/90deb4d9dd705272dc4b3db1364d759d551d72a9f70a91f60e3a1f5e278b985d.9019d8d0ae95e32b896211ae7ae130d7c36bb19ccf35c90a9e51923309458f70.lock





INFO:filelock:Lock 140081153126072 acquired on /root/.cache/torch/transformers/5f91c3ab24cfb315cf0be4174a25619f6087eb555acc8ae3a82edfff7f705138.b5f1c2070e0a0c189ca3b08270b0cb5bd0635b7319e74e93bd0dc26689953c27.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1338740706.0, style=ProgressStyle(descr…

INFO:filelock:Lock 140081153126072 released on /root/.cache/torch/transformers/5f91c3ab24cfb315cf0be4174a25619f6087eb555acc8ae3a82edfff7f705138.b5f1c2070e0a0c189ca3b08270b0cb5bd0635b7319e74e93bd0dc26689953c27.lock





INFO:filelock:Lock 140081153117208 acquired on /root/.cache/torch/transformers/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 140081153117208 released on /root/.cache/torch/transformers/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock





In [35]:
#Train the model
import timeit
start = timeit.default_timer()

model.train_model(qa_train)

stop = timeit.default_timer()
print('Time: ', stop - start)  

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 21983/21983 [00:18<00:00, 1208.51it/s]
add example index and unique id: 100%|██████████| 21983/21983 [00:00<00:00, 840788.46it/s]


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=2748.0, style=ProgressStyle(descr…

Running loss: 2.800294



Running loss: 1.514127



Running loss: 0.697862



INFO:simpletransformers.question_answering.question_answering_model: Training of bert model complete. Saved to outputs/.


Time:  1721.5348657289996


In [36]:
#Predict the evaluation and test sets
predictions_val = model.predict(qa_val)
predictions_test = model.predict(qa_test)


INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 5497/5497 [00:05<00:00, 921.23it/s]
add example index and unique id: 100%|██████████| 5497/5497 [00:00<00:00, 633617.93it/s]


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 3534/3534 [00:02<00:00, 1198.91it/s]
add example index and unique id: 100%|██████████| 3534/3534 [00:00<00:00, 681627.44it/s]


HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))




In [0]:
#@title Obtain output with the highest prob - Validation set

#Validation Set highest probability output
predictions_df_val = pd.DataFrame.from_dict(predictions_val)
text_val = pd.DataFrame(predictions_val[0])
prob_val = pd.DataFrame(predictions_val[1])
prop1_val = prob_val['probability'].tolist()
prop2_val = pd.DataFrame(prop1_val)
text1_val = text_val['answer'].tolist()
text2_val = pd.DataFrame(text1_val)

In [0]:
#@title Obtain output with the highest prob - Test set
predictions_df_test = pd.DataFrame.from_dict(predictions_test)
text_test = pd.DataFrame(predictions_test[0])
prob_test = pd.DataFrame(predictions_test[1])
prop1_test = prob_test['probability'].tolist()
prop2_test = pd.DataFrame(prop1_test)
text1_test = text_test['answer'].tolist()
text2_test = pd.DataFrame(text1_test)

In [0]:
# Make a copy of the validation and test sets so that we are not modifying the original sets
sub_val_df = val_df.copy()
sub_test_df = test_df.copy()

In [0]:
#Add the predicted result to the copied data frames 
sub_val_df['predicted_selected_text'] = text2_val[0].values
sub_test_df['predicted_selected_text'] = text2_test[0].values

In [0]:
#Add the probability of the prediction
sub_val_df['prob'] = prop2_val[1].values
sub_test_df['prob'] = prop2_test[1].values

## Evaluate Validation Test with Jaccard Score

In [42]:
# Check head of dataset
sub_val_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,predicted_selected_text,prob
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive,woooooooooo are you coming to nottingham at an...,0.212806
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking,0.000497
6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try ...","was in Palawan a couple of days ago, i`ll try ...",neutral,"was in palawan a couple of days ago, i`ll try ...",0.012782
2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ...,horrible.,negative,its horrible.,0.26507
4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, an...",glad,positive,glad,0.2134


In [0]:
#Make a copy of the original validation set and reset indexes
df_js=sub_val_df.copy()
df_js=df_js.reset_index()

In [0]:
#Define the Jaccard Score function
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [45]:
#Obtain JS for the entire set
results = []
for i in range(len(df_js)):
    score = jaccard(df_js['selected_text'].iloc[i], df_js['predicted_selected_text'].iloc[i])
    results.append(score)
    
Jaccard_score = sum(results) / len(results)
Jaccard_score

0.7032950352614399

## Prepare and Submit Test Set

In [0]:
# Check head of dataset
sub_test_df.head()

Unnamed: 0,textID,text,sentiment,predicted_selected_text,prob
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day,0.27224
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,exciting,0.114143
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,shame!,0.29143
3,01082688c6,happy bday!,positive,happy bday!,0.469685
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,i like it!!,0.215219


In [0]:
#Prepare file for submission
final_test=sub_test_df[['textID','predicted_selected_text']]
final_test.columns=['textID','selected_text']
final_test.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,last session of the day
1,96d74cb729,exciting
2,eee518ae67,shame!
3,01082688c6,happy bday!
4,33987a8ee5,i like it!!


In [0]:
#Submit
final_test[['textID','selected_text']].to_csv('submission.csv', index=False)
print("Submission successful")

Submission successful


## Save trained model arguments and other files

In [0]:
#This line creates a JSON file that is required to load the model in the future
with open('args_train.json', 'w') as fp: 
    json.dump(args_train, fp)

In [0]:
#Additonal files if required
"""from google.colab import files
sub_val_df.to_csv('sub_val.csv') 
files.download('sub_val.csv')
sub_test_df.to_csv('sub_test.csv') 
files.download('sub_test.csv')
train_df.to_csv("new_train_df")"""

#### TODO To assess via Jaccard Score, please refer to the last part of this notebook

# OPTION 2: Load and Evaluate a Richardson's Pre-Trained Model

In [30]:
%cd ..

/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction


In [32]:

ROOT = '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' #Don't change

FULL_PATH = join(ROOT, NAME_OF_MODEL)

#Change the workspace to the model folder
%cd '{FULL_PATH}' 

#Load the model's arguments list (required to setup the existing model) 
with open('args_train.json') as json_file: 
    train_args = json.load(json_file) 

[Errno 2] No such file or directory: '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models/diana_bert-large-cased_A'
/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction


FileNotFoundError: ignored

## Setup loaded model

In [0]:
#Load the model
loaded_model = QuestionAnsweringModel(MODEL_ARCHITECTURE, 'outputs/', args=train_args, use_cuda=use_cuda)

In [19]:
#Predict the evaluation and test sets
predictions_val = loaded_model.predict(qa_val)
predictions_test = loaded_model.predict(qa_test)

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 5497/5497 [00:04<00:00, 1270.25it/s]
add example index and unique id: 100%|██████████| 5497/5497 [00:00<00:00, 729392.25it/s]


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 3534/3534 [00:02<00:00, 1215.71it/s]
add example index and unique id: 100%|██████████| 3534/3534 [00:00<00:00, 791809.31it/s]


HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))




In [0]:
#@title Obtain output with the highest prob - Validation set
predictions_df_val = pd.DataFrame.from_dict(predictions_val)
text_val = pd.DataFrame(predictions_val[0])
prob_val = pd.DataFrame(predictions_val[1])
prop1_val = prob_val['probability'].tolist()
prop2_val = pd.DataFrame(prop1_val)
text1_val = text_val['answer'].tolist()
text2_val = pd.DataFrame(text1_val)


In [0]:
#@title Obtain output with the highest prob - Test set
predictions_df_test = pd.DataFrame.from_dict(predictions_test)
text_test = pd.DataFrame(predictions_test[0])
prob_test = pd.DataFrame(predictions_test[1])
prop1_test = prob_test['probability'].tolist()
prop2_test = pd.DataFrame(prop1_test)
text1_test = text_test['answer'].tolist()
text2_test = pd.DataFrame(text1_test)

In [0]:
# Make a copy of the validation and test sets so that we are not modifying the original sets
sub_val_df = val_df.copy()
sub_test_df = test_df.copy()

In [0]:
#Add the predicted result to the copied data frames 
sub_val_df['predicted_selected_text'] = text2_val[0].values
sub_test_df['predicted_selected_text'] = text2_test[0].values

In [0]:
#Add the probability of the prediction
sub_val_df['prob'] = prop2_val[1].values
sub_test_df['prob'] = prop2_test[1].values

## Evaluate Validation Test with Jaccard Score

In [28]:
# Check head of dataset
sub_val_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,predicted_selected_text,prob
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive,lovelovelove,0.306367
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking,0.000298
6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try ...","was in Palawan a couple of days ago, i`ll try ...",neutral,"was in palawan a couple of days ago, i`ll try ...",0.008895
2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ...,horrible.,negative,horrible.,0.177021
4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, an...",glad,positive,glad,0.234319


In [0]:
#Make a copy of the original validation set and reset indexes
df_js=sub_val_df.copy()
df_js=df_js.reset_index()

In [0]:
#Define the Jaccard Score function
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [27]:
#Obtain JS for the entire set
results = []
for i in range(len(df_js)):
    score = jaccard(df_js['selected_text'].iloc[i], df_js['predicted_selected_text'].iloc[i])
    results.append(score)
    
Jaccard_score = sum(results) / len(results)
Jaccard_score

0.7066765232598925

## Prepare and Submit Test Set

In [0]:
# Check head of dataset
sub_test_df.head()

Unnamed: 0,textID,text,sentiment,selected_text_results
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day http://twitpic.com/67ezh
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,such a shame!
3,01082688c6,happy bday!,positive,happy bday!
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,i like it!!


In [0]:
#Prepare file for submission
final_test=sub_test_df[['textID','predicted_selected_text']]
final_test.columns=['textID','selected_text']
final_test.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,last session of the day http://twitpic.com/67ezh
1,96d74cb729,shanghai is also really exciting
2,eee518ae67,such a shame!
3,01082688c6,happy bday!
4,33987a8ee5,i like it!!


In [0]:
#Submit
final_test[['textID','selected_text']].to_csv('submission.csv', index=False)
print("Submission successful")

Submission successful


In [0]:
#Additonal files if required
"""from google.colab import files
sub_val_df.to_csv('sub_val.csv') 
files.download('sub_val.csv')
sub_test_df.to_csv('sub_test.csv') 
files.download('sub_test.csv')
train_df.to_csv("new_train_df")"""