<a href="https://colab.research.google.com/github/DianaMoyano1/NLP-Sentiment_Extraction_Challenge/blob/master/Template_SingleModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SECTION 1: Setup


### Mount Your Own Gdrive

Below command will require you to validate your account, and it will provide you with a temporary access code to paste in the required field

In [1]:
# Mount your local Google drive and show the models you have
from google.colab import drive
drive.mount('/gdrive')
%ls '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive
[0m[01;34mDiana_bert-base-cased_A[0m/    [01;34mDiana_distilbert-base-cased_A[0m/
[01;34mdiana_bert-base-uncased_A[0m/  [01;34mDiana_distilbert-base-uncased_A[0m/


In [2]:
#install the following first
!pip install transformers==2.11.0 --quiet
!pip install tensorflow==2.2.0 --quiet
!pip install tensorboardX --quiet
!pip install simpletransformers --quiet

[K     |████████████████████████████████| 675kB 4.7MB/s 
[K     |████████████████████████████████| 890kB 11.7MB/s 
[K     |████████████████████████████████| 3.8MB 13.4MB/s 
[K     |████████████████████████████████| 1.2MB 44.6MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 204kB 5.0MB/s 
[K     |████████████████████████████████| 194kB 4.4MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Setup NVIDIA APEX

Tool to enable mixed precision training. More info here: https://github.com/NVIDIA/apex

In [3]:
%%writefile setup.sh
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [4]:
#this will take 10mins to run
import timeit
start = timeit.default_timer()

!sh setup.sh --quiet

stop = timeit.default_timer()
print('Time: ', stop - start)  

Cloning into 'apex'...
remote: Enumerating objects: 151, done.[K
remote: Counting objects: 100% (151/151), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 7163 (delta 112), reused 87 (delta 55), pack-reused 7012[K
Receiving objects: 100% (7163/7163), 13.83 MiB | 25.43 MiB/s, done.
Resolving deltas: 100% (4830/4830), done.
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-sxcvqfv3
Created temporary directory: /tmp/pip-req-tracker-f1j2vfw5
Created requirements tracker '/tmp/pip-req-tracker-f1j2vfw5'
Created temporary directory: /tmp/pip-install-q_dy0jdy
Processing ./apex
  Created temporary directory: /tmp/pip-req-build-jtuhedas
  Added file:///content/apex to build tracker '/tmp/pip-req-tracker-f1j2vfw5'
    Running setup.py (path:/tmp/pip-req-build-jtuhedas/setup.py) egg_info for package from file:///content/apex
    Running command python setup.py egg_info


    torch.__version__  = 1.5.0+cu101


    r

### Import Packages

In [0]:
#Import packages
from os.path import join
import numpy as np 
import pandas as pd 
from apex import amp
#from glob import glob
#import os
#from random import random
#from pathlib import Path
import json
#import torch
#from transformers import AutoModel, AutoTokenizer, BertTokenizer, AutoModelForQuestionAnswering
#from transformers import TFBertModel, BertModel, DistilBertModel, XLNetModel, RobertaModel
#from tensorboardX import SummaryWriter
#from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
#from transformers import AdamW, get_linear_schedule_with_warmup

#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"

#from os.path import join


use_cuda = True ##If True, GPU will be used

### Load the Data



Before running below command, make sure you have...
- Created a *'tweet-sentiment-extraction'* folder inside the *'Colab Notebooks'* directory
- Uploaded the *train.csv* and *test.csv* files to the *'tweet-sentiment-extraction'* folder 

Finally, make sure you have a folder called *'models'* inside the *'tweet-sentiment-extraction'* directory

In [0]:
train_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/test.csv')



#sub_df = pd.read_csv('/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/sample_submission.csv') #Optional

### Prepare the Data

Split into train and validation sets

In [0]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state = 42)

In [0]:
#drop selected_text column from the validation dataset (it will be later compared to the ground truth)
val_df_new = val_df.drop('selected_text', axis=1)

In [0]:
print(train_df.shape)
print(val_df_new.shape)
print(test_df.shape)

In [0]:
train = np.array(train_df)
val = np.array(val_df_new)
test = np.array(test_df)

### Initiate the SimpleTransformers Task



The SimpleTransformers library supports numerous tasks:  


- Sequence Classification
- Token Classification (NER)
- Question Answering
- Language Model Fine-Tuning
- Language Model Training
- Language Generation
- T5 Model
- Seq2Seq Tasks
- Multi-Modal Classification
- Conversational AI

In this case, we are performing a <ins>Question&Answer</ins> task

In [0]:
# Import the Question&Answering model
from simpletransformers.question_answering import QuestionAnsweringModel

### Format the data under the SimpleTransformer's *Question&Answer* schema 



To input the dataset, we need to assign each column to specific inputs
- Context: The entire tweet
- Question: The sentiment (positive, negative or neutral). In other words, we are asking *\"What part of the entire tweet best represents this sentiment?\"*
- Answer: the label - the extracted text **TODO (for validation and test it will be empty as it its the model's prediction)**

The formated data is assigned to the variables *qa_train, qa_val* and *qa_test* respectively



In [11]:
#@title Create list for training

## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def find_all(input_str, search_str):
    l1 = []
    length = len(input_str)
    index = 0
    while index < length:
        i = input_str.find(search_str, index)
        if i == -1:
            return l1
        l1.append(i)
        index = i + 1
    return l1

def do_qa_train(train):

    output = []
    for line in train:
        context = line[1]

        qas = []
        question = line[-1]
        qid = line[0]
        answers = []
        answer = line[2]
        if type(answer) != str or type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answer_starts = find_all(context, answer)
        for answer_start in answer_starts:
            answers.append({'answer_start': answer_start, 'text': answer.lower()})
            break
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})

        output.append({'context': context.lower(), 'qas': qas})
        
    return output

qa_train = do_qa_train(train)


nan <class 'float'>
nan <class 'float'>
neutral <class 'str'>


In [0]:
#@title Create val list
## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def do_qa_val(val):
    output = []
    for line in val:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_val = do_qa_val(val)

In [0]:
#@title Create test list
## Adapted from https://www.kaggle.com/cheongwoongkang/roberta-baseline-starter-simple-postprocessing
def do_qa_test(test):
    output = []
    for line in test:
        context = line[1]
        qas = []
        question = line[-1]
        qid = line[0]
        if type(context) != str or type(question) != str:
            print(context, type(context))
            print(answer, type(answer))
            print(question, type(question))
            continue
        answers = []
        answers.append({'answer_start': 1000000, 'text': '__None__'})
        qas.append({'question': question, 'id': qid, 'is_impossible': False, 'answers': answers})
        output.append({'context': context.lower(), 'qas': qas})
    return output

qa_test = do_qa_test(test)

### Create a Logging Module --> More info [here](https://realpython.com/python-logging/#:~:text=The%20Logging%20Module,-The%20logging%20module&text=It%20is%20used%20by%20most,homogeneous%20log%20for%20your%20application.&text=With%20the%20logging%20module%20imported,that%20you%20want%20to%20see.)


Logs provide developers with an extra set of eyes that are constantly looking at the flow that an application is going through. They can store information, like which user or IP accessed the application.  

With the logging module imported, you can use something called a “logger” to log messages that you want to see. By default, there are 5 standard levels indicating the severity of events.
- DEBUG
- INFO
- WARNING
- ERROR
- CRITICAL

In this case, we picked INFO and WARNING

In [0]:
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

### Train a SimpleTransformers Model **<ins>OR</ins>** Load an Existing Richardson Model  

Select the option that best applies to your case.  


>#### Option 1: Train a SimpleTransformers Model

>1. Create a folder that will contain the new model's PyTorch and hyperameters files. Follow below instructions to assign a name to the *'NAME_OF_MODEL'*  folder:


>>>**Basic Structure:**

>>>\<Name>_\<Model>_\<Version>  

>>>>Where:
- Name: Your name
- Model: Based on the model names used in the official Transformers site: https://huggingface.co/transformers/pretrained_models.html
- Version: For notebooks with same name and model but different hyperparameters, include the version (A, B, C...)
  
  >>>>Examples:
  - Lucas_distilroberta-base_A
  - Lucas_distilroberta-base_B
  - Landis_bert_A  

>2. Follow **OPTION 1** section



---



  >#### Option 2: Load an Existing Richardson Model

  >1. Under *'NAME_OF_MODEL'*, enter the name of the model folder you want to load
  >2. Skip *OPTION 1* and follow **OPTION 2** section





In [0]:
# Change this BEFORE RUNNING *********************************************************************************************
YOUR_NAME = 'diana'
YOUR_LETTER = 'A'     # identify your model A,B,C,D,E...
MODEL_ARCHITECTURE = 'bert'
MODEL_NAME = 'bert-base-uncased'
# ************************************************************************************************************************

NAME_OF_MODEL = YOUR_NAME + '_' + MODEL_NAME + '_' + YOUR_LETTER 

#Don't change this:
ROOT = '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' 
FULL_PATH = join(ROOT, NAME_OF_MODEL)

# OPTION 1: Load, train and evaluate a SimpleTransformers Pre-trained Model

#### Next Steps
A this point, you should be saving your work:

1. Save a copy of this notebook in GitHub with the same name you used under *'NAME_OF_MODEL'* 
2. Go to the Experiments project and add a note with the following info


>*   Name you enter under NAME_OF_MODEL
>*   Jaccard Score (once you have it)
>*   List of arguments (you'll find them in SECTION 2 under <ins>**args_train**</ins>



Supported model types for Question&Answering:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- XLM
- XLNet

Related link: https://huggingface.co/transformers/pretrained_models.html

In [0]:
#Change the workspace to the "tweet-sentiment-extraction/models" folder
%cd '{ROOT}'
#It creates the folder where the model components will be saved. If you have a folder with the same name, it will give you an error
%mkdir '{NAME_OF_MODEL}' 
#Change the workspace to the recently created folder
%cd '{FULL_PATH}' 

/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models
/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models/lucas_bert-base-cased_A


In [0]:
#For more arguments, refer to this link --> https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model

args_train={'reprocess_input_data': True,
'overwrite_output_dir': True,
'learning_rate': 5e-5,
'num_train_epochs': 1,
'max_seq_length': 192,
'doc_stride': 64,
'fp16': False,
}

#Fit the model
model = QuestionAnsweringModel(MODEL_ARCHITECTURE, MODEL_NAME, args=args_train, use_cuda=use_cuda)

In [0]:
#Train the model
import timeit
start = timeit.default_timer()

model.train_model(qa_train)

stop = timeit.default_timer()
print('Time: ', stop - start)  

In [0]:
#Predict the evaluation and test sets
predictions_val = model.predict(qa_val)
predictions_test = model.predict(qa_test)


In [0]:
#@title --------TODO Output with highest prob - Val and Test

#Validation Set highest probability output
predictions_df_val = pd.DataFrame.from_dict(predictions_val)
text_val = pd.DataFrame(predictions_val[0])
prob_val = pd.DataFrame(predictions_val[1])
prop1_val = prob_val['probability'].tolist()
prop2_val = pd.DataFrame(prop1_val)
text1_val = text_val['answer'].tolist()
text2_val = pd.DataFrame(text1_val)

#Test highest probability output
predictions_df_test = pd.DataFrame.from_dict(predictions_test)
text_test = pd.DataFrame(predictions_test[0])
prob_test = pd.DataFrame(predictions_test[1])
prop1_test = prob_test['probability'].tolist()
prop2_test = pd.DataFrame(prop1_test)
text1_test = text_test['answer'].tolist()
text2_test = pd.DataFrame(text1_test)

In [0]:
# Make a copy of the validation and test sets so that we are not modifying the original sets
sub_val_df = val_df.copy()
sub_test_df = test_df.copy()

In [0]:
#Add the predicted result to the copied data frames 
sub_val_df['selected_text_results'] = text2_val[0].values
sub_test_df['selected_text_results'] = text2_test[0].values

In [0]:
# Check head of dataset
sub_test_df.head()

## Save trained model arguments and other files

In [0]:
"""from google.colab import files
sub_val_df.to_csv('sub_val.csv') 
files.download('sub_val.csv')
sub_test_df.to_csv('sub_test.csv') 
files.download('sub_test.csv')
train_df.to_csv("new_train_df")"""

In [0]:
#This line creates a JSON file that is required when loading the model
with open('args_train.json', 'w') as fp: 
    json.dump(args_train, fp)

#### TODO To assess via Jaccard Score, please refer to the last part of this notebook

# OPTION 2: Load and Evaluate a Richardson's Pre-Trained Model

In [16]:

ROOT = '/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models' #Don't change

FULL_PATH = join(ROOT, NAME_OF_MODEL)

#Change the workspace to the model folder
%cd '{FULL_PATH}' 

#Load the model's arguments list (required to setup the existing model) 
with open('args_train.json') as json_file: 
    train_args = json.load(json_file) 

/gdrive/My Drive/Colab Notebooks/tweet-sentiment-extraction/models/diana_bert-base-uncased_A


#### Setup loaded model

Supported model types for Question&Answering:

- ALBERT
- BERT
- DistilBERT
- ELECTRA
- XLM
- XLNet

Related link: https://huggingface.co/transformers/pretrained_models.html

In [0]:
loaded_model = QuestionAnsweringModel(MODEL_ARCHITECTURE, 'outputs/', args=train_args, use_cuda=use_cuda)

In [18]:
predictions_val = loaded_model.predict(qa_val)
predictions_test = loaded_model.predict(qa_test)

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 5497/5497 [00:04<00:00, 1151.29it/s]
add example index and unique id: 100%|██████████| 5497/5497 [00:00<00:00, 775072.75it/s]


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 3534/3534 [00:03<00:00, 1114.55it/s]
add example index and unique id: 100%|██████████| 3534/3534 [00:00<00:00, 726779.62it/s]


HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))




In [0]:
#@title Output with highest prob - Val and Test
#val
predictions_df_val = pd.DataFrame.from_dict(predictions_val)
text_val = pd.DataFrame(predictions_val[0])
prob_val = pd.DataFrame(predictions_val[1])
prop1_val = prob_val['probability'].tolist()
prop2_val = pd.DataFrame(prop1_val)
text1_val = text_val['answer'].tolist()
text2_val = pd.DataFrame(text1_val)
#test
predictions_df_test = pd.DataFrame.from_dict(predictions_test)
text_test = pd.DataFrame(predictions_test[0])
prob_test = pd.DataFrame(predictions_test[1])
prop1_test = prob_test['probability'].tolist()
prop2_test = pd.DataFrame(prop1_test)
text1_test = text_test['answer'].tolist()
text2_test = pd.DataFrame(text1_test)

In [0]:
sub_val_df = val_df.copy()
sub_test_df = test_df.copy()

In [36]:
sub_val_df

Unnamed: 0,textID,text,selected_text,sentiment
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral
6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try ...","was in Palawan a couple of days ago, i`ll try ...",neutral
2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ...,horrible.,negative
4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, an...",glad,positive
...,...,...,...,...
616,4ccb7ef67c,"either way, you always tend to make my #follo...",you do rock that much,positive
4504,33231252fd,That`s cause Ovie is the one man team and whe...,That`s cause Ovie is the one man team and when...,neutral
9887,8a24c189a8,Is leaving Utah today Super Sad Face,Super Sad Face,negative
19734,21e7f12c06,I love it when it rains on me when im golfing,love it wh,positive


In [0]:
#create files to export 
sub_val_df['selected_text_results'] = text2_val[0].values
sub_test_df['selected_text_results'] = text2_test[0].values

In [0]:
from google.colab import files
sub_val_df.to_csv('sub_val_loaded.csv') 
files.download('sub_val_loaded.csv')
sub_test_df.to_csv('sub_test_loaded.csv') 
files.download('sub_test_loaded.csv')
train_df.to_csv("new_train_df_loaded")

In [38]:
sub_val_df

Unnamed: 0,textID,text,selected_text,sentiment,selected_text_results
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive,lovelovelove<3
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking
6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try ...","was in Palawan a couple of days ago, i`ll try ...",neutral,"was in palawan a couple of days ago, i`ll try ..."
2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ...,horrible.,negative,horrible.
4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, an...",glad,positive,glad
...,...,...,...,...,...
616,4ccb7ef67c,"either way, you always tend to make my #follo...",you do rock that much,positive,you do rock that much
4504,33231252fd,That`s cause Ovie is the one man team and whe...,That`s cause Ovie is the one man team and when...,neutral,that`s cause ovie is the one man team and when...
9887,8a24c189a8,Is leaving Utah today Super Sad Face,Super Sad Face,negative,sad face
19734,21e7f12c06,I love it when it rains on me when im golfing,love it wh,positive,i love


# Jaccard Score Evaluation

In [0]:
#Define the Jaccard Score function
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [0]:
#Make a copy of the original sets and reset indexes
df=sub_val_df.copy()
df = df.reset_index()

In [60]:
df.head()

Unnamed: 0,index,textID,text,selected_text,sentiment,selected_text_results
0,1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive,lovelovelove<3
1,23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking
2,6561,07d17131b1,"was in Palawan a couple of days ago, i`ll try ...","was in Palawan a couple of days ago, i`ll try ...",neutral,"was in palawan a couple of days ago, i`ll try ..."
3,2602,2820205db5,I know! I`m so slow its horrible. DON`T TELL ...,horrible.,negative,horrible.
4,4003,7d3ce4363c,"Glad I went out, glad I didn`t leave early, an...",glad,positive,glad


In [64]:
#Obtain JS for the entire set
results = []
for i in range(len(df)):
    score = jaccard(df['selected_text'].iloc[i], df['selected_text_results'].iloc[i])
    results.append(score)
    
Jaccard_score = sum(results) / len(results)
Jaccard_score

0.7004602310828985