<a href="https://colab.research.google.com/github/MatteoGuglielmi-tech/Polarity-and-Subjectivity-Detection/blob/main/src/MyModel/BERT-Fine-Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Embedding Fine Tuning

In [4]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 12.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 66.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 85.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3


In [5]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [7]:
rootdir = '/content/gdrive/MyDrive/Colab Notebooks/Polarity-Subjectivity-Detection/'

In [1]:
import tensorflow as tf

# Get the GPU device name
device = tf.test.gpu_device_name()

if 'GPU' in device:
  print(f'GPU available : {device}')
else :
  device = "cpu"
  raise SystemError("GPU not found, use CPU instead")

GPU available : /device:GPU:0


In [8]:
import pandas as pd


# loading dataset
movie_reviews = pd.read_csv(rootdir+'movie_rews.csv')
subj_obj_dataset = pd.read_csv(rootdir+'subj_obj_dataset.csv')

In [9]:
movie_reviews

Unnamed: 0.1,Unnamed: 0,text,pos,neg
0,0,films adapted comic books plenty success wheth...,1,0
1,1,every movie comes along suspect studio every i...,1,0
2,2,got mail works alot better deserves order make...,1,0
3,3,jaws rare film grabs attention shows single im...,1,0
4,4,moviemaking lot like general manager nfl team ...,1,0
...,...,...,...,...
1995,1995,anything stigmata taken warning releasing simi...,0,1
1996,1996,john boorman zardoz goofy cinematic debacle fu...,0,1
1997,1997,kids hall acquired taste took least season wat...,0,1
1998,1998,time john carpenter great horror director cour...,0,1


In [10]:
subj_obj_dataset

Unnamed: 0.1,Unnamed: 0,text,tag
0,0,"smart and alert , thirteen conversations about...",subj
1,1,"color , musical bounce and warm seas lapping o...",subj
2,2,it is not a mass-market entertainment but an u...,subj
3,3,a light-hearted french film about the spiritua...,subj
4,4,my wife is an actress has its moments in looki...,subj
...,...,...,...
9995,9995,"in the end , they discover that balance in lif...",obj
9996,9996,a counterfeit 1000 tomin bank note is passed i...,obj
9997,9997,enter the beautiful and mysterious secret agen...,obj
9998,9998,after listening to a missionary from china spe...,obj


### Major commands :
- .tokenize(sent)
- .convert_tokens_to_ids(tokenized_sent)
- .encode.plus() [source](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.tokenization_utils_base.PreTrainedTokenizerBase.batch_encode_plus)

In [103]:
# BERT model script from: huggingface.co
from transformers import BertTokenizer, BertModel
from typing import Tuple, List, Dict
import numpy as np


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [217]:
def embedding(dataset : pd.DataFrame, sentence_column : str) -> Dict:
  #embeddings = {'embedding' : []}
  for idx, sent in enumerate(dataset[sentence_column][:3]):
    print(f"\n{sent}")
    sent_encoding = tokenizer.encode_plus(sent, # untokenized sentence 
                              add_special_tokens = True,  # add '[CLS]' and '[SEP]'
                              truncation = True,  # truncate to maximum length
                              padding = "max_length",  # pad to maximum admissible sentence
                              return_attention_mask = True,  # return attention mask
                              return_tensors = "tf")["input_ids"]  # returns tensorflow constant obj
    if idx ==0:
      df = pd.DataFrame(sent_encoding.numpy().ravel(), columns=['embedding'])
    else :
      df2 = pd.DataFrame(sent_encoding.numpy().ravel(), columns=['embedding'])
      pd.concat([df,df2], axis=1)
    #print(sent_encoding.numpy().flatten())
    #embeddings['embedding'].append(sent_encoding.numpy().flatten())
    #df_embedding = pd.DataFrame.from_dict(embedding)
    #print(embeddings['embedding'])
  return df

In [218]:
rev_emb = embedding(movie_reviews, 'text')


films adapted comic books plenty success whether superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world never reallyneg beenneg aneg comicneg bookneg likeneg fromneg hellneg beforeneg starters created alan moore eddie campbell brought medium whole new level mid 80s 12 part series called watchmen say moore campbell thoroughly researched subject jack ripper would like saying michael jackson starting look little odd book graphic novel 500 pages long includes nearly 30 consist nothing butneg footnotesneg words dismiss film source get past whole comic book thing might find another stumbling block hell directors albert allen hughes getting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film set ghetto features really violent street crime mad geniuses behind menace ii society ghetto question course whitechapel 1888 london east end filthy sooty place whores called unfortunates starting get little nervous myst

In [219]:
rev_emb

Unnamed: 0,embedding
0,101
1,3152
2,5967
3,5021
4,2808
...,...
507,2015
508,28925
509,3533
510,14913


In [153]:
print(rev_emb['embedding'])

[]


In [141]:
print(len(rev_emb["embedding"]))

512


In [143]:
df_rev_emb = pd.DataFrame.from_dict(rev_emb['embedding'], orient='columns')

In [144]:
df_rev_emb

Unnamed: 0,0
0,101
1,2048
2,2283
3,4364
4,3960
...,...
507,0
508,0
509,0
510,0
