# <center> Sentiment Analysis </center>
We seek to assess the accuracy of classification performance of a well tuned base BERT Transformer with Tensorflow.<br> We will be using the [Rotten Tomatoes movie reviews dataset](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data) for the analysis <br>.
This excercise is performed across three phases:
<ol>
    <li> Data Transformation - Data Loading + EDA + Tokenization </li>
    <li> Model Building and Training </li>
    <li> Prediction </li>
</ol>

### Part 3- Prediction

In [4]:
import tensorflow as tf
import numpy as np
import pandas as pd
from transformers import BertTokenizer

#### Load test data

In [5]:
df_test=pd.read_csv('test.tsv', delimiter='\t')

In [6]:
df_test.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


The test data obviously does not have the sentiments which we will need to predict through our model.

#### Data Cleaning

Kaggle expects a simple csv containing the PhraseId and the Sentiment.
From Kaggle website:
    <i>"We expect the solution file to have 66292 prediction rows. This file should have a header row."</i>

In [7]:
df_test.describe()

Unnamed: 0,PhraseId,SentenceId
count,66292.0,66292.0
mean,189206.5,10114.909144
std,19136.99636,966.787807
min,156061.0,8545.0
25%,172633.75,9266.0
50%,189206.5,10086.0
75%,205779.25,10941.0
max,222352.0,11855.0


#### Tokenization


In [5]:
tokenizer=BertTokenizer.from_pretrained('bert-base-cased')

In [9]:
tokens=tokenizer(df_test['Phrase'].tolist(),max_length=512,padding='max_length',add_special_tokens=True,truncation=True,return_tensors='tf')

In [10]:
tokens.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [11]:
dict_token={'ids':tokens['input_ids'],'masks':tokens['attention_mask']}

#### Load Model

In [13]:
model=tf.keras.models.load_model('BERT_sentiment_analysis')

In [14]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 ids (InputLayer)               [(None, 512)]        0           []                               
                                                                                                  
 masks (InputLayer)             [(None, 512)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  108310272   ['ids[0][0]',                    
                                thPooling(last_hidd               'masks[0][0]']                  
                                en_state=(None, 512                                               
                                , 768),                                                     

In [15]:
test_sentiments=model.predict(dict_token)

In [16]:
sentiments=np.argmax(test_sentiments,axis=1)

In [17]:
len(sentiments)

66292

In [18]:
sentiments

array([3, 1, 2, ..., 1, 2, 1])

In [19]:
df_check=pd.DataFrame(sentiments)

In [20]:
df_check.head()

Unnamed: 0,0
0,3
1,1
2,2
3,2
4,1


In [21]:
df_check['PhraseId']=df_test['PhraseId']

In [22]:
df_check.head()

Unnamed: 0,0,PhraseId
0,3,156061
1,1,156062
2,2,156063
3,2,156064
4,1,156065


In [23]:
df_check.columns=['Sentiment','PhraseID']

In [24]:
df_final=df_check.reindex(columns= ['PhraseID','Sentiment'])

In [25]:
df_final

Unnamed: 0,PhraseID,Sentiment
0,156061,3
1,156062,1
2,156063,2
3,156064,2
4,156065,1
...,...,...
66287,222348,1
66288,222349,1
66289,222350,1
66290,222351,2


In [26]:
df_final.to_csv('test_sentiments_final32.csv')