# Introduction


In this project we'll be performing sentiment analysis on Rotten Tomatoes Dataset whose dataset has been attached in this repo

The main task corresponds to a multi-class text classification on Movie Reviews Competition and the dataset contains 156,060 from which we have to classify among 5 classes. The sentiment labels are:

0 → Negative      </br>
1 → Somewhat negative  </br>
2 → Neutral </br>
3 → Somewhat positive </br>
4 → Positive </br>



We will be comparing performance of several algorithms and will deduce which works best

# Steps to be followed

1. Importing necessary libraries
2. Opening the train and test dataset in the form of pandas dataframe and perform exploratory data analysis on train data
3. Performing pre-processing 
4. Taking the train data and splitting it into train and val dataset ( test set is already given)
5. Applying different models </br>
   a) BERT </br>
   b) RoBERTa (Robustly Optimized BERT Pre-training Approach) </br>
5. Comparing performance of different models 
6. Final Analysis

### Step 1. Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
## other libraries will be imported as and when required

## Step 2. Opening the dataset and performing EDA

In [None]:
train_df = pd.read_csv('../input/sentimentdata/train.tsv/train.tsv' , sep='\t')
test_df = pd.read_csv('../input/sentimentdata/test.tsv/test.tsv' , sep = '\t')

In [None]:
train_df.head()

In [None]:
print(train_df.shape)
#print(train_df.info)
print(train_df.columns)
print(train_df.isnull().sum())

In [None]:
train_df['Sentiment'].value_counts()


In [None]:
df2=train_df.copy(deep=True)
pie1=pd.DataFrame(df2['Sentiment'].replace(0,'Negative').replace(1,'Somewhat negative').replace(2,'Neutral').replace(3,'Somewhat positive').replace(4,'Positive').value_counts())
pie1.reset_index(inplace=True)
pie1.plot(kind='pie', title='Pie chart of Sentiment Class',y = 'Sentiment', 
          autopct='%1.1f%%', shadow=False, labels=pie1['index'], legend = False, fontsize=14, figsize=(12,12))

***Insights*** <br>
There is an imbalance . So we cannot do random split, We'll do <tt>**StratifiedSplit()**</tt> to ensure distribution is same in splits

In [None]:
f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5,figsize=(25,8))
ax1.hist(train_df[train_df['Sentiment'] == 0]['Phrase'].str.split().map(lambda x: len(x)), bins=50, color='b')
ax1.set_title('Negative Reviews')

ax2.hist(train_df[train_df['Sentiment'] == 1]['Phrase'].str.split().map(lambda x: len(x)), bins=50, color='r')
ax2.set_title('Somewhat Negative Reviews')

ax3.hist(train_df[train_df['Sentiment'] == 2]['Phrase'].str.split().map(lambda x: len(x)), bins=50, color='g')
ax3.set_title('Neutral Reviews')

ax4.hist(train_df[train_df['Sentiment'] == 3]['Phrase'].str.split().map(lambda x: len(x)), bins=50, color='y')
ax4.set_title('Somewhat Positive Reviews')

ax5.hist(train_df[train_df['Sentiment'] == 4]['Phrase'].str.split().map(lambda x: len(x)), bins=50, color='k')
ax5.set_title('Positive Reviews')

f.suptitle('Histogram number of words in reviews')

In [None]:
train_df['Phrase'].str.split().map(lambda x: len(x)).max()

***Insights*** <br>
Through these graphs we can see that most reviews of any class are of shorter length, around 5-20. But max length is 52 
Effectively was 52 words, this means if we would Tokenize by word the max_length should be 52, however as transformers consider sub-words tokenization such number could be increased depending on the words being used which can increase such length to 60 or even more, thus we have to take that into account when modeling as it could cause our model to take significatively a long time to train, therefore we have to find a trade-off between training time and performance.

In [None]:

df=pd.DataFrame(train_df['Phrase'].str.split().map(lambda x: len(x))>=20)
print('Number of sentences which contain more than 20 words: ', df.loc[df['Phrase']==True].shape[0])
print(' ')
df=pd.DataFrame(train_df['Phrase'].str.split().map(lambda x: len(x))>=30)
print('Number of sentences which contain more than 30 words: ', df.loc[df['Phrase']==True].shape[0])
print(' ')
df=pd.DataFrame(train_df['Phrase'].str.split().map(lambda x: len(x))>=40)
print('Number of sentences which contain more than 40 words: ', df.loc[df['Phrase']==True].shape[0])
print(' ')
df=pd.DataFrame(train_df['Phrase'].str.split().map(lambda x: len(x))>=50)
print('Number of sentences which contain more than 50 words: ', df.loc[df['Phrase']==True].shape[0])
print(' ')
df=pd.DataFrame(train_df['Phrase'].str.split().map(lambda x: len(x))==52)
print('Number of sentences which contain 52 words: ', df.loc[df['Phrase']==True].shape[0])
print(' ')
#dfff.loc[dfff['Phrase']==True]

***Insights*** <br>
We can remove sentences which have length more than 40 words and they won't contribute much but removing them can help to boost computation

In [None]:
train_df['len'] = train_df['Phrase'].str.split().map(lambda x: len(x))
print(train_df.shape)

train_df = train_df[train_df['len'] <40 ]
print(train_df.shape)

In [None]:
156060 - 155708

### Step 4. Taking the new data and splitting it into train and test (validation set will be made from train set later)

In [None]:
train_df['Sentiment'].value_counts()

There is an imbalance . So we cannot do random split, We'll do <tt>**StratifiedSplit()**</tt> to ensure distribution is same in splits

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
X = train_df.drop('Sentiment',axis=1)
y = train_df['Sentiment']
sss = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=0) #test size of 10% 

for train_index, test_index in sss.split(X , y):  
    X_train = X.iloc[train_index]
    y_train = y.iloc[train_index]  
    X_val = X.iloc[test_index]
    y_val = y.iloc[test_index]

In [None]:
print('Train distribution')
print(y_train.value_counts())
print(y_train.shape[0])
print("\n")
print('Val distribution')
print(y_val.value_counts())
print(y_val.shape[0])

In [None]:
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(y_train)
y_val = to_categorical(y_val)

### Step 5. Applying Transformer Model
### a) BERT

In [None]:
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy

from tensorflow.keras.callbacks import ReduceLROnPlateau , ModelCheckpoint , EarlyStopping

import pandas as pd


In [None]:
from transformers import TFBertModel,  BertConfig, BertTokenizerFast

model_name = 'bert-base-uncased'
max_length = 45

config = BertConfig.from_pretrained(model_name)
bert_tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path=model_name , config=config)
transformer_bert_model = TFBertModel.from_pretrained(model_name , config = config)

In [None]:
sample_text = train_df['Phrase'][0]
print(sample_text)
print(bert_tokenizer(sample_text))

#### Building the model

In [None]:


input_ids = Input(shape = (max_length,) , name = 'input_ids' , dtype = 'int32')

#transformer_bert_model.trainable = False
# Load the Transformers BERT model as a layer in a Keras model
bert_model = transformer_bert_model(input_ids)[1]

dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(bert_model, training=False)


# Then build your model output
Sentiments = Dense(units=5, kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='Sentiment')(pooled_output)
outputs = Sentiments
bert_model = Model(inputs=input_ids, outputs=outputs, name='Bert-SentimentNetwork')


In [None]:
bert_model.summary()

In [None]:
optimizer = Adam(learning_rate=5e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)
loss = {'Sentiment': CategoricalCrossentropy(from_logits = True)}
bert_model.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])



x_train = bert_tokenizer(
          text=X_train['Phrase'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)



x_val = bert_tokenizer(
          text=X_val['Phrase'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)


In [None]:
# callbacks = [
#     EarlyStopping(patience=5),
#     ReduceLROnPlateau(factor=0.3, patience=3, min_lr=0.00001 ),
#     ModelCheckpoint('bert_model.h5')
# ]

In [None]:
# Fit the model
history = bert_model.fit(
    x=x_train['input_ids'],
    y= y_train,
    validation_data=(x_val['input_ids'], y_val),
    batch_size=256,
    epochs=10,
    )

In [None]:
y_val_pred = bert_model.predict(x_val['input_ids'])



In [None]:
y_val_pred.shape

In [None]:
y_val_pred_max = np.argmax(y_val_pred , axis = 1)
y_val_gt_max = np.argmax(y_val , axis = 1)

print(y_val_pred_max.shape)
print(y_val_gt_max.shape)


In [None]:
from sklearn.metrics import classification_report,confusion_matrix
report = classification_report(y_val_pred_max, y_val_gt_max)

print(report)

In [None]:
import seaborn as sns
print(sns.heatmap(confusion_matrix(y_val_gt_max , y_val_pred_max) , annot=True))

#### b) RoBERTa

In [None]:
from transformers import RobertaTokenizer, TFRobertaModel, RobertaConfig 

model_name = 'roberta-base'
max_length = 40

config = RobertaConfig.from_pretrained(model_name)
config.output_hidden_states = False

roberta_tokenizer = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

transformer_roberta_model = TFRobertaModel.from_pretrained(model_name, config = config)

In [None]:
input_ids = Input(shape = (max_length,) , name = 'input_ids' , dtype = 'int32')

#transformer_roberta_model.trainable = False
# Load the Transformers RoBERTa model as a layer in a Keras model
roberta_model = transformer_roberta_model(input_ids)[1]

dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(roberta_model, training=False)


# Then build your model output
Sentiments = Dense(units=5, kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='Sentiment')(pooled_output)
outputs = Sentiments
roberta_model = Model(inputs=input_ids, outputs=outputs, name='RobBERTa_Sentiment')

In [None]:
roberta_model.summary()

In [None]:
optimizer = Adam(learning_rate=5e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)
loss = {'Sentiment': CategoricalCrossentropy(from_logits = True)}
roberta_model.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

x_train = roberta_tokenizer(
          text=X_train['Phrase'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)



x_val = roberta_tokenizer(
          text=X_val['Phrase'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)

In [None]:
callbacks = [
    EarlyStopping(patience=5, verbose=1 , monitor = 'accuracy'),
    ReduceLROnPlateau(factor=0.3, patience=3, min_lr=0.00001),
    ModelCheckpoint('roberta_model.h5',mode='max', verbose=1,monitor="val_accuracy" ,save_best_only=True, save_weights_only=False)
]

In [None]:
history = roberta_model.fit(
    x=x_train['input_ids'],
    y= y_train,
    validation_data=(x_val['input_ids'], y_val),
    batch_size=256,
    epochs=10,
    #callbacks = callbacks
)

### Step 6. Evaluating performance of both models on val set

In [None]:
y_val_pred = model.predict(x_val)

In [None]:
y_val_pred.shape

In [None]:
y_val_pred_max = np.argmax(y_val_pred , axis = 1)
y_val_gt_max = np.argmax(y_val , axis = 1)

print(y_val_pred_max.shape)
print(y_val_gt_max.shape)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
report = classification_report(y_val_pred_max, y_val_gt_max)

print(report)

In [None]:
import seaborn as sns
print(sns.heatmap(confusion_matrix(y_val_gt_max , y_val_pred_max) , annot=True))