<a href="https://colab.research.google.com/github/BalavSha/Natural-Language-Processing/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center><u>**Sentiment Analysis**</u></center>

In [2]:
!pip install --upgrade nbformat nbconvert

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


<a href = "https://www.kaggle.com/datasets/charunisa/chatgpt-sentiment-analysis">Link to the Dataset</a>

## **Import the required libraries and download the dataset**

**l Import required libraries**

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1


In [None]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from transformers import BertTokenizer, TFBertModel
from sklearn.model_selection import train_test_split

**l Download/Upload the Dataset**

In [None]:
# Mount Google Drive to access dataset file
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Load and preprocess the dataset**


**l Load the Dataset from Google Drive**

In [None]:
# Load dataset from CSV file
df = pd.read_csv("/content/drive/MyDrive/Sentiment Analysis/chatgpt_sentiments.csv", index_col=0)

# display first few rows
df.head()

Unnamed: 0,tweets,labels
0,ChatGPT: Optimizing Language Models for Dialog...,neutral
1,"Try talking with ChatGPT, our new AI system wh...",good
2,ChatGPT: Optimizing Language Models for Dialog...,neutral
3,"THRILLED to share that ChatGPT, our new model ...",good
4,"As of 2 minutes ago, @OpenAI released their ne...",bad


**l Remove rows with missing values if there is any**

In [None]:
# Remove rows with missing values
df.dropna(inplace=True)

In [None]:
# check the missing values
df.isna().sum()

tweets    0
labels    0
dtype: int64

In [None]:
# remove rows having neutral sentiment
df = df[(df["labels"] == "good") | (df["labels"] == "bad")]

**l Convert Sentiment labels to numerical values**

In [None]:
df["labels"] = df["labels"].replace({"good":1, "bad":0})

# display first few rows
df.head()

Unnamed: 0,tweets,labels
1,"Try talking with ChatGPT, our new AI system wh...",1
3,"THRILLED to share that ChatGPT, our new model ...",1
4,"As of 2 minutes ago, @OpenAI released their ne...",0
5,"Just launched ChatGPT, our new AI system which...",1
6,"As of 2 minutes ago, @OpenAI released their ne...",0


**l Tokenize the text data using the BERT tokenizer**

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


df['tokens'] = df['tweets'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

# display first few rows
df.head()

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Unnamed: 0,tweets,labels,tokens
1,"Try talking with ChatGPT, our new AI system wh...",1,"[101, 3046, 3331, 2007, 11834, 21600, 2102, 10..."
3,"THRILLED to share that ChatGPT, our new model ...",1,"[101, 16082, 2000, 3745, 2008, 11834, 21600, 2..."
4,"As of 2 minutes ago, @OpenAI released their ne...",0,"[101, 2004, 1997, 1016, 2781, 3283, 1010, 1030..."
5,"Just launched ChatGPT, our new AI system which...",1,"[101, 2074, 3390, 11834, 21600, 2102, 1010, 22..."
6,"As of 2 minutes ago, @OpenAI released their ne...",0,"[101, 2004, 1997, 1016, 2781, 3283, 1010, 1030..."


**l Pad or Truncate the tokenized sequences to a fixed length**

In [None]:
# Pad or truncate the tokenized sequences to a fixed length of 128
max_length = 128

df['tokens'] = df['tokens'].apply(lambda x: x[:max_length] + [0]*(max_length-len(x)) if len(x) < max_length else x[:max_length])

# display first few rows
df.head()

Unnamed: 0,tweets,labels,tokens
1,"Try talking with ChatGPT, our new AI system wh...",1,"[101, 3046, 3331, 2007, 11834, 21600, 2102, 10..."
3,"THRILLED to share that ChatGPT, our new model ...",1,"[101, 16082, 2000, 3745, 2008, 11834, 21600, 2..."
4,"As of 2 minutes ago, @OpenAI released their ne...",0,"[101, 2004, 1997, 1016, 2781, 3283, 1010, 1030..."
5,"Just launched ChatGPT, our new AI system which...",1,"[101, 2074, 3390, 11834, 21600, 2102, 1010, 22..."
6,"As of 2 minutes ago, @OpenAI released their ne...",0,"[101, 2004, 1997, 1016, 2781, 3283, 1010, 1030..."


**l Split the dataset into training, validation, and testing sets**

In [None]:
# Split the dataset into training, validation, and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

**l Convert the tokenized Sequences to numpy arrays**

In [None]:
# Convert the tokenized sequences to numpy arrays
train_tokens = np.array(train_df['tokens'].tolist())
train_labels = np.array(train_df['labels'].tolist())

val_tokens = np.array(val_df["tokens"].tolist())
val_labels = np.array(val_df["labels"].tolist())

test_tokens = np.array(test_df["tokens"].tolist())
test_labels = np.array(test_df["labels"].tolist())

## **Load a pre-trained BERT model and add a classification layer on top**

In [None]:
# Import necessary libraries
import tensorflow as tf
from transformers import TFBertModel

**l Define the function to create Model Architecture**

In [None]:
# Define the model architecture
def create_model():

    # create a Keras Input layer which can be used to pass input to the model
    input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='input_ids')

    # load a pre-trained BERT Model trained on large corpus of text
    bert = TFBertModel.from_pretrained('bert-base-uncased')

    # pass the input_ids tensor to the BERT model and retrieves the output tensor corresponding to the final hidden state of each token in the input sequence
    sequence_output = bert(input_ids)[0]

    # extract the output corresponding to the first token of the input sequence
    # BERT model uses it for classification task
    cls_token = sequence_output[:, 0, :]

    # add dense output layer with sigmoid activation function
    output = tf.keras.layers.Dense(1, activation='sigmoid')(cls_token)

    # Create a  Keras model using the input layer `input_ids` and the output layer `output`.
    model = tf.keras.models.Model(inputs=input_ids, outputs=output)

    return model

## **Train the model on the training set**

**l Create an instance of the Model**

In [None]:
model = create_model()
model.summary()

Downloading tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_ids (InputLayer)      [(None, 128)]             0         
                                                                 
 tf_bert_model (TFBertModel)  TFBaseModelOutputWithPoo  109482240
                             lingAndCrossAttentions(l            
                             ast_hidden_state=(None,             
                             128, 768),                          
                              pooler_output=(None, 76            
                             8),                                 
                              past_key_values=None, h            
                             idden_states=None, atten            
                             tions=None, cross_attent            
                             ions=None)                          
                                                             

**l Compile the Model for Training**

In [None]:
# Compile the model
# define optimizer for the model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)

# define loss function
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)

# define evaluation metrics
metrics = tf.metrics.BinaryAccuracy()

# compile optimizer, loss function, and evaluation metrics
model.compile(optimizer=optimizer, loss=loss, metrics=[metrics])

In [None]:
# display the compiled Model for trainin
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_ids (InputLayer)      [(None, 128)]             0         
                                                                 
 tf_bert_model (TFBertModel)  TFBaseModelOutputWithPoo  109482240
                             lingAndCrossAttentions(l            
                             ast_hidden_state=(None,             
                             128, 768),                          
                              pooler_output=(None, 76            
                             8),                                 
                              past_key_values=None, h            
                             idden_states=None, atten            
                             tions=None, cross_attent            
                             ions=None)                          
                                                             

**l Train the Model with Training set**

In [None]:
# Train the model
history = model.fit(
    train_tokens, train_labels, 
    validation_data=(val_tokens, val_labels), 
    epochs=3, 
    batch_size=32
    )

Epoch 1/3


  output, from_logits = _get_logits(


Epoch 2/3
Epoch 3/3


## **Save the model for future use**

**l Save the Trained Model**

In [None]:
# Save the trained model
model.save("/content/drive/MyDrive/Sentiment_Analysis/saved_model/sentiment_analyzer.h5")

**Load the Trained Model**

> When we use a custom layer or model, Keras will be unable to recognize it when loading the saved model.

> So, we need to provide a custom object scope when loading the saved model as below:

```
# Save the model
model.save('my_model.h5')

# Load the model with custom object scope
from transformers import TFBertModel
from tensorflow.keras.models import load_model

custom_objects = {'TFBertModel': TFBertModel.from_pretrained('bert-base-uncased')}
loaded_model = load_model('my_model.h5', custom_objects=custom_objects)
```

In [None]:
# Load the model with custom object scope
from transformers import TFBertModel
from tensorflow.keras.models import load_model

saved_model = "/content/drive/MyDrive/Sentiment_Analysis/saved_model/sentiment_analyzer.h5"

custom_objects = {'TFBertModel': TFBertModel.from_pretrained('bert-base-uncased')}
loaded_model = load_model(saved_model, custom_objects=custom_objects)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


## **Evaluate the model on the validation set**

In [None]:
# Evaluate the model on the Validation set
val_loss, val_accuracy = loaded_model.evaluate(val_tokens, val_labels, batch_size=32)

print('Validation Loss:', val_loss)
print('Validation Accuracy:', val_accuracy)

Validation Loss: 0.03762894496321678
Validation Accuracy: 0.9877141714096069


## **Evaluate the final model on the testing set**

In [None]:
# Evaluate the model on the Test set
test_loss, test_accuracy = loaded_model.evaluate(test_tokens, test_labels, batch_size=32)

print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

Test Loss: 0.039910733699798584
Test Accuracy: 0.9889506101608276


## **Make a Prediction on New Data**

**l Define the function to Preprocess input text for prediction**

In [None]:
# Define a function to preprocess input text
def preprocess_text(text):

    # tokenize the text data
    tokens = tokenizer.encode(text, add_special_tokens=True)

    # pad or truncate the tokenized sequences to a fixed length
    tokens = tokens[:max_length] + [0]*(max_length-len(tokens)) if len(tokens) < max_length else tokens[:max_length]
    
    return np.array(tokens).reshape(1, -1)

**l Define a function to make a prediction on an input text**

In [None]:
# Define a function to make predictions on input text
def predict_sentiment(text):

    # tokenize the input text
    tokens = preprocess_text(text)

    # make a prediction with saved model
    prediction = loaded_model.predict(tokens)[0][0]

    # classify the prediction as "Positive" or "Negative"
    sentiment = "Positive" if prediction >= 0.5 else "Negative"

    return sentiment, prediction

**l Test the Trained model on some Sample text**

In [None]:
# Test the model on some sample text 1
text = "This movie was great! I really enjoyed it."
sentiment, prediction = predict_sentiment(text)
print('Text:', text)
print('Sentiment:', sentiment)
print('Prediction:', prediction)

Text: This movie was great! I really enjoyed it.
Sentiment: Positive
Prediction: 0.9999639


In [None]:
# Test the model on some sample text 2
text = "This app is informative and helpful for students as well as staff in the campus"
sentiment, prediction = predict_sentiment(text)
print('Text:', text)
print('Sentiment:', sentiment)
print('Prediction:', prediction)

Text: This app is informative and helpful for students as well as staff in the campus
Sentiment: Positive
Prediction: 0.9993185


In [None]:
# Test the model on some sample text 3
text = "Balav is an Amazing Person though he is unpredictable."
sentiment, prediction = predict_sentiment(text)
print('Text:', text)
print('Sentiment:', sentiment)
print('Prediction:', prediction)

Text: Balav is an Amazing Person though he is unpredictable.
Sentiment: Positive
Prediction: 0.99928385


In [None]:
# Test the model on some sample text 4
text = "Balav likes Deep Learning though he isn't interested in Web development"

sentiment, prediction = predict_sentiment(text)

print("Text:", text)
print("Sentiment:", sentiment)
print("Prediction:", prediction)

Text: Balav likes Deep Learning though he isn't interested in Web development
Sentiment: Positive
Prediction: 0.99964786


# <center>**... The End ...**</center>


---

---