<a href="https://colab.research.google.com/github/BalavSha/Natural-Language-Processing/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center><u>**Sentiment Analysis**</u></center>

## **Import the required libraries and download the dataset**

**-> Import required libraries**

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1


In [60]:
import os
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from transformers import BertTokenizer, TFBertModel
from sklearn.model_selection import train_test_split

**-> Download/Upload the Dataset**

In [61]:
# Mount Google Drive to access dataset file
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Load and preprocess the dataset**


**-> Load the Dataset from Google Drive**

In [66]:
# Load dataset from CSV file
df = pd.read_csv("/content/drive/MyDrive/Sentiment Analysis/chatgpt_sentiments.csv", index_col=0)

# display first few rows
df.head()

Unnamed: 0,tweets,labels
0,ChatGPT: Optimizing Language Models for Dialog...,neutral
1,"Try talking with ChatGPT, our new AI system wh...",good
2,ChatGPT: Optimizing Language Models for Dialog...,neutral
3,"THRILLED to share that ChatGPT, our new model ...",good
4,"As of 2 minutes ago, @OpenAI released their ne...",bad


**-> Remove rows with missing values if there is any**

In [68]:
# Remove rows with missing values
df.dropna(inplace=True)

In [69]:
# check the missing values
df.isna().sum()

tweets    0
labels    0
dtype: int64

**-> Convert Sentiment labels to numerical values**

In [72]:
df["labels"] = df["labels"].replace({"good":1, "neutral":0, "bad":-1})

# display first few rows
df.head()

Unnamed: 0,tweets,labels
0,ChatGPT: Optimizing Language Models for Dialog...,0
1,"Try talking with ChatGPT, our new AI system wh...",1
2,ChatGPT: Optimizing Language Models for Dialog...,0
3,"THRILLED to share that ChatGPT, our new model ...",1
4,"As of 2 minutes ago, @OpenAI released their ne...",-1


**-> Tokenize the text data using the BERT tokenizer**

In [73]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


df['tokens'] = df['tweets'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

# display first few rows
df.head()

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Unnamed: 0,tweets,labels,tokens
0,ChatGPT: Optimizing Language Models for Dialog...,0,"[101, 11834, 21600, 2102, 1024, 23569, 27605, ..."
1,"Try talking with ChatGPT, our new AI system wh...",1,"[101, 3046, 3331, 2007, 11834, 21600, 2102, 10..."
2,ChatGPT: Optimizing Language Models for Dialog...,0,"[101, 11834, 21600, 2102, 1024, 23569, 27605, ..."
3,"THRILLED to share that ChatGPT, our new model ...",1,"[101, 16082, 2000, 3745, 2008, 11834, 21600, 2..."
4,"As of 2 minutes ago, @OpenAI released their ne...",-1,"[101, 2004, 1997, 1016, 2781, 3283, 1010, 1030..."


**-> Pad or Truncate the tokenized sequences to a fixed length**

In [74]:
# Pad or truncate the tokenized sequences to a fixed length of 128
max_length = 128

df['tokens'] = df['tokens'].apply(lambda x: x[:max_length] + [0]*(max_length-len(x)) if len(x) < max_length else x[:max_length])

# display first few rows
df.head()

Unnamed: 0,tweets,labels,tokens
0,ChatGPT: Optimizing Language Models for Dialog...,0,"[101, 11834, 21600, 2102, 1024, 23569, 27605, ..."
1,"Try talking with ChatGPT, our new AI system wh...",1,"[101, 3046, 3331, 2007, 11834, 21600, 2102, 10..."
2,ChatGPT: Optimizing Language Models for Dialog...,0,"[101, 11834, 21600, 2102, 1024, 23569, 27605, ..."
3,"THRILLED to share that ChatGPT, our new model ...",1,"[101, 16082, 2000, 3745, 2008, 11834, 21600, 2..."
4,"As of 2 minutes ago, @OpenAI released their ne...",-1,"[101, 2004, 1997, 1016, 2781, 3283, 1010, 1030..."


## **Split the dataset into training, validation, and testing sets**

## **Load a pre-trained BERT model and add a classification layer on top**

## **Train the model on the training set**

## **Evaluate the model on the validation set**

## **Fine-tune the model by adjusting the hyperparameters**

## **Evaluate the final model on the testing set**

## **Save the model for future use**