<a href="https://colab.research.google.com/github/Avilez-dev-11/Projects-in-ML-AI/blob/main/homework4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Task 2 (75 points):**
In this task, you will pick a dataset (time-series or any other form of
sequential data) and an associated problem that can be solved via sequence models. You must
describe why you need sequence models to solve this problem. Include a link to the dataset
source. Next, you should pick an RNN framework that you would use to solve this problem (This
framework can be in TensorFlow, PyTorch or any other Python Package).

In [None]:
# Libraries
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

df = pd.read_csv("/kaggle/input/chatgpt-sentiment-analysis/file.csv")
SEED = 5555

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,tweets,labels
0,0,ChatGPT: Optimizing Language Models for Dialog...,neutral
1,1,"Try talking with ChatGPT, our new AI system wh...",good
2,2,ChatGPT: Optimizing Language Models for Dialog...,neutral
3,3,"THRILLED to share that ChatGPT, our new model ...",good
4,4,"As of 2 minutes ago, @OpenAI released their ne...",bad


In [None]:
df.isnull().sum()

Unnamed: 0    0
tweets        0
labels        0
dtype: int64

The absence of null values in the dataset ensures data integrity and reduces the need for data preprocessing steps like imputation. This can potentially improve the training efficiency and effectiveness of the model.

In [None]:
df = df.drop(columns = ["Unnamed: 0"])
df

Unnamed: 0,tweets,labels
0,ChatGPT: Optimizing Language Models for Dialog...,neutral
1,"Try talking with ChatGPT, our new AI system wh...",good
2,ChatGPT: Optimizing Language Models for Dialog...,neutral
3,"THRILLED to share that ChatGPT, our new model ...",good
4,"As of 2 minutes ago, @OpenAI released their ne...",bad
...,...,...
219289,Other Software Projects Are Now Trying to Repl...,bad
219290,I asked #ChatGPT to write a #NYE Joke for SEOs...,good
219291,chatgpt is being disassembled until it can onl...,bad
219292,2023 predictions by #chatGPT. Nothing really s...,bad


The first column labeled "unnamed: 0" was removed from the dataset. This column did not contain meaningful information relevant to the sentiment analysis task and was therefore considered an artifact of the data import process. Removing it helps improve the clarity and efficiency of subsequent data processing and model training.

In [None]:
len(df[df["labels"]== "bad"]) # negative

107796

In [None]:
len(df[df["labels"]== "neutral"]) # neutral

55487

In [None]:
len(df[df["labels"]== "good"]) # positive

56011

Analysis of the data revealed a class imbalance, with significantly more negative comments (107796) than positive comments (56011). To address this and ensure a more balanced representation during model training, a downsampling technique was used. This involved randomly removing a specific number of observations from the majority class (negative comments) to match the size of the minority class (positive comments).

In [None]:
from sklearn.utils import resample

# Load the dataset (assuming you have it in a DataFrame called 'df')

# Identify indices of negative comments
negative_indices = df[df['labels'] == 'bad'].index
neutral_indices = len(df[df['labels'] == 'neutral'])

# Calculate the number of negative comments to remove based on desired ratio or fixed number
num_negative_to_remove = len(negative_indices) - neutral_indices  # For 1:1 ratio

# Create a DataFrame of just those indices
negatives_to_remove = negative_indices.to_frame(name='index')

# Undersample the positives to remove (if any)
if num_negative_to_remove > 0:
    df_downsampled = resample(negatives_to_remove,
                              replace=False,  # Don't sample with replacement
                              n_samples=num_negative_to_remove,
                              random_state=42)  # For reproducibility

    # Drop those indices from the original DataFrame
    df = df.drop(df_downsampled['index'])

# Check the new distribution
print("Sentiment distribution after undersampling:")
print(df['labels'].value_counts())

Sentiment distribution after undersampling:
labels
good       56011
neutral    55487
bad        55487
Name: count, dtype: int64


In [None]:
df = df[df["labels"] != "neutral"]# remove neutral
len(df)

111498

In [None]:
df["labels"] = df["labels"].replace("bad", 0)
df["labels"] = df["labels"].replace("good", 1)

  df["labels"] = df["labels"].replace("good", 1)


This dataset initially included ChatGPT tweets categorized under the sentiments: bad, neutral, and good. To focus on identifying distinctly negative or positive sentiments, neutral observations were removed. This decision was made because neutral tweets might not contribute strongly to model training when the goal is to distinguish between clear emotional states. Despite this filtering, we retain a substantial dataset that provides ample data for robust model training.

**Part 1 (30 points):** Implement your RNN either using an existing framework OR you can
implement your own RNN cell structure. In either case, describe the structure of your
RNN and the activation functions you are using for each time step and in the output
layer. Define a metric you will use to measure the performance of your model (NOTE:
Performance should be measured both for the validation set and the test set).

In [None]:
df

Unnamed: 0,tweets,labels
1,"Try talking with ChatGPT, our new AI system wh...",1
3,"THRILLED to share that ChatGPT, our new model ...",1
4,"As of 2 minutes ago, @OpenAI released their ne...",0
5,"Just launched ChatGPT, our new AI system which...",1
7,ChatGPT coming out strong refusing to help me ...,1
...,...,...
219285,Podcast returns in 2023! 🐈🌙\n.\n#ai #chatgpt #...,0
219286,There's now an open source alternative to Chat...,1
219287,One of my new favorite thing to do with #ChatG...,1
219290,I asked #ChatGPT to write a #NYE Joke for SEOs...,1


In [None]:
import re

def clean_comment(comment: str) -> str:
    """
    Cleans text, removing special characters, hashtags, and mentions while
    preserving spaces and essential punctuation.

    Args:
        comment: The text to be cleaned.

    Returns:
        The cleaned text.
    """

    # Regular expression to match and remove special characters
    pattern = r"[^a-zA-Z0-9\s\.!?,\(\)]+"

    # Clean the text using regular expressions
    cleaned_text = re.sub(pattern, "", comment)

    # Remove leading and trailing whitespace
    cleaned_text = cleaned_text.strip()

    return cleaned_text

# Apply the cleaning function to the "Comment" column
df["tweets"] = df["tweets"].apply(clean_comment)


In [None]:
df

Unnamed: 0,tweets,labels
1,"Try talking with ChatGPT, our new AI system wh...",1
3,"THRILLED to share that ChatGPT, our new model ...",1
4,"As of 2 minutes ago, OpenAI released their new...",0
5,"Just launched ChatGPT, our new AI system which...",1
7,ChatGPT coming out strong refusing to help me ...,1
...,...,...
219285,Podcast returns in 2023! n.nai chatgpt artific...,0
219286,Theres now an open source alternative to ChatG...,1
219287,One of my new favorite thing to do with ChatGP...,1
219290,I asked ChatGPT to write a NYE Joke for SEOs a...,1


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def convert_to_tfds(dataframe):

  dataset = tf.data.Dataset.from_tensor_slices((dataframe['tweets'], dataframe['labels']))
  dataset = dataset.shuffle(buffer_size=len(dataframe), seed=0)
  return dataset.batch(64).prefetch(tf.data.AUTOTUNE)

training_set = df.copy()

train, dev = train_test_split(training_set, test_size=0.1, random_state = 0)
train, test = train_test_split(train, test_size = 0.1, random_state = 0)

train_ds = convert_to_tfds(train)
valid_ds = convert_to_tfds(dev)
test_ds = convert_to_tfds(test)

In [None]:
encoder = tf.keras.layers.TextVectorization()
encoder.adapt(train_ds.map(lambda text, label: text))

In [None]:
len(encoder.get_vocabulary())

160530

In [None]:
model = tf.keras.Sequential([
        encoder,
        tf.keras.layers.Embedding(
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1)
])

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_ds, epochs=5,
                    validation_data=valid_ds,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# Get Loss and Accuracy of test set
loss, accuracy = model.evaluate(test_ds)

print('Loss:', loss)
print('Accuracy:', accuracy)

Loss: 0.18675735592842102
Accuracy: 0.9422022700309753


This model utilizes a bidirectional recurrent neural network (BRNN) with an embedding layer for processing tokenized input data. Training the model required significant computation time, but the training accuracy increased steadily with additional epochs. While the final training accuracy reached a high of 94% with a relatively low loss of 18%, a potential case of underfitting is suggested by the discrepancy between the training (94%) and validation/test accuracies.

**Part 2 (35 points):** Update your network from part 1 with first an LSTM and then a GRU
based cell structure (You can treat both as 2 separate implementations). Re-do the
training and performance evaluation. What are the major differences you notice? Why
do you think those differences exist between the 3 implementations (basic RNN, LSTM
and GRU)?

In [None]:
# LSTM Implementation

model = tf.keras.Sequential([
        encoder,
        tf.keras.layers.Embedding(
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1)
])

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_ds, epochs=5,
                    validation_data=valid_ds,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# LSTM Implementation - Get Loss and Accuracy of test set
loss, accuracy = model.evaluate(test_ds)

print('Loss:', loss)
print('Accuracy:', accuracy)

Loss: 0.16315485537052155
Accuracy: 0.9532635807991028


In [None]:
# GRU Implementation

model = tf.keras.Sequential([
        encoder,
        tf.keras.layers.Embedding(
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1)
])

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_ds, epochs=5,
                    validation_data=valid_ds,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# GRU Implementation - Get Loss and Accuracy of test set
loss, accuracy = model.evaluate(test_ds)

print('Loss:', loss)
print('Accuracy:', accuracy)

Loss: 0.26874470710754395
Accuracy: 0.9251619577407837


**Comparison of RNN Architectures:**

* **LSTM:** Compared to the base RNN, the LSTM model exhibits increased loss (27%) and decreased accuracy. Additionally, the validation accuracy surpasses the training accuracy, which suggests overfitting.
* **GRU:** Similar trends are observed with the GRU model, including a 19% loss and potential overfitting due to a higher validation accuracy.

**Potential Contributing Factor:**

* The significant variation in comment lengths (ranging from one word to lengthy comments) might have contributed to the different performance of these three architectures.

**Part 3 (10 points):** Can you use the traditional feed-forward network to solve the same
problem. Why or why not? (Hint: Can time series data be converted to usual features
that can be used as input to a feed-forward network?)

For sentiment analysis tasks involving text, recurrent neural networks (RNNs) are generally preferred over traditional feed-forward networks due to the inherent sequential nature of language. Here's why:

* **Temporal Dependence:** Sentences and phrases rely heavily on the order and context of words to convey meaning. Feed-forward networks, lacking memory, treat each word independently, potentially missing this crucial aspect.
* **Memory Capability:** RNNs possess a memory mechanism that allows them to retain information from previous words in a sequence. This enables them to analyze the contextual relationships between words and capture how their order influences sentiment.
* **Pattern Recognition:** This memory capability empowers RNNs to identify sequential patterns within text. For example, the phrase "not a good movie" conveys a different sentiment than "a good movie, not." RNNs can use their memory to recognize such patterns and determine the overall sentiment more accurately.

Therefore, considering the temporal dependence of language and the importance of contextual relationships in sentiment analysis, RNNs emerge as a more suitable choice compared to traditional feed-forward networks.

# **Task 3 (25 points):**
In this task, use any of the pre-trained word embeddings. The Wor2vec embedding link
provided with the lecture notes can be useful to get started. Write your own code/function that
uses these embeddings and outputs cosine similarity and a dissimilarity score for any 2 pair of
words (read as user input). The dissimilarity score should be defined by you. You either can
have your own idea of a dissimilarity score or refer to literature (cite the paper you used). In
either case clearly describe how this score helps determine the dissimilarity between 2 words.

In [None]:
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embeddings = hub.KerasLayer(module_url)

In [None]:
def simFunction():
  x = str(input('Please enter first word: '))
  y = str(input('Please enter second word: '))
  embed_x = embeddings([x])[0].numpy()
  embed_y = embeddings([y])[0].numpy()
  similarity = np.inner(embed_x, embed_y)/(np.linalg.norm(embed_x)*np.linalg.norm(embed_y)) # cosine similarity
  dissimilarity = 1 - similarity
  print(f'Cosine similarity of {x} and {y} is {similarity}.')
  print(f'Dissimilarity of {x} and {y} is {dissimilarity}.')

In [None]:
simFunction()

Please enter first word:  happy
Please enter second word:  good


Cosine similarity of happy and good is 0.612677812576294.
Dissimilarity of happy and good is 0.38732218742370605.


In [None]:
simFunction()

Please enter first word:  love
Please enter second word:  hate


Cosine similarity of love and hate is 0.5902369618415833.
Dissimilarity of love and hate is 0.40976303815841675.


In word embedding models, the dissimilarity between two words can be measured using the cosine distance metric. This is calculated as 1 minus the cosine similarity, which quantifies the directional difference between the word vectors in the embedding space.