In [None]:
import pandas as pd

# Loading the comments dataset
comments_df = pd.read_csv("comments.csv")
sample_size = 300
comment_sample = comments_df.sample(n=sample_size, random_state=42)
comment_sample.to_csv("comment_sample.csv", index=False)

In [None]:
df = pd.read_csv("comment_sample.csv")
label_counts = df["Label"].value_counts()

print(label_counts)


### Sample Labeling for Testing

In preparation for testing my fine-tuned DistilBERT model, I manually labeled a sample of 300 YouTube comments. The labeling process involved categorizing comments into three sentiment classes: 2 for neutral, 1 for positive, and 0 for negative. However, as the model was fine-tuned specifically for positive and negative sentiment analysis, I plan to remove the neutral class during evaluation to focus on the target sentiments.

The positive and negative sentiment classes turned out to be randomly distributed, with approximately equal weighting. In the sample, there were 72 positive comments and 77 negative comments, while the remaining comments were labeled as neutral. This labeled sample will be used to assess the model's performance on YouTube comments, providing valuable insights into its ability to classify sentiments effectively.

In [None]:
df_test = df.drop(columns="Comment ID")

df_test = df_test[df_test["Label"] != 2]

df_test.head()

In [None]:
import tensorflow as tf
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Extracting the comments from the test dataset
youtube_comments = df_test["Comment"].tolist()

# Tokenizing the YouTube comments using the same tokinzer
tokenized_youtube_comments = tokenizer(
    youtube_comments,
    padding=True,
    truncation=True,
    max_length=103,
    return_tensors="tf"
)

# Extracing input IDs and attenetion mask for the test dataset
input_ids_test = tokenized_youtube_comments["input_ids"]
attention_mask_test = tokenized_youtube_comments["attention_mask"]

# Adding tokenized data to the original DataFrame for the test dataset
df_test["input_ids"] = input_ids_test.numpy().tolist()
df_test["attention_mask"] = attention_mask_test.numpy().tolist()

In [None]:
df_test.head()

In [None]:
# Extracting input IDs and attention masks from the DataFrame
input_ids_test = df_test["input_ids"].tolist()
attention_mask_test = df_test["attention_mask"].tolist()

# Converting lists to TensorFlow tensors
input_ids_tensor = tf.convert_to_tensor(input_ids_test, dtype=tf.int32)
attention_mask_tensor = tf.convert_to_tensor(attention_mask_test, dtype=tf.int32)

# Displaying the shapes of the tensors
print("Input IDs Tensor Shape:", input_ids_tensor.shape)
print("Attention Mask Tensor Shape:", attention_mask_tensor.shape)

In [None]:
model_path = "../NLP_model/best_sentiment_model"

loaded_model = tf.keras.models.load_model(model_path)

In [None]:
type(loaded_model)

The fine-tuned DistilBERT model, specifically trained for binary sentiment classification, is then loaded, retaining its task-specific configuration. This enables the model to predict sentiment based on its training with similar data.

In the subsequent sections, the loaded DistilBERT model is applied to predict sentiment in the prepared YouTube comments. The analysis aims to reveal insights into the model's effectiveness in discerning sentiment within the context of user-generated content on the YouTube platform.

In [None]:
print(loaded_model.signatures)

In [None]:
# Making predictions using the best model

infer = loaded_model.signatures["serving_default"]

predictions = infer(
    input_ids=input_ids_tensor,
    attention_mask=attention_mask_tensor
)

logits = predictions["dense"]

# Converting logits to probabilites using softmax
probabilities = tf.nn.softmax(logits, axis=-1)

# Getting the predicted label values
predicted_labels = tf.argmax(probabilities, axis=-1)

print("Predicted Labels:", predicted_labels.numpy())

In [None]:
y, idx, count = tf.unique_with_counts(predicted_labels)

for label, count in zip(y.numpy(), count.numpy()):
    print(f"Label {label}: {count} occurences")