# DATASET FOR FINE-TUNING 

To create the dataset for fine tuning i used the YouTube Data API to fetch comments from specific videos.

1. **API Initialization**: Import required libraries and establish a connection to the YouTube API using a developer key.

2. **Comment Retrieval Function**:
   - Define `get_video_comments` to fetch comments for a given `video_id`.
   - Retrieve top-level comments and paginate through all available comments, capturing data such as the comment text, like count, reply count, and publication date.

3. **Data Handling**:
   - For each video ID, retrieve comments and store them in separate pandas DataFrames for individual analysis.

In [1]:
from googleapiclient.discovery import build
import pandas as pd

# Set up API key and YouTube API service
API_KEY = "YOUR_API"  # put your api key here to run
youtube = build("youtube", "v3", developerKey=API_KEY)


# Function to fetch comments from a video
def get_video_comments(video_id, max_results=50):
    comments = []
    request = youtube.commentThreads().list(
        part="snippet", videoId=video_id, maxResults=max_results, textFormat="plainText"
    )

    while request:
        response = request.execute()
        for item in response["items"]:
            comment_snippet = item["snippet"]["topLevelComment"]["snippet"]
            comment_data = {
                "text": comment_snippet["textDisplay"],
                "likes": comment_snippet.get("likeCount", 0),
                "replies": item["snippet"].get("totalReplyCount", 0),
                "date": comment_snippet["publishedAt"],
            }
            comments.append(comment_data)

        # Check if there's a next page of comments
        request = youtube.commentThreads().list_next(request, response)

    return comments


# list of video IDs:
# 1. "October 7 | Al Jazeera Investigations", '_0atzea-mPY'
# 2. "Hamas militant's bodycam shows how attacks on Israel began", 'nDn10nDnk_k'
# 3. "Investigating war crimes in Gaza I Al Jazeera Investigations", 'kPE6vbKix6A'
# 4. "‘People in Gaza feel abandoned by the world’: 40,000 Palestinians killed as ceasefire talks resume", 'Da_Ll7P5kYU'
# 5. "Gaza towers collapse after explosion", 'c5tWYj_Y60w'
# 6. "Israel-Gaza: At least half of Gaza's buildings damaged or destroyed, new analysis shows | BBC News", 'cONhigj9Po4'

videos = [
    "_0atzea-mPY",
    "nDn10nDnk_k",
    "kPE6vbKix6A",
    "Da_Ll7P5kYU",
    "c5tWYj_Y60w",
    "cONhigj9Po4",
]

# Create separate DataFrames for each video
df_video_1 = pd.DataFrame(get_video_comments(videos[0]))
df_video_2 = pd.DataFrame(get_video_comments(videos[1]))
df_video_3 = pd.DataFrame(get_video_comments(videos[2]))
df_video_4 = pd.DataFrame(get_video_comments(videos[3]))
df_video_5 = pd.DataFrame(get_video_comments(videos[4]))
df_video_6 = pd.DataFrame(get_video_comments(videos[5]))


I concatenated all the comments in one single dataset

In [2]:
all_videos_df = pd.concat(
    [df_video_1, df_video_2, df_video_3, df_video_4, df_video_5, df_video_6],
    ignore_index=True,
)
all_videos_df.head()

Unnamed: 0,text,likes,replies,date
0,HAMAS got turned into HUMmus 😂😂😂😂,0,0,2025-03-01T03:42:43Z
1,Very biased what can you expect from islamic t...,0,0,2025-03-01T01:27:00Z
2,a simple word imbarrassing,0,0,2025-02-28T09:43:00Z
3,Go live and live in these ternnols,0,0,2025-02-27T23:33:36Z
4,When they debate if it’s a baby or 18 year old...,2,0,2025-02-27T10:05:15Z


In [3]:
all_videos_df.shape

(24479, 4)

I shuffled the comments

In [4]:
# Shuffle the DataFrame rows
shuffled_df = all_videos_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Display the shuffled DataFrame
shuffled_df


Unnamed: 0,text,likes,replies,date
0,AJ admin is deleting messages that are critica...,0,1,2024-10-24T17:34:07Z
1,Bring down the caliphate. Rid the world of bar...,0,0,2023-10-09T18:34:10Z
2,I see some hero \nSome hero of palestainian wo...,2,0,2024-10-03T12:33:13Z
3,I am sorry to say this conflict probably will ...,2,0,2024-08-18T23:47:42Z
4,Palestina livre.,0,0,2024-04-24T16:47:50Z
...,...,...,...,...
24474,Bayangkan berapa mayat di reruntuhan bangunan ...,0,0,2023-12-31T00:15:53Z
24475,The IDF should have been able to prevent that ...,2,0,2024-03-20T17:15:47Z
24476,They are still repeating the mass rape and bab...,1,0,2024-09-05T10:12:38Z
24477,Absolutely heartbreaking.,0,0,2024-10-05T03:06:12Z


I used the `langdetect` library to find and select only comments written in English

In [5]:
from langdetect import detect, DetectorFactory

# Ensure language detection is deterministic
DetectorFactory.seed = 0


def is_english(text):
    try:
        return detect(text) == "en"
    except:
        return False  # Return False if detection fails


# Apply filtering
df_filtered = shuffled_df[shuffled_df["text"].apply(is_english)].reset_index(drop=True)

# Display the filtered DataFrame
df_filtered

Unnamed: 0,text,likes,replies,date
0,AJ admin is deleting messages that are critica...,0,1,2024-10-24T17:34:07Z
1,Bring down the caliphate. Rid the world of bar...,0,0,2023-10-09T18:34:10Z
2,I see some hero \nSome hero of palestainian wo...,2,0,2024-10-03T12:33:13Z
3,I am sorry to say this conflict probably will ...,2,0,2024-08-18T23:47:42Z
4,How did Jews go from the likes of Viktor Frank...,1,0,2024-10-04T13:19:31Z
...,...,...,...,...
19116,"Al Jazeera is a war crime, hybrid war of misin...",0,0,2024-10-20T22:06:20Z
19117,The IDF should have been able to prevent that ...,2,0,2024-03-20T17:15:47Z
19118,They are still repeating the mass rape and bab...,1,0,2024-09-05T10:12:38Z
19119,Absolutely heartbreaking.,0,0,2024-10-05T03:06:12Z


I added the sentiment column filled with 0 values

In [None]:
df_filtered["sentiment"] = 0

In [7]:
# df_filtered.to_csv("df_merged.csv", index=False)

This dataset, df_merged.csv, contains comments that I manually labeled with their expressed sentiments. This process helped me construct the actual fine-tuning dataset used for training the BERT model.

# INFERENCE DATASET 

Creation of the datasets that i'll use to perform the sentiment analysis and topic modelling

1. October 7 aljazera investigation documentary (20/03/2024) (9000 comments) 

In [6]:
id_aljazera_2024 = "_0atzea-mPY"
aljazera_2024 = pd.DataFrame(get_video_comments(id_aljazera_2024))

In [18]:
aljazera_2024 = aljazera_2024[aljazera_2024["text"].apply(is_english)].reset_index(
    drop=True
)

# create sentiment column
aljazera_2024["sentiment"] = 0


# export csv
aljazera_2024.to_csv("aljazera_2024.csv", index=False)

2. Lastet live update over 700 people killed by israeli forces (10-10-2023) (5000 comments)

In [8]:
id_aljazera_2023 = "XZHXUvBcyhE"
aljazera_2023 = pd.DataFrame(get_video_comments(id_aljazera_2023))

In [19]:
aljazera_2023 = aljazera_2023[aljazera_2023["text"].apply(is_english)].reset_index(
    drop=True
)

# create sentiment column
aljazera_2023["sentiment"] = 0


# export csv
aljazera_2023.to_csv("aljazera_2023.csv", index=False)

3. CNN: People in gaza feel abandoned 40k palestinian killed (15/8/2024) (4000 comments)

In [10]:
id_cnn_2024 = "Da_Ll7P5kYU"
cnn_2024 = pd.DataFrame(get_video_comments(id_cnn_2024))

In [20]:
cnn_2024 = cnn_2024[cnn_2024["text"].apply(is_english)].reset_index(drop=True)

# create sentiment column
cnn_2024["sentiment"] = 0

# export csv
cnn_2024.to_csv("cnn_2024.csv", index=False)

4. CNN suprise attack on israel (9/10/2023) (3000 comments)

In [12]:
id_cnn_2023 = "PuTn9g-KfR0"
cnn_2023 = pd.DataFrame(get_video_comments(id_cnn_2023))

In [21]:
cnn_2023 = cnn_2023[cnn_2023["text"].apply(is_english)].reset_index(drop=True)

# create sentiment column
cnn_2023["sentiment"] = 0

# export csv
cnn_2023.to_csv("cnn_2023.csv", index=False)

5. gaza Before after 7 october (6-10-2024) (1000 comments)

In [14]:
id_sky_2024 = "eSKq2IjmmJc"
sky_2024 = pd.DataFrame(get_video_comments(id_sky_2024))

In [22]:
sky_2024 = sky_2024[sky_2024["text"].apply(is_english)].reset_index(drop=True)

# create sentiment column
sky_2024["sentiment"] = 0

# export csv
sky_2024.to_csv("sky_2024.csv", index=False)

6. Bombs rain down on gaza (10-10-2023) (20000 coments) 

In [16]:
id_sky_2023 = "kBf3jm8OKyo"
sky_2023 = pd.DataFrame(get_video_comments(id_sky_2023))

In [23]:
sky_2023 = sky_2023[sky_2023["text"].apply(is_english)].reset_index(drop=True)

# create sentiment column
sky_2023["sentiment"] = 0

# export csv
sky_2023.to_csv("sky_2023.csv", index=False)