## Social Data Science Project

###  <span style="color:#044e8a"> The Impact of Video Length and Interactivity on YouTube Channel Size and Video Popularity:</span> 
### <span style="color:#044e8a"> Analysis of Greece's Most-Viewed Channels </span>

<span style="color: white; background-color: #044e8a; padding: 2px; border-radius: 3px;">
Objective: </span>

To explore how video length, content type, and interactivity affect channel size and video popularity among the most-viewed YouTube channels in Greece using data from the YouTube API.

Byun et al. (2023). The effect of YouTube comment interaction on video engagement: focusing on interactivity centralization and creators' interactivity. Available at: https://www.emerald.com/insight/content/doi/10.1108/oir-04-2022-0217/full/html

<span style="color: white; background-color: #044e8a; padding: 2px; border-radius: 3px;">
TABLE OF CONTENTS</span>

|**No.**| **Chapter** | **Sections** |
|:-------:|:-------------------|:-----------------|
|**1**|**Data Collection**| 
|||**1.1.** **YouTube API**
|||**1.2. YouTube API Call 1:** Channel Ids & Data|
|||**1.3. YouTube API Call 2:** Video Data|
|||**1.4. YouTube API Call 3:** Comments & Threads Data|
|**2**| **Data Cleaning & Processing**|
|||**2.1.**  **Dataset functions as a relational dataset**, linking the channel_data and video_data to represent how these entities are interconnected
|||**2.2.** **Dataset** presenting all the **channel basic info and statistics**
|||**2.3.** **Dataset** presenting basic information and statistics for **all videos uploaded by the channels**
||| **2.4.1.**  **Dataset** presenting **comments and threads data** for 100 randomly sampled videos
|||**2.4.2.**  **Anonymization of Usernames** column of Dataset presenting comments and threads data for 100 randomly sampled videos

<span style="color: white; background-color: #145c02; padding: 2px; border-radius: 3px;">
Libraries used: </span>

*this section lists all the libraries required for data collection, cleaning, analysis, and visualization in this project.*

<span style="color:#145c02"> **Data Collection** </span> 

In [1]:
from googleapiclient.discovery import build #for YouTube API access
from bs4 import BeautifulSoup #for HTML parsing during web scraping
import requests #for making HTTP requests
import urllib.parse #for encoding URLs
from fake_useragent import UserAgent #for generating random user agents-in scraping

In [2]:
import time #for adding delays between API or scraping requests
import random #for using random wait times in scraping to avoid detection
from tqdm import tqdm #for progress bars during loops
import re #for regular expressions to clean or parse text
import json #for handling JSON responses the API

<span style="color:#145c02"> **Data Cleaning, Processing, Analysis** </span> 

In [3]:
import pandas as pd #for working with structured datasets
import numpy as np #for numerical computations
import datetime  # For handling date/time conversions and computations
from isodate import parse_duration #for handling ISO 8601 format commonly used in APIs

In [4]:
import hashlib #provides secure hashing algorithms like sha256 for anonymizing sensitive data

In [5]:
from collections import Counter

In [6]:
import statsmodels.api as sm #for statistical analysis and regression modeling
from statsmodels.formula.api import ols #for building regression models using formulas

In [7]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [8]:
from sklearn.preprocessing import StandardScaler #for normalizing/standardizing data
from sklearn.model_selection import train_test_split #for splitting data into training/testing sets
from sklearn.metrics import mean_squared_error, accuracy_score #for model evaluation

<span style="color:#145c02">**Visualization**</span> 

In [9]:
import matplotlib as plt #core library for visualizations
import matplotlib.pyplot as plt #for plotting and customizing figures
import seaborn as sns #for statistical data visualizations
import plotly.express as px #for interactive and advanced plotes
from wordcloud import WordCloud #for generating word clouds,text analysis

# <span style="color:#1d2224"> **1 | Data Collection** </span>

# <span style="color:#940000"> **1.1. | YouTube API** </span> 

The **YouTube API**, provided by Google Developers, offers **access to YouTube Analytics**, which is highly valuable for researchers. Detailed information and **documentation** about the YouTube API can be found on the official **YouTube for Developers page**. https://developers.google.com/youtube/v3/docs

*It includes comprehensive instructions about the types of data you can access, the data format, usage limitations, and other useful information.*

In [251]:
api_key = 'myKEY'
youtube= build('youtube', 'v3', developerKey=api_key)

### <span style="color:#940000"> **1.2. | YouTube API Call 1: Channel Ids & Data** </span> 

Function to get Channel IDs from Creator's youtube usernames

In [22]:
def get_channel_ids_from_usernames_via_search(youtube, usernames):
    results = []
    for username in usernames:
        try:
            request = youtube.search().list(
                part="snippet",
                q=username,
                type="channel",
                maxResults=1
            )
            response = request.execute()
            if 'items' in response and len(response['items']) > 0:
                results.append({'Username': username, 'Channel_ID': response['items'][0]['id']['channelId']})
            else:
                results.append({'Username': username, 'Channel_ID': None})
        except Exception as e:
            print(f"Error retrieving channel ID for username '{username}': {e}")
            results.append({'Username': username, 'Channel_ID': None})

    return results

In [23]:
results = get_channel_ids_from_usernames_via_search(youtube, channel_usernames)

In [24]:
channel_names_ids = pd.DataFrame(results)

In [25]:
channel_names_ids.to_csv('Channel_Names_IDs.csv',index=False)

In [43]:
channel_names_ids.head(5)

Unnamed: 0,Username,Channel_ID
0,,UC97CozuZ0Zr9TP7Xfo1ISlw
1,katerinalioliouofficial,UCP8dWXxd3pGUG8S8DiHQDVg
2,zenith.greece,UCwvPiopiZ1v9ygZvN88k2JA
3,mcdonaldsgreece6193,UCDohft-LyivpohtUgw8Ukww
4,deigr,UCByqwwMSyHsd3xbWCMls1Gg


In [99]:
channel_names_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Username    51 non-null     object
 1   Channel_ID  51 non-null     object
dtypes: object(2)
memory usage: 948.0+ bytes


<span style="color: white; background-color: red; padding: 2px; border-radius: 3px;">
Note: </span>

At this stage, I will remove specific channels associated with brands, such as energy providers (e.g., dei.gr) and global fast-food companies (e.g., McDonald's).

*These types of channels are typically used by companies to upload and share advertisements rather than organic content. Since these videos are promoted by algorithms to reach specific view thresholds, they are not representative of natural audience engagement. Including such channels in the analysis would introduce inconsistencies, especially when comparing interactivity and channel growth with other creators.*

Additionally, I will also exclude record companies and singers/artists, as this type of content belongs to a specific category where user engagement and behavior differ significantly from entertainment-focused channels.

Therefore, **this project will focus exclusively on creators who produce Greek content** within categories such as entertainment (excluding music), education, comedy, beauty, and similar fields.

In [39]:
channels_removed = [
    'zenith.greece', 
    'mcdonaldsgreece6193',
    'deigr','nerdom_gr',
    'goodvybzmusic',
    'luigigr',
    'ant1823',
    'eurobankgroup',
    'offbeatrecordsgr',
    'katerinalioliouofficial',
    'nikosportokaloglouofficial',
    'cknd_records',
    'lilatrianti',
    'N/A'
]

In [47]:
channels_ids_cleaned = channel_names_ids[~channel_names_ids['Username'].isin(channels_removed)]

In [49]:
channels_ids_cleaned.head()

Unnamed: 0,Username,Channel_ID
6,gl_show,UC5_mz3DN_m9dTlsyH1Hy2BA
7,xristosraftas,UCA9IL3pf55xRhTJ4fRxAzEw
8,unboxholics,UCjBCvQBVTh4XjPwtSMQNcFg
9,persadpopara,UC5ISG0x-4OwC2VrnRxWLwzA
10,george_tala,UChOKNVWJLqGN2DB1ZcpNHKA


In [119]:
channels_ids_cleaned.to_csv('Channel_Names_IDs.csv',index=False)

In [100]:
channels_ids_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, 6 to 50
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Username    37 non-null     object
 1   Channel_ID  37 non-null     object
dtypes: object(2)
memory usage: 888.0+ bytes


In [50]:
channellist_ids = channels_ids_cleaned['Channel_ID'].tolist()

In [53]:
channellist_ids[5:10]

['UC9tvpZxjAnJC4xFLcSHpr6A',
 'UCWAl9Nd_PB9vp_KfivDgd4A',
 'UC1KjWRBCUGvxDkrhu_-UBWg',
 'UCHxGlwqoQf6jUQZX4WYIFHA',
 'UC1cDpj_OXMMlpEmWpDbaGsg']

Function to get Channel Data

In [74]:
def get_channel_stats(youtube, channel_ids):
    all_data=[]
    request = youtube.channels().list(
        part = 'snippet,contentDetails,statistics',
            id= ','.join(channel_ids))
    
    response = request.execute()

    for i in range(len(response['items'])):
        data = dict(
                        Channel_name=response['items'][i]['snippet']['title'],
                        Description=response['items'][i]['snippet']['description'],
                        Subscribers=response['items'][i]['statistics']['subscriberCount'],
                        ViewCount=response['items'][i]['statistics']['viewCount'],
                        Total_Videos=response['items'][i]['statistics']['videoCount'],
                        Playlist_id=response['items'][i]['contentDetails']['relatedPlaylists']['uploads'],
                    )
        all_data.append(data)

    return all_data

In [55]:
channel_all_data= get_channel_stats(youtube, channellist_ids)

In [56]:
channelData= pd.DataFrame(channel_all_data)

In [58]:
channelData.to_csv('channelData.csv',index=False)

In [66]:
channelData.head()

Unnamed: 0,Channel_name,Description,Subscribers,ViewCount,Total_Videos,Playlist_id
0,Greekonomics,Πως η Οικονομία επηρεάζει την Κοινωνία!\n\nΤα ...,231000,13154190,60,UU1KjWRBCUGvxDkrhu_-UBWg
1,Dat Lilly,Νέο βίντεο κάθε Κυριακή ❤\nThank's for being h...,521000,100038173,190,UU9WYita8NlpXTcmn_js5YzQ
2,Pavlos Makris,Εδώ για να σε διασκεδάσω!😊\nΠάτα το Like & Sub...,12500,1611596,33,UUhWPS3NiUzeRmh8DBmtvYpA
3,Eponimos,ναι.,392000,95878785,513,UUFOasUEk9Pkr8YeJxGc88Lw
4,Unboxholics,TIME WELL WASTED.\nGaming | Tech | Cinema | En...,1070000,442841695,1546,UUjBCvQBVTh4XjPwtSMQNcFg


In [70]:
channelData['Total_Videos'] = channelData['Total_Videos'].astype('int')

In [71]:
channelData['Total_Videos'].sum()

13970

### <span style="color:#940000"> **1.3 | YouTube API Call 2: Video Data** </span> 

To access and retrieve video data, the process begins by collecting the playlist ID corresponding to the playlist that contains all videos uploaded to a channel since its creation. Once the playlist ID is obtained, the individual video IDs within the playlist can be retrieved. These video IDs are then used to gather detailed information for each video.

In [67]:
playlist_ids = channelData['Playlist_id'].tolist()

Since the **total number of videos uploaded by all channels is 13,970**, an equal number of entries is expected when retrieving the collected **video IDs** from the 37 playlists of the channels that have already been gathered.

Function to get from the playlist_id all the video ids

In [77]:
def get_videos_ids(youtube, playlist_ids):
    all_video_data = []

    for playlist_id in tqdm(playlist_ids, desc="Getting video IDs from playlists"):
        try:
            next_page_token = None
            more_pages = True

            while more_pages:
                request = youtube.playlistItems().list(
                    part='snippet,contentDetails',
                    playlistId=playlist_id,
                    maxResults=50,
                    pageToken=next_page_token
                )
                response = request.execute()

                for item in response['items']:
                    video_data = {
                        "Playlist_Id": playlist_id,
                        "Video_Id": item['contentDetails']['videoId'],
                        "Channel_Id": item['snippet']['channelId']
                    }
                    all_video_data.append(video_data)

                next_page_token = response.get('nextPageToken')
                more_pages = next_page_token is not None

        except Exception as e:
            print(f"Error getting video IDs from playlist '{playlist_id}': {e}")
    
    return pd.DataFrame(all_video_data)

In [78]:
video_ids = get_videos_ids(youtube, playlist_ids)

Getting video IDs from playlists: 100%|█████████| 37/37 [01:15<00:00,  2.04s/it]


In [80]:
video_ids.to_csv('Video_ids.csv',index=False)

In [83]:
video_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13780 entries, 0 to 13779
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Playlist_Id  13780 non-null  object
 1   Video_Id     13780 non-null  object
 2   Channel_Id   13780 non-null  object
dtypes: object(3)
memory usage: 323.1+ KB


Function to get from all the videos' details from all videos uploaded to the channels 

In [88]:
def get_video_details(youtube, video_ids):

    all_videos_stats = []
    video_ids2 = video_ids['Video_Id'].tolist()
    
    for i in tqdm(range(0, len(video_ids), 50), desc="Fetching video details"):
        try:
            request = youtube.videos().list(
                part='snippet,statistics,contentDetails',
                id=','.join(video_ids2[i:i+50])
            )
            response = request.execute()

            for video in response['items']:
                matching_row = video_ids.loc[video_ids['Video_Id'] == video['id']].iloc[0]
                video_stats = {
                    "Id": video['id'],
                    "Title": video['snippet']['title'],
                    "Published_Date": video['snippet']['publishedAt'],
                    "Description": video['snippet']['description'],
                    "Tags": video['snippet'].get('tags', []),
                    "Views": video['statistics'].get('viewCount', 0),
                    "Likes": video['statistics'].get('likeCount', 0),
                    "Dislikes": video['statistics'].get('dislikeCount', 0),
                    "Comments": video['statistics'].get('commentCount', 0),
                    "Channel_Id": matching_row['Channel_Id'],
                    "Playlist_Id": matching_row['Playlist_Id'],
                    "Video_Length": video['contentDetails']['duration']
                }
                all_videos_stats.append(video_stats)
        except Exception as e:
            print(f"Error fetching details for videos batch: {e}")
    
    return pd.DataFrame(all_videos_stats)

In [89]:
videoData = get_video_details(youtube, video_ids)

Fetching video details: 100%|█████████████████| 276/276 [01:16<00:00,  3.62it/s]


In [90]:
videoData.to_csv('VideoData.csv',index=False)

In [92]:
videoData.info()ho

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13780 entries, 0 to 13779
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Id              13780 non-null  object
 1   Title           13780 non-null  object
 2   Published_Date  13780 non-null  object
 3   Description     13780 non-null  object
 4   Tags            13780 non-null  object
 5   Views           13780 non-null  object
 6   Likes           13780 non-null  object
 7   Dislikes        13780 non-null  int64 
 8   Comments        13780 non-null  object
 9   Channel_Id      13780 non-null  object
 10  Playlist_Id     13780 non-null  object
 11  Video_Length    13780 non-null  object
dtypes: int64(1), object(11)
memory usage: 1.3+ MB


In [93]:
videoData['Comments'] = videoData['Comments'].astype('int')

In [94]:
videoData['Comments'].sum()

6989308

### <span style="color:#940000"> **1.4. | YouTube API Call 3: Comments & Threads Data** </span> 

At this stage, to **track whether the creator interacts with the audience while collecting comments and threads**, the **channels_ids_cleaned data frame can be merged with the video_ids data frame**. This merge adds the Username of the channel owner to each video, enabling comparison between the creator's username and the authorDisplayName of comments or replies.

In [101]:
merged_data = pd.merge(video_ids, channels_ids_cleaned, how='left', left_on='Channel_Id', right_on='Channel_ID')

In [104]:
merged_data.head()

Unnamed: 0,Playlist_Id,Video_Id,Channel_Id,Username,Channel_ID
0,UU1KjWRBCUGvxDkrhu_-UBWg,qIifbbutkl0,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg
1,UU1KjWRBCUGvxDkrhu_-UBWg,JLQNJPg9lH4,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg
2,UU1KjWRBCUGvxDkrhu_-UBWg,y6DFcf_-EzI,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg
3,UU1KjWRBCUGvxDkrhu_-UBWg,KXOTXhH0-SM,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg
4,UU1KjWRBCUGvxDkrhu_-UBWg,IBKZJupo064,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg


In [103]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13780 entries, 0 to 13779
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Playlist_Id  13780 non-null  object
 1   Video_Id     13780 non-null  object
 2   Channel_Id   13780 non-null  object
 3   Username     13780 non-null  object
 4   Channel_ID   13780 non-null  object
dtypes: object(5)
memory usage: 538.4+ KB


In [118]:
merged_data.to_csv('mergedchannelvideodata.csv',index=False)

In [249]:
merged_data= pd.read_csv('mergedchannelvideodata.csv')

Since the total number of **comments** posted across all videos from the channels is **6,989,308**, it is expected to retrieve the same number of comment entries when collecting comments from the **13,780 videos** belonging to the **37 channels** that have already been gathered. 


<span style="color: white; background-color: red; padding: 2px; border-radius: 3px;">
However: </span>

Due to the **API's quota limitations and time constraints**, given the exceptionally large total number of comments, **a random sample of 100 videos** will be selected for the comments analysis. This approach ensures a manageable dataset while maintaining a representative sample for the study.

In [250]:
sampled_video_ids = merged_data.sample(n=100)

Function to get from all the videos uploaded all the comments and threads

In [252]:
def get_video_comments(youtube, video_data):
    all_video_comments = []

    for _, row in tqdm(video_data.iterrows(), desc="Getting comments", total=len(video_data)):
        video_id = row['Video_Id']
        channel_id = row['Channel_Id']
        creator_username = row['Username']
        next_page_token = None

        while True:
            try:
                request = youtube.commentThreads().list(
                    part='snippet,replies',
                    videoId=video_id,
                    textFormat='plainText',
                    maxResults=100,
                    pageToken=next_page_token
                )

                response = request.execute()

                if 'error' in response:
                    print(f"Error: {response['error']['message']}")
                    break

                for item in response.get('items', []):
                    comment = item['snippet']['topLevelComment']['snippet']
                    reply_count = 0
                    creator_reply_count = 0

                    if 'replies' in item and 'comments' in item['replies']:
                        replies = item['replies']['comments']
                        reply_count = len(replies)
                        creator_reply_count = sum(
                            1 for reply in replies 
                            if reply['snippet']['authorDisplayName'] == creator_username
                        )

                    comment_data = dict(
                        Video_ID=video_id,
                        Channel_ID=channel_id,
                        User_name=comment['authorDisplayName'],
                        Comment=comment['textDisplay'],
                        Comment_likes=comment['likeCount'],
                        Published_Date=comment['publishedAt'],
                        Total_Replies=reply_count,
                        Creator_Replies=creator_reply_count
                    )

                    all_video_comments.append(comment_data)

                next_page_token = response.get('nextPageToken')
                if not next_page_token:
                    break

                time.sleep(1)

            except Exception as e:
                print(f"Exception getting comments from video {video_id}: {e}")
                break
            
    return all_video_comments

In [None]:
video_comments = get_video_comments(youtube, sampled_video_ids)

<span style="color: white; background-color: red; padding: 2px; border-radius: 3px;">
Note: </span>

Comments can be disabled for several reasons, including the creator's choice or compliance with YouTube's policies on Kids Content. As a result, some of the sampled videos returned an **HttpError 403** with the **reason 'commentsDisabled'**, indicating that comments are not allowed on those videos.

In [254]:
commentsData = pd.DataFrame(video_comments)

In [255]:
commentsData.to_csv('CommentsData.csv',index=False)

In [256]:
commentsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31115 entries, 0 to 31114
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Video_ID         31115 non-null  object
 1   Channel_ID       31115 non-null  object
 2   User_name        31115 non-null  object
 3   Comment          31115 non-null  object
 4   Comment_likes    31115 non-null  int64 
 5   Published_Date   31115 non-null  object
 6   Total_Replies    31115 non-null  int64 
 7   Creator_Replies  31115 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 1.9+ MB


# <span style="color:#1d2224"> **2 | Data Cleaning & Processing** </span>

At this stage, all collected datasets will be thoroughly reviewed to ensure data quality and readiness for analysis. This process includes checking for missing values, verifying and correcting data types, and creating any required variables. The goal is to ensure that all necessary information is present, accurate, and formatted appropriately to proceed with the analysis seamlessly.

## <span style="color:#black"> **2.1. | Dataset functions as a relational dataset, linking the channel_data and video_data to represent how these entities are interconnected** </span> 

In [8]:
info_ids = pd.read_csv('mergedchannelvideodata.csv')

In [24]:
info_ids.head(2)

Unnamed: 0,Playlist_Id,Video_Id,Channel_Id,Username,Channel_ID
0,UU1KjWRBCUGvxDkrhu_-UBWg,qIifbbutkl0,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg
1,UU1KjWRBCUGvxDkrhu_-UBWg,JLQNJPg9lH4,UC1KjWRBCUGvxDkrhu_-UBWg,greekonomics,UC1KjWRBCUGvxDkrhu_-UBWg


In [23]:
info_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13780 entries, 0 to 13779
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Playlist_Id  13780 non-null  object
 1   Video_Id     13780 non-null  object
 2   Channel_Id   13780 non-null  object
 3   Username     13780 non-null  object
 4   Channel_ID   13780 non-null  object
dtypes: object(5)
memory usage: 538.4+ KB


In [25]:
info_ids = info_ids.drop(columns=['Channel_ID'], errors='ignore')

In [26]:
info_ids.to_csv('mergedchannelvideodata.csv',index=False)

## <span style="color:#black"> **2.2. | Dataset presenting all the channel basic info and statistics** </span> 

In [9]:
channel_data = pd.read_csv('channelData.csv')

In [28]:
channel_data.head(2)

Unnamed: 0,Channel_name,Description,Subscribers,ViewCount,Total_Videos,Playlist_id
0,Greekonomics,Πως η Οικονομία επηρεάζει την Κοινωνία!\n\nΤα ...,231000,13154190,60,UU1KjWRBCUGvxDkrhu_-UBWg
1,Dat Lilly,Νέο βίντεο κάθε Κυριακή ❤\nThank's for being h...,521000,100038173,190,UU9WYita8NlpXTcmn_js5YzQ


In [47]:
channel_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Channel_name   37 non-null     object
 1   Description    37 non-null     object
 2   Subscribers    37 non-null     int64 
 3   ViewCount      37 non-null     int64 
 4   Total_Videos   37 non-null     int64 
 5   Playlist_id    37 non-null     object
 6   Description_c  37 non-null     object
dtypes: int64(3), object(4)
memory usage: 2.2+ KB


In [264]:
def preprocess_greek_text(text):
    #remove special characters and numbers
    text = re.sub(r'[^\w\sΑ-Ωα-ωΆ-Ώά-ώA-Za-z]', '', text)
    #remove excess whitespace and newline characters
    text = text.replace('\n', ' ').strip()
    #convert to lowercase
    text = text.lower()
    return text

In [44]:
channel_data['Description'] = channel_data['Description'].fillna('').astype(str)

In [50]:
channel_data['Description_c'] = channel_data['Description'].apply(preprocess_greek_text)

In [51]:
channel_data.head()

Unnamed: 0,Channel_name,Description,Subscribers,ViewCount,Total_Videos,Playlist_id,Description_c
0,Greekonomics,Πως η Οικονομία επηρεάζει την Κοινωνία!\n\nΤα ...,231000,13154190,60,UU1KjWRBCUGvxDkrhu_-UBWg,πως η οικονομία επηρεάζει την κοινωνία τα βίν...
1,Dat Lilly,Νέο βίντεο κάθε Κυριακή ❤\nThank's for being h...,521000,100038173,190,UU9WYita8NlpXTcmn_js5YzQ,νέο βίντεο κάθε κυριακή thanks for being here
2,Pavlos Makris,Εδώ για να σε διασκεδάσω!😊\nΠάτα το Like & Sub...,12500,1611596,33,UUhWPS3NiUzeRmh8DBmtvYpA,εδώ για να σε διασκεδάσω πάτα το like subscri...
3,Eponimos,ναι.,392000,95878785,513,UUFOasUEk9Pkr8YeJxGc88Lw,ναι
4,Unboxholics,TIME WELL WASTED.\nGaming | Tech | Cinema | En...,1070000,442841695,1546,UUjBCvQBVTh4XjPwtSMQNcFg,time well wasted gaming tech cinema enterta...


In [24]:
channel_data.to_csv('channelData.csv',index=False)

## <span style="color:#black"> **2.3. | Dataset presenting basic information and statistics for all videos uploaded by the channels** </span> 

In [86]:
video_data = pd.read_csv('VideoData.csv')

In [7]:
video_data.head(2)

Unnamed: 0,Id,Title,Published_Date,Description,Tags,Views,Likes,Dislikes,Comments,Channel_Id,Playlist_Id,Video_Length
0,qIifbbutkl0,Το Παγκόσμιο Μέλλον του Χρήματος | Greekonomic...,2024-11-22T12:44:42Z,Ένα ταξίδι στο μέλλον του χρηματοπιστωτικού συ...,[],172515,14860,0,1332,UC1KjWRBCUGvxDkrhu_-UBWg,UU1KjWRBCUGvxDkrhu_-UBWg,PT37M24S
1,JLQNJPg9lH4,"Η ""Κολομβία"" της Ευρώπης | Greekonomics #45",2024-09-22T08:44:24Z,Ευχαριστώ την Freedom24 που στηρίζει το κανάλι...,[],549229,41163,0,4127,UC1KjWRBCUGvxDkrhu_-UBWg,UU1KjWRBCUGvxDkrhu_-UBWg,PT19M58S


In [59]:
video_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13780 entries, 0 to 13779
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   Id                     13780 non-null  object  
 1   Title                  13780 non-null  object  
 2   Published_Date         13780 non-null  object  
 3   Description            13780 non-null  object  
 4   Tags                   13780 non-null  object  
 5   Views                  13780 non-null  int64   
 6   Likes                  13780 non-null  int64   
 7   Dislikes               13780 non-null  int64   
 8   Comments               13780 non-null  int64   
 9   Channel_Id             13780 non-null  object  
 10  Playlist_Id            13780 non-null  object  
 11  Video_Length           13780 non-null  object  
 12  Published_Year         13780 non-null  int32   
 13  Published_Month        13780 non-null  int32   
 14  Description_c          13780 non-null 

In [9]:
video_data['Published_Date'] = pd.to_datetime(video_data['Published_Date'])
video_data['Published_Date'] = video_data['Published_Date'].dt.tz_convert('Europe/Athens')

In [10]:
video_data['Published_Year'] = video_data['Published_Date'].dt.year
video_data['Published_Month'] = video_data['Published_Date'].dt.month

In [11]:
video_data['Published_Date'] = video_data['Published_Date'].dt.date

In [17]:
video_data['Description'] = video_data['Description'].fillna('').astype(str)
video_data['Description_c'] = video_data['Description'].apply(preprocess_greek_text)

In [18]:
def convert_video_length(duration): #to convert ISO 8601 duration to total seconds
    parsed_duration = parse_duration(duration) 
    total_seconds = int(parsed_duration.total_seconds())
    return total_seconds

In [19]:
video_data['Video_Length_Seconds'] = video_data['Video_Length'].apply(convert_video_length)
video_data['Video_Length_HH_MM_SS'] = pd.to_timedelta(video_data['Video_Length_Seconds'], unit='s')
video_data['Video_Length_HH_MM_SS'] = video_data['Video_Length_HH_MM_SS'].astype(str).str.replace('0 days ', '')

In [39]:
video_data['Comments_Presence'] = (video_data['Comments'] > 0).astype(int)

In [42]:
#fefine refined bins and labels
bins = [0, 240, 1200, 3600, float('inf')]  # 0-4 min, 4-20 min, 20-60 min, >60 min
labels = ['short', 'medium', 'long', 'super long']

#create the categorical variable based on video duration
video_data['Video_Length_Category'] = pd.cut(video_data['Video_Length_Seconds'], bins=bins, labels=labels)

In [None]:
median_views = video_data['Views'].median()
video_data['Popular'] = (video_data['Views'] > median_views).astype(int)

In [271]:
channelinfo = pd.read_csv('Channel_Names_IDs.csv')

In [272]:
channelinfo.head()

Unnamed: 0,Username,Channel_ID
0,gl_show,UC5_mz3DN_m9dTlsyH1Hy2BA
1,xristosraftas,UCA9IL3pf55xRhTJ4fRxAzEw
2,unboxholics,UCjBCvQBVTh4XjPwtSMQNcFg
3,persadpopara,UC5ISG0x-4OwC2VrnRxWLwzA
4,george_tala,UChOKNVWJLqGN2DB1ZcpNHKA


In [273]:
#create a mapping of Channel_Id to Channel_Username
channel_mapping = channelinfo.set_index('Channel_ID')['Username']

# Channel_Username with corresponding Channel_ID
video_data['Channel_Username'] = video_data['Channel_Id'].map(channel_mapping)

In [260]:
video_data.to_csv('VideoData.csv',index=False)

## <span style="color:#black"> **2.4.1. | Dataset presenting comments and threads data for 100 randomly sampled videos** </span> 

In [257]:
comments_data = pd.read_csv('CommentsData.csv')

In [282]:
comments_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31115 entries, 0 to 31114
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Video_ID          31115 non-null  object
 1   Channel_ID        31115 non-null  object
 2   User_name         31112 non-null  object
 3   Comment           31099 non-null  object
 4   Comment_likes     31115 non-null  int64 
 5   Published_Date    31115 non-null  object
 6   Total_Replies     31115 non-null  int64 
 7   Creator_Replies   31115 non-null  int64 
 8   Published_Year    31115 non-null  int64 
 9   Published_Month   31115 non-null  int64 
 10  Comment_p         30288 non-null  object
 11  Replies_Presence  31115 non-null  int64 
dtypes: int64(6), object(6)
memory usage: 2.8+ MB


In [260]:
comments_data['Published_Date'] = pd.to_datetime(comments_data['Published_Date'])
comments_data['Published_Date'] = comments_data['Published_Date'].dt.tz_convert('Europe/Athens')

In [261]:
comments_data['Published_Year'] = comments_data['Published_Date'].dt.year
comments_data['Published_Month'] = comments_data['Published_Date'].dt.month

In [262]:
comments_data['Published_Date'] = comments_data['Published_Date'].dt.date

In [281]:
comments_data['Replies_Presence'] = (comments_data['Total_Replies'] > 0).astype(int)

In [265]:
comments_data['Comment'] = comments_data['Comment'].fillna('').astype(str)
comments_data['Comment_p'] = comments_data['Comment'].apply(preprocess_greek_text)

In [283]:
comments_data.to_csv('CommentsDataP.csv',index=False)

### <span style="color:#940000"> **2.4.2. | Anonymization of Usernames column of Dataset presenting comments and threads data for 100 randomly sampled videos** </span> 

Since **comments are classified as personal data under Article 4(1) of Regulation 2016/679 (GDPR)**, their **processing** in this research is conducted **under the legal basis of Article 6(1)(f)**, which allows processing for legitimate interests. In this case, the legitimate interest pertains to conducting academic research as part of a specific exam project. A **random sample** of 62,037 comments, including usernames (personal data), from 100 videos was collected in adherence to the principle of **data minimization**, as outlined in **Article 5(1)(c)**, ensuring that only data necessary for the research purpose was processed.

Recognizing that usernames constitute personal data that could potentially identify individuals, **pseudonymization technique** was implemented to safeguard data security and ensure user anonymity. This aligns with the requirements of **Article 32(1)(a)**, which emphasizes the importance of technical and organizational measures to protect personal data, and the guidance provided in Recital 26, which underscores the value of pseudonymization in mitigating risks associated with personal data processing.

source: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:32016R0679

In [274]:
#list of creator username channels
creator_channels = channelinfo['Username']

#Function to anonymize usernames
def anonymize_username(username, creators):
    if pd.isna(username):  # Handle missing values
        return 'Unknown'
    if username in creators:
        return username  # Retain the creator's name
    # Anonymize other usernames using SHA-256 hashing
    return hashlib.sha256(str(username).encode()).hexdigest()

#apply anonymization
comments_data['User_name'] = comments_data['User_name'].apply(lambda x: anonymize_username(x, creator_channels))

In [276]:
comments_data.head(2)

Unnamed: 0,Video_ID,Channel_ID,User_name,Comment,Comment_likes,Published_Date,Total_Replies,Creator_Replies,Published_Year,Published_Month,Comment_p
0,pxv2GXvEFqY,UCFOasUEk9Pkr8YeJxGc88Lw,5dfadffa8ffe5ba97828ebc3c70d18c49f69cea80d70b4...,Cfv 0:15,0,2024-05-11,0,0,2024,5,cfv 015
1,pxv2GXvEFqY,UCFOasUEk9Pkr8YeJxGc88Lw,d4d5d861b27a239f2bb8b84a6ec6436073c28cd35338c5...,Φίλε δε τραγούδησες της Ελλάδας,0,2023-07-27,0,0,2023,7,φίλε δε τραγούδησες της ελλάδας
