# IMT 547 Project Part II: Data Preprocessing

Chesie Yu

02/18/2024

<style type = "text/css">  
    body {
        font-family: "Serif"; 
        font-size: 12pt;
    }
    em {
        color: #4E7F9E;
    }
    strong {
        color: #436D87;
    }
    li {
        color: #4E7F9E;
    }
    ul {
        color: #4E7F9E;
    }
    img {
        display: block;
        margin: auto;
    } 
    .jp-RenderedHTMLCommon a:link { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon a:visited { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon code {
        color: #4E7F9E;
    }  
    .mark {
        color: #B00D00;
        background-color: #FFF7B1;
    }
</style>

_This notebook outlines the **data preprocessing** process for the **YouTube Gaming Comment Toxicity** project._    

**Components**  
1. **Data Cleaning**: Data cleaning procedures including handling missing values and converting data types.    
2. **Text Preprocessing**: Text cleaning measures including text standardization, irrelevant content removal, stopwords removal, and tokenization.   
3. **Data Labeling**: Perspective API toxicity annotations and VADER/TextBlob/Empath sentiment scoring.  

**Functions**   
- **`clean(text)`**: Performs text preprocessing steps on a given document.  
- **`build_client(api_key)`**: Build a client for a given Perspective API key.  
- **`perspective_toxicity(comments)`**: Compute Perspective toxicity scores for a given list of texts. Support throttling management w/ client reuse, key rotation, and exponential backoff.   
- **`vader_sentiment(text)`**: Compute VADER sentiment scores for a given text.    
- **`textblob_sentiment(text)`**: Compute TextBlob sentiment scores for a given text.   
- **`empath_sentiment(text)`**: Compute Empath sentiment scores for a given text.   

In [1]:
# Import the libraries
import json
import random
import re
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import contractions
import nltk
from nltk.corpus import stopwords

## 1. Load the Data

### Channel

In [74]:
# Load the data
channel = pd.read_csv("../data/channel.csv")
channel.head(2)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753
1,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,"Hi, I'm Markiplier. I make videos. \n\nFrom qu...",US,UU7_YxT-KID8kRbqZo7MyscQ,21204065899,36400000,5576


In [75]:
# Check the dimensions
print(f"Number of rows: {channel.shape[0]}\n"
      f"Number of columns: {channel.shape[1]}\n")

# Check for missing values
print(f"Number of missing values: {channel.isna().sum().sum()}")

Number of rows: 33
Number of columns: 8

Number of missing values: 0


### Video

In [76]:
# Load the data
video = pd.read_csv("../data/video.csv")
video.head(2)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30T16:40:18Z,🌏 Get exclusive NordVPN deal here ➵ https://N...,"['pewdiepie', 'pewds', 'pewdie']",11582061,473895,15142.0,../subtitle/F-yEoHL7MYY.en.json3,i have beaten all souls games without dying a ...
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24T15:00:10Z,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5178533,192094,4313.0,../subtitle/PV4NGwn_xdI.en.json3,ah you ready yes we're ready eldon ring baby l...


In [77]:
# Check the dimensions
print(f"Number of rows: {video.shape[0]}\n"
      f"Number of columns: {video.shape[1]}\n")

# Check for missing values
print(f"Number of missing values: {video.isna().sum().sum()}")

Number of rows: 1452
Number of columns: 11

Number of missing values: 180


### Comment

In [78]:
# Load the data
comment = pd.read_csv("../data/comment.csv")
comment.head(2)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02T19:37:22Z,9892.0,47.0
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14T23:36:11Z,6290.0,9.0


In [79]:
# Check the dimensions
print(f"Number of rows: {comment.shape[0]}\n"
      f"Number of columns: {comment.shape[1]}\n")

# Check for missing values
print(f"Number of missing values: {comment.isna().sum().sum()}")

Number of rows: 140107
Number of columns: 7

Number of missing values: 39


_The dataset contains **140,637 comments** collected from action and non-action gaming videos on YouTube.  It features **17 columns** on metadata associated with the videos and comments.  **877 missing entries** are detected in this dataset; in the subsequent sections, we will address these data quality concerns._    

<br>

## 2. Data Cleaning

### Handle Missing Values

_Given that the missing entries account for only **0.624%** of the dataset, we will employ the **deletion** method to handle these missings.  By eliminitating rows that contain missing values, we ensure that our analysis is based on **complete and accurate** information._    

In [8]:
# Check the missings
video.isna().sum()

channel_id              0
video_id                0
video_title             0
video_creation_time     0
video_description      17
video_tags              0
video_viewcount         0
video_likecount         0
video_commentcount      3
video_subtitle_path    80
video_subtitle         80
dtype: int64

In [9]:
# Check the missings
comment.isna().sum()

video_id               0
comment_id             3
comment_author_id      3
comment_text          12
comment_time           7
comment_likecount      7
comment_replycount     7
dtype: int64

In [10]:
# Inspect invalid entries
comment_na = comment[comment.isna().any(axis=1)]
comment_na.loc[37540:37545]

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount


In [11]:
# Inspect more invalid entries
comment_na.loc[91742:91743]

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount


In [12]:
# Remove the missings
video.dropna(inplace=True)
video.shape

(1354, 11)

In [13]:
# Remove the missings
comment.dropna(inplace=True)
comment.shape

(140091, 7)

### Convert Data Types

*Note that the **`video_creation_time`** and **`comment_time`** are represented as **objects**; since these two columns represent dates and times, we will convert them to the more appropriate type **`datetime`** for efficient anlaysis.*  

In [14]:
# Check the data types
video.dtypes

channel_id              object
video_id                object
video_title             object
video_creation_time     object
video_description       object
video_tags              object
video_viewcount          int64
video_likecount          int64
video_commentcount     float64
video_subtitle_path     object
video_subtitle          object
dtype: object

In [15]:
# Check the data types
comment.dtypes

video_id               object
comment_id             object
comment_author_id      object
comment_text           object
comment_time           object
comment_likecount     float64
comment_replycount    float64
dtype: object

In [16]:
# Convert to datetime
video["video_creation_time"] = pd.to_datetime(video["video_creation_time"])
comment["comment_time"] = pd.to_datetime(comment["comment_time"])

In [17]:
# Extract the comments
comments = comment["comment_text"]
comments[:5]

0    Damn dude, even with mimic I think it would ta...
1    This is the pewds that I thought he’d turn int...
2    This is actually awesome. Can't believe a meme...
3    Wow, didn't even know Pewds had this analytica...
4    Damn, i cant believe it took me 11 months afte...
Name: comment_text, dtype: object

In [18]:
# Extract the subtitles
subtitles = video["video_subtitle"]
subtitles[:5]

0    i have beaten all souls games without dying a ...
1    ah you ready yes we're ready eldon ring baby l...
2    but the new gta game is iron and it's not what...
3    [Music] hello good morning gamers early today ...
4    [Music] so [Music] yes welcome welcome welcome...
Name: video_subtitle, dtype: object

<br>

## 3. Feature Labeling

### Toxicity Annotations

_Acquiring the toxicity labels is crucial for analyzing toxicity in comments.  However, manually annotating nearly 140,000 comments is **impractical** given the large volume and resource limitations.  Thus, to effectively **quantify the level of toxicity** in comments, we will leverage the **[Perspective API](https://perspectiveapi.com/)** to obtain our true labels._   

**Quota Limits and Throttling Management**  

*The Perspective API, however, enforces a **[quota limit](https://developers.perspectiveapi.com/s/about-the-api-limits-and-errors?language=en_US)** of **1 query per second (QPS)** for each project.  Despite the **lack of batch processing** support, we have devised a **throttling management** strategy that incorporates **key rotation** and **exponential backoff** to efficiently manage this constraint.*    

_Our approach involves cycling through **10 different API keys** and their respective **pre-built clients**, enhancing our query capacity within the API's quota restrictions.  Furthermore, an **exponential backoff** mechanism is enforced to manage **retries** following any quota breaches or server errors.  This method will **systematically increase the delay between subsequent requests**, thereby minimizing the likelihood of succesive failures and mitigating the impact on the API server._  

_Additional features such as **logging** and **exception handling** are integrated to support **monitoring** and **troubleshooting**, facilitating a smooth and efficient data lebeling process.  These measures collectively **reduce the projected processing time** from an initial estimate of **2.26 days** to approximately **4 hours**._    

In [19]:
# Import the libraries
import itertools
import logging
from googleapiclient import discovery
from googleapiclient.errors import HttpError

In [20]:
# Configure logging to file
logging.basicConfig(
    filename="../logs/toxicity.log",
    level=logging.INFO,  # Log info, warning, error, critical
    format="%(asctime)s - %(levelname)s - %(message)s",
    filemode="w"  # Overwrite on each run
)

In [21]:
# The Perspective API keys
PERSPECTIVE_API_KEYS = [
    "AIzaSyAMpL8JpwPU4c1nEGKCiBAiGp979r6o4-4",  # perspective-api-414709
    "AIzaSyD_-Oiitvk4OL5zgvX90Nn5TcoA23TrMlM",  # perspective-api-414723
    "AIzaSyCLQ0SAdw0-xKDEqGyTcBPO7yApPF2M3R0",  # perspe-414800
    "AIzaSyDTzo_CBwQ_5zVDojWSBMnH1jI_F6rEs7s",  # precise-antenna-414801
    "AIzaSyAt70Atcrnx2bfvFuPTwtvOV8Nf2PBPx4A",  # sound-datum-414801
    "AIzaSyBgO09nuuysiO7YNqexVZiskWhJPSv5t3A",  # perspective-api-414710
    "AIzaSyBFU4rFCLaCAVuQ0i4K3QhF_f9wBV4gBm4",  # perspective-api-414800
    "AIzaSyC8kMo6iX7iXX_lj8gx8IM0LuNS8p94UA4",  # shaped-canyon-414800
    "AIzaSyAhRHCYoYkRkQkco4NzhNuKT7Zm92BKOS8",  # perspective-api-414801
    "AIzaSyCr_b9CLWmy9Rt0f0ME74ZZmh3uT6gAwpk"  # hardy-order-414801
]

def build_client(api_key):
    """
    Build a client for a given Perspective API key.
    """
    # Create a client object
    # Reference: https://developers.google.com/codelabs/setup-perspective-api#4
    client = discovery.build(
        "commentanalyzer",  # Name
        "vlalpha1",  # Version
        developerKey=api_key,
        discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
        static_discovery=False
    )
    return client

# Pre-build a client for each API key
clients = {key: build_client(key) for key in PERSPECTIVE_API_KEYS}

# Set up the iterator
api_key_iterator = itertools.cycle(PERSPECTIVE_API_KEYS)

In [22]:
def split_text(text, size=40, overlap=10):
    """
    Split the given text into overlapping chunks.
    """
    chunks = []
    words = text.split()
    total_words = len(words)
    start = 0
    
    while start < total_words:
        # Set the end index
        end = start + size
        # Check if this is the last chunk
        if end >= total_words:
            # Add the last chunk and break
            chunks.append(" ".join(words[start:total_words]))
            break
        else:
            chunks.append(" ".join(words[start:end]))
        # Update the start index
        start = end - overlap
    return chunks

In [23]:
def calculate_proportions(chunks, overlap=10):
    """
    Calculate the proportion of each chunk.  
    Assign lower weight to overlapping words.
    """
    proportions = []
    n = len(chunks)

    # Calculate total number of words
    total_words = sum(len(chunk.split()) for chunk in chunks) - (n - 1) * overlap

    # If only one chunk, set proportion to 1
    if n == 1:
        proportions.append(1)
        return proportions
        
    for i, chunk in enumerate(chunks):
        count = len(chunk.split())
        # Number of overlaps for middle chunks
        overlap_count = overlap
        # Number of overlaps for first and last chunk
        if i == 0 or i == n - 1:
            overlap_count = overlap / 2
        
        # Compute the proportion
        proportions.append((count - overlap_count) / total_words)

    return proportions

In [24]:
def perspective_toxicity(texts, prefix, size=40, overlap=10, split=False):
    """
    Compute Perspective toxicity scores for a given list of texts.
    Support throttling management w/ client reuse, key rotation, and 
    exponential backoff.
    """
    # Empty list to store toxicity scores
    scores = []

    # Loop through the texts
    for index, text in enumerate(texts):  
        # Default to single chunk 
        chunks = [text]
        proportions = [1]
        
        # If text splitting is enabled
        if split:
            chunks = split_text(text, size, overlap)
            proportions = calculate_proportions(chunks, overlap)
            logging.info(f"Processing subtitle #{index}: {len(text.split())} words; {len(chunks)} chunk(s).")

        # Empty list to store chunk scores
        temp_scores = []
        
        for chunk_index, (chunk, proportion) in enumerate(zip(chunks, proportions)):
            # Specify the text and attributes
            analyze_request = {
                "comment": {"text": chunk},
                "languages": ["en"],
                "requestedAttributes": {
                    "TOXICITY": {},
                    "SEVERE_TOXICITY": {},
                    "IDENTITY_ATTACK": {},
                    "INSULT": {},
                    "PROFANITY": {},
                    "THREAT": {}}
            }
        
            # Reset attempt count for each text
            attempt = 0
            # Attempts allowed
            max_attempts = len(PERSPECTIVE_API_KEYS) * 5  # Number of keys * Attempts per key

            # While retry attempts are not exhausted
            while attempt < max_attempts:
                # Rotate to the next API key
                current_key = next(api_key_iterator)
                client = clients[current_key]

                try:
                    res = client.comments().analyze(body=analyze_request).execute()
                    s = res["attributeScores"]
                    temp_scores.append({
                        "toxicity": s["TOXICITY"]["summaryScore"]["value"] * proportion,
                        "severe_toxicity": s["SEVERE_TOXICITY"]["summaryScore"]["value"] * proportion,
                        "identity_attack": s["IDENTITY_ATTACK"]["summaryScore"]["value"] * proportion,
                        "insult": s["INSULT"]["summaryScore"]["value"] * proportion,
                        "profanity": s["PROFANITY"]["summaryScore"]["value"] * proportion,
                        "threat": s["THREAT"]["summaryScore"]["value"] * proportion,
                        "language": res["detectedLanguages"]
                    })
                    if split:
                        logging.info(f"Success for subtitle #{index} part {chunk_index + 1}/{len(chunks)} with key {current_key} on attempt {attempt + 1}")
                    else:
                        logging.info(f"Success for comment #{index} with key {current_key} on attempt {attempt + 1}")
                    # Break the loop if successful
                    break

                # Http errors
                except HttpError as e:
                    # Rate limit errors
                    if e.resp.status == 429:
                        logging.warning(f"HTTP 429 Rate limit exceeded for text #{index} with key '{current_key}' on attempt {attempt + 1}. Retrying with exponential backoff.")
                    else:
                        logging.warning(f"HTTP error for text #{index} with key '{current_key}' on attempt {attempt + 1}: {e}. Retrying with exponential backoff.")
                # Timeout errors
                except TimeoutError:
                    logging.warning(f"TimeoutError for text #{index} with key '{current_key}' on attempt {attempt + 1}. Retrying with exponential backoff.")
                # Unexpected errors
                except Exception as e:
                    logging.warning(f"Unexpected error for text #{index} with key '{current_key}' on attempt {attempt + 1}: {e}. Retrying with exponential backoff.")

                # Exponential backoff + random jitter
                sleep_time = (2 ** (attempt // len(PERSPECTIVE_API_KEYS))) + random.uniform(0, 1)
                time.sleep(sleep_time)
                attempt += 1

                # Check if all retry attempts are exhausted
                if attempt == max_attempts:
                    logging.error(f"Max attempts reached for text #{index} with key {current_key}. Moving to the next text.")
    
        # Compute the weighted average for each score
        aggregated_score = {f"{prefix}{k}": sum(temp[k] for temp in temp_scores) \
                            for k in temp_scores[0].keys() \
                            if k != "language"}
        # Use the detected languages from the first chunk
        aggregated_score[f"{prefix}language"] = temp_scores[0]["language"]
        scores.append(aggregated_score)
    
    # Convert to DataFrame
    toxicity_scores = pd.DataFrame(scores)
    
    return toxicity_scores

#### Comments

In [25]:
# Start timing
start_time = time.time()

# Compute Perspective API toxicity scores for each comment
toxicity_scores = perspective_toxicity(texts=comments, prefix="comment_", split=False)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
toxicity_scores.head()

Runtime: 16782.8708



Unnamed: 0,comment_toxicity,comment_severe_toxicity,comment_identity_attack,comment_insult,comment_profanity,comment_threat,comment_language
0,0.543256,0.03411,0.028629,0.338892,0.627178,0.049451,[en]
1,0.077668,0.00412,0.011765,0.019741,0.039407,0.010991,[en]
2,0.146031,0.006599,0.009027,0.021933,0.064458,0.055594,[en]
3,0.081625,0.004139,0.010137,0.017228,0.041161,0.021127,[en]
4,0.45703,0.024115,0.028017,0.154143,0.508187,0.011379,[en]


In [26]:
# Combine into one DataFrame
for column in toxicity_scores.columns:
    comment[column] = toxicity_scores[column].values
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,comment_insult,comment_profanity,comment_threat,comment_language
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02 19:37:22+00:00,9892.0,47.0,0.543256,0.03411,0.028629,0.338892,0.627178,0.049451,[en]
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14 23:36:11+00:00,6290.0,9.0,0.077668,0.00412,0.011765,0.019741,0.039407,0.010991,[en]
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31 18:16:36+00:00,5082.0,54.0,0.146031,0.006599,0.009027,0.021933,0.064458,0.055594,[en]


In [27]:
# Check the dimensions
comment.shape

(140091, 14)

#### Subtitles

In [28]:
# Start timing
start_time = time.time()

# Compute Perspective API toxicity scores for each comment
toxicity_scores = perspective_toxicity(texts=subtitles, prefix="video_",
                                       size=100, overlap=20, split=True)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
toxicity_scores.head()

Runtime: 8818.8637



Unnamed: 0,video_toxicity,video_severe_toxicity,video_identity_attack,video_insult,video_profanity,video_threat,video_language
0,0.358079,0.071938,0.056368,0.148396,0.244498,0.224883,[en]
1,0.500542,0.155937,0.098798,0.280781,0.373053,0.261945,[en]
2,0.4074,0.082654,0.078979,0.231179,0.271065,0.105283,[en]
3,0.400487,0.093281,0.070456,0.196307,0.288111,0.19028,[en]
4,0.432568,0.120564,0.077785,0.209814,0.314366,0.241215,[en]


In [29]:
# Combine into one DataFrame
for column in toxicity_scores.columns:
    video[column] = toxicity_scores[column].values
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle,video_toxicity,video_severe_toxicity,video_identity_attack,video_insult,video_profanity,video_threat,video_language
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30 16:40:18+00:00,🌏 Get exclusive NordVPN deal here ➵ https://N...,"['pewdiepie', 'pewds', 'pewdie']",11582061,473895,15142.0,../subtitle/F-yEoHL7MYY.en.json3,i have beaten all souls games without dying a ...,0.358079,0.071938,0.056368,0.148396,0.244498,0.224883,[en]
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24 15:00:10+00:00,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5178533,192094,4313.0,../subtitle/PV4NGwn_xdI.en.json3,ah you ready yes we're ready eldon ring baby l...,0.500542,0.155937,0.098798,0.280781,0.373053,0.261945,[en]
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19 17:15:01+00:00,Grand Theft Auto: The Trilogy is not epic bros...,"['pewdiepie', 'pewds', 'pewdie', 'Grand Theft ...",4053785,282859,9072.0,../subtitle/CF3jK8ai0l4.en.json3,but the new gta game is iron and it's not what...,0.4074,0.082654,0.078979,0.231179,0.271065,0.105283,[en]


In [30]:
# Check the dimensions
video.shape

(1354, 18)

### Sentiment Scoring

_To further investigate the **emotional dynamics** of the comments, we will generate the **sentiment scores** using **VADER**, **TextBlob**, and **Empath**.  Note that our initial analysis with Empath will concentrate on positive and negative emotions; yet if time allows, we hope to extend our examination to encompass all Empath categories in the future, aiming for a more nuanced understanding of the prevalent themes within YouTube gaming comments._  

In [31]:
# Import the libraries
from nltk.corpus import opinion_lexicon
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from empath import Empath

In [32]:
def vader_sentiment(text, prefix):
    """
    Compute VADER sentiment scores for a given text.
    """
    # Initialize the analyzer
    analyzer = SentimentIntensityAnalyzer()
    
    # Compute the scores
    scores = analyzer.polarity_scores(text)
    vader_scores = {f"{prefix}{k}": v for k, v in scores.items()}
    
    return vader_scores

In [33]:
def textblob_sentiment(text, prefix):
    """
    Compute TextBlob sentiment scores for a given text.
    """
    # Initialize the analyzer
    blob = TextBlob(text)
    
    # Compute the scores
    textblob_scores = {
        f"{prefix}polarity": blob.sentiment.subjectivity, 
        f"{prefix}subjectivity": blob.sentiment.subjectivity
    }
    
    return textblob_scores

In [34]:
def empath_sentiment(text, prefix):
    """
    Compute Empath sentiment scores for a given text.
    """
    # Initialize the analyzer
    lexicon = Empath()
    
    # Compute the scores
    categories = lexicon.analyze(text, normalize=True)
    empath_scores = {f"{prefix}{k}":v for k, v in categories.items()}
    
    return empath_scores

#### Comments

In [35]:
# Start timing
start_time = time.time()

# Compute VADER sentiment scores for each comment
vader_scores = comments.apply(lambda x: vader_sentiment(x, "comment_"))\
                       .apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
vader_scores.head()

Runtime: 857.4708



Unnamed: 0,comment_neg,comment_neu,comment_pos,comment_compound
0,0.087,0.743,0.17,0.3707
1,0.0,0.868,0.132,0.5859
2,0.086,0.613,0.301,0.7906
3,0.0,0.698,0.302,0.9358
4,0.029,0.813,0.158,0.8761


In [36]:
# Compute TextBlob sentiment scores for each comment
textblob_scores = comments.apply(lambda x: textblob_sentiment(x, "comment_"))\
                          .apply(pd.Series)
textblob_scores.head()

Unnamed: 0,comment_polarity,comment_subjectivity
0,0.4,0.4
1,0.345238,0.345238
2,0.583333,0.583333
3,0.56,0.56
4,0.675,0.675


In [37]:
# Start timing
start_time = time.time()

# Compute Empath sentiment scores for each comment
empath_scores = comments.apply(lambda x: empath_sentiment(x, "comment_"))\
                        .apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
empath_scores.head()

Runtime: 1532.8357



Unnamed: 0,comment_help,comment_office,comment_dance,comment_money,comment_wedding,comment_domestic_work,comment_sleep,comment_medical_emergency,comment_cold,comment_hate,...,comment_weapon,comment_children,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.038462,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027027
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018519,0.0


In [38]:
# Combine into one DataFrame
comment = pd.concat([comment, vader_scores, textblob_scores, empath_scores], axis=1)
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,...,comment_weapon,comment_children,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02 19:37:22+00:00,9892.0,47.0,0.543256,0.03411,0.028629,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14 23:36:11+00:00,6290.0,9.0,0.077668,0.00412,0.011765,...,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.038462,0.0
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31 18:16:36+00:00,5082.0,54.0,0.146031,0.006599,0.009027,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# Check the dimensions
comment.shape

(140091, 214)

In [40]:
# Write to CSV
comment.to_csv("../data/comment-labeled.csv", index=False)

#### Subtitles

In [41]:
# Start timing
start_time = time.time()

# Compute VADER sentiment scores for each subtitle
vader_scores = subtitles.apply(lambda x: vader_sentiment(x, "video_"))\
                        .apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
vader_scores.head()

Runtime: 2966.0401



Unnamed: 0,video_neg,video_neu,video_pos,video_compound
0,0.134,0.723,0.143,0.9951
1,0.143,0.668,0.19,0.9976
2,0.105,0.66,0.235,0.9997
3,0.112,0.705,0.183,1.0
4,0.109,0.706,0.185,1.0


In [42]:
# Compute TextBlob sentiment scores for each subtitle
textblob_scores = subtitles.apply(lambda x: textblob_sentiment(x, "video_"))\
                          .apply(pd.Series)
textblob_scores.head()

Unnamed: 0,video_polarity,video_subjectivity
0,0.495382,0.495382
1,0.576037,0.576037
2,0.560633,0.560633
3,0.514297,0.514297
4,0.508676,0.508676


In [43]:
# Start timing
start_time = time.time()

# Compute Empath sentiment scores for each subtitle
empath_scores = subtitles.apply(lambda x: empath_sentiment(x, "video_"))\
                        .apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
empath_scores.head()

Runtime: 16.1475



Unnamed: 0,video_help,video_office,video_dance,video_money,video_wedding,video_domestic_work,video_sleep,video_medical_emergency,video_cold,video_hate,...,video_weapon,video_children,video_monster,video_ocean,video_giving,video_contentment,video_writing,video_rural,video_positive_emotion,video_musical
0,0.001459,0.001571,0.001459,0.001795,0.001795,0.000561,0.001346,0.004488,0.002468,0.004937,...,0.008527,0.005834,0.001122,0.000337,0.004039,0.000673,0.000561,0.000785,0.007966,0.001683
1,0.001245,0.001245,0.000622,0.001867,0.002489,0.0,0.001867,0.002489,0.004978,0.001867,...,0.006223,0.011201,0.001245,0.006223,0.004356,0.0,0.000622,0.001245,0.011201,0.001867
2,0.0,0.000783,0.000783,0.003915,0.0,0.000783,0.0,0.000783,0.000783,0.01018,...,0.001566,0.004699,0.002349,0.001566,0.002349,0.000783,0.000783,0.0,0.005482,0.005482
3,0.001393,0.001393,8.7e-05,0.002699,0.002177,0.001654,0.001219,0.00148,0.004179,0.002699,...,0.007314,0.00862,0.000871,0.001741,0.007662,0.000871,0.000261,0.001741,0.00714,0.001393
4,0.001107,0.001006,0.000302,0.00513,0.001609,0.001509,0.001107,0.001207,0.002917,0.003823,...,0.012272,0.006136,0.00171,0.008349,0.008651,0.000604,0.000503,0.001006,0.005432,0.000402


In [44]:
# Combine into one DataFrame
video = pd.concat([video, vader_scores, textblob_scores, empath_scores], axis=1)
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,...,video_weapon,video_children,video_monster,video_ocean,video_giving,video_contentment,video_writing,video_rural,video_positive_emotion,video_musical
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30 16:40:18+00:00,🌏 Get exclusive NordVPN deal here ➵ https://N...,"['pewdiepie', 'pewds', 'pewdie']",11582061,473895,15142.0,../subtitle/F-yEoHL7MYY.en.json3,...,0.008527,0.005834,0.001122,0.000337,0.004039,0.000673,0.000561,0.000785,0.007966,0.001683
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24 15:00:10+00:00,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5178533,192094,4313.0,../subtitle/PV4NGwn_xdI.en.json3,...,0.006223,0.011201,0.001245,0.006223,0.004356,0.0,0.000622,0.001245,0.011201,0.001867
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19 17:15:01+00:00,Grand Theft Auto: The Trilogy is not epic bros...,"['pewdiepie', 'pewds', 'pewdie', 'Grand Theft ...",4053785,282859,9072.0,../subtitle/CF3jK8ai0l4.en.json3,...,0.001566,0.004699,0.002349,0.001566,0.002349,0.000783,0.000783,0.0,0.005482,0.005482


In [45]:
# Check the dimensions
video.shape

(1354, 218)

In [46]:
# Write to CSV
video.to_csv("../data/video-labeled.csv", index=False)

#### All YouTube Data

In [72]:
# Load the data
yt_all = pd.read_csv("../data/yt.csv")
yt_all.head(2)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount,video_id,video_title,...,video_likecount,video_commentcount,video_subtitle_path,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,genre
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238460000.0,111000000.0,4753.0,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,473895.0,15142.0,../subtitle/F-yEoHL7MYY.en.json3,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02T19:37:22Z,9892.0,47.0,action
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238460000.0,111000000.0,4753.0,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,473895.0,15142.0,../subtitle/F-yEoHL7MYY.en.json3,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14T23:36:11Z,6290.0,9.0,action


In [73]:
# Merge channel, video, and comments information
channel_videos = pd.merge(channel, video.drop(columns=["video_subtitle"]), 
                          on="channel_id", how="left")
yt = pd.merge(channel_videos, comment, on="video_id", how="left")
yt.head(3)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount,video_id,video_title,...,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical,comment_cleaned,comment_tokenized,video_genre
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,damn dude even mimic think would take approxim...,"[damn, dude, even, mimic, think, would, take, ...",action
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,0.0,0.0,0.038462,0.0,0.0,0.038462,0.0,pewds thought would turn gaming early channel ...,"[pewds, thought, would, turn, gaming, early, c...",action
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,actually awesome cannot believe meme became tr...,"[actually, awesome, can, not, believe, meme, b...",action


In [48]:
# Check the dimensions
yt.shape

(132592, 437)

In [49]:
# Write to CSV
yt.to_csv("../data/yt-labeled.csv", index=False)

<br>

## 4. Text Preprocessing

### Filter English Comments

_To **align** our analysis with the interests of the English-speaking YouTube gaming community, we intend to employ the **[spacy-langdetect](https://pypi.org/project/spacy-langdetect/)** tool to **filter our dataset for English comments** only.  However, our initial attempt to implement a code solution from SpaCy's documentation was unsuccessful; if time permits, we will explore alternative methods to isolate English comments for our analysis._    

In [50]:
# Inspect non-English videos
len(video[video["video_language"].apply(lambda x: "en" not in x)])

0

In [51]:
# Remove non-English videos
video = video[video["video_language"].apply(lambda x: "en" in x)]

In [52]:
# Check the dimensions
video.shape

(1354, 218)

In [53]:
# Inspect non-English comments
len(comment[comment["comment_language"].apply(lambda x: "en" not in x)])

3837

In [54]:
# Inspect non-English comments
comment[comment["comment_language"].apply(lambda x: "en" not in x)].head()

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,...,comment_weapon,comment_children,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical
903,Dy3ege2nD6M,Ugw_KYxpcQBlXGdN2Oh4AaABAg,UCEU6y10PTmsVpLPmutO4L6g,Seruuu berasa ikut naik motor... Yeeyyy,2020-12-14 01:35:53+00:00,0.0,0.0,0.010744,0.000854,0.001119,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002,iROAUHfCrnU,UgzlNvwjnXvpxbcs8zt4AaABAg,UCEU6y10PTmsVpLPmutO4L6g,Wih keren juga nih game,2020-12-20 10:55:30+00:00,0.0,0.0,0.004901,0.000529,0.000906,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1075,iROAUHfCrnU,UgwNZwmVT5yCLM8As-V4AaABAg,UC1R2DBFNbKrX5aQ-yRDW2fA,Me sorprende que de 108 M de suscriptores solo...,2020-12-21 02:22:13+00:00,5.0,3.0,0.004335,0.000269,0.000666,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
1115,zQNPaRcvivg,UgxsqCdR_tl495HFfLJ4AaABAg,UCEU6y10PTmsVpLPmutO4L6g,gamenya kok serem sih.. Abis tembak tembakan s...,2020-12-15 16:14:04+00:00,0.0,0.0,0.20033,0.011597,0.007326,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1166,zQNPaRcvivg,UgxAD47-s87bAZtI0Yp4AaABAg,UCT2TIET5hixQiVMP-rqFIcA,36:23,2020-12-15 07:26:31+00:00,516.0,14.0,0.03192,0.002785,0.007363,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
# Remove non-English comments
comment = comment[comment["comment_language"].apply(lambda x: "en" in x)]

In [56]:
# Check the dimensions
comment.shape

(136254, 214)

### Text Cleaning

_To preserve the **most relevant information**, we will undertake a series of text preprocessing steps to refine our corpus for analysis._  

_This initial step involves **text standardization** to ensure that the text will be **consistently understood** by analytical tools.  All texts will be converted to **lowercase**; **contractions** will be expanded to their full forms using the [`contractions`](https://pypi.org/project/contractions/) library._  

_Next, we will **remove the URLs, mentions, hashtags, and non-alphabetic characters** to eliminate the noise in data.  Common English **stopwords** will also be removed as they do not possess significant information.  Note the **potential caveat** in this procedure: the elimination of these elements could result in loss of certain nuances in text._  

In [57]:
# Function for text preprocessing
def clean(text):
    """
    Performs text preprocessing steps on a given document.
    """
    # Convert to lowercase
    text = text.lower()
    # Remove contractions
    text = contractions.fix(text)
    
    # Remove URLs
    text = re.sub(r"http\S+", "", text)
    # Remove mentions
    text = re.sub(r"(?<![@\w])@(\w{1,25})", "", text)
    # Remove hashtags
    text = re.sub(r"(?<![#\w])#(\w{1,25})", "", text)
    # Remove new line characters
    text = re.sub("\n", " ", text)
    
    # Remove non-alphabetic characters
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    
    # Remove extra spaces
    text = re.sub(r"\s+", " ", text)
    
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    text = " ".join([word for word in text.split() if word not in stop_words])
    
    return text

In [58]:
# Clean the comments
print(comments[0])
comments_cleaned = comments.apply(clean)
print(comments_cleaned[0])

Damn dude, even with mimic I think it would take me approximately 3 lifetimes to complete a no death run... you really are a gamer god.
damn dude even mimic think would take approximately lifetimes complete death run really gamer god


In [59]:
# Clean the subtitles
subtitles_cleaned = subtitles.apply(clean)

### Tokenization

*Using `word_tokenizer`, we will **tokenize** the text into smaller pieces.  This process will be crucial for **analyzing term frequency** or **identifying common themes** within the corpous as the analysis progresses.*  

In [60]:
# Import the libraries
from nltk.tokenize import word_tokenize

#### Comments

In [61]:
# Tokenize the comments
comments_tokenized = comments_cleaned.apply(word_tokenize)
comments_tokenized[0]

['damn',
 'dude',
 'even',
 'mimic',
 'think',
 'would',
 'take',
 'approximately',
 'lifetimes',
 'complete',
 'death',
 'run',
 'really',
 'gamer',
 'god']

In [62]:
# Combine into one DataFrame
comment["comment_cleaned"] = comments_cleaned
comment["comment_tokenized"] = comments_tokenized
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,...,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical,comment_cleaned,comment_tokenized
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02 19:37:22+00:00,9892.0,47.0,0.543256,0.03411,0.028629,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,damn dude even mimic think would take approxim...,"[damn, dude, even, mimic, think, would, take, ..."
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14 23:36:11+00:00,6290.0,9.0,0.077668,0.00412,0.011765,...,0.0,0.0,0.0,0.038462,0.0,0.0,0.038462,0.0,pewds thought would turn gaming early channel ...,"[pewds, thought, would, turn, gaming, early, c..."
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31 18:16:36+00:00,5082.0,54.0,0.146031,0.006599,0.009027,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,actually awesome cannot believe meme became tr...,"[actually, awesome, can, not, believe, meme, b..."


In [63]:
# Remove the missings
comment.dropna(inplace=True)
comment.shape

(136254, 216)

In [64]:
# Write to CSV
comment.to_csv("../data/comment-cleaned.csv", index=False)

#### Subtitles

In [65]:
# Tokenize the subtitles
subtitles_tokenized = subtitles_cleaned.apply(word_tokenize)

In [66]:
# Combine into one DataFrame
video["video_subtitle_cleaned"] = subtitles_cleaned
video["video_subtitle_tokenized"] = subtitles_tokenized
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,...,video_monster,video_ocean,video_giving,video_contentment,video_writing,video_rural,video_positive_emotion,video_musical,video_subtitle_cleaned,video_subtitle_tokenized
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30 16:40:18+00:00,🌏 Get exclusive NordVPN deal here ➵ https://N...,"['pewdiepie', 'pewds', 'pewdie']",11582061,473895,15142.0,../subtitle/F-yEoHL7MYY.en.json3,...,0.001122,0.000337,0.004039,0.000673,0.000561,0.000785,0.007966,0.001683,beaten souls games without dying single time m...,"[beaten, souls, games, without, dying, single,..."
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24 15:00:10+00:00,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5178533,192094,4313.0,../subtitle/PV4NGwn_xdI.en.json3,...,0.001245,0.006223,0.004356,0.0,0.000622,0.001245,0.011201,0.001867,ah ready yes ready eldon ring baby likely unli...,"[ah, ready, yes, ready, eldon, ring, baby, lik..."
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19 17:15:01+00:00,Grand Theft Auto: The Trilogy is not epic bros...,"['pewdiepie', 'pewds', 'pewdie', 'Grand Theft ...",4053785,282859,9072.0,../subtitle/CF3jK8ai0l4.en.json3,...,0.002349,0.001566,0.002349,0.000783,0.000783,0.0,0.005482,0.005482,new gta game iron whatever one wanted new gta ...,"[new, gta, game, iron, whatever, one, wanted, ..."


In [67]:
# Check the dimensions
video.dropna(inplace=True)
video.shape

(1354, 220)

In [68]:
# Write to CSV
video.to_csv("../data/video-cleaned.csv", index=False)

#### All YouTube Data

In [80]:
# Merge channel, video, and comments information
channel_videos = pd.merge(
    channel, 
    video.drop(columns=["video_subtitle", "video_subtitle_cleaned", "video_subtitle_tokenized"]),       
    on="channel_id", 
    how="left"
)
yt = pd.merge(channel_videos, comment, on="video_id", how="left")
yt.head(3)

KeyError: "['video_subtitle_cleaned', 'video_subtitle_tokenized'] not found in axis"

In [70]:
# Check the dimensions
yt.shape

(129011, 439)

In [71]:
# Write to CSV
yt.to_csv("../data/yt-cleaned.csv", index=False)

_The labeled dataset contains **138,996 rows** and **33 columns**.  In `03-preliminary anlaysis`, we will begin to explore the dataset, examining its **distribution** through **exploratory data analysis** and **visualizations**._   