# IMT 547 Final Project: Phase 2

Team 1: Chesie Yu, Hongfan Lu, Bella Wei

02/27/2024

## Project Overview: YouTube Gaming Comments Toxicity

**Team name**: _Team 1_  
**Team members**: _Chesie Yu, Hongfan Lu, Bella Wei_  

**Problem Description**:  
Toxicity in the gaming community is a prevalent problem that hinders the harmonious development of the gaming industry. Our objective is to tackle this concern by exploring whether the game category (Action and Non-Action) serves as a primary determinant of toxicity levels in YouTube video comments. This study focuses on the observational perspective rather than the player angle. If proven, it can offer valuable insights for gaming community management, game design, and the design of social media platforms.  

**Initial RQ**:
- Do videos of action games arouse significant more toxic comments than non-action games in YouTube?  
- Is gaming category a primary predictor for toxic comments in YouTube?  

**Initial Findings in Phase 1**:
- Videos of action games arouse a little more toxic comments than non-action games, but the influence is not very strong. 
- Gaming category is not a primary predictor for toxic comments in YouTube.


___

## Project Progress in Phase 2

## Qualitative:

### Question, Problem, Hypothesis, Claim, Context, Motivation    
**Question**: 
- Do videos with higher toxicity in the transcript trigger more toxic comments?
- Do YouTubers who frequently use toxic language elicit more toxic comments?
- What are the most common topics in toxic videos compared to harmonious videos?

**Hypothesis**:  
- Videos with higher toxicity in the transcript are likely to elicit more toxic comments.
- YouTubers who use more toxic language are likely to trigger more toxic comments.
- Topics in toxic videos are expected to involve more violence, war, and fights compared to harmonious videos.

**Motivation**:  
In this phase, we focus on investigating the relationship between video content and the comment section. If we can identify patterns or causal relationships in the video content and user interactions within the comment section, we can offer insightful suggestions to social media platforms on how to prevent toxic comments. This could involve censoring potentially harmful video content before it leads to toxic interactions and has a widespread negative impact.

### Definitions, Data, Methods to be Used  

**Data**:  

We scraped videos under 33 English-Speaking YouTube channels from a list of top 100. For each channel, we select 30 video for action games and 30 for non-action games. In Phase two, for each video, we got the transcript for further analysis and comparison.


**Methods**:  

We use the yt-dlp package to gain access to YouTube video transcripts. We then divide the lengthy transcript into separate small chunks with overlapping to preserve context. Different weights are assigned to various parts of the transcript, and the toxicity score is calculated using the Perspective API.


### Assumptions, Biases  

**Assumptions**:
- YouTube comments are free-speech and not-filtered seriously.   
- YouTube transcript are complete and not filtered seriously.

**Biases**:
- We use predefined keywords list to fetch videos, which may not be representative enough.  
- We retrieve 100 comments under each video, provided by the default order on YouTube, which may introduce bias.
- We found that the toxicity in the transcript is much higher than in the comment section. This difference may be attributed to the fact that oral language is more prone to being toxic than written comments. Additionally, some language used during gameplay may not necessarily imply bad intentions and could be less serious in nature.

## Quantitative: (Procedural Section & New things we do)

### Data Collection Procedure  

Data science is an **iterative process**. :) This phase of our project introduces several **adjustments to our original data collection procedures**, **improving the accuracy** and **broadening the scope** of our investigation.  Here is an overview of the changes we have implemented: 

**Selection Criteria**  

- **Keyword List Revision**: The keyword **"batman"** was removed from our list of action games, to **eliminate confusion** and **mitigate misclassification**, as "batman" could refer to the character rather than the game.   

- **Channel Selection Modification**: From our initial list curated from [SocialBook's Top 100 Gaming YouTubers](https://socialbook.io/youtube-channel-rank/top-100-gaming-youtubers), we have **excluded one channel** that we identified as not predominantly focusing on video gaming content, ensuring our focus on the **English-speaking gaming community**.  

**Additional Features**  
- **Channel Features**: We collected an **expanded set of channel features**: `["channel_id", "channel_name", "channel_description", "channel_country", "channel_uploads_id", "channel_viewcount", "channel_subscribercount", "channel_video_count"]`  

- **Video Features**: **New video attributes** have been included: `["video_duration", "video_live"]`, offering insights into video length and whether the content was streamline live.     

- **Video Subtitle**: We added a new function, **`get_video_subtitle`**, to the existing data collection pipeline.  By utilizing **`yt-dlp`** to download **automatic English subtitles** for all videos, we aim to investigate the correlation between **video langauge and comment toxicity**.  

**Performance Improvement**  

- **Batch Processing**: We introduced batch processing in **channel and video information retrieval**, through `channels().list()` and `videos().list()` methods, **reducing the number of API calls** required and thereby **lowering quota usage from 49.25% to 34.75%**.  

- **Additional Logging**: We have implemented additional logging mechanisms to facilitate the debugging process, ensuring a smoother data collection workflow.  

**Bug Fixes**  

- **Video Statistics Imputation**: We noticed that video statistics that were unavailable were **incorrectly imputed as "0"**.  This practice affected a tiny fractions of videos yet may lead to **misrepresentation of the actual engagement levels**.  We corrected this error to ensure accuracy in our data representation.     

The final dataset consists of **136,463 comments** and **1,407** unique videos, encompassing **25 channel, video, and comment features**.  In the subsequent section, we will list the updates in preprocessing steps that addressed data quality concerns.  

### Data Preprocessing

_To ensure the **integrity and relevance** of our data, we have also refined our preprocessing methodologies and procedures.  Below is an outline of the significant updates made during this phase:_    

**Data Cleaning**

- **ISO 8601 Duration**: Video durations, coded in ISO 8601 format, were parsed into the **total number of seconds** to facilitate the anlaysis.  

- **Invalid Entries**: We identified a few invalid entries due to **imporper handling of carriage returns (\r)** when writing to CSV.  These were cleaned up in the process of removing missing data.  

- **Duplicate Entries**: We identified 4 videos that were categorized as both action and non-action in the previous data.  Here we introduced a **check for duplicates to prevent overlapping genres**.   

- **Sparse Channels**: We intended to remove channels with a **very low count of videos** to **avoid bias in channel-level analyses**.  However, we chose to retain these at this stage to explore interesting patterns from certain channels (IShowSpeed) that would have otherwise been removed.   

**Feature Engineering**   

We have derived some new features by **transforming existing data**:  
- **`video_game`**: Identifies the **game mentioned** in the video title.  

- **`video_blocked_wordcount`, `video_blocked_proportion`**: Counts and proportion of **censored words in video subtitles**.  We noticed that **YouTube automatically blocks certain profanities** - given that this would likely **systematically lower perceived video toxicity scores**, we extracted these metrics as a supplementary dimension to video toxicity assessment.  

- **`video_speed`**: Calculates the words per second within videos.  

**Content Labeling: Toxicity Annotation**   
- **Video Toxicity**: We have added annotations for video transcripts to assess toxicity levels.  However, Perspective API is **unable to process long text with more than 20480 bytes**; moreover, it works the best for **comment-length documents** as it was originally trained on comments.  To overcome this issue, we implemented two additional functions.  In **`split_text`**, we split the transcript into **overlapping chunks of 100 words**, **preserving the context** and accounting for the fact that these **chunks are not independent**.  In **`calculate_proportion`**, we computed the **weight** of each chunk and assigned **lower weights to overlapping words**.  We then compute the **weighted average** of the scores for each chunk as the **"video toxicity"**.  

- **Raw Text**: Taken into account the feedback from phase 1, we applied the Perspective API **directly on raw texts** to **retain the original context**.  

**Text Preprocessing**   

- **English Comments**: While the `spacy-langdetect` tool was not effective for our purposes, we utilized the **`detectedLanguage` attribute from Perspective API** to filter out non-English comments.  

- **Text Cleaning**: We segmented text cleaning into **three levels** - **minimal, moderate, and tokenization** - to **tailor preprocessing needs** for sentiment analysis and potential future modeling.  

**Content Labeling: Sentiment Evaluation**   

- **Video Sentiment**: We have extracted sentiment scores for video transcripts in addition to comments.  

- VADER is performed on minimally clened text, as it is **tailored to [social media text](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517)** with its ability to **handle punctuation, all caps, slang, and emoji**.  

- TextBlob and Empath anlayses were conducted on moderately cleaned text, as we couldn't find evidence on how well they support these social media text nuances from the documentation/paper.   

- **Empath**: We collected **all 194 empath categories** to examine the **influence of video topic/theme on comment toxicity** - i.e., are toxicity more prevalent when certain topics are mentioned in the video?  

The final labeled dataset contains **124,704 rows and 448 columns**, with **1301 unique videos** extracted from **32 unique channels**.  

### Preliminary Analysis

`03-preliminary-analysis` were also updated with **visualizations on video features** in addition to comment features.  In addition, we modified the toxicity threshold from 0.3 to 0.5.  

<br>

## Data Analysis Visualization

### Investigate the relationship between transcript toxicity and comment toxicity through each game.
### Visualize the distribution of toxicity in transcript and in comments

1. GroupBy Game

- Scatterplot

<img src="Feb-27/ByGame_Scatter.png" width="50%">


For each game, we can see that there is a positive correlation relationship between Video Toxicity and Comment Toxicity. Although the scatters are sparse (compared to the graph grouped by channel/youtuber), regplot was able to identify a positive relationship.

- Barplot

<img src="Feb-27/ByGame_Barplot.png" width="50%">


From this distribution plot, the correlation between comment and transcript toxicity is less obvious. However, we can still observe that the comment toxicity (purple bars) have a higher running average for the top half of the graph is slightly higher than that of the lower half. 

<img src="Feb-27/KDE_Distribution.png" width="50%">


Overall, there are a few observations from two plots above:

1. There is a positive relationship among video transcripts toxicity and comment section toxicity when grouped BY GAME
2. This means that some games are indeed more toxic than the other; Although we have yet to test this hypothesis statistically.
3. This fact is observed, INDEPENDENT of YouTuber's individual effect on comment sections

2. GroupBy YouTuber

<img src="Feb-27/ByChannel_Scatter.png" width="60%">

The graph above is a much denser looking scatter and regplot and the positive slope of the regression line is steeper compared to the scatter plot for game. This suggest the stronger explanatory power of group by channel, indicating that individual behaviors in transcripts might have a bigger impact on the comment section behaviors.

<img src="Feb-27/ByGame_Barplot.png" width="60%">

<img src="Feb-27/KDE_Youtuber.png" width="70%">

## Investigate the most common topics in toxic videos

<img src="Feb-27/Topics_highest_toxic_videos.png" width="70%">

We sorted the list of videos by toxicity, identified the top 10 videos with the highest toxicity scores, and determined the most common 10 topics among them. The topic keywords are: negative_emotion, weapon, violence, friends, swearing, pain, war, death, kill, and hate.

<img src="Feb-27/Barplot_highest.png" width="70%">

## Investigate the most common topics in non-toxic videos

<img src="Feb-27/Topics_lowest_toxic_videos.png" width="70%">

We sorted the list of videos by toxicity, identified the top 10 videos with the lowest toxicity scores, and determined the most common 10 topics among them. The topic keywords are: musical, music, dance, fun, art, listen, sound, hearing, noise and positive_emotion.

<img src="Feb-27/Barplot_lowest.png" width="70%">

___

## Qualitative  

### Answer, update question/claim, summary, re-contextualization, story, relate to domain knowledge  

**Answer**:   

From our further exploration, we can answer our update research questions:
- Videos with higher toxicity in the transcript generally tend to trigger more toxic comments, but the relationship is not strong or causal. It is not an accurate predictor of comment toxicity. The distribution of comment toxicity appears somewhat random and may be influenced by various factors other than just the toxicity of the transcript.
- Youtubers who use more toxic language are likely to trigger more toxic comments, but is not the only determinator.
- Surprisingly, the common topics in the most toxic and least toxic videos exhibit clear differences and patterns. Toxic videos frequently cover negative topics such as 'violence,' 'war,' and 'kill,' while non-toxic videos predominantly feature positive topics like 'music,' 'dance,' 'art,' and 'fun.' Intuitively, it can be observed that the topics in toxic videos are more negative in nature compared to non-toxic videos.

**Updated Question**:

- What are other primary factors or predictors of online gaming toxicity?
- How can our insights inform practical strategies for fostering a healthier online interaction environment?
            
**Relate to Domain Knowledge**:   

- From our toxicity calculation using the Perspective API, we discovered that the toxicity in the video transcript is much higher than in the comments. This difference may be attributed to the common and non-harmful use of certain language within the gaming community while playing games. Such language does not necessarily imply malicious intent, prompting us to explore methods for detecting the genuine intentions behind oral language use beyond the literal words in gaming area.

___

## Credit listing:

### Chesie Yu    

- Buildng code structure   
- Writing code/documentations for data collection, data cleaning, text processing, toxicity and sentiment labeling, visualization and preliminary analysis     
         
### Hongfan Lu  
- Writing code for data collection, processing, preliminary analysis and visualization    
- Collecting gaming YouTuber channel information   
- Generating observations and ideas for future exploration  
         
### Bella Wei  
- Conducting Data collection, data analysis and visualization
- Writing the ProjectPhase2-Summary-report  
     
### Collaboration    
- Idea Generation  
- Literature Review  
- Reseach Question Formation  
- Workflow Design  
- Data collection  
- Data collection result Interpretaion and Discussion  
- Future direction exploration   