# IMT 547 Final Project: Phase 1

Team 1: Chesie Yu, Hongfan Lu, Bella Wei

02/19/2024

## Project Overview: YouTube Gaming Comments Toxicity

**Team name**: _Team 1_  
**Team members**: _Chesie Yu, Hongfan Lu, Bella Wei_  

**Problem Description**:  
Toxicity in the gaming community is a prevalent problem that hinders the harmonious development of the gaming industry. Our objective is to tackle this concern by exploring whether the game category (Action and Non-Action) serves as a primary determinant of toxicity levels in YouTube video comments. This study focuses on the observational perspective rather than the player angle. If proven, it can offer valuable insights for gaming community management, game design, and the design of social media platforms.  

### RQ1:  
 
- Do videos of action games arouse significant more toxic comments than non-action games in YouTube?  
- Our project is interested in investigating sections for action and non-action based gaming videos on YouTube.  
- We hypothesis that there will be different type of emotions represented in the comment section;   
- Moreover, the occurance of profanity might be higher in the action based gaming videos.   

### RQ2:   

- Which kinds of gaming video attract most toxic comments? Any pattern behind the scene?  
- How does YouTubers' contents, characteristics and behaviors influence toxicity in comments section.  

___

## Qualitative:

### Question, Problem, Hypothesis, Claim, Context, Motivation    

**Question**:      
- Do videos of action games arouse significant more toxic comments than non-action games in YouTube?  
- Is gaming category a primary predictor for toxic comments in YouTube?   


**Hypothesis**:  

We assume action game(including violence scene) will trigger significant more toxicity in YouTube, and different categories of games present different level of toxicity.   


**Context**:

Online gaming toxicity refers to harmful and negative conversations within the gaming community, presenting a serious issue across social media platforms. This behavior often involves the use of offensive language, insults, harassment, or even threats exchanged among players. It negatively impacts the harmony of the gaming community and social media platforms, fostering resentment and animosity among people and adversely affecting people’s mental health, especially teenagers.    


**Motivation**:  

Addressing and mitigating online gaming toxicity is crucial for fostering a positive and inclusive gaming culture. If we can identify the main factors and predictors of online gaming toxicity, we can efficiently prevent and regulate this problem, providing valuable insights to gaming industry, gaming design, gaming community management, social media platform, and gamers. It would contribute to creating a healthier and more enjoyable online gaming environment while also aiding in the prevention of cyberbullying and game-related mental health problems.  

### Definitions, Data, Methods to be Used  

**Definitions**:  

Online gaming toxicity refers to the use of harmful and negative language within the gaming community. Our focus was on the YouTube platform, specifically examining comments under gaming videos of each category. Our research novelty lies in that we address this issue from the perspective of gaming observation rather than real-time gameplay.  

**Data**:  

We scraped videos under 33 English-Speaking YouTube channels from a list of top 100. For each channel, we select 30 video for action games and 30 for non-action games. We get 63442 comment records for antion game and 75760 comment records for non-action game(139202 comment records in total).    

**Methods**:  

In data collection phase we use YouTube API key to get access to chosen YouTube channels. We perform text cleaning on the collected data by removing missing values and unnecessary tokens through tokenization and regular expressions. Subsequently, we use the Perspective API to assess toxicity levels, and leverage NLP sentiment analysis tools such as Vader, TextBlob, and Empath to evaluate the emotional tone of the comments. Finally, we apply data visualization techniques to enhance the presentation of our analysis results.   


### Assumptions, Biases  

**Assumptions**:
YouTube comments are free-speech and not-filtered seriously.   

**Biases**:
We use predifined keywords list to fetch videos, which may not be representative enough.  
We retrieve 100 comments under each video, provided by the default order on YouTube, which may introduce bias.  

___

## Quantitative: 

### Data Collection Procedure   

1\. Utilizing **YouTube Data API** to access Youtube comments  
    
2\. Keyword selection  

**Keywords for Action Games**:   
action_keywords = ["call of duty", "gta", "the last of us", "god of war", "batman", "red dead redemption", "assassin's creed", "star wars jedi", "resident evil", "cyberpunk", "fallout", "tomb raider", "elden ring"]     
        
**Keywords for NonAction Games**:   
nonaction_keywords = ["minecraft", "pokemon go", "just dance", "it takes two", "uncharted", "brawl stars"]    
            
3\. Found 33 English-Speaking youtubers from a list of top 100 (ranked by number of subscriber)  
    
4\. For each Youtuber, select 30 video for action games and 30 for non-action games  
    
5\. Use Relevance (YouTube internal algo) to extract 100 comments each video  

### Notebook: 01-data-collection  

The detailed data collection code and description can be found in **`01-data-collection.ipynb`**.  Here we will only include a brief overview for readability.  Our data collection phase is divide into 3 components:  

1\. **Authentication & Configuration**: Prepared the necessary tools and configuration for our later data collection  
- Set up the YouTube API key  
- Install and import necessary libraries  
- Configure the logging system  
- Initialize a client that can communicates with YouTube API    


2\. **Utility Function**: Encapsulated a series of necessary operations into functions.   
- **`get_uploads_id(channel_id)`**: Fetch the uploads playlist ID for a given YouTube channel.  
- **`get_video_ids(uploads_id, max_videos=30, keywords=“”)`**: Fetch 30 video IDs containing keyword in a given upload playlist.   
- **`get_video_info(video_ids)`**: Fetch necessary video info from a list of YouTube videos.  
     **Video Features**: [“channel_id”, “channel_name”, “video_id”, “video_title”, “video_creation_time”, “video_description”, “video_tags”, “video_viewcount”, “video_likecount”, "video_commentcount"]             
- **`get_video_comments(video_ids, max_comments=100)`**: Fetch comments (100 max each) for a list of videos.  
     **Comment Features**: [“video_id”, “comment_id”, “comment_author_id”, “comment_text”, “comment_time”, “comment_likecount”, "comment_replycount"]     
- **`get_youtube_data(channel_ids, max_videos=30, max_comments=100, keywords=“”)`**: Main function. Fetch videos and comments for a list of channels   
     

3\. **Data Collection**     
- Set up the parameters(max 100 comments under max 30 videos for each YouTube channel) and keywords list for ‘Action’ and ‘Non-Action’ game category  
- Read a self-built document ‘top 100 popular gaming YouTubers’ as our targeted channels, filtering only english-speaking channels  
- Collect and combine comment infos for action and non-action games   
- Get and save 139202 comments records   

### Data Processing

### Notebook: 02-preprocessing    

Our data preprocessing is divided into 3 components:
    
1\. **Data Cleaning**    
- **Handle Missings**: **991 rows with missing values; 138211 rows left after removing NA**   
- **Convert Data Types**: Convert time-related columns into datetime data type:   
    Comments are from a range of time in **('2011-04-22T01:05:52Z', '2024-02-18T20:15:11Z')**  
- A total of 1435 unique videos are included   

![YouTube Data Summary Statistics](summary-statistics.png)

2\. **Text Preprocessing**
- Filter English comments only 
- Do text cleaning by removing unimportant content such as URLs, mentions and stop words.
- Tokenization and save results into the dataframe  

![Text Cleaning](text-cleaning.png)

3\. **Data Labeling**  

- Toxicity scores by perspective API  
- Sentiment scores by VADER, TextBlob and Empath  

#### Toxicity Annotations: Perspective API

![Perspective API Keys](perspective-keys.png)

![Perspective head](perspective-api-head.png)

![Toxicity Distribution](01a-toxicity-distribution.png)

**The graph above is showing the percentage of comments that are tagged with the six dimensions of Perspective API;**   

- is_toxicity (> 0.3) identified around 13% of all data and it is the largest bucket identified.   

- is_identity_attack (> 0.3) identify the least number of comments among the six identifiers.

#### Sentiment Scoring: VADER/TextBlob/Empath

![Vader by genre](02a-sentiment-vader-by-genre.png)

![TextBlob by genre](02b-sentiment-textblob-by-genre.png)

![Empath by genre](02c-sentiment-empath-by-genre.png)

**The above three graphs measure the sentiment scores measure by Vader, TextBlob and Empath respectively.**  

- We can observe that the overall distribution of action and non-action videos' sentiment scores are extremely close, amost overlapped;   

- however, the vertical dotted lines, which are the mean of either action or non-action sentiments scores, are also quite close to each other.   

- From Vader, we can say that action videos are showing more sentiments, positive and negative alike, and less neutrality.  

### Preliminary Analysis

#### Toxicity by Game

![Toxicity By Games](toxicity-by-game.png)

**In the bar chart above:**  

- there is a pattern that action games tend to squeezed on the left hand side, which indicates higher Toxicity score.  

- while non- action games like minecraft, tend to squeeze on the lower Toxicity scores side  

- P.S You may see many game names are a combination of several games like 'call of duty, minecraft', this is becuase that minecraft allow player to use the gaming skins from other games.  

- In other cases where multiple games are capture together, the reason might be that it is a video about both games but these situations are rare overall.  

#### Toxicity by Channel

![Toxicity by YouTubers](toxicity-by-channel.png)

**In the bar graph above, we observe an obvious difference in the mean toxicity scores of different YouTuber.**

To Investigate if the rootcause is that the higher proportion of action videos might lead to higher toxicity score we plotted the charts below.  

![Action/Non-Action Proportion](action-non-action-proportion-by-channel.png)

**For some players that play mostly non-action games, it seems like they tend to have lower toxicity scores.**  

Below is a more detailed breakdown.  

![action_vid_percent_toxicity](action-video-toxicity-proportion.png)

![Action/Non-Action videos posted by each YouTuber](action-video-proportion-vs-toxicity.png)

**We plotted a scatter plot with trend line above. We can observe a positive relationship between Percent of Action Video in Youtuber's channel and the mean toxicity scores**   

- however, there are a few exception like 'Tommyinit', this youtuber mainly post non-action games videos.   

- But his toxicity score is pretty high. We assume the underlying reasons might be his content, his characteristics, or his language habit;  

- We plan to further investigate in the future by aalyzing the transcripts so that we can make relationships between video contents and comments section atmosphere.  

#### Word Clouds: All Comments

![wordcloud_all](04a-wordcloud-all-comments.png)

#### Word Cloud: Action Game Comments

![wordcloud_action](04b-wordcloud-action.png)

#### Word Cloud: Non-Action Game Comments

![wordcloud_nonaction](04c-wordcloud-nonaction.png)

**In these 3 word clouds we do not observe eye-catching difference. It indicates the language use in action and non-action game is not very different.**  

___

## Qualitative  

### Answer, update question/claim, summary, re-contextualization, story, relate to domain knowledge  

**Answer**:   
From our preliminary research, the action games tend to have a little higher toxicity, while may not be a reliable determinent and the influence is not strong. Also the sentiments of action and non-action games do not have obvious difference, but action games tend to arouse stronger emotions.  

**Updated Question**:    
What are other primary factors or predictors of online gaming toxicity?
What are the factors influence the overall sentiment and atmosphere of YouTube comment section？YouTuber's content, topic, characteristic or language habit?
            
**Relate to Domain Knowledge**:   
This finding challenges the common assumption that game category is a strong factors which influences toxicity levels and sentiment. It prompts a deeper exploration into the intricate factors contributing to online gaming toxicity. One possible reason is the gaming observer is a different group from game players. Another reason is the game culture in YouTube gaming community is overall positive and kind, which makes the comments fairly positive.  

### Uncertainty, limitations, caveats  

**Uncertainty and limitations**:
- We are not sure the fairness of our data although we fetch quite large amount of comments(130000+) for analysis.    

- Our dataset is derived exclusively from top YouTube gaming channels, potentially introducing bias and limiting the generalizability of our findings to broader gaming communities.    

- The predefined keyword list used for video selection might not encompass the entire spectrum of gaming content, omitting potential influential videos.  


**Caveats**:  
- Our study focuses on English-language comments, limiting applicability to non-English gaming communities.  

- The Perspective API's predefined toxicity thresholds may not fully align with subjective perceptions of offensive content in gaming community.  

### New problems, next steps  
**New problems**:  

We observed a significant imbalance in the like-counts among comments under the same video. Certain comments received exceptionally high like-counts, what are the factors driving such popular comments?    

Considering that YouTube employs its own comment filtering mechanism, the prevalence of 'toxic' comments may be limited. Can we observe obvious pattern if we analyse other emotion dimension such as disappointment, discouragement or unsatisfication?    

We clearly observed different atmosphere under different youtube channel/videos. What drives such difference?     

**Next steps**:
- Explore other factors that might influence toxicity levels: content, language habit, YouTuber characteristic and topic.  
    - Collect video transcripts for video content analysis, examining the relationship between content and toxicity, as well as the relationship between language use and toxicity.   
    - Collect video transcripts for YouTuber language use analysis, investigating the relationship between language use and toxicity.      
    - Define different gaming Youtuber characteristics, such as serious, humourous, educational, interactive and emthusiastic, and explore the relationship between different YouTubers' characteristic and toxicity.    
    - Define different video topics, such as game reviews, game walkthroughs and guidance, game comparisons (such as 'I quit A for B'), game first impressions/reactions, etc., and explore the relationship between different gaming topics and toxicity.   

                
- Explore the factors influencing engagement in gaming videos and comments.  
    - Collect information about the popularity of videos/comments, such as like-count and reply-count.  
    - Analyze engagement data to gain insights into which types of gaming videos are more likely to go viral, as well as comments.  
                

___

## Credit listing:

### Chesie Yu    

- Buildng code structure   
- Writing code/documentations for data collection, data cleaning, text processing, toxicity and sentiment labeling, visualization and preliminary analysis     
         
### Hongfan Lu  
- Writing code for data collection, processing, preliminary analysis and visualization    
- Collecting gaming YouTuber channel information   
- Generating observations and ideas for future exploration  
         
### Bella Wei  
- Conducting Data collection and analysis  
- Writing the ProjectPhase1-Summary-report  
     
### Collaboration    
- Idea Generation  
- Literature Review  
- Reseach Question Formation  
- Workflow Design  
- Data collection  
- Data collection result Interpretaion and Discussion  
- Future direction exploration   