# IMT 547 - Term Project Phase 1

#### Title: YouTube Gaming Comments Toxicity
    Team name: Group 1
    Team members: Chesie Yu, Hongfan Lu, Bella Wei
    Problem Description:
        Toxicity in the gaming community is a prevalent problem that hinders the harmonious development of the gaming industry. Our objective is to tackle this concern by exploring whether the game category (Action and Non-Action) serves as a primary determinant of toxicity levels in YouTube video comments. This study focuses on the observational perspective rather than the player angle. If proven, it can offer valuable insights for gaming community management, game design, and the design of social media platforms.

#### RQ1:
    - Do videos of action games arouse significant more toxic comments than non-action games in YouTube?
    - Our project is interested in investigating sections for action and non-action based gaming videos on YouTube.
    - We hypothesis that there will be different type of emotions represented in the comment section;
    - Moreover, the occurance of profanity might be higher in the action based gaming videos.

#### RQ2:
    - Which kinds of gaming video attract most toxic comments? Any pattern behind the scene?
    - How does YouTubers' contents, characteristics and behaviors influence toxicity in comments section.

In [None]:
# Preliminary Analysis

#### Report format using QQQ
    Qualitative:
        Question, problem, hypothesis, claim, context, motivation
        Definitions, data, methods to be used
        Rationale, assumptions, biases
    Quantitative:
        Data processing, analysis, visualization
        Documented code and results
        Summary visuals
    Qualitative:
        Answer, update question/claim, summary, re-contextualization, story, relate to domain knowledge
        Uncertainty, limitations, caveats
        New problems, next steps
    Repeat. QQQ-QQQ-QQQ-...
        Break down a large problem into parts
        Alternative approaches to a problem
        Sequence of related problems, "vignettes"
        Follow-up problems

## Qualitative:

    ·Question, Problem, Hypothesis, Claim, Context, Motivation:
        - Question:
            Do videos of action games arouse significant more toxic comments than non-action games in YouTube?
            Is gaming category a primary predictor for toxic comments in YouTube?
        - Hypothesis:
            We assume action game(including violence scene) will trigger significant more toxicity in YouTube, and different categories of games present different level of toxicity.
        - Context:
            Online gaming toxicity refers to harmful and negative conversations within the gaming community, presenting a serious issue across social media platforms. This behavior often involves the use of offensive language, insults, harassment, or even threats exchanged among players. It negatively impacts the harmony of the gaming community and social media platforms, fostering resentment and animosity among people and adversely affecting people’s mental health, especially teenagers.
        - Motivation:
            Addressing and mitigating online gaming toxicity is crucial for fostering a positive and inclusive gaming culture. If we can identify the main factors and predictors of online gaming toxicity, we can efficiently prevent and regulate this problem, providing valuable insights to gaming industry, gaming design, gaming community management, social media platform, and gamers. It would contribute to creating a healthier and more enjoyable online gaming environment while also aiding in the prevention of cyberbullying and game-related mental health problems.
        
    ·Definitions, Data, Methods to be Used:
        - Definitions:
            Online gaming toxicity refers to the use of harmful and negative language within the gaming community. Our focus was on the YouTube platform, specifically examining comments under gaming videos of each category. Our research novelty lies in that we address this issue from the perspective of gaming observation rather than real-time gameplay.
        - Data:
            We scraped videos under 33 English-Speaking YouTube channels from a list of top 100. For each channel, we select 30 video for action games and 30 for non-action games. We get 63442 comment records for antion game and 75760 comment records for non-action game(139202 comment records in total).
        - Methods:
            In data collection phase we use YouTube API key to get access to chosen YouTube channels. We perform text cleaning on the collected data by removing missing values and unnecessary tokens through tokenization and regular expressions. Subsequently, we use the Perspective API to assess toxicity levels, and leverage NLP sentiment analysis tools such as Vader, TextBlob, and Empath to evaluate the emotional tone of the comments. Finally, we apply data visualization techniques to enhance the presentation of our analysis results.
        
    ·Assumptions, Biases:
        - Assumptions:
            YouTube comments are free-speech and not-filtered seriously.
        - Biases:
            We use predifined keywords list to fetch videos, which may not be representative enough
            We retrieve 100 comments under each video, provided by the default order on YouTube, which may introduce bias.

## Quantitative: 

### Data collection steps:

    1. Utilizing YouTube API to access Youtube comments
    
    2. Keywords for ActionGames:
       - action_keywords = [
            "call of duty", "gta", "the last of us", "god of war", "batman",
            "red dead redemption", "assassin's creed", "star wars jedi",
            "resident evil", "cyberpunk", "fallout", "tomb raider", "elden ring"]
        - Keywords for NonActionGames:
            nonaction_keywords = [
            "minecraft", "pokemon go", "just dance", "it takes two", "uncharted",
            "brawl stars"]
            
    3. Found 33 English-Speaking youtubers from a list of top 100 (ranked by number of subscriber)
    
    4. For each Youtuber, select 30 video for action games and 30 for non-action games
    
    5. Use Relevance (YouTube internal algo) to extract 100 comments each video

### Description for notebook 01-data-collection

Description of code 1- Data Collection:
    
    Our data collection phrase is divided into 3 modules:
        1. Authentication & Configuration: Prepared the necessary tools and configuration for our later data collection
            - Set up the YouTube API key
            - Install and import necessary libraries
            - Configure the log recording system
            - Initialize a client that can communicates with YouTube API
            
        2. Utility Function: Encapsulated a series of necessary operations into functions.
            - get_uploads_id(channel_id): Fetch the uploads playlist ID for a given YouTube channel.
            - get_video_ids(uploads_id, max_videos=30, keywords=“”): Fetch 30 video IDs containing keyword in a given upload playlist
            - get_video_info(video_ids): Fetch necessary video info from a list of YouTube videos.
                The info contains information of ([“channel_id”, “channel_name”, “video_id”, “video_title”, “video_creation_time”, “video_description”, “video_tags”, “video_viewcount”, “video_likecount”, "video_commentcount"])
            - get_video_comments(video_ids, max_comments=100): Fetch comments (100 max each) for a list of videos.
                The comments’ info contains ([“video_id”, “comment_id”, “comment_author_id”, “comment_text”, “comment_time”, “comment_likecount”, "comment_replycount"])
            - get_youtube_data(channel_ids, max_videos=30, max_comments=100, keywords=“”): Main function. Fetch videos and comments for a list of channels
            
        3. Data Collection
            - Set up the parameters(max 100 comments under max 30 videos for each YouTube channel) and keywords list for ‘Action’ and ‘Non-Action’ game category
            - Read a self-built document ‘top 100 popular gaming YouTubers’ as our targeted channels, filtering only english-speaking channels
            - Collect and combine comment infos for action and non-action games
            - Get and save 139202 comments records

### Descripton for notebook 02-preprocessing

Our data preprocessing is divided into ? modules:
    
1. Data Cleaning
    - Check missings: **991 rows with missing values; 138211 rows left after removing NA**
    - Check data types: transfer time-related columns into datetime data type:
        Comments are from a range of time in **('2011-04-22T01:05:52Z', '2024-02-18T20:15:11Z')**
    - A total of 1435 unique videos are included

![YouTube Data Symmary Statistics](Summary_Statistics.png)

2. Text Preprocessing
    - Filter English comments only 
    - Do text cleaning by removing unimportant content such as URLs, mentions and stop words.
    - Tokenization and save results into the dataframe

![Text Cleaning](Text_Cleaning.png)

3. Data Labeling
    - Toxicity scores by perspective API
    - Sentiment scores by vader, textblob and empath

#### Perspective API

![Perspective API Keys](Perspective_keys.png)

![Perspective head](Perspective_API_head.png)

![Toxicity Distribution](01a-toxicity-distribution.png)

#### The graph above is showing the percentage of comments that are tagged with the six dimensions of Perspective API; 
- is_toxicity (> 0.3) identified around 20% of all data and it is the largest bucket identified. 
- is_identity_attack (> 0.3) identify the least number of comments among the six identifiers.

![Vader by genre](02a-sentiment-vader-by-genre.jpg)

![TextBlob by genre](02b-sentiment-textbblob-by-genre.jpg)

![Empath by genre](02c-sentiment-empath-by-genre.jpg)

#### The above three graphs measure the sentiment scores measure by Vader, TextBlob and Empath respectively. 
- We can observe that the overall distribution of action and non-action videos' sentiment scores are extremely close, amost overlapped; 
- however, the vertical dotted lines, which are the mean of either action or non-action sentiments scores, are also quite close to each other. 
- From Vader, we can say that action videos are showing more sentiments, positive and negative alike, and less neutrality.

![Toxicity By Games](Toxic_By_Name.png)

#### In the chart above, 
    - there is a pattern that action games tend to squeezed on the left hand side, which indicates higher Toxicity score.
    - while non- action games like minecraft, tend to squeeze on the lower Toxicity scores side
    - P.S You may see many game names are a combination of several games like 'call of duty, minecraft', this is becuase that minecraft allow player to use the gaming skins from other games.
    - In other cases where multiple games are capture together, the reason might be that it is a video about both games but these situations are rare overall.

![Toxicity by YouTubers](Toxic_By_Channel.png)

#### In the bar graph above, we obser an obvious difference in the mean toxicity scores of different YouTuber.

    To Investigate if the rootcause is that the higher proportion of action videos might lead to higher toxicity score we plotted the charts below

![Action/Non-Action Proportion](ActionNonActionProportion.png)

#### For some players that play mostly non-action games, it seems like they tend to have lower toxicity scores

    Below is a more detailed breakdown

![action_vid_percent_toxicity](action_vid_percent_toxicity.png)

![Action/Non-Action videos posted by each YouTuber](Percent_of_Action_Video_Channel.png)

#### We plotted a scatter plot with trend line above. We can observe a positive relationship between Percent of Action Video in Youtuber's channel and the mean toxicity scores
    - however, there are a few exception like 'Tommyinit', this youtuber mainly post non-action games videos. 
    - But his toxicity score is pretty high. We assume the underlying reasons might be his content, his characteristics, or his language habit; 
    - We plan to further investigate in the future by aalyzing the transcripts so that we can make relationships between video contents and comments section atmosphere.

## Qualitative:

    ·Answer, update question/claim, summary, re-contextualization, story, relate to domain knowledge
        - Answer: Our preliminary analysis indicates that the game category is not a significant factor in online gaming toxicity. The toxicity, emotional tone, and word cloud analysis for both action and non-action games show no substantial differences.
        - Updated Question:
            What are the primary factors or predictors of online gaming toxicity?
            
        - Relate to Domain Knowledge: This finding challenges the common assumption that game category alone influences toxicity levels. It prompts a deeper exploration into the intricate factors contributing to online gaming toxicity. One possible reason is the gaming observer is a different group of game players. Another reason is the game culture in YouTube gaming community is positive and kind, differ from other gaming communities.


    ·Uncertainty, limitations, caveats
        - Uncertainty and limitations:
            - We are not sure the fairness of our data although we fetch large amount of comments for analysis.
            - Our dataset is derived exclusively from top YouTube gaming channels, potentially introducing bias and limiting the generalizability of our findings to broader gaming communities.
            - The predefined keyword list used for video selection might not encompass the entire spectrum of gaming content, omitting potential influential videos.
        - Caveats:
            - Our study focuses on English-language comments, limiting applicability to non-English gaming communities.
            - The Perspective API's predefined toxicity thresholds may not fully align with subjective perceptions of offensive content in gaming community.

    ·New problems, next steps
        - New problems:
            We observed a significant imbalance in the like-counts among comments under the same video. Certain comments received exceptionally high like-counts, what are the factors driving this phenomenon?
            Considering that YouTube employs its own comment filtering mechanism, the prevalence of 'toxic' comments may be limited. Can we observe obvious pattern if we analyse other emotion dimension such as disappointment, discouragement or unsatisfication?
            
        - Next steps:


