# Sentiment Analysis for Wuthering Waves
## Data Collection, Data Cleansing, and Preprocessing
__Harits Mughni Zakinu__

---



# Background

**Wuthering Waves** is an action RPG released by **Kuro Games in May 2024**, featuring a theme of action and adventure. The game has garnered significant attention due to its unique blend of character-driven narratives and stunning visual storytelling. With a growing fan base and increasing engagement on its official YouTube channel, understanding player sentiment has become crucial for the developers to tailor future updates and content.

# Objective

Performing sentiment analysis on comments from the official Wuthering Waves YouTube channel, focusing on two main aspects: characters and story to understand player reactions using Random Forest Classifier and Support Vector Machine.

# Install package

In [None]:
!pip install google-api-python-client

In [None]:
!pip install vaderSentiment

In [None]:
!pip install emoji

# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import emoji

import nltk
from bs4 import BeautifulSoup
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from googleapiclient.discovery import build
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

# Data Collection

Comments are taken from the official Wuthering Waves YouTube channel and saved in CSV format. The videos in question are titled as follows:

- “Wuthering Waves | Resonator Showcase | Changli — SEIZE THE INITIATIVE”,
- “Wuthering Waves Resonator Showcase | Jinhsi — HEED MY CALLING”,
- “Wuthering Waves | Resonator Showcase | Yinlin — SURRENDER OR DIE”,
- “Wuthering Waves | Resonator Showcase | Jiyan — THROUGH THE DARKEST OF NIGHTS”
- “Wuthering Waves | Resonator Showcase | Lingyang — Debut”,
- “Wuthering Waves | Resonator Showcase | Encore — BAA BAA!!”,
- “Wuthering Waves Story Cinematics | Pursuit”,
- “Wuthering Waves Story Cinematics | Daybreak”,
- “Wuthering Waves Featured Cinematics | AS FATE HAS DECREED”,
- “Wuthering Waves Story Cinematics | Battle Beneath the Crescent”,
- “Wuthering Waves Featured Cinematics | SAVING LIGHT”,
- “Opening Cinematics | Set Sail”

In [None]:
#def video_comments(video_id, api_key):
#    """
#    Retrieve comments from a YouTube video using the YouTube Data API.

#    Args:
#    video_id (str): The ID of the YouTube video.
#    api_key (str): Your YouTube Data API key.

#    Returns:
#    list: A list of comments with each comment represented as a list
#          containing the published date, user name, and the comment text.
#    """

#    comments = []

#    # Initialize the YouTube API client
#    youtube = build('youtube', 'v3', developerKey=api_key)
#    video_response = youtube.commentThreads().list(
#        part='snippet,replies',
#        videoId=video_id,
#        maxResults=100  # Maximum number of results per page
#    ).execute()

#    # Extract comments from the video response
#     while video_response:
#         for item in video_response['items']:
#             published = item['snippet']['topLevelComment']['snippet']['publishedAt']
#             user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
#             comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
#             comments.append([published, user, comment])

#             replycount = item['snippet']['totalReplyCount']
#             if replycount > 0:
#                 for reply in item['replies']['comments']:
#                     published = reply['snippet']['publishedAt']
#                     user = reply['snippet']['authorDisplayName']
#                     repl = reply['snippet']['textDisplay']
#                     comments.append([published, user, repl])

#         if 'nextPageToken' in video_response:
#             video_response = youtube.commentThreads().list(
#                 part='snippet,replies',
#                 pageToken=video_response['nextPageToken'],
#                 videoId=video_id
#             ).execute()
#         else:
#             break

#     return comments

# api_key = 'API_KEYS'

# Video IDs
#video_ids_character = [
#    "jYjxjy1l6Co",
#    "0caRWqAQFMc",
#    "TImtNKeNk78",
#    "wnxtQsHOy1k",
#    "ptSvUTfGmNI",
#    "u_hHNpp6qs0"
#]

#video_ids_story = [
#    "gA0Z1qd0ymQ",
#    "fYKlOyzU740",
#    "TvxwtXIrf1g",
#    "JWy3zoOMLxM",
#    "4i1IRQ6SGTM",
#    "xOn_aOXq1ZY"
#]

## Taking comments from video ids for aspects of the character
#comments_character = []
#for video_id in video_ids_character:
#    comments = video_comments(video_id, api_key)
#    comments_character.extend(comments)

#df_char = pd.DataFrame(comments_character, columns=['publishedAt', 'authorDisplayName', 'textDisplay'])
#df_char.to_csv('/content/drive/MyDrive/Dataset/yt_comments_character.csv', index=False)

## Taking comments from video ids for story aspects
#comments_story = []
#for video_id in video_ids_story:
#    comments = video_comments(video_id, api_key)
#    comments_story.extend(comments)

#df_story = pd.DataFrame(comments_story, columns=['publishedAt', 'authorDisplayName', 'textDisplay'])
#df_story.to_csv('/content/drive/MyDrive/Dataset/yt_comments_story.csv', index=False)

In [None]:
df_char = pd.read_csv('/content/drive/MyDrive/Dataset/yt_comments_character.csv')
df_story = pd.read_csv('/content/drive/MyDrive/Dataset/yt_comments_story.csv')

In [None]:
df_char

Unnamed: 0,publishedAt,authorDisplayName,textDisplay
0,2024-08-12T18:35:15Z,@erenop2319,Finally Got her after loosing the 50-50 welp a...
1,2024-08-12T15:13:30Z,@user-ie9xx5yo2z,I have to say that Changli&#39;s character mod...
2,2024-08-13T02:05:49Z,@avej99,I mean thats not hard to do tbh
3,2024-08-13T17:47:23Z,@WuWa-oh7rd,Agree
4,2024-08-12T06:05:15Z,@DonLuy,"<a href=""https://www.youtube.com/watch?v=jYjxj..."
...,...,...,...
14031,2024-04-22T10:02:20Z,@Ureli06,1 month lefttt
14032,2024-04-22T10:01:56Z,@D-mail_user.,2
14033,2024-04-22T10:01:55Z,@ShinkaiCh,I&#39;M HERE
14034,2024-04-22T10:01:40Z,@abdo77066,the first


In [None]:
df_char.shape

(14036, 3)

In [None]:
df_story

Unnamed: 0,publishedAt,authorDisplayName,textDisplay
0,2024-08-14T08:18:31Z,@vrika4280,"I just finished the quest, and both this and J..."
1,2024-08-13T01:16:21Z,@prevailege,I really like her design! It&#39;s just PERFECT!!
2,2024-08-12T06:51:42Z,@Ew-wth,"Look at the replay amounts for each scene, lol..."
3,2024-08-11T19:42:25Z,@S.k.Editor-14,BRO I CAN&#39;T THE HAND MOMENT 😭✋🏻💦
4,2024-08-09T19:33:09Z,@CrimRui,I love it! Great stuff.
...,...,...,...
10732,2024-02-18T03:01:01Z,@misakoblades,Yo this looks cool
10733,2024-02-18T03:00:59Z,@dq9168,Cool
10734,2024-02-18T03:00:56Z,@1Yeji,first
10735,2024-02-18T03:00:55Z,@Rarrinn,WTF-


In [None]:
df_story.shape

(10737, 3)

# Labeling Data

Based on the criteria established by the VaderSentiment method, with the following guidelines:

- Positive Sentiment: compound score >= 0.05
- Neutral Sentiment: (compound score > -0.05) and (compound score < 0.05)
- Negative Sentiment: compound score <= -0.05

In [None]:
def labeling(sentences):
  analyzer = SentimentIntensityAnalyzer()
  data = []
  for sentence in sentences:
    # Convert sentence to string if it's not already
    if not isinstance(sentence, str):
      sentence = str(sentence)
    vs = analyzer.polarity_scores(sentence)
    data.append(vs)
  return data

df_char['scores'] = labeling(df_char['textDisplay'])
df_story['scores'] = labeling(df_story['textDisplay'])

df_char['neg_score'] = [score['neg'] for score in df_char['scores']]
df_char['pos_score'] = [score['pos'] for score in df_char['scores']]
df_char['neu_score'] = [score['neu'] for score in df_char['scores']]

df_story['neg_score'] = [score['neg'] for score in df_story['scores']]
df_story['pos_score'] = [score['pos'] for score in df_story['scores']]
df_story['neu_score'] = [score['neu'] for score in df_story['scores']]

df_char['compound_score'] = [score['compound'] for score in df_char['scores']]
df_story['compound_score'] = [score['compound'] for score in df_story['scores']]

def get_sentiment_3category(score):
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

df_char['sentiment'] = df_char['compound_score'].apply(get_sentiment_3category)
df_story['sentiment'] = df_story['compound_score'].apply(get_sentiment_3category)

In [None]:
df_char

Unnamed: 0,publishedAt,authorDisplayName,textDisplay,scores,neg_score,pos_score,neu_score,compound_score,sentiment
0,2024-08-12T18:35:15Z,@erenop2319,Finally Got her after loosing the 50-50 welp a...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral
1,2024-08-12T15:13:30Z,@user-ie9xx5yo2z,I have to say that Changli&#39;s character mod...,"{'neg': 0.0, 'neu': 0.628, 'pos': 0.372, 'comp...",0.0,0.372,0.628,0.7964,positive
2,2024-08-13T02:05:49Z,@avej99,I mean thats not hard to do tbh,"{'neg': 0.0, 'neu': 0.844, 'pos': 0.156, 'comp...",0.0,0.156,0.844,0.0762,positive
3,2024-08-13T17:47:23Z,@WuWa-oh7rd,Agree,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.0,1.000,0.000,0.3612,positive
4,2024-08-12T06:05:15Z,@DonLuy,"<a href=""https://www.youtube.com/watch?v=jYjxj...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral
...,...,...,...,...,...,...,...,...,...
14031,2024-04-22T10:02:20Z,@Ureli06,1 month lefttt,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral
14032,2024-04-22T10:01:56Z,@D-mail_user.,2,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral
14033,2024-04-22T10:01:55Z,@ShinkaiCh,I&#39;M HERE,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral
14034,2024-04-22T10:01:40Z,@abdo77066,the first,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral


In [None]:
df_story

Unnamed: 0,publishedAt,authorDisplayName,textDisplay,scores,neg_score,pos_score,neu_score,compound_score,sentiment
0,2024-08-14T08:18:31Z,@vrika4280,"I just finished the quest, and both this and J...","{'neg': 0.0, 'neu': 0.528, 'pos': 0.472, 'comp...",0.000,0.472,0.528,0.9701,positive
1,2024-08-13T01:16:21Z,@prevailege,I really like her design! It&#39;s just PERFECT!!,"{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'comp...",0.000,0.575,0.425,0.8443,positive
2,2024-08-12T06:51:42Z,@Ew-wth,"Look at the replay amounts for each scene, lol...","{'neg': 0.0, 'neu': 0.781, 'pos': 0.219, 'comp...",0.000,0.219,0.781,0.4215,positive
3,2024-08-11T19:42:25Z,@S.k.Editor-14,BRO I CAN&#39;T THE HAND MOMENT 😭✋🏻💦,"{'neg': 0.133, 'neu': 0.56, 'pos': 0.307, 'com...",0.133,0.307,0.560,0.6166,positive
4,2024-08-09T19:33:09Z,@CrimRui,I love it! Great stuff.,"{'neg': 0.0, 'neu': 0.259, 'pos': 0.741, 'comp...",0.000,0.741,0.259,0.8622,positive
...,...,...,...,...,...,...,...,...,...
10732,2024-02-18T03:01:01Z,@misakoblades,Yo this looks cool,"{'neg': 0.0, 'neu': 0.566, 'pos': 0.434, 'comp...",0.000,0.434,0.566,0.3182,positive
10733,2024-02-18T03:00:59Z,@dq9168,Cool,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.000,1.000,0.000,0.3182,positive
10734,2024-02-18T03:00:56Z,@1Yeji,first,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.000,0.000,1.000,0.0000,neutral
10735,2024-02-18T03:00:55Z,@Rarrinn,WTF-,"{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound...",1.000,0.000,0.000,-0.5859,negative


# Data Cleansing & Preprocessing

## Character Aspect

In [None]:
df_char.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14036 entries, 0 to 14035
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   publishedAt        14036 non-null  object 
 1   authorDisplayName  14008 non-null  object 
 2   textDisplay        14011 non-null  object 
 3   scores             14036 non-null  object 
 4   neg_score          14036 non-null  float64
 5   pos_score          14036 non-null  float64
 6   neu_score          14036 non-null  float64
 7   compound_score     14036 non-null  float64
 8   sentiment          14036 non-null  object 
dtypes: float64(4), object(5)
memory usage: 987.0+ KB


In [None]:
df_char.isnull().sum()

Unnamed: 0,0
publishedAt,0
authorDisplayName,28
textDisplay,25
scores,0
neg_score,0
pos_score,0
neu_score,0
compound_score,0
sentiment,0


In [None]:
df_char.describe()

Unnamed: 0,neg_score,pos_score,neu_score,compound_score
count,14036.0,14036.0,14036.0,14036.0
mean,0.073598,0.1906,0.735802,0.186428
std,0.154544,0.236303,0.255211,0.443977
min,0.0,0.0,0.0,-0.9957
25%,0.0,0.0,0.556,0.0
50%,0.0,0.092,0.755,0.0
75%,0.081,0.333,1.0,0.5719
max,1.0,1.0,1.0,0.9999


### Remove Duplicated Comments

In [None]:
df_char[df_char.duplicated(['authorDisplayName', 'textDisplay', 'publishedAt'])].shape

(5, 9)

In [None]:
df_char.drop_duplicates(['authorDisplayName', 'textDisplay', 'publishedAt'], inplace=True)
df_char.reset_index(drop=True, inplace=True)
df_char.shape

(14031, 9)

### Preprocessing

Next, we will perform preprocessing to transform our text into a more digestible form for our classifier. The steps are as follows:

- Remove HTML tags
- Use regular expression to remove special characters and numbers
- Lowercase words
- Remove repetitive characters
- Use NLTK to remove stopwords
- Lemmatization

In [None]:
def char_preprocessed_text(char_raw_text):

  if not isinstance(char_raw_text, str):     # Fixes float error if the input is not a string, just return an empty string
    return ''

  # Removes HTML Tags
  char_comments_text = BeautifulSoup(char_raw_text).get_text()

  # Removes non-letters/emojis, change to lowercase and then splitsit into tokens
  char_words_only = re.sub("[^a-zA-Z]", " ", char_comments_text)

  # Convert to lowercase
  words = char_words_only.lower().split()

  # Remove repetitive characters
  words = [re.sub(r'(.)\1{2,}', r'\1', word) for word in words]

  # Tokenization
  tokens = word_tokenize(" ".join(words))

  # Removing stopwords like the, in, of etc.
  char_stop_words = set(stopwords.words('english'))
  char_fil_words = [word for word in tokens if word not in char_stop_words]

  # Lemmatization
  lemmatizer = WordNetLemmatizer()
  lemmatized_words = [lemmatizer.lemmatize(word) for word in char_fil_words]

  return " ".join(lemmatized_words)

In [None]:
# Pre-process the raw text
df_char['textDisplayAdj'] = df_char['textDisplay'].map(char_preprocessed_text)

In [None]:
df_char

Unnamed: 0,publishedAt,authorDisplayName,textDisplay,scores,neg_score,pos_score,neu_score,compound_score,sentiment,textDisplayAdj
0,2024-08-12T18:35:15Z,@erenop2319,Finally Got her after loosing the 50-50 welp a...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral,finally got loosing welp anyways deserves
1,2024-08-12T15:13:30Z,@user-ie9xx5yo2z,I have to say that Changli&#39;s character mod...,"{'neg': 0.0, 'neu': 0.628, 'pos': 0.372, 'comp...",0.0,0.372,0.628,0.7964,positive,say changli character modeling better genshin
2,2024-08-13T02:05:49Z,@avej99,I mean thats not hard to do tbh,"{'neg': 0.0, 'neu': 0.844, 'pos': 0.156, 'comp...",0.0,0.156,0.844,0.0762,positive,mean thats hard tbh
3,2024-08-13T17:47:23Z,@WuWa-oh7rd,Agree,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.0,1.000,0.000,0.3612,positive,agree
4,2024-08-12T06:05:15Z,@DonLuy,"<a href=""https://www.youtube.com/watch?v=jYjxj...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral,gojo
...,...,...,...,...,...,...,...,...,...,...
14026,2024-04-22T10:02:20Z,@Ureli06,1 month lefttt,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral,month left
14027,2024-04-22T10:01:56Z,@D-mail_user.,2,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral,
14028,2024-04-22T10:01:55Z,@ShinkaiCh,I&#39;M HERE,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral,
14029,2024-04-22T10:01:40Z,@abdo77066,the first,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.000,1.000,0.0000,neutral,first


In [None]:
pd.set_option('display.max_colwidth', None)
df_char[['textDisplay' , 'textDisplayAdj']]

Unnamed: 0,textDisplay,textDisplayAdj
0,Finally Got her after loosing the 50-50 welp anyways she deserves,finally got loosing welp anyways deserves
1,I have to say that Changli&#39;s character modeling is better than genshin❤,say changli character modeling better genshin
2,I mean thats not hard to do tbh,mean thats hard tbh
3,Agree,agree
4,"<a href=""https://www.youtube.com/watch?v=jYjxjy1l6Co&amp;t=138"">2:18</a> gojo?",gojo
...,...,...
14026,1 month lefttt,month left
14027,2,
14028,I&#39;M HERE,
14029,the first,first


### Cleansing

In [None]:
# Find reviews with actual substance
df_char['textDisplay_clean'] = df_char['textDisplayAdj'].str.split().map(len)

In [None]:
df_char[df_char['textDisplay_clean']==0].shape

(693, 11)

In [None]:
# Removing emojis and other comments
df_char[df_char['textDisplay_clean']==0]['textDisplay']

Unnamed: 0,textDisplay
39,😭😭
66,唉，摘桃 唉，意难平😮‍💨
70,🎉
79,🔥🔥🔥🔥🔥
81,"霓裳,长离"
...,...
14010,❤❤❤❤❤❤
14016,here
14018,👀
14027,2


In [None]:
# Drop these comments that do not have any substance to it
df_char = df_char.drop(df_char[df_char['textDisplay_clean']==0].index)

# Reindex the dataframe
df_char.reset_index(drop=True, inplace=True)
df_char.shape

(13338, 11)

In [None]:
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

# Create a boolean mask indicating which comments are in English
english_mask_char = df_char['textDisplay'].apply(is_english)

# Filter the DataFrame to keep only English comments
df_char = df_char[english_mask_char]

# Reset the index to ensure it is sequential and starts at 0
df_char.reset_index(drop=True, inplace=True)

# Create a mask for non-English textDisplay
non_english_mask_char = ~english_mask_char

# Use the mask to locate the indexes of non-English textDisplay
indexes_of_non_english_char = df_char[non_english_mask_char].index

# Access and view the non-English content using the indexes
non_english_textDisplay_char = df_char.loc[indexes_of_non_english_char, 'textDisplay']

In [None]:
df_char

Unnamed: 0,publishedAt,authorDisplayName,textDisplay,scores,neg_score,pos_score,neu_score,compound_score,sentiment,textDisplayAdj,textDisplay_clean
0,2024-08-12T18:35:15Z,@erenop2319,Finally Got her after loosing the 50-50 welp anyways she deserves,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,finally got loosing welp anyways deserves,6
1,2024-08-12T15:13:30Z,@user-ie9xx5yo2z,I have to say that Changli&#39;s character modeling is better than genshin❤,"{'neg': 0.0, 'neu': 0.628, 'pos': 0.372, 'compound': 0.7964}",0.000,0.372,0.628,0.7964,positive,say changli character modeling better genshin,6
2,2024-08-13T02:05:49Z,@avej99,I mean thats not hard to do tbh,"{'neg': 0.0, 'neu': 0.844, 'pos': 0.156, 'compound': 0.0762}",0.000,0.156,0.844,0.0762,positive,mean thats hard tbh,4
3,2024-08-09T09:48:36Z,@istgifusupportnehema...,OMG I HAVE CHANGLI AND HELP HER COMBAT ANIMS R SO GOOD?.???? AAAA😭😭😭😭 <br><br>offtopic but does anyone know how to get to the mianloong chamber from loong&#39;s rest?? im lost😭,"{'neg': 0.306, 'neu': 0.602, 'pos': 0.092, 'compound': -0.8962}",0.306,0.092,0.602,-0.8962,negative,omg changli help combat anims r good offtopic anyone know get mianloong chamber loong rest im lost,17
4,2024-08-08T10:37:45Z,@finchXY,Lost 50/50 but will try again on rerun 😫,"{'neg': 0.407, 'neu': 0.593, 'pos': 0.0, 'compound': -0.6705}",0.407,0.000,0.593,-0.6705,negative,lost try rerun,3
...,...,...,...,...,...,...,...,...,...,...,...
9734,2024-04-22T10:48:43Z,@detlefswelt4546,What? 1 month....16 days? What are you talking about?,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,month day talking,3
9735,2024-04-22T10:50:12Z,@onii3146,27 may right?,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,may right,2
9736,2024-04-22T10:02:20Z,@Ureli06,1 month lefttt,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,month left,2
9737,2024-04-22T10:01:40Z,@abdo77066,the first,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,first,1


In [None]:
# Dropping textDisplay_clean column and then rearranging the columns to move contentAdj next to content
df_char.drop('textDisplay_clean', axis=1, inplace=True)
df_char = df_char[['authorDisplayName', 'textDisplay', 'textDisplayAdj',
                   'scores', 'neg_score', 'pos_score', 'neu_score',
                   'compound_score', 'sentiment', 'publishedAt']]

In [None]:
df_char

Unnamed: 0,authorDisplayName,textDisplay,textDisplayAdj,scores,neg_score,pos_score,neu_score,compound_score,sentiment,publishedAt
0,@erenop2319,Finally Got her after loosing the 50-50 welp anyways she deserves,finally got loosing welp anyways deserves,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-08-12T18:35:15Z
1,@user-ie9xx5yo2z,I have to say that Changli&#39;s character modeling is better than genshin❤,say changli character modeling better genshin,"{'neg': 0.0, 'neu': 0.628, 'pos': 0.372, 'compound': 0.7964}",0.000,0.372,0.628,0.7964,positive,2024-08-12T15:13:30Z
2,@avej99,I mean thats not hard to do tbh,mean thats hard tbh,"{'neg': 0.0, 'neu': 0.844, 'pos': 0.156, 'compound': 0.0762}",0.000,0.156,0.844,0.0762,positive,2024-08-13T02:05:49Z
3,@istgifusupportnehema...,OMG I HAVE CHANGLI AND HELP HER COMBAT ANIMS R SO GOOD?.???? AAAA😭😭😭😭 <br><br>offtopic but does anyone know how to get to the mianloong chamber from loong&#39;s rest?? im lost😭,omg changli help combat anims r good offtopic anyone know get mianloong chamber loong rest im lost,"{'neg': 0.306, 'neu': 0.602, 'pos': 0.092, 'compound': -0.8962}",0.306,0.092,0.602,-0.8962,negative,2024-08-09T09:48:36Z
4,@finchXY,Lost 50/50 but will try again on rerun 😫,lost try rerun,"{'neg': 0.407, 'neu': 0.593, 'pos': 0.0, 'compound': -0.6705}",0.407,0.000,0.593,-0.6705,negative,2024-08-08T10:37:45Z
...,...,...,...,...,...,...,...,...,...,...
9734,@detlefswelt4546,What? 1 month....16 days? What are you talking about?,month day talking,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-04-22T10:48:43Z
9735,@onii3146,27 may right?,may right,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-04-22T10:50:12Z
9736,@Ureli06,1 month lefttt,month left,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-04-22T10:02:20Z
9737,@abdo77066,the first,first,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-04-22T10:01:40Z


###  Save Clean Data

In [None]:
df_char.to_csv('yt_comments_char_clean.csv', index=False) #Save the data

## Story Aspect

In [None]:
df_story.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10737 entries, 0 to 10736
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   publishedAt        10737 non-null  object 
 1   authorDisplayName  10734 non-null  object 
 2   textDisplay        10735 non-null  object 
 3   scores             10737 non-null  object 
 4   neg_score          10737 non-null  float64
 5   pos_score          10737 non-null  float64
 6   neu_score          10737 non-null  float64
 7   compound_score     10737 non-null  float64
 8   sentiment          10737 non-null  object 
dtypes: float64(4), object(5)
memory usage: 755.1+ KB


In [None]:
df_story.isnull().sum()

Unnamed: 0,0
publishedAt,0
authorDisplayName,3
textDisplay,2
scores,0
neg_score,0
pos_score,0
neu_score,0
compound_score,0
sentiment,0


In [None]:
df_story.describe()

Unnamed: 0,neg_score,pos_score,neu_score,compound_score
count,10737.0,10737.0,10737.0,10737.0
mean,0.074005,0.176716,0.749279,0.182553
std,0.157343,0.227318,0.251038,0.444893
min,0.0,0.0,0.0,-0.9906
25%,0.0,0.0,0.58,0.0
50%,0.0,0.078,0.78,0.0
75%,0.079,0.3,1.0,0.5719
max,1.0,1.0,1.0,0.9997


### Remove Duplicated Comments

In [None]:
df_story[df_story.duplicated(['authorDisplayName', 'textDisplay', 'publishedAt'])].shape

(1, 9)

In [None]:
df_story.drop_duplicates(['authorDisplayName', 'textDisplay', 'publishedAt'], inplace=True)
df_story.reset_index(drop=True, inplace=True)
df_story.shape

(10736, 9)

### Preprocessing

Next, we will perform preprocessing to transform our text into a more digestible form for our classifier. The steps are as follows:

- Remove HTML tags
- Use regular expression to remove special characters and numbers
- Lowercase words
- Remove repetitive characters
- Use NLTK to remove stopwords
- Lemmatization

In [None]:
def story_preprocessed_text(story_raw_text):

  if not isinstance(story_raw_text, str):     # Fixes float error if the input is not a string, just return an empty string
    return ''

  # Removes HTML Tags
  story_comments_text = BeautifulSoup(story_raw_text).get_text()

  # Removes non-letters/emojis, change to lowercase and then splitsit into tokens
  story_words_only = re.sub("[^a-zA-Z]", " ", story_comments_text)

  # Convert to lowercase
  words = story_words_only.lower().split()

  # Remove repetitive  characters
  words = [re.sub(r'(.)\1{2,}', r'\1', word) for word in words]

  # Tokenization
  tokens = word_tokenize(" ".join(words))

  # Removing stopwords like the, in, of etc.
  story_stop_words = set(stopwords.words('english'))
  char_fil_words = [word for word in tokens if word not in story_stop_words]

  # Lemmatization
  lemmatizer = WordNetLemmatizer()
  lemmatized_words = [lemmatizer.lemmatize(word) for word in char_fil_words]

  return " ".join(lemmatized_words)

In [None]:
# Pre-process the raw text
df_story['textDisplayAdj'] = df_story['textDisplay'].map(story_preprocessed_text)

In [None]:
df_story

Unnamed: 0,publishedAt,authorDisplayName,textDisplay,scores,neg_score,pos_score,neu_score,compound_score,sentiment,textDisplayAdj
0,2024-08-14T08:18:31Z,@vrika4280,"I just finished the quest, and both this and Jinhsi&#39;s, improved so much over Yinlin&#39;s imo, awesome work! Glad this game exists and hope you continue to have great success!","{'neg': 0.0, 'neu': 0.528, 'pos': 0.472, 'compound': 0.9701}",0.000,0.472,0.528,0.9701,positive,finished quest jinhsi improved much yinlin imo awesome work glad game exists hope continue great success
1,2024-08-13T01:16:21Z,@prevailege,I really like her design! It&#39;s just PERFECT!!,"{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8443}",0.000,0.575,0.425,0.8443,positive,really like design perfect
2,2024-08-12T06:51:42Z,@Ew-wth,"Look at the replay amounts for each scene, lol. Just saying.","{'neg': 0.0, 'neu': 0.781, 'pos': 0.219, 'compound': 0.4215}",0.000,0.219,0.781,0.4215,positive,look replay amount scene lol saying
3,2024-08-11T19:42:25Z,@S.k.Editor-14,BRO I CAN&#39;T THE HAND MOMENT 😭✋🏻💦,"{'neg': 0.133, 'neu': 0.56, 'pos': 0.307, 'compound': 0.6166}",0.133,0.307,0.560,0.6166,positive,bro hand moment
4,2024-08-09T19:33:09Z,@CrimRui,I love it! Great stuff.,"{'neg': 0.0, 'neu': 0.259, 'pos': 0.741, 'compound': 0.8622}",0.000,0.741,0.259,0.8622,positive,love great stuff
...,...,...,...,...,...,...,...,...,...,...
10731,2024-02-18T03:01:01Z,@misakoblades,Yo this looks cool,"{'neg': 0.0, 'neu': 0.566, 'pos': 0.434, 'compound': 0.3182}",0.000,0.434,0.566,0.3182,positive,yo look cool
10732,2024-02-18T03:00:59Z,@dq9168,Cool,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.3182}",0.000,1.000,0.000,0.3182,positive,cool
10733,2024-02-18T03:00:56Z,@1Yeji,first,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,first
10734,2024-02-18T03:00:55Z,@Rarrinn,WTF-,"{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.5859}",1.000,0.000,0.000,-0.5859,negative,wtf


In [None]:
pd.set_option('display.max_colwidth', None)
df_story[['textDisplay' , 'textDisplayAdj']]

Unnamed: 0,textDisplay,textDisplayAdj
0,"I just finished the quest, and both this and Jinhsi&#39;s, improved so much over Yinlin&#39;s imo, awesome work! Glad this game exists and hope you continue to have great success!",finished quest jinhsi improved much yinlin imo awesome work glad game exists hope continue great success
1,I really like her design! It&#39;s just PERFECT!!,really like design perfect
2,"Look at the replay amounts for each scene, lol. Just saying.",look replay amount scene lol saying
3,BRO I CAN&#39;T THE HAND MOMENT 😭✋🏻💦,bro hand moment
4,I love it! Great stuff.,love great stuff
...,...,...
10731,Yo this looks cool,yo look cool
10732,Cool,cool
10733,first,first
10734,WTF-,wtf


### Cleaning

In [None]:
# Find reviews with actual substance
df_story['textDisplay_clean'] = df_story['textDisplayAdj'].str.split().map(len)

In [None]:
df_story[df_story['textDisplay_clean']==0].shape

(483, 11)

In [None]:
# Removing emojis and other comments
df_story[df_story['textDisplay_clean']==0]['textDisplay']

Unnamed: 0,textDisplay
8,感覺不如原神
37,好心你做个人吧，没有事情做了…..我的休息时间就是想玩点东西，搞点活动，还有联觉等级的奖励太少了！
66,抄襲薩爾達的垃圾遊戲
86,呜朝绝对是国产之光🇨🇳故事讲的是的呜朝当皇帝，😃😃😃😃😃😃
114,🎉🎉
...,...
10654,🔥🔥🔥💯
10655,❤❤
10685,😱
10706,same


In [None]:
# Drop these comments that do not have any substance to it
df_story = df_story.drop(df_story[df_story['textDisplay_clean']==0].index)

# Reindex the dataframe
df_story.reset_index(drop=True, inplace=True)
df_story.shape

(10253, 11)

In [None]:
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

# Create a boolean mask indicating which comments are in English
english_mask_char = df_story['textDisplay'].apply(is_english)

# Filter the DataFrame to keep only English comments
df_story = df_story[english_mask_char]

# Reset the index to ensure it is sequential and starts at 0
df_story.reset_index(drop=True, inplace=True)

# Create a mask for non-English textDisplay
non_english_mask_char = ~english_mask_char

# Use the mask to locate the indexes of non-English textDisplay
indexes_of_non_english_story = df_story[non_english_mask_char].index

# Access and view the non-English content using the indexes
non_english_textDisplay_story = df_story.loc[indexes_of_non_english_story, 'textDisplay']

In [None]:
df_story

Unnamed: 0,publishedAt,authorDisplayName,textDisplay,scores,neg_score,pos_score,neu_score,compound_score,sentiment,textDisplayAdj,textDisplay_clean
0,2024-08-14T08:18:31Z,@vrika4280,"I just finished the quest, and both this and Jinhsi&#39;s, improved so much over Yinlin&#39;s imo, awesome work! Glad this game exists and hope you continue to have great success!","{'neg': 0.0, 'neu': 0.528, 'pos': 0.472, 'compound': 0.9701}",0.000,0.472,0.528,0.9701,positive,finished quest jinhsi improved much yinlin imo awesome work glad game exists hope continue great success,16
1,2024-08-13T01:16:21Z,@prevailege,I really like her design! It&#39;s just PERFECT!!,"{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8443}",0.000,0.575,0.425,0.8443,positive,really like design perfect,4
2,2024-08-12T06:51:42Z,@Ew-wth,"Look at the replay amounts for each scene, lol. Just saying.","{'neg': 0.0, 'neu': 0.781, 'pos': 0.219, 'compound': 0.4215}",0.000,0.219,0.781,0.4215,positive,look replay amount scene lol saying,6
3,2024-08-11T19:42:25Z,@S.k.Editor-14,BRO I CAN&#39;T THE HAND MOMENT 😭✋🏻💦,"{'neg': 0.133, 'neu': 0.56, 'pos': 0.307, 'compound': 0.6166}",0.133,0.307,0.560,0.6166,positive,bro hand moment,3
4,2024-08-09T19:33:09Z,@CrimRui,I love it! Great stuff.,"{'neg': 0.0, 'neu': 0.259, 'pos': 0.741, 'compound': 0.8622}",0.000,0.741,0.259,0.8622,positive,love great stuff,3
...,...,...,...,...,...,...,...,...,...,...,...
7771,2024-02-18T03:09:47Z,@monkeydluffy5054,Let them burn the kitchen for this game,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,let burn kitchen game,4
7772,2024-02-18T03:12:59Z,@isen47,Based on their character models they happen to be baking too.,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,based character model happen baking,5
7773,2024-02-18T03:01:01Z,@misakoblades,Yo this looks cool,"{'neg': 0.0, 'neu': 0.566, 'pos': 0.434, 'compound': 0.3182}",0.000,0.434,0.566,0.3182,positive,yo look cool,3
7774,2024-02-18T03:00:59Z,@dq9168,Cool,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.3182}",0.000,1.000,0.000,0.3182,positive,cool,1


In [None]:
# Dropping content_clean column and then rearranging the columns to move contentAdj next to content
df_story.drop('textDisplay_clean', axis=1, inplace=True)
df_story = df_story[['authorDisplayName', 'textDisplay', 'textDisplayAdj',
                   'scores', 'neg_score', 'pos_score', 'neu_score',
                   'compound_score', 'sentiment', 'publishedAt']]

In [None]:
df_story

Unnamed: 0,authorDisplayName,textDisplay,textDisplayAdj,scores,neg_score,pos_score,neu_score,compound_score,sentiment,publishedAt
0,@vrika4280,"I just finished the quest, and both this and Jinhsi&#39;s, improved so much over Yinlin&#39;s imo, awesome work! Glad this game exists and hope you continue to have great success!",finished quest jinhsi improved much yinlin imo awesome work glad game exists hope continue great success,"{'neg': 0.0, 'neu': 0.528, 'pos': 0.472, 'compound': 0.9701}",0.000,0.472,0.528,0.9701,positive,2024-08-14T08:18:31Z
1,@prevailege,I really like her design! It&#39;s just PERFECT!!,really like design perfect,"{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8443}",0.000,0.575,0.425,0.8443,positive,2024-08-13T01:16:21Z
2,@Ew-wth,"Look at the replay amounts for each scene, lol. Just saying.",look replay amount scene lol saying,"{'neg': 0.0, 'neu': 0.781, 'pos': 0.219, 'compound': 0.4215}",0.000,0.219,0.781,0.4215,positive,2024-08-12T06:51:42Z
3,@S.k.Editor-14,BRO I CAN&#39;T THE HAND MOMENT 😭✋🏻💦,bro hand moment,"{'neg': 0.133, 'neu': 0.56, 'pos': 0.307, 'compound': 0.6166}",0.133,0.307,0.560,0.6166,positive,2024-08-11T19:42:25Z
4,@CrimRui,I love it! Great stuff.,love great stuff,"{'neg': 0.0, 'neu': 0.259, 'pos': 0.741, 'compound': 0.8622}",0.000,0.741,0.259,0.8622,positive,2024-08-09T19:33:09Z
...,...,...,...,...,...,...,...,...,...,...
7771,@monkeydluffy5054,Let them burn the kitchen for this game,let burn kitchen game,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-02-18T03:09:47Z
7772,@isen47,Based on their character models they happen to be baking too.,based character model happen baking,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.000,0.000,1.000,0.0000,neutral,2024-02-18T03:12:59Z
7773,@misakoblades,Yo this looks cool,yo look cool,"{'neg': 0.0, 'neu': 0.566, 'pos': 0.434, 'compound': 0.3182}",0.000,0.434,0.566,0.3182,positive,2024-02-18T03:01:01Z
7774,@dq9168,Cool,cool,"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.3182}",0.000,1.000,0.000,0.3182,positive,2024-02-18T03:00:59Z


### Save Clean Data

In [None]:
df_story.to_csv('yt_comments_story_clean.csv', index=False) #Save the data