# From Passion to Profit: Exploring Emotions as Indicators of Monetized Message Engagement
Jasmine Guo \
Data Science Institute \
DS 5780 Natural Language Processing \
Dr. Scott Crossley \
April 23, 2024

# Data Processing 
This notebook is an auxiliary notebook to the main notebook. The purpose of this notebook is to process the data to a dataframe, which will be used in the main notebook. The main notebook will contain more data wranling, but those wrangling are specific to the main notebook, and thus it's kept there.

The process in this notebook is as follows:
- Reading all the data from local drive.
- Binding all the data into a single dataframe.
- Filtering the data to only include English comments, and comments are actually messages.
- Save the dataframe into a .csv file. 

In [23]:
# loading the library 
import pandas as pd
from langdetect import detect
import os

We will begin by loading in the data. The data are stored in the subdirectory of the working directory. The regular message are part of the "regular" folder, and the monetized mesaages are part of the "superchat" folder, taking name from YouTube's naming convention.

In [24]:
# loading the data

# folder path
regularFP = "data/regular/"
monetizedFP = "data/superchat/"

# list of dataframes
regularDF = []
monetizedDF = []

# looping through the directory, extract the file and append it to the respective list
for filename in os.listdir(regularFP):
    file_path = os.path.join(regularFP, filename)
    df = pd.read_table(file_path, header = None)
    regularDF.append(df) 

for filename in os.listdir(monetizedFP):
    file_path = os.path.join(monetizedFP, filename) 
    df = pd.read_table(file_path, header = None)
    monetizedDF.append(df) 


In [25]:
# concatenate the data into one dataframe
regularDF = pd.concat(regularDF)
monetizedDF = pd.concat(monetizedDF)

Next, we will split the data into its respective columns. The data are stored in a .txt file without consistent format, and thus were loaded in as a single string. \
Between the timestampe and name column, the deliminater is "|", and between the name and the comment column, the deliminater is a colon.

In [26]:
# split the regular data into timestamp, name and comment
# using n = 1 to split only upon first occurance
timeStampDF = regularDF[0].str.split(" | ", expand=True, n=1)
commentDF = timeStampDF[1].str.split(":", expand=True, n=1)
regularSplitDF = pd.concat([timeStampDF[0], commentDF], axis=1)
regularSplitDF.columns = ['Timestamp', 'Name', 'Comment']

In [27]:
# split the monetized data into timestamp, name and comment
# using n = 1 to split only upon first occurance
timeStampDF = monetizedDF[0].str.split("|", expand=True, n=1)
commentDF = timeStampDF[1].str.split(":", expand=True, n=1)
monetizedSplitDF = pd.concat([timeStampDF[0], commentDF], axis=1)
monetizedSplitDF.columns = ['Timestamp', 'Name', 'Comment']

Next, we will further process the data to remove null values and system notification of memberships.

In [28]:
# removing NaN values
regularSplitDF.dropna(subset = ["Comment"], inplace=True)
monetizedSplitDF.dropna(subset = ["Comment"], inplace=True)

In [29]:
# removing gifted message
monetizedSplitDF = monetizedSplitDF[~monetizedSplitDF['Comment'].str.contains('Gifted')]

In [30]:
print(f"The number of regular comments are: {len(regularSplitDF)}")
print(f"The number of monetized comments are: {len(monetizedSplitDF)}")

The number of regular comments are: 1381915
The number of monetized comments are: 6001


There are currently over 1M regular comments and only 6k monetized comments. We will subset the regular message to 20k to reduce the scope for this.

In [31]:
regularSplitDFS = regularSplitDF.sample(n=20000)

We will now filter out non-English comments.

In [32]:
def detect_language(text):
    try:
        return detect(text)
    except:
        return "Error"

# detect comment langauge and keep only english comments
regularSplitDFS['Language'] = regularSplitDFS['Comment'].apply(detect_language)
monetizedSplitDF['Language'] = monetizedSplitDF['Comment'].apply(detect_language)

In [33]:
regularEN = regularSplitDFS[regularSplitDFS['Language'] == 'en']
monetizedEN = monetizedSplitDF[monetizedSplitDF['Language'] == 'en']

We will now save the dataframe into a .csv file. The file will be further wrangled for the used wihtin the main notebook. 

In [34]:
regularEN.to_csv('data/regularEN.csv', index=False)
monetizedEN.to_csv('data/monetizedEN.csv', index=False)