# 04 Text Processing

This notebook contains the main text processing of streamer chats. It is an comprehensive pipeline which tries to extract the most meaningful data from the chats while still preserving most of the raw text data.

In [1]:
import nltk
import os
from glob import glob
import json
import emoji
from collections import defaultdict
from tqdm import tqdm
from nltk.corpus import stopwords
import re
import nltk
import emoji
# Following packackes need to be downloaded once
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('punkt_tab')

In [2]:
root_path = os.path.dirname(os.getcwd())
data_path = os.path.join(root_path, "data", "mention_network_chats")
data_path

'c:\\Users\\Andreas\\OneDrive\\Desktop\\Social Graphs and Interactions\\Final Project\\social_graphs_project\\data\\mention_network_chats'

Here we define, the main text processing functions. The steps include:
- Removing text data which mention the streamer itself, as this would be a word overflowing in following analysis.
- Removing Twitch system messages, as they have no inherent meaning for the streamer-viewer interaction.
- Defining additional stopwords to reduce overflow of jitter terms.
- Squashing repetitive characters
- Normalizing URLs, i.e removal
- Remove mentions, as this was overflowing with chat administrators and had no inherent linguistic meaning for the analysis

*Note pipeline has been heavily assisted with generative AI*

In [None]:


stopwords_english = set(stopwords.words('english'))

twitch_system_words = { # Common Twitch system messages
    'welcome', 'joined', 'left', 'hosted', 'hosting',
    'subscribed', 'gifted', 'sub', 'prime', 'tier', 
    'months', 'streak', 'resubscribed', 'raided'
}

chat_filler_stopwords = {
    'lol', 'lmao', 'like', 'hai', 'mhm', 'yeah', 'yawn', 
    'bro', 'back', 'get', 'one', 'good', 'want', 'think', 
    'know', 'im', 'thats', 'dont', 'cant', 'tho', 'oh', 'uh',
    'ok', 'okay', 'right', 'maybe', 'well', 'see', 'bot',
    'watch', 'streams', 'turn', 'stop', 'please', 'got', 'ass', 'fire', 'call',
    'kekw', 'lul', 'pog', 'poggers', 'monka', 'ez', 'monkas', 'pepe', 'xd', 
    'omegalul', 'gasm', 'hi', 'www', 'omg', 'lo', 'om', 'wo', 'na', 'ww'

}

combined_stopwords = stopwords_english.union(twitch_system_words).union(chat_filler_stopwords)


SYSTEM_PHRASES = {
    'subscribed with prime',  
    'subscribed for',
    'just subscribed',
    'gifted a subscription', 
    'gifted a sub',
    'gifted sub',
    'gift',
    'donate',
    'consecutive streams',
    '!uptime',
    '?song'
}


def generate_alias_set(streamer_name):
    """
    Generates a set of potential aliases by simplifying the complex streamer name.
    """
    aliases = {streamer_name.lower()}
    
    # 1. Remove leading/trailing numbers and symbols 
    simple_name = re.sub(r'(\d+|tv|live|gaming|ow|hd|_|-)$', '', streamer_name, flags=re.IGNORECASE)
    simple_name = re.sub(r'^(\d+|tv|live|gaming|ow|hd|_|-)', '', simple_name, flags=re.IGNORECASE)

    if simple_name.lower() != streamer_name.lower() and len(simple_name) >= 3:
        aliases.add(simple_name.lower())
    
    # 2. Generate prefixes of the simplified name
    if len(simple_name) >= 3:
        for i in range(3, len(simple_name) + 1):
            aliases.add(simple_name[:i].lower())
            
    # 3. Heuristic for complex names: The common short alias is often the last part.
    if len(simple_name) > 4:
        short_alias = simple_name[-3:].lower() # Take the last 3 characters
        if short_alias.isalpha(): # Only include if it's not just numbers/symbols
            aliases.add(short_alias)
            
    return aliases

def remove_streamer_name(text, streamer):
    if not streamer:
        return text

    aliases_to_remove = generate_alias_set(streamer)

    # Escape and sort by length descending
    escaped_aliases = [re.escape(alias) for alias in aliases_to_remove]
    escaped_aliases.sort(key=len, reverse=True)
    pattern_str = '|'.join(escaped_aliases)
    
    # Simple, high-impact substitution:
    cleaned_text = re.sub(pattern_str, " ", text, flags=re.IGNORECASE) 
    
    return cleaned_text


def is_auto_message(text):
    """
    Checks if a message body contains known system/alert phrases.
    """
    lowered_text = text.lower()
    
    # Check for known system phrases
    for phrase in SYSTEM_PHRASES:
        if phrase in lowered_text:
            return True

    return False

def squash_spam(token):
    
    if len(token) > 2:
        # Check for 3 or more consecutive identical characters
        pattern = r'(.)\1{2,}'
        if re.search(pattern, token):
            return True
    return False


def remove_mentions(text):
    """
    Removes all user mentions (e.g., @MukkingAround, @OmegaTooYew) 
    from the text, replacing them with a space.
    """
    pattern = r'@\S+'
    return re.sub(pattern, ' ', text).strip()

def normalize_urls(text):
    url_pattern = r'https?://\S+|www\.\S+'
    return re.sub(url_pattern, '', text)


def preprocess_text(text, streamer):
    text = normalize_urls(text)
    text = remove_mentions(text)
    text = remove_streamer_name(text, streamer)
    text = emoji.replace_emoji(text, replace='') # Reformat emojis to white space
    lowered_text = text.lower() # lowercase for tokenization
    text_cleaned_aggressive = re.sub(r'[^a-z0-9\s]', ' ', lowered_text) # Remove non-alphanumeric characters aggressively
    tokens = nltk.word_tokenize(text_cleaned_aggressive)
    tokens = [token for token in tokens if not squash_spam(token)]
    tokens = [token for token in tokens if token.isalnum()] # Keep only alphanumeric tokens
    tokens = [token for token in tokens if token not in combined_stopwords]
    return tokens

Main loop which runs through the json chat files, where the comments for a stream is stored. Here each comment is checked for specific conditions which could mean that it is skipped. For example, if the data indicates that it is an emoticon or a system message, it will skip the message. Afterwards, the comment will be processed with formerly mentioned text cleaning pipeline.

In [4]:
# Get all chat files
chat_files = list(glob(os.path.join(data_path, '*_chat.json')))
tokenized_chat = defaultdict(list)

for i, chat_file in tqdm(enumerate(chat_files), total=len(chat_files)):
    # Extract streamer name from filename
    filename = chat_file.split("\\")[-1].split(".")[0]  
    if 'latest' in filename:
        continue
    source_streamer = filename.rsplit('_', 5)[0].lower() # Extract streamer 
    try:
        with open(chat_file, 'r', encoding='utf-8') as f:
            chat_data = json.load(f)    
        comments = chat_data.get('comments', [])
        for comment in tqdm(comments, desc=f"Processing comments for {source_streamer}"):
            fragments = comment['message']['fragments']
            if not fragments:
                continue
            if comment['message']['fragments'][0]['emoticon']:
                continue
            text = comment['message']['fragments'][0]['text']
            if is_auto_message(text): # Check for system messages
                continue  
            tokens = preprocess_text(text, source_streamer) # Preprocess text
            tokenized_chat[source_streamer] += tokens
    except Exception as e:
        print(f"[{i}/{len(chat_files)}] {source_streamer:15s}: Error - {e}")


Processing comments for 39daph: 100%|██████████| 14693/14693 [00:01<00:00, 13696.38it/s]
Processing comments for abdulhd: 100%|██████████| 1474/1474 [00:00<00:00, 12106.76it/s]
Processing comments for aceu: 100%|██████████| 1498/1498 [00:00<00:00, 11719.76it/s]
Processing comments for adapt: 100%|██████████| 99262/99262 [00:07<00:00, 14161.80it/s]
Processing comments for admiralbahroo: 100%|██████████| 9436/9436 [00:00<00:00, 12263.62it/s]
Processing comments for agent00: 100%|██████████| 19122/19122 [00:01<00:00, 14419.55it/s]
Processing comments for agurin: 100%|██████████| 1421/1421 [00:00<00:00, 10707.14it/s]
Processing comments for ahmpy: 100%|██████████| 17919/17919 [00:01<00:00, 11488.14it/s]
Processing comments for alinity: 100%|██████████| 13865/13865 [00:00<00:00, 14011.47it/s]
Processing comments for alois_nl: 100%|██████████| 15015/15015 [00:01<00:00, 11122.16it/s]
Processing comments for alveussanctuary: 100%|██████████| 18590/18590 [00:01<00:00, 13318.32it/s]
Processing c

[378/935] jasontheween   : Error - Expecting value: line 1 column 1 (char 0)


Processing comments for jasontheween: 100%|██████████| 485854/485854 [00:34<00:00, 14108.01it/s]
Processing comments for jay3: 100%|██████████| 9560/9560 [00:00<00:00, 13365.43it/s]
Processing comments for jeefhs: 100%|██████████| 2474/2474 [00:00<00:00, 12325.92it/s]
Processing comments for jenazad: 100%|██████████| 85505/85505 [00:05<00:00, 15427.83it/s]
Processing comments for jerma985: 100%|██████████| 53780/53780 [00:03<00:00, 13538.14it/s]
Processing comments for jingggxd: 100%|██████████| 4788/4788 [00:00<00:00, 12784.25it/s]
Processing comments for jinnytty: 100%|██████████| 37585/37585 [00:02<00:00, 15983.22it/s]
Processing comments for joe_bartolozzi: 100%|██████████| 35074/35074 [00:02<00:00, 11707.45it/s]
Processing comments for johnstone: 100%|██████████| 16448/16448 [00:01<00:00, 12245.47it/s]
Processing comments for jokerdtv: 100%|██████████| 10072/10072 [00:01<00:00, 8662.66it/s]
Processing comments for joshseki: 100%|██████████| 9568/9568 [00:00<00:00, 11712.52it/s]
Pr

[443/935] lacy           : Error - Expecting value: line 1 column 1 (char 0)


Processing comments for lacy: 100%|██████████| 80458/80458 [00:04<00:00, 17347.81it/s]
Processing comments for laynalazar: 100%|██████████| 20044/20044 [00:01<00:00, 15434.57it/s]
Processing comments for leaksworld_: 100%|██████████| 7785/7785 [00:00<00:00, 13863.24it/s]
Processing comments for lec: 100%|██████████| 5247/5247 [00:00<00:00, 11951.58it/s]
Processing comments for lilaggy: 100%|██████████| 5315/5315 [00:00<00:00, 11280.28it/s]
Processing comments for lilsimsie: 100%|██████████| 11685/11685 [00:01<00:00, 10115.47it/s]
Processing comments for limealicious: 100%|██████████| 16126/16126 [00:01<00:00, 15156.77it/s]
Processing comments for limmy: 100%|██████████| 20671/20671 [00:01<00:00, 15909.24it/s]
Processing comments for lirik: 100%|██████████| 50065/50065 [00:03<00:00, 12568.37it/s]
Processing comments for littlespacerock: 100%|██████████| 2280/2280 [00:00<00:00, 11212.22it/s]
Processing comments for lobanjicaa: 100%|██████████| 21883/21883 [00:02<00:00, 10526.96it/s]
Proc

Result is a dictionary which contains streamers as keys, and a list of all tokenized text from their respective stream as the value.

In [5]:
streamer_comments = dict(tokenized_chat) # defaultdict to regular dict

Export for further analysis.

In [6]:
with open(os.path.join(root_path, "data", "streamer_tokenized_comments.json"), 'w', encoding='utf-8') as f: # Export text_processed data
    json.dump(streamer_comments, f)