# Enriched Columns
In this section we will derive some columns but these will be different in the sence that are transformations of the "comment" column of the clean dataset. They will be useful for some analyses that we will run in the future.

In [None]:
import polars as pl
from datetime import date
import os
import sys
from tqdm import tqdm
# load project directory to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from src.preprocessing import *
from paths import Paths
import config

print(f"Working with the channel handle: {config.channel_handle}.")
date_today = date.today()
channel_paths = Paths(channel_handle=config.channel_handle, date_obj= date_today)

Working with the channel handle: kurzgesagt.


# Derived columns
We are derivating the following columns from the comment:
- Tokens simple
- Tokens without stopwords
- Emojis only
- Mentions (that start with @, eg, @username)
- Hashtags, although rare for YouTube comments

The functions to get all these derived columns are over in the `src/preprocessing.py` file. There you can peek to learn more about what each function is doing in detail.

In [2]:
def add_tokens_simple(df: pl.DataFrame) -> pl.Series:
    return df["comment"].map_elements(lambda t: tokenize_mixed(t, keep_stopwords=True))

def add_tokens_wo_stop(df: pl.DataFrame) -> pl.Series:
    return df["comment"].map_elements(lambda t: tokenize_mixed(t, keep_stopwords=False))

def add_emojis(df: pl.DataFrame) -> pl.Series:
    return df["comment"].map_elements(lambda t: extract_emojis(t))

def add_mentions(df: pl.DataFrame) -> pl.Series:
    return df['comment'].map_elements(lambda t: extract_mentions(t))

def add_hashtags(df: pl.DataFrame) -> pl.Series:
    return df['comment'].map_elements(lambda t: extract_hashtags(t))

# def add_tokens_lemmatized(df: pl.DataFrame) -> pl.Series:
#     return df["comment"].map_elements(lambda t: lemmatize_tokens(tokenize_mixed(t)))

ENRICHERS = {
    "tokens_simple": add_tokens_simple,
    "tokens_wo_stop": add_tokens_wo_stop,
    "emojis": add_emojis,
    "mentions": add_mentions,
    "hashtags": add_hashtags
    # "tokens_lemmatized": add_tokens_lemmatized,
    # ... add as needed
}

## Future enriched columns
In the case that a new column is added to the enriched process, the following function provides a way to add or patch columns, that is, it will not touch columns already
in the dataset, and only add those columns that are new.

In [3]:
def add_or_patch_columns(df: pl.DataFrame, enrichers: dict):
    
    # Figure out which enrichers are missing
    missing = [col for col in enrichers if col not in df.columns]
    if missing:
        print(f"Adding '{missing}'")
    else:
        print("All columns present")
        return df

    # Apply only the missing enrichers
    for col in tqdm(missing, desc="Adding missing columns", unit="col"):
        df = df.with_columns([
            enrichers[col](df).alias(col)
        ])

    return df

# Reading from the clean comments file.
Every enriched file will have a similar partitioning to those of clean comments. It will be sepparated by days, and every clean file for a specific day will generate a new enriched parquet file with all the desired derived columns. For now, we will need the original comment from the clean comments parquet, we will save also the ID of the comment to identify every derived column with its original comment.

In [4]:
df = pl.read_parquet(channel_paths.clean_comments_file_path, columns=['comment_id', 'comment'])
df.sample(3)

comment_id,comment
str,str
"""Ugzwc2k1UTzpSCw1wq54AaABAg.8fZ…","""Lucian MacAndrew the holy Qur…"
"""UgixBTWBLENWT3gCoAEC.80UF8ZksD…","""007VitaminD Murder is bad, sex…"
"""UgivQIBkqYD753gCoAEC""","""Kurzgesagt, you da man"""


Running the function that adds the derived columns

In [5]:
df = add_or_patch_columns(df, ENRICHERS)

Adding '['tokens_simple', 'tokens_wo_stop', 'emojis', 'mentions', 'hashtags']'


Adding missing columns: 100%|██████████| 5/5 [00:43<00:00,  8.61s/col]


This is how our dataset looks now.

In [12]:
df.sample(5)

comment_id,comment,tokens_simple,tokens_wo_stop,emojis,mentions,hashtags
str,str,list[str],list[str],list[str],list[str],list[str]
"""UgwVii-xFX3XsVM6YPh4AaABAg""","""Looks like a subnautica fish😅""","[""looks"", ""like"", … ""fish""]","[""looks"", ""like"", … ""fish""]","[""😅""]",[],[]
"""Ugh_qztkNYvDTngCoAEC.8JjzTsy9q…","""@@Chivas6 But human beings are…","[""chivas"", ""but"", … ""existence""]","[""chivas"", ""human"", … ""existence""]",[],"[""@Chivas6""]",[]
"""Ugx1fZD5ySCw7dOdMx54AaABAg.9UH…","""@@Jesuisunknown in the mighty …","[""jesuisunknown"", ""in"", … ""lol""]","[""jesuisunknown"", ""mighty"", … ""lol""]",[],"[""@Jesuisunknown""]",[]
"""Ugw9BBNy2rh0s0pcsUx4AaABAg""","""I didn't see that coming 4:16""","[""i"", ""didn"", … ""coming""]","[""see"", ""coming""]",[],[],[]
"""Ugwy_-dij3dYe6JWufp4AaABAg.9F8…","""@@rashidahmad3086 the book of …","[""rashidahmad"", ""the"", … ""that""]","[""rashidahmad"", ""book"", … ""way""]",[],"[""@rashidahmad3086""]",[]


Polars file size estimation.

In [13]:
print(f"{df.estimated_size()/1e6:.2f} MB")

146.32 MB


## Saving the results back to parquet
Now that we have the enrichers in place, we are ready to save our results in a parquet file. These files will go into the `data/processed/enriched` file, and will be identified by the channel handle and the date from the clean parquet file. For example:
- `comments/kurzgesagt_comments_2025_09_03.parquet` will become `enriched/kurzgesagt_enriched_comments_2025_09_03.parquet`.
We won't save the original comment in the enriched file, it can be fetched from the clean comments file.

In [14]:
cols_to_exclude = {"comment", }
df = df.select([col for col in df.columns if col not in cols_to_exclude])

In [16]:
df.write_parquet(channel_paths.enriched_comments_file_path, compression='zstd')

# Bulk add columns
The following sections was made when a new column is added to the analysis and needs to be added to the existing enriched files.

In [12]:
# for file in channel_paths.list_enriched_files():
#     print(f'Checking file at {file}')
#     df = pl.read_parquet(file)
#     df = add_or_patch_columns(df, enrichers=ENRICHERS)
#     df.write_parquet(channel_paths.enriched_comments_file_path, compression='zstd')