<h3 style="text-align:center; font-weight:bold;">Create dataset</h3>

<p style="text-align:right; font-style:italic;">Last Edited: December 17th, 2025</p>

This notebook contains code to create the final dataset (`finalD.csv`) from a larger collection of TikTok videos from stay-at-home mom (SAHM) creators. This notebook creates the following variables: engagement, age_group, relatability, and emotions. 

- Section 0: Anonymize dataset for public repository
- Section 1: Filter dataset (US, 2024, minimum activity) â†’ `data/data0.csv`
- Section 2: Create age variable + sample 1,000 videos per age group â†’ `data/data1.csv`
- Section 3: Create engagement metric â†’ `data/data2.csv`
- Section 4: Relatability scoring â†’ `data/data3.csv` (and `data/test_relatability.csv`)
- Section 5: Emotion scoring â†’ `data/data4.csv` (and `data/test_emotion.csv`)
- Section 6: Variable selection â†’ `data/finalD.csv`

`Section 0: Anonymize dataset for public repository`

In [3]:
### DATASET ANONYMIZED FOR PUBLIC REPOSITORY

import pandas as pd

# Process data1.csv through data4.csv
for i in range(1, 5):
    df = pd.read_csv(f"data/data{i}.csv")
    df.drop(["id", "create_time", "video_description"], axis=1, inplace=True)
    df.to_csv(f"data/data{i}.csv", index=False)

# Process finalD.csv
df = pd.read_csv("data/finalD.csv")
df.drop(["create_time", "comment_count", "share_count", "view_count"], axis=1, inplace=True)
df.to_csv("data/finalD.csv", index=False)

`Section 1: Filter dataset`

In [1]:
### Import relevant libraries

import json, csv
import pandas as pd
from datetime import datetime

In [2]:
### Convert jsonl file to csv file

with open("data/initialD.jsonl","r",encoding="utf-8") as infile, open("data/data0.csv","w",encoding="utf-8",newline="") as outfile:
    writer = csv.DictWriter(outfile,fieldnames=["id","username","create_time","like_count","comment_count","share_count","view_count","hashtags","video_description"])
    writer.writeheader()
    for line in infile:
        video=json.loads(line)
        if video.get("region_code","").lower()!="us":
            continue
        create_time=datetime.fromtimestamp(video["create_time"])
        if create_time.year!=2024:
            continue
        if video.get("view_count",0)==0: #should have at least been viewed before (most of these are are just failed data collection)
            continue
        if video.get("like_count",0)==0: #need at least some engagement
            continue
        if video.get("comment_count",0)==0: #need at least some engagement
            continue
        row={
            "id":video.get("id",""),
            "username":video.get("username",""),
            "create_time":create_time.isoformat(),
            "like_count":video.get("like_count",0),
            "comment_count":video.get("comment_count",0),
            "share_count":video.get("share_count",0),
            "view_count":video.get("view_count",0),
            "hashtags":video.get("hashtag_names",""),
            "video_description":video.get("video_description","")
        }
        writer.writerow(row)

In [5]:
### Import and check csv file (does not show id, username, create_time, video_description)

data0 = pd.read_csv("data/data0.csv", usecols=lambda c: c not in ["id", "username", "create_time", "video_description"])
print("Number of videos:", len(data0))
data0.head()

Number of videos: 865517


Unnamed: 0,like_count,comment_count,share_count,view_count,hashtags
0,9,2,0,403,"['newyearseve', 'fyp', 'sahm', 'happynewyearseve', 'babynumber3', 'momsoftiktok', 'capcut', 'feb2024']"
1,1671,25,12,41702,"['breakfast', 'morningroutine', 'coffeeaddict', 'frenchtoast', 'vlog', 'newborns', 'sahm', 'momsoftiktok', 'unboxwithme', 'embercupðŸ”¥']"
2,113,10,1,4284,"['happynewyear', 'sahm', 'minivlog', 'momsoftiktok', 'hello2024', '2023recap', 'endof2023season']"
3,3,2,0,238,"['dayinthelife', 'sahm', 'momsoftiktok', 'wivesoftiktok', 'blackmomsoftiktok']"
4,37,1,0,602,"['momlife', 'iconic', 'momoftwo', 'sahm', 'homeschooling', 'homeschoolmom', 'millennialmom']"


`Section 2: Create age-variable`

In [6]:
### Import relevant libraries

import ast, random
import pandas as pd
from collections import Counter

In [7]:
### Import dataset (does not show id, username, create_time, video_description)

data0 = pd.read_csv("data/data0.csv", usecols=lambda c: c not in ["id", "username", "create_time", "video_description"])
data0.head()

Unnamed: 0,like_count,comment_count,share_count,view_count,hashtags
0,9,2,0,403,"['newyearseve', 'fyp', 'sahm', 'happynewyearseve', 'babynumber3', 'momsoftiktok', 'capcut', 'feb2024']"
1,1671,25,12,41702,"['breakfast', 'morningroutine', 'coffeeaddict', 'frenchtoast', 'vlog', 'newborns', 'sahm', 'momsoftiktok', 'unboxwithme', 'embercupðŸ”¥']"
2,113,10,1,4284,"['happynewyear', 'sahm', 'minivlog', 'momsoftiktok', 'hello2024', '2023recap', 'endof2023season']"
3,3,2,0,238,"['dayinthelife', 'sahm', 'momsoftiktok', 'wivesoftiktok', 'blackmomsoftiktok']"
4,37,1,0,602,"['momlife', 'iconic', 'momoftwo', 'sahm', 'homeschooling', 'homeschoolmom', 'millennialmom']"


In [9]:
### Check the top 80 hashtags

data0["hashtags_parsed"]=data0["hashtags"].apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else x)
all_hashtags=[tag for tags in data0["hashtags_parsed"] for tag in tags]
top80=Counter(all_hashtags).most_common(80)
cols={}
for i in range(8):
    chunk=top80[i*10:(i+1)*10]
    cols[f"Col_{i+1}"]=[f"{tag}: {count}" for tag,count in chunk]
pd.DataFrame(cols)

Unnamed: 0,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7,Col_8
0,sahm: 865517,foryoupage: 67072,girlmom: 41952,vlog: 25740,family: 21151,coffee: 17369,pregnant: 13607,dinner: 11741
1,fyp: 311958,boymom: 62184,creatorsearchinsights: 38501,momsover30: 25067,babiesoftiktok: 21063,moms: 16908,wife: 13606,morningvlog: 11433
2,momsoftiktok: 308863,fypã‚·: 57033,fypage: 36683,firsttimemom: 24899,relatable: 20878,funny: 16819,viralvideo: 13098,2under2: 11274
3,momlife: 206506,viral: 50072,sahmtok: 35430,cleantok: 24611,fyppppppppppppppppppppppp: 20634,youngmom: 16614,love: 12911,momvlog: 11252
4,momtok: 177002,motherhood: 47411,teamwork: 34071,ootd: 23847,dayinmylife: 20484,postpartum: 15878,dayinthelife: 12706,parati: 11143
5,sahmlife: 153796,trending: 46002,contentcreator: 30657,mama: 23634,ditl: 19925,tiktok: 15214,cooking: 12221,kids: 10882
6,sahmsoftiktok: 91138,capcut: 45938,roadto10k: 29674,baby: 22526,momof3: 19345,newborn: 14952,babyboy: 12211,newmom: 10604
7,mom: 79817,fypã‚·ã‚šviral: 44744,morningroutine: 27677,christmas: 22140,momcontent: 18734,workingmom: 14528,cleanwithme: 12110,breakfast: 10502
8,toddlermom: 78679,stayathomemom: 42973,tiktokshop: 26565,momof2: 21855,grwm: 17425,digitalmarketing: 13802,lifestyle: 11974,ditlofamom: 10194
9,foryou: 70816,toddlersoftiktok: 42367,fy: 25891,toddler: 21832,cleaning: 17411,workfromhome: 13658,babytok: 11939,roadto5k: 10144


In [10]:
### Check hashtags, find more via the next codeblock

def count_hashtags(df, tag_list):
    return {tag: df["hashtags_parsed"].apply(lambda x: tag in x).sum() for tag in tag_list}

age1 = ["pregnancy", "pregnant", "pregnantlife", "thirdtrimester", "pregnancyjourney", "pregnanttiktok", "pregnancyjourney"] #pre-birth
print("Age group 1: ", count_hashtags(data0, age1))

age2 = ["newborn", "newbornbaby", "baby", "newbornbaby", "4monthsold"] #0 to 1 years old
print("Age group 2: ", count_hashtags(data0, age2))

age3 = ["toddler", "toddlermom", "toddlertok", "toddlersoftiktok", "toddleractivities", "toddlermama", "preschool"] #1 to 5 years old (combined) toddlerlife
print("Age group 3: ", count_hashtags(data0, age3))

age4 = ["kindergarten", "kindergartener", "elementary", "elementaryteacher", "elementaryschool", "firstgrade", "secondgrade", "preteen", "kindergartenlife", "5thgrade",
        "kindergartenmom", "kindergartenteacher", "homeschoolkindergarten", "kindergartenhomeschool", "elementaryeducation", "preteenmom", "elementarymusic", "4thgrade"] #5 to 11 years old
print("Age group 4: ", count_hashtags(data0, age4))

age5 = ["teens", "teen",  "teenager", "teenagers", "middleschool", "highschool", "momofteens", "raisingteens", "teenagersbelike", "momsofteens", "teensoftiktok"] #12 to 18 years old
print("Age group 5: ", count_hashtags(data0, age5))

# ages = ([f"{w}weeksold" for w in range(0,13)] + [f"{w}monthsold" for w in range(0,13)] + [f"{m}months" for m in range(0,13)] + [f"{y}yearsold" for y in range(1,19)])
# print("Ages: ", count_hashtags(data0, ages))

#age1 previous: 'prenatal': 96, 'morningsickness': 94, 'pregnantbelly': 677, 'pregnantðŸ¤°': 605, 'pregnanttok': 696, 'expectingmom': 557, 'pregnantcontent': 22
#age2 previous: 'infant': 951, 'infants': 73, 'newbornsleep': 60, 'newbornnights': 37, 'newbornmornings': 1, 'newbornbabies': 163, 'newbornsoftiktok': 344, 'newbornlife': 819, 'newborntoys': 4, 'newbornstage': 107
#age3 previous: 'toddlersbelike': 1498, 'preschooler': 271, 'preschoolers': 107, 'preschooler': 271, 'preschoolers': 107, 'preschoolmama': 16, 'preschoolersbelike': 10
#age4 previous:  'tweens': 17,  '1stgrade': 17, 'primaryschool': 2, 'elementarymath': 4, 'gradeschoolers': 1
#age5 previous: 'adolescence': 0, 'middleschoolers': 5, 'parentingpreteens': 1
#previous considered: adultkids, college, collegekids, offtocollege, teenmoms & teenmama (two meanings)

Age group 1:  {'pregnancy': 7827, 'pregnant': 13607, 'pregnantlife': 5193, 'thirdtrimester': 2013, 'pregnancyjourney': 1941, 'pregnanttiktok': 3652}
Age group 2:  {'newborn': 14952, 'newbornbaby': 1921, 'baby': 22526, '4monthsold': 986}
Age group 3:  {'toddler': 21832, 'toddlermom': 78679, 'toddlertok': 7199, 'toddlersoftiktok': 42367, 'toddleractivities': 3855, 'toddlermama': 1897, 'preschool': 1994}
Age group 4:  {'kindergarten': 908, 'kindergartener': 11, 'elementary': 67, 'elementaryteacher': 71, 'elementaryschool': 155, 'firstgrade': 95, 'secondgrade': 55, 'preteen': 67, 'kindergartenlife': 55, '5thgrade': 20, 'kindergartenmom': 30, 'kindergartenteacher': 85, 'homeschoolkindergarten': 36, 'kindergartenhomeschool': 30, 'elementaryeducation': 18, 'preteenmom': 24, 'elementarymusic': 8, '4thgrade': 15}
Age group 5:  {'teens': 265, 'teen': 431, 'teenager': 275, 'teenagers': 241, 'middleschool': 71, 'highschool': 192, 'momofteens': 536, 'raisingteens': 44, 'teenagersbelike': 102, 'moms

In [109]:
### Helpful function: Find list of hashtags

word = "teens"

hashtags_matching = []
for tags in data0["hashtags_parsed"]:
    hashtags_matching.extend([tag for tag in tags if word in tag.lower()])
hashtag_counts = Counter(hashtags_matching)

pd.set_option("display.max_colwidth", None)
top_hashtags = pd.DataFrame(hashtag_counts.items(), columns=["hashtag", "count"]).sort_values("count", ascending=False)
top_hashtags.head(15)

Unnamed: 0,hashtag,count
2,momofteens,536
9,teens,265
5,momsofteens,99
23,momsofteensandtweens,61
10,teensoftiktok,49
1,raisingteens,44
4,teensbelike,37
58,giftsforteens,22
0,teensahm,21
72,provingteenswrong,17


In [59]:
### Helpful function: Search for other hashtags given a hashtag

word = "earlychildhood" #change depending on what we are looking for
pd.set_option("display.max_colwidth", None)
data0[data0["hashtags_parsed"].apply(lambda x: word in x)][["hashtags_parsed"]].head(10) #can change number

Unnamed: 0,hashtags_parsed
21821,"[baby, longhair, sick, teacher, ladybug, momlife, babyfever, stayathomemom, caregiver, sahm, earlychildhood, mascararoutine, babyhairstyle, girlmom, amazoninfluencer, amazonstorefron]"
44541,"[family, kids, baby, childhood, doctor, play, learning, children, viral, parenting, foryou, education, momlife, fyp, parenthood, mentalhealth, montessori, childdevelopment, childcare, earlylearning, sahm, earlyyears, earlychildhood, parentingtips, occupationaltherapy, sensoryplay, foryoupage, earlychildhoodeducation, momsoftiktok, finemotorskills, gentleparenting, learningthroughplay, positiveparenting, fypã‚·, playbasedlearning]"
65297,"[attaboy, prek, preschool, earlylearning, sahm, earlychildhood, momsoftiktok, parentsoftiktok, hyperlexia, toddlerwriting, preschoolwriting]"
111958,"[baby, parenting, maternityleave, sahm, sahmlife, toddlermom, earlychildhood, ecse, earlychildhoodeducation, momsoftiktok, parentsoftiktok, dadsoftiktok, toddlersoftiktok, teachermama, specialeducationteacher]"
149184,"[mylife, momlife, wifelife, childcare, sahm, sahmlife, earlychildhood, contentcreator, earlychildhoodeducation, sahmsoftiktok, simplyakins, childcarecrisis, simplyakinsgang, simplyakinsmom, faithfamilyempire, obedientwifeandmom]"
153717,"[outdoors, counting, play, numbers, learn, explore, chalk, sahm, earlychildhood, momsoftiktok, toddlersoftiktok, learningthroughplay]"
156960,"[sahm, earlychildhood, momtok, muffintin, babiesoftiktok, learningthroughplay]"
165623,"[kids, elementary, kindergarten, preschool, homeschool, firstgrade, sahm, earlychildhood, busymom]"
165634,"[kids, elementary, kindergarten, preschool, homeschool, firstgrade, sahm, earlychildhood, busymom]"
165767,"[kids, elementary, kindergarten, preschool, homeschool, firstgrade, sahm, earlychildhood, busymom]"


In [12]:
### Helpful function: Count the number of videos in each age-group that does NOT contain any hashtags from other age-groups

age_groups = {"age1": age1, "age2": age2, "age3": age3, "age4": age4, "age5": age5}

def contains_any(tags, tag_list):
    return any(t in tags for t in tag_list)

age_group_col = []
for tags in data0["hashtags_parsed"]:
    assigned = None
    for group_name, tags_list in age_groups.items():
        if contains_any(tags, tags_list):
            others = [t for g, lst in age_groups.items() if g != group_name for t in lst] #check it does NOT appear in any other groups
            if not contains_any(tags, others):
                assigned = group_name
                break
    age_group_col.append(assigned)
data0["age_group"] = age_group_col
data_age = data0.dropna(subset=["age_group"]) 
data_age = data_age.drop('hashtags_parsed', axis=1) #gonna be a string later anyway
data_age["age_group"].value_counts()

age_group
age3    107800
age2     27321
age1     14961
age5      1512
age4      1007
Name: count, dtype: int64

In [13]:
## Randomly select 1000 videos from each age group

random.seed(123)
samples = []
for group in ["age1", "age2", "age3", "age4", "age5"]:
    group_df = data_age[data_age["age_group"] == group]
    samples.append(group_df.sample(n=1000, random_state=42)) 
data1 = pd.concat(samples, ignore_index=True)

print(data1["age_group"].value_counts()) #double check
data1.head()

age_group
age1    1000
age2    1000
age3    1000
age4    1000
age5    1000
Name: count, dtype: int64


Unnamed: 0,username,like_count,comment_count,share_count,view_count,hashtags,age_group
0,anon_0001,76,34,0,656,"['momlife', 'firsttimemom', 'sahm', 'newmom', 'roadto10k', 'motherhoodunplugged', 'momlifebelike', 'momtok', 'momsoftiktok', 'pregnanttiktok', 'growingmyaccount', 'motherhoodjourney', 'fypã‚·ã‚šviral', 'motherhoodunfiltered', 'scrunchymama', 'momlifevibes', 'pregnanttok']",age1
1,anon_0002,32,4,0,1010,"['pregnancy', 'ultrasound', 'sahm', 'momsoftiktok', 'momtoks', 'relatablemomcontent', 'sahmsoftiktok', 'pregnancytiktok', 'momtoktakeover', 'momtoktakeover2024']",age1
2,anon_0003,1086,74,85,50305,"['pregnancy', 'pregnant', 'ttc', 'momlife', 'fyp', 'stayathomemom', 'sahm', 'newmom', '23weekspregnant', 'firstpregnancy', 'fertility', 'foryoupage']",age1
3,anon_0004,2088,24,5,28514,"['pregnant', 'dinnertime', 'familyof5', 'onabudget', 'whatsfordinner', 'sahm', 'lowincome', 'momof3']",age1
4,anon_0005,165,17,3,3012,"['breakfast', 'earlymorning', 'pregnant', 'fy', 'fyp', 'CookWithMe', 'bluecollar', 'sahm', 'easyrecipe', 'makebreakfastwithme', 'bluecollarlife', 'fypã‚·', 'fypã‚·ã‚šviral', 'bluecollarwife']",age1


In [113]:
### Export dataset

data1.to_csv("data/data1.csv", index=False)

`Section 3: Engagement metric`

In [14]:
### Import relevant library and dataset

import pandas as pd

data1 = pd.read_csv("data/data1.csv")
data1.head()

Unnamed: 0,username,like_count,comment_count,share_count,view_count,hashtags,age_group
0,anon_0001,76,34,0,656,"['momlife', 'firsttimemom', 'sahm', 'newmom', 'roadto10k', 'motherhoodunplugged', 'momlifebelike', 'momtok', 'momsoftiktok', 'pregnanttiktok', 'growingmyaccount', 'motherhoodjourney', 'fypã‚·ã‚šviral', 'motherhoodunfiltered', 'scrunchymama', 'momlifevibes', 'pregnanttok']",age1
1,anon_0002,32,4,0,1010,"['pregnancy', 'ultrasound', 'sahm', 'momsoftiktok', 'momtoks', 'relatablemomcontent', 'sahmsoftiktok', 'pregnancytiktok', 'momtoktakeover', 'momtoktakeover2024']",age1
2,anon_0003,1086,74,85,50305,"['pregnancy', 'pregnant', 'ttc', 'momlife', 'fyp', 'stayathomemom', 'sahm', 'newmom', '23weekspregnant', 'firstpregnancy', 'fertility', 'foryoupage']",age1
3,anon_0004,2088,24,5,28514,"['pregnant', 'dinnertime', 'familyof5', 'onabudget', 'whatsfordinner', 'sahm', 'lowincome', 'momof3']",age1
4,anon_0005,165,17,3,3012,"['breakfast', 'earlymorning', 'pregnant', 'fy', 'fyp', 'CookWithMe', 'bluecollar', 'sahm', 'easyrecipe', 'makebreakfastwithme', 'bluecollarlife', 'fypã‚·', 'fypã‚·ã‚šviral', 'bluecollarwife']",age1


In [115]:
### Create engagement metric variable

data1['engagement'] = (data1['like_count'] + data1['comment_count'] + data1['share_count']) / data1['view_count']
# data1 = data1.drop(columns=['like_count', 'comment_count', 'share_count', 'view_count'])

In [116]:
### Reorder columns

cols = list(data1.columns)
if 'create_time' in cols:
    idx = cols.index('create_time') + 1
    cols.insert(idx, cols.pop(cols.index('engagement')))
data1 = data1[cols]

In [120]:
### Export dataset

data1.to_csv("data/data2.csv", index=False)

`Section 4: Relatability`

In [121]:
### Import relevant libraries

import os
import pandas as pd
from transformers import pipeline

In [15]:
### Import dataset

data2 = pd.read_csv("data/data2.csv")
data2.head()

Unnamed: 0,username,engagement,like_count,comment_count,share_count,view_count,hashtags,age_group
0,anon_0001,0.167683,76,34,0,656,"['momlife', 'firsttimemom', 'sahm', 'newmom', 'roadto10k', 'motherhoodunplugged', 'momlifebelike', 'momtok', 'momsoftiktok', 'pregnanttiktok', 'growingmyaccount', 'motherhoodjourney', 'fypã‚·ã‚šviral', 'motherhoodunfiltered', 'scrunchymama', 'momlifevibes', 'pregnanttok']",age1
1,anon_0002,0.035644,32,4,0,1010,"['pregnancy', 'ultrasound', 'sahm', 'momsoftiktok', 'momtoks', 'relatablemomcontent', 'sahmsoftiktok', 'pregnancytiktok', 'momtoktakeover', 'momtoktakeover2024']",age1
2,anon_0003,0.024749,1086,74,85,50305,"['pregnancy', 'pregnant', 'ttc', 'momlife', 'fyp', 'stayathomemom', 'sahm', 'newmom', '23weekspregnant', 'firstpregnancy', 'fertility', 'foryoupage']",age1
3,anon_0004,0.074244,2088,24,5,28514,"['pregnant', 'dinnertime', 'familyof5', 'onabudget', 'whatsfordinner', 'sahm', 'lowincome', 'momof3']",age1
4,anon_0005,0.061421,165,17,3,3012,"['breakfast', 'earlymorning', 'pregnant', 'fy', 'fyp', 'CookWithMe', 'bluecollar', 'sahm', 'easyrecipe', 'makebreakfastwithme', 'bluecollarlife', 'fypã‚·', 'fypã‚·ã‚šviral', 'bluecollarwife']",age1


In [123]:
### Load model for relatability

print("Loading relatability scoring model...")
relatability_classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli", truncation=True)

Loading relatability scoring model...


The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Device set to use mps:0


In [53]:
### Apply to dataset

def analyze_relatability(data, output_file, batch_size=500):
    
    columns = list(data.columns) + ['relatability_label', 'relatability_score']
    
    if os.path.exists(output_file):
        processed_rows = sum(1 for _ in open(output_file, encoding='utf-8')) - 1 
        print(f"Resuming from row {processed_rows}...")
    else:
        pd.DataFrame(columns=columns).to_csv(output_file, index=False, encoding='utf-8')
        processed_rows = 0
        print("Starting fresh...")

    num_batches = (len(data) // batch_size) + 1
    start_batch = processed_rows // batch_size

    candidate_labels = ["relatable content", "personal experience", "shared experience", "everyday life situation", "common problem"]

    for i in range(start_batch, num_batches):
        start_idx = i * batch_size
        end_idx = min((i + 1) * batch_size, len(data))
        batch = data.iloc[start_idx:end_idx].copy()

        if batch.empty:
            continue

        texts = batch['video_description'].astype(str).tolist()
        batch_results = []
        for text in texts:
            if pd.isna(text) or text == "" or str(text).strip() == "":
                batch_results.append({"relatability_label": "none", "relatability_score": 0.0})
            else:
                try:
                    result = relatability_classifier(str(text), candidate_labels)
                    relatability_score = max(result['scores'])
                    top_label_idx = result['scores'].index(relatability_score)
                    top_label = result['labels'][top_label_idx]
                    batch_results.append({"relatability_label": top_label, "relatability_score": relatability_score})
                except Exception as e:
                    print(f"Error analyzing relatability for text: {e}")
                    batch_results.append({"relatability_label": "none", "relatability_score": 0.0})
        
        batch['relatability_label'] = [r['relatability_label'] for r in batch_results]
        batch['relatability_score'] = [r['relatability_score'] for r in batch_results]
        batch.to_csv(output_file, mode='a', header=False, index=False, encoding='utf-8')
        print(f"Finished batch {i+1}/{num_batches} ({end_idx}/{len(data)} rows)")
    print("All batches processed and saved to:", output_file)

In [124]:
### Test data

test_data = data2.head(10)
analyze_relatability(test_data, "data/test_relatability.csv")

Starting fresh...
Finished batch 1/1 (10/10 rows)
All batches processed and saved to: data/test_relatability.csv


In [130]:
### Full data

analyze_relatability(data2, "data/data3.csv")

Starting fresh...
Finished batch 1/11 (500/5000 rows)
Finished batch 2/11 (1000/5000 rows)
Finished batch 3/11 (1500/5000 rows)
Finished batch 4/11 (2000/5000 rows)
Finished batch 5/11 (2500/5000 rows)
Finished batch 6/11 (3000/5000 rows)
Finished batch 7/11 (3500/5000 rows)
Finished batch 8/11 (4000/5000 rows)
Finished batch 9/11 (4500/5000 rows)
Finished batch 10/11 (5000/5000 rows)
All batches processed and saved to: data/data3.csv


`Section 5: Emotion`

In [131]:
### Import relevant libraries

import os
import pandas as pd
from transformers import pipeline 

In [16]:
### Import dataset

data3 = pd.read_csv("data/data3.csv")
data3.head()

Unnamed: 0,username,engagement,like_count,comment_count,share_count,view_count,hashtags,age_group,relatability_label,relatability_score
0,anon_0001,0.167683,76,34,0,656,"['momlife', 'firsttimemom', 'sahm', 'newmom', 'roadto10k', 'motherhoodunplugged', 'momlifebelike', 'momtok', 'momsoftiktok', 'pregnanttiktok', 'growingmyaccount', 'motherhoodjourney', 'fypã‚·ã‚šviral', 'motherhoodunfiltered', 'scrunchymama', 'momlifevibes', 'pregnanttok']",age1,relatable content,0.955621
1,anon_0002,0.035644,32,4,0,1010,"['pregnancy', 'ultrasound', 'sahm', 'momsoftiktok', 'momtoks', 'relatablemomcontent', 'sahmsoftiktok', 'pregnancytiktok', 'momtoktakeover', 'momtoktakeover2024']",age1,relatable content,0.981253
2,anon_0003,0.024749,1086,74,85,50305,"['pregnancy', 'pregnant', 'ttc', 'momlife', 'fyp', 'stayathomemom', 'sahm', 'newmom', '23weekspregnant', 'firstpregnancy', 'fertility', 'foryoupage']",age1,relatable content,0.878635
3,anon_0004,0.074244,2088,24,5,28514,"['pregnant', 'dinnertime', 'familyof5', 'onabudget', 'whatsfordinner', 'sahm', 'lowincome', 'momof3']",age1,relatable content,0.967317
4,anon_0005,0.061421,165,17,3,3012,"['breakfast', 'earlymorning', 'pregnant', 'fy', 'fyp', 'CookWithMe', 'bluecollar', 'sahm', 'easyrecipe', 'makebreakfastwithme', 'bluecollarlife', 'fypã‚·', 'fypã‚·ã‚šviral', 'bluecollarwife']",age1,relatable content,0.526099


In [137]:
### Load pre-trained emotion classification model

print("Loading emotion classification model...")
emotion_classifier = pipeline("text-classification", model="cirimus/modernbert-base-go-emotions", top_k=None, truncation=True)

Loading emotion classification model...


Device set to use mps:0


In [138]:
### Apply to dataset

emotion_labels = ["admiration","amusement","anger","annoyance","approval","caring","confusion", "curiosity","desire","disappointment",
    "disapproval","disgust","embarrassment", "excitement","fear","gratitude","grief","joy","love","nervousness","optimism",
    "pride","realization","relief","remorse","sadness","surprise","neutral"]

def analyze_emotion(data, output_file, batch_size=500):
    columns = list(data.columns) + emotion_labels
    if os.path.exists(output_file):
        processed_rows = sum(1 for _ in open(output_file, encoding='utf-8')) - 1
        print(f"Resuming from row {processed_rows}...", flush=True)
    else:
        pd.DataFrame(columns=columns).to_csv(output_file, index=False, encoding='utf-8')
        processed_rows = 0
        print("Starting fresh...", flush=True)
    num_batches = (len(data) // batch_size) + 1
    start_batch = processed_rows // batch_size
    for i in range(start_batch, num_batches):
        start_idx = i * batch_size
        end_idx = min((i + 1) * batch_size, len(data))
        batch = data.iloc[start_idx:end_idx].copy()
        if batch.empty:
            continue
        texts = batch['video_description'].astype(str).tolist()
        all_results = emotion_classifier(texts)
        rows = []
        for result in all_results:
            row = {label:0.0 for label in emotion_labels}
            for e in result:
                row[e["label"]] = e["score"]
            rows.append(row)
        for label in emotion_labels:
            batch[label] = [r[label] for r in rows]
        batch.to_csv(output_file, mode='a', header=False, index=False, encoding='utf-8')
        print(f"Finished batch {i+1}/{num_batches} ({end_idx}/{len(data)} rows)", flush=True)
    print("All batches processed and saved to:", output_file, flush=True)

In [142]:
### Test data

test_data = data3.head(10)
analyze_emotion(test_data, "data/test_emotion.csv")

Starting fresh...
Finished batch 1/1 (10/10 rows)
All batches processed and saved to: data/test_emotion.csv


In [143]:
### Full data (took 344m 0.4s --> 5.73 hours)

analyze_emotion(data3, "data/data4.csv")

Starting fresh...
Finished batch 1/11 (500/5000 rows)
Finished batch 2/11 (1000/5000 rows)
Finished batch 3/11 (1500/5000 rows)
Finished batch 4/11 (2000/5000 rows)
Finished batch 5/11 (2500/5000 rows)
Finished batch 6/11 (3000/5000 rows)
Finished batch 7/11 (3500/5000 rows)
Finished batch 8/11 (4000/5000 rows)
Finished batch 9/11 (4500/5000 rows)
Finished batch 10/11 (5000/5000 rows)
All batches processed and saved to: data/data4.csv


`Section 6: Variable selection`

In [7]:
### Import relevant libraries

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [18]:
### Import dataset

data4 = pd.read_csv("data/data4.csv")
data4.head()

Unnamed: 0,username,engagement,like_count,comment_count,share_count,view_count,hashtags,age_group,relatability_label,relatability_score,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,anon_0001,0.167683,76,34,0,656,"['momlife', 'firsttimemom', 'sahm', 'newmom', 'roadto10k', 'motherhoodunplugged', 'momlifebelike', 'momtok', 'momsoftiktok', 'pregnanttiktok', 'growingmyaccount', 'motherhoodjourney', 'fypã‚·ã‚šviral', 'motherhoodunfiltered', 'scrunchymama', 'momlifevibes', 'pregnanttok']",age1,relatable content,0.955621,...,0.002649,0.000209,0.005396,0.000315,0.009154,0.000232,0.000669,0.002343,0.004203,0.681188
1,anon_0002,0.035644,32,4,0,1010,"['pregnancy', 'ultrasound', 'sahm', 'momsoftiktok', 'momtoks', 'relatablemomcontent', 'sahmsoftiktok', 'pregnancytiktok', 'momtoktakeover', 'momtoktakeover2024']",age1,relatable content,0.981253,...,0.001729,0.000284,0.001688,0.000724,0.008073,0.00048,0.000604,0.002883,0.003812,0.961542
2,anon_0003,0.024749,1086,74,85,50305,"['pregnancy', 'pregnant', 'ttc', 'momlife', 'fyp', 'stayathomemom', 'sahm', 'newmom', '23weekspregnant', 'firstpregnancy', 'fertility', 'foryoupage']",age1,relatable content,0.878635,...,0.001891,0.000313,0.001142,0.000189,0.009771,0.000358,0.00183,0.015346,0.006977,0.481108
3,anon_0004,0.074244,2088,24,5,28514,"['pregnant', 'dinnertime', 'familyof5', 'onabudget', 'whatsfordinner', 'sahm', 'lowincome', 'momof3']",age1,relatable content,0.967317,...,0.003766,0.000393,0.002119,0.003368,0.022148,0.0083,0.001892,0.008402,0.000545,0.120321
4,anon_0005,0.061421,165,17,3,3012,"['breakfast', 'earlymorning', 'pregnant', 'fy', 'fyp', 'CookWithMe', 'bluecollar', 'sahm', 'easyrecipe', 'makebreakfastwithme', 'bluecollarlife', 'fypã‚·', 'fypã‚·ã‚šviral', 'bluecollarwife']",age1,relatable content,0.526099,...,0.097502,0.000734,0.002674,0.000243,0.014961,0.001394,0.030267,0.120188,0.002904,0.315595


In [9]:
### Basic info

print("Number of videos:", len(data4))

Number of videos: 5000


In [31]:
### Multicollinearity

def calc_vif(df, cols):
    x=df[cols].assign(const=1)
    return pd.DataFrame({
        "variable":cols,
        "vif":[variance_inflation_factor(x.values,i) for i in range(len(cols))]
    })

num = data4.select_dtypes(include=["number"])
calc_vif(num, num.columns).sort_values("vif", ascending=False)

Unnamed: 0,variable,vif
2,like_count,6.838414
5,view_count,6.711678
3,comment_count,1.694928
4,share_count,1.135448
0,id,1.007526
12,caring,1.001206
17,disapproval,1.000685
11,approval,1.000507
7,admiration,1.000492
15,desire,1.000229


In [74]:
### Drop high vif columns

data4 = data4.drop('like_count', axis=1)
num = data4.select_dtypes(include=["number"])
calc_vif(num, num.columns).sort_values("vif", ascending=False)

Unnamed: 0,variable,vif
2,comment_count,1.708115
3,share_count,1.138448
0,id,1.007008
11,caring,1.001216
16,disapproval,1.000691
10,approval,1.000513
6,admiration,1.000498
14,desire,1.000233
13,curiosity,1.000184
1,engagement,1.000163


In [21]:
### Drop columns that won't be used in analyses

data4 = data4.drop(['id',  'hashtags', 'video_description', 'relatability_label'], axis=1)
data4.head()

Unnamed: 0,username,engagement,age_group,relatability_score,admiration,amusement,anger,annoyance,approval,caring,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,anon_0001,0.167683,age1,0.955621,0.006489,0.002753,0.001264,0.003769,0.012369,0.000699,...,0.002649,0.000209,0.005396,0.000315,0.009154,0.000232,0.000669,0.002343,0.004203,0.681188
1,anon_0002,0.035644,age1,0.981253,0.006649,0.001622,0.00061,0.001963,0.014375,0.000633,...,0.001729,0.000284,0.001688,0.000724,0.008073,0.00048,0.000604,0.002883,0.003812,0.961542
2,anon_0003,0.024749,age1,0.878635,0.001476,0.000802,0.00187,0.004185,0.004589,0.001859,...,0.001891,0.000313,0.001142,0.000189,0.009771,0.000358,0.00183,0.015346,0.006977,0.481108
3,anon_0004,0.074244,age1,0.967317,0.046669,0.000551,0.000666,0.008453,0.015583,0.0035,...,0.003766,0.000393,0.002119,0.003368,0.022148,0.0083,0.001892,0.008402,0.000545,0.120321
4,anon_0005,0.061421,age1,0.526099,0.008951,0.009532,0.003827,0.022288,0.007278,0.009099,...,0.097502,0.000734,0.002674,0.000243,0.014961,0.001394,0.030267,0.120188,0.002904,0.315595


In [34]:
### Export dataset

data4.to_csv("data/finalD.csv", index=False)