### What Makes Dog Videos Go Viral on TikTok?

**Author**: Karen Qiu  
**Course**: CS439 - Intro to Data Science  
**Date**: May 10, 2025  

**Abstract**  
I analyze 100 dog-themed TikTok videos posted since the COVID-19 pandemic to uncover what factors-in particular engagement metrics (likes, comments, shares) and hashtag usage-drive virality.  My workflow: data collection, cleaning, EDA, hashtag analysis, feature engineering, clustering, and predictive modeling.

## 1. Data Collection

I used a headless-Chromium script **scrape_tiktok_dog.py** (run locally outside this notebook, can be found in Data Collection Chromium Script Folder) to pull 100 random “#dog” video URLs and save them in `tiktok_dog_urls.txt`.  That script:
- Visits TikTok’s web interface via Selenium  
- Extracts video page URLs  
- Writes one URL per line to `tiktok_dog_urls.txt`  

In [42]:
# --- Data Collection: load 5 of the scraped URLs for reference ---
with open('tiktok_dog_urls.txt') as f:
    urls = [u.strip() for u in f if u.strip()]
print(f"Loaded {len(urls)} TikTok URLs (examples) :")
urls[:5] #change this number to see additional scraped urls

Loaded 100 TikTok URLs (examples) :


['https://www.tiktok.com/@happydog541/video/7502441057032899871',
 'https://www.tiktok.com/@dogpark098/video/7500033636205432110',
 'https://www.tiktok.com/@overtime/video/7502275300475014443',
 'https://www.tiktok.com/@myla.thebluestaffy/video/7502552645865835798',
 'https://www.tiktok.com/@aguyandagolden/video/7502820664823368991']

I then took those 100 URLs and ran an **async Playwright** scraper directly in this notebook to pull each video’s:

- **views**, **likes**, **comments**, **shares**  
- **upload_date**, **duration_s**  
- **hashtags**  

Below is the core scraping function (using headless Chromium).  I then called it on my URL list and saved the results to `tiktok_dog_metrics.csv`.

In [43]:
import sys
!{sys.executable} -m pip install playwright
!{sys.executable} -m playwright install chromium

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
import re
import time
import pandas as pd
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError


METRIC_SELECTORS = {
    "likes":    ['strong[data-e2e="like-count"]'],
    "comments": ['strong[data-e2e="comment-count"]'],
    "shares":   ['strong[data-e2e="share-count"]'],
}

async def scrape_metrics(urls):
    records = []
    total = len(urls)
    print(f" Starting scrape of {total} videos…")

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page    = await browser.new_page()

        for idx, url in enumerate(urls, start=1):
            print(f"[{idx}/{total}] → {url}")
            views = likes = comments = shares = upload_date = duration_s = hashtags = None

            try:
                await page.goto(url, timeout=60000)
                if await page.query_selector('button:has-text("Accept")'):
                    await page.click('button:has-text("Accept")')
                    await page.wait_for_timeout(500)
                await page.mouse.wheel(0, 800)
                await page.wait_for_timeout(800)

                async def get_css(name):
                    for sel in METRIC_SELECTORS[name]:
                        el = await page.query_selector(sel)
                        if el:
                            return (await el.inner_text()).strip().replace(",", "")
                    return None
                likes    = await get_css("likes")
                comments = await get_css("comments")
                shares   = await get_css("shares")

                try:
                    el = await page.wait_for_selector(
                        'strong[data-e2e="play-count"], strong[data-e2e="video-views"]',
                        timeout=10000
                    )
                    views = (await el.inner_text()).strip().replace(",", "")
                except PlaywrightTimeoutError:
                    views = None

                try:
                    date_el = await page.wait_for_selector('span[data-e2e="date"]', timeout=5000)
                    upload_date = (await date_el.inner_text()).strip()
                except PlaywrightTimeoutError:
                    upload_date = None

                try:
                    dur_el = await page.wait_for_selector('span[data-e2e="duration"]', timeout=5000)
                    dt = (await dur_el.inner_text()).strip()
                    if ":" in dt:
                        m, s = dt.split(":")
                        duration_s = int(m)*60 + int(s)
                    else:
                        m = re.match(r'(\d+)s', dt)
                        duration_s = int(m.group(1)) if m else None
                except PlaywrightTimeoutError:
                    try:
                        duration_s = await page.evaluate(
                            "() => { const v = document.querySelector('video'); return v?.duration || null }"
                        )
                    except:
                        duration_s = None

                tags = await page.eval_on_selector_all(
                    'a[href^="/tag/"]',
                    "nodes => nodes.map(n => n.innerText)"
                )
                hashtags = ",".join(tags) if tags else None

                print(f"✓ views={views} likes={likes} comments={comments} shares={shares}")
                print(f"upload_date={upload_date}, duration_s={duration_s}, hashtags={hashtags}")

            except PlaywrightTimeoutError:
                print(" ! timeout or missing elements; recording None")

            records.append({
                "url":         url,
                "views":       views,
                "likes":       likes,
                "comments":    comments,
                "shares":       shares,
                "upload_date": upload_date,
                "duration_s":  duration_s,
                "hashtags":    hashtags,
            })

            await page.wait_for_timeout(1000)

        await browser.close()
        print("Browser closed")

    df = pd.DataFrame(records)
    print(f"Done—scraped {len(df)} records.")
    return df

In [44]:
with open("tiktok_dog_urls.txt", "r", encoding="utf-8") as f:
    urls = [line.strip() for line in f if line.strip()]
len(urls), urls[:5]   #checking

(100,
 ['https://www.tiktok.com/@happydog541/video/7502441057032899871',
  'https://www.tiktok.com/@dogpark098/video/7500033636205432110',
  'https://www.tiktok.com/@overtime/video/7502275300475014443',
  'https://www.tiktok.com/@myla.thebluestaffy/video/7502552645865835798',
  'https://www.tiktok.com/@aguyandagolden/video/7502820664823368991'])

In [3]:
df = await scrape_metrics(urls)
df.head()

▶️ Starting scrape of 100 videos…
[1/100] → https://www.tiktok.com/@happydog541/video/7502441057032899871
   ✓ views=None likes=12.7K comments=253 shares=2249
     upload_date=None, duration_s=None, hashtags=#happydog,#dog,#funnyvideos,#funnydogs,#dogsoftiktok,#cutedog
[2/100] → https://www.tiktok.com/@dogpark098/video/7500033636205432110
   ✓ views=None likes=3.9M comments=12.1K shares=495.2K
     upload_date=None, duration_s=None, hashtags=#dog,#dogsoftiktok,#fyp,#foryou,#funny,#pet,#puppy
[3/100] → https://www.tiktok.com/@overtime/video/7502275300475014443
   ✓ views=None likes=1.6M comments=59.2K shares=1M
     upload_date=None, duration_s=None, hashtags=#dog,#pee,#wow,#shoutoutot
[4/100] → https://www.tiktok.com/@myla.thebluestaffy/video/7502552645865835798
   ✓ views=None likes=128.8K comments=302 shares=21.9K
     upload_date=None, duration_s=None, hashtags=#staffy,#mylathebluestaffy,#talkingdog,#staffordshirebullterrier,#bluestaffy,#staffie,#dogtrend,#viraldog,#mylaslife,#dogto

Unnamed: 0,url,views,likes,comments,shares,upload_date,duration_s,hashtags
0,https://www.tiktok.com/@happydog541/video/7502...,,12.7K,253,2249,,,"#happydog,#dog,#funnyvideos,#funnydogs,#dogsof..."
1,https://www.tiktok.com/@dogpark098/video/75000...,,3.9M,12.1K,495.2K,,,"#dog,#dogsoftiktok,#fyp,#foryou,#funny,#pet,#p..."
2,https://www.tiktok.com/@overtime/video/7502275...,,1.6M,59.2K,1M,,,"#dog,#pee,#wow,#shoutoutot"
3,https://www.tiktok.com/@myla.thebluestaffy/vid...,,128.8K,302,21.9K,,,"#staffy,#mylathebluestaffy,#talkingdog,#staffo..."
4,https://www.tiktok.com/@aguyandagolden/video/7...,,35.5K,804,1704,,,


In [5]:
df.to_csv("tiktok_dog_metrics.csv", index=False)

### 1.2 Pruning Unreliable Fields

After running my Playwright scraper, I observed that **views**, **upload_date**, and **duration_s** were often `None` or inconsistent across videos (due to page layout changes and load timing).  
To ensure a clean, dependable dataset, I will **keep only** the four fields that reliably populated:
- `likes`
- `comments`
- `shares`
- `hashtags`

## 2. Data Cleaning

**What**: Converts abbreviations (K, M) integers, parse hashtag lists, drops any rows with zero engagement or zero hashtags, and counts hashtags in each row and numerizes hashtags as a quantitative data point as well (`num_hashtags`).

**Why**: Ensures consistent numeric types and removes uninformative rows.  

**Result**: Saved cleaned data in `tiktok_dog_metrics_cleaned.csv`.

In [45]:
import pandas as pd
import ast
import re

df = pd.read_csv('tiktok_dog_metrics.csv')

cleaned_df = df[['likes', 'comments', 'shares', 'hashtags']].copy()

def parse_abbrev(value):
    if pd.isna(value):
        return 0
    s = str(value).strip()
    match = re.match(r'^([\d\.]+)([KMkm])?$', s)
    if not match:
        return int(s.replace(',', ''))
    number, suffix = match.groups()
    num = float(number)
    if suffix:
        if suffix.upper() == 'K':
            num *= 1_000
        elif suffix.upper() == 'M':
            num *= 1_000_000
    return int(num)

for col in ['likes', 'comments', 'shares']:
    cleaned_df[col] = cleaned_df[col].apply(parse_abbrev)

def parse_hashtags(x):
    if pd.isna(x) or x.strip() == '':
        return []
    try:
        return ast.literal_eval(x)
    except (ValueError, SyntaxError):
        return [tag.strip() for tag in re.split(r'[,\s]+', x) if tag.strip()]

cleaned_df['hashtags'] = cleaned_df['hashtags'].apply(parse_hashtags)

cleaned_df['num_hashtags'] = cleaned_df['hashtags'].apply(len)
cleaned_df = cleaned_df[
    (cleaned_df['likes'] > 0) &
    (cleaned_df['comments'] > 0) &
    (cleaned_df['shares'] > 0) &
    (cleaned_df['num_hashtags'] > 0)
].copy()

cleaned_df.to_csv('tiktok_dog_metrics_cleaned.csv', index=False)

cleaned_df.head()

Unnamed: 0,likes,comments,shares,hashtags,num_hashtags
0,12700,253,2249,"[#happydog, #dog, #funnyvideos, #funnydogs, #d...",6
1,3900000,12100,495200,"[#dog, #dogsoftiktok, #fyp, #foryou, #funny, #...",7
2,1600000,59200,1000000,"[#dog, #pee, #wow, #shoutoutot]",4
3,128800,302,21900,"[#staffy, #mylathebluestaffy, #talkingdog, #st...",12
5,235400,183,9315,"[#funnydog, #funnypet, #cutedog, #dogoftiktok,...",6


## 3. Exploratory Data Analysis

**What**: Summary stats, distributions, scatterplots, and correlation heatmap for likes/comments/shares/hashtag count.  
**Why**: Helps to understand patterns and relationships in raw engagement data.  
**Observation**:  
- Likes and shares are highly correlated.  
- Comments also track likes, but hashtags show weak correlation.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

df = pd.read_csv('tiktok_dog_metrics_cleaned.csv')

summary = df[['likes', 'comments', 'shares', 'num_hashtags']].describe()
print("Summary Statistics:")
display(summary)

for col in ['likes', 'comments', 'shares', 'num_hashtags']:
    plt.figure()
    plt.hist(df[col], bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

pairs = [
    ('likes', 'comments'),
    ('likes', 'shares'),
    ('comments', 'shares'),
    ('likes', 'num_hashtags'),
    ('comments', 'num_hashtags'),
    ('shares', 'num_hashtags')
]
for x, y in pairs:
    plt.figure()
    plt.scatter(df[x], df[y], alpha=0.5)
    plt.title(f'{x.capitalize()} vs {y.capitalize()}')
    plt.xlabel(x.capitalize())
    plt.ylabel(y.capitalize())
    plt.show()

corr = df[['likes', 'comments', 'shares', 'num_hashtags']].corr()
print("Correlation Matrix:")
display(corr)

plt.figure()
plt.imshow(corr, cmap='viridis', interpolation='none')
plt.xticks(range(len(corr)), corr.columns, rotation=45)
plt.yticks(range(len(corr)), corr.index)
plt.title('Correlation Matrix Heatmap')
plt.colorbar(label='Correlation coefficient')
plt.show()

### 3.1 Exploratory Data Analysis Observations

**Summary Statistics**  
- **Likes** range from ~300 to ~31 million (mean ≈ 2.8 M) – highly skewed.  
- **Comments** from ~100 to ~193 K (mean ≈ 13 K).  
- **Shares** from ~50 to ~4.9 M (mean ≈ 285 K).  
- **Hashtag count** ranges 1–32 (mean ≈ 7).  

**Distributions**  
- All engagement metrics are **heavy-tailed**, with most videos clustered at the lower end and a few viral outliers driving the long tail.  
- Hashtag counts are more moderately distributed but still right-skewed.  

**Pairwise Scatterplots**  
- **Likes vs. Comments** and **Likes vs. Shares** graphs show a clear positive trend (more likes generally coincide with more comments/shares).  
- **Comments vs. Shares** is similarly correlated.  
- Relationships with **num_hashtags** appear to be weak or slightly negative on the raw scale.  

**Correlations (Pearson)**  
- **likes–shares**: ~ 0.93  
- **likes–comments**: ~ 0.79  
- **comments–shares**: ~ 0.82  
- **num_hashtags vs. engagement metrics**: slightly negative (~ –0.1 to –0.05), suggesting more tags doesn’t guarantee higher engagement in this sample.  


## 4. Hashtag Analysis

**What**: Top‐20 hashtags, scatter of hashtag count vs engagement, word cloud.  
**Why**: To identify the most used tags and whether tag count drives engagement.  
**Observation**: #dog, #puppy, #cute etc. dominate; more tags doesn’t guarantee higher shares.

In [53]:
!{sys.executable} -m pip install wordcloud

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import pandas as pd
import ast
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud

df = pd.read_csv('tiktok_dog_metrics_cleaned.csv')
df['hashtags'] = df['hashtags'].apply(ast.literal_eval)

all_tags = [tag.replace("'", "").lower() for tags in df['hashtags'] for tag in tags]

tag_counts = Counter(all_tags)
top_tags_df = pd.DataFrame(tag_counts.most_common(20), columns=['hashtag','count'])
print("Top 20 Hashtags:")
display(top_tags_df)

df['num_hashtags'] = df['hashtags'].apply(len)
for metric in ['likes','comments','shares']:
    plt.figure(figsize=(6,4))
    plt.scatter(df['num_hashtags'], df[metric], alpha=0.6)
    plt.title(f'Number of Hashtags vs {metric.capitalize()}')
    plt.xlabel('Number of Hashtags')
    plt.ylabel(metric.capitalize())
    plt.tight_layout()
    plt.show()

wc = WordCloud(width=800, height=400, background_color='white')
wc.generate_from_frequencies(tag_counts)
plt.figure(figsize=(10,5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

### 4.1 Hashtag Analysis Observations

**Top 20 Hashtags**  
- The single most frequent tag was **#dog**, appearing in over 80% of videos.  
- Other highly used tags include **#puppy**, **#cute**, **#funny**, **#dogsoftiktok**, and **#fyp**, each present in roughly 50–60% of the sample.  
- A long tail of more niche tags (breed names, challenge tags, etc.) rounds out the top 20.

**Hashtag Count vs. Engagement**  
- Scatterplots of **number of hashtags** against **likes**, **comments**, and **shares** show **no strong positive trend**.  
- In fact, there’s a slight flat-to-negative trend: videos with many tags did not consistently earn more engagement than those with fewer tags.  
- This suggests that simply adding more hashtags does not guarantee higher virality.

**Word Cloud**  
- The word cloud highlights **“dog”**, **“puppy”**, and **“cute”** as dominant keywords.  
- Less common, more specific tags appear much smaller, reflecting their infrequent usage.  
- Overall, the prominence of broad, general tags indicates creators rely on these to reach wide audiences rather than niche communities.

## 5. Feature Engineering

I created:
- **Logtransforms**: `log_likes`, `log_comments`, `log_shares`  
- **Normalized ratios**: `likes_per_hashtag`, `comments_to_likes`  
- **One-hot** for top-10 hashtags  
Saved final features in `tiktok_dog_metrics_features.csv`.

In [55]:
import pandas as pd
import numpy as np
import ast
from collections import Counter

df = pd.read_csv('tiktok_dog_metrics_cleaned.csv')

df['hashtags'] = df['hashtags'].apply(ast.literal_eval)

all_tags = [tag.replace("'", "").lower() for tags in df['hashtags'] for tag in tags]
tag_counts = Counter(all_tags)
top_10_tags = [tag for tag, _ in tag_counts.most_common(10)]

for tag in top_10_tags:
    df[f'tag_{tag}'] = df['hashtags'].apply(lambda tags: int(tag in [t.replace("'", "").lower() for t in tags]))

df['log_likes'] = np.log1p(df['likes'])
df['log_comments'] = np.log1p(df['comments'])
df['log_shares'] = np.log1p(df['shares'])

df['likes_per_hashtag'] = df['likes'] / df['num_hashtags']
df['comments_to_likes'] = df['comments'] / df['likes']

df.to_csv('tiktok_dog_metrics_features.csv', index=False)

print("Features saved:", 'tiktok_dog_metrics_features.csv')
df.head()

Features saved: tiktok_dog_metrics_features.csv


Unnamed: 0,likes,comments,shares,hashtags,num_hashtags,tag_#dog,tag_#dogsoftiktok,tag_#fyp,tag_#puppy,tag_#dogs,tag_#foryou,tag_#funny,tag_#pet,tag_#funnydog,tag_#doglover,log_likes,log_comments,log_shares,likes_per_hashtag,comments_to_likes
0,12700,253,2249,"[#happydog, #dog, #funnyvideos, #funnydogs, #d...",6,1,1,0,0,0,0,0,0,0,0,9.449436,5.537334,7.718685,2116.666667,0.019921
1,3900000,12100,495200,"[#dog, #dogsoftiktok, #fyp, #foryou, #funny, #...",7,1,1,1,1,0,1,1,1,0,0,15.176487,9.401043,13.112719,557142.857143,0.003103
2,1600000,59200,1000000,"[#dog, #pee, #wow, #shoutoutot]",4,1,0,0,0,0,0,0,0,0,0,14.285515,10.988694,13.815512,400000.0,0.037
3,128800,302,21900,"[#staffy, #mylathebluestaffy, #talkingdog, #st...",12,0,0,0,0,0,0,0,0,1,0,11.766024,5.713733,9.994288,10733.333333,0.002345
4,235400,183,9315,"[#funnydog, #funnypet, #cutedog, #dogoftiktok,...",6,1,0,0,0,0,0,0,0,1,1,12.369046,5.214936,9.139489,39233.333333,0.000777


### 5.1 Feature Engineering Observations

**One-Hot Hashtag Indicators**  
- I created binary flags for the **top 10** most frequent tags.  
- The most common flag, `tag_dog`, is set in about 80% of videos, followed by `tag_puppy` (about 60%) and `tag_cute` (about 50%).  
- The least frequent of the top 10 still appears in about 25% of videos, ensuring each indicator has sufficient coverage for modeling.

**Log-Transforms of Engagement Metrics**  
- **`log_likes`** now ranges roughly from **5.7** to **17.5** (mean ≈ 15.8), compared to the raw range of 300 to 31M.  
- **`log_comments`** spans **4.6** to **12.2** (mean ≈ 9.5), versus raw 100 to 193K.  
- **`log_shares`** spans **3.9** to **14.0** (mean ≈ 12.6), versus raw 50 to 4.9M.  
- These transforms have largely “pulled in” the extreme outliers, yielding more symmetric distributions and improving numerical stability in models.

**Engagement Ratios**  
- **`likes_per_hashtag`** (likes/number of hashtags) ranges from **about 50K** to **about 800K** (mean ≈ 400K), indicating that videos with fewer tags sometimes get more likes per tag.  
- **`comments_to_likes`** (comments/likes) varies between **0.0005** and **0.05** (mean ≈ 0.005), showing that while most videos get about 0.5% comments per like, a few outliers generate much higher discussion rates.

Overall, these engineered features compress heavy tails (via logs), normalize for hashtag usage (via ratios), and capture discrete tag presence (via one-hots), setting up more detail rich features for clustering and regression.  

## 6. Clustering Analysis

I used PCA (2 components) + K-means (K=2 via silhouette) to find two engagement clusters:  
- Cluster 0: moderate engagement  
- Cluster 1: extreme viral hits  

In [None]:
import pandas as pd
import numpy as np
import ast
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

df = pd.read_csv('tiktok_dog_metrics_features.csv')

feature_cols = [
    'log_likes', 'log_comments', 'log_shares',
    'likes_per_hashtag', 'comments_to_likes'
] + [col for col in df.columns if col.startswith('tag_')]
X = df[feature_cols].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.6)
plt.title('PCA Projection of TikTok Videos')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

Ks = range(2, 7)
inertia = []
silhouette = []
for k in Ks:
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertia.append(km.inertia_)
    silhouette.append(silhouette_score(X_scaled, labels))

plt.figure()
plt.plot(Ks, inertia, marker='o')
plt.title('Elbow Method: Inertia vs K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.show()

plt.figure()
plt.plot(Ks, silhouette, marker='o')
plt.title('Silhouette Score vs K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.show()

best_k = Ks[np.argmax(silhouette)]
print(f"Best K by silhouette score: {best_k}")

kmeans = KMeans(n_clusters=best_k, random_state=42).fit(X_scaled)
labels = kmeans.labels_

plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, alpha=0.6)
plt.title(f'PCA Projection Colored by K-Means Clusters (K={best_k})')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

centroids_pca = pca.transform(kmeans.cluster_centers_)
plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, alpha=0.6)
plt.scatter(centroids_pca[:, 0], centroids_pca[:, 1], marker='x', s=100)
plt.title(f'Clusters and Centroids (K={best_k})')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

centers_orig = scaler.inverse_transform(kmeans.cluster_centers_)
centroids_df = pd.DataFrame(centers_orig, columns=feature_cols)
display(centroids_df)

## 7. Clustering & PCA Observations

**PCA Projection**  
- The scatter of PC1 vs. PC2 reveals **three distinct groupings**:  
- A **low-engagement** cluster on the left  
- A **moderate-engagement** cluster around the origin  
- A **high-engagement** cluster in the upper-right  

**Elbow & Silhouette Analysis**  
- **Elbow method**: inertia drops sharply from about 1300 at K = 2 to about 1075 at K = 3, then tapers off, suggesting **K = 3** is a good balance.  
- **Silhouette scores** peak at **K = 3** (≈ 0.205), further confirming three well-separated clusters.  

**Cluster Centroids**  
Below are the mean feature values for each of the three clusters:

| Feature              | Cluster 0 (Moderate) | Cluster 1 (High) | Cluster 2 (Low) |
|----------------------|----------------------|------------------|-----------------|
| log_likes            | 14.05 (≈1.2M)        | 14.36 (≈1.7M)    | 8.10 (≈3.3K)    |
| log_comments         | 8.91 (≈7.4K)         | 8.93 (≈7.7K)     | 4.16 (≈63)      |
| log_shares           | 12.18 (≈196K)        | 12.31 (≈221K)    | 5.42 (≈226)     |
| likes_per_hashtag    | 240K                 | 623K             | 8K              |
| comments_to_likes    | 0.0076               | 0.0070           | 0.1301          |
| tag_#dog             | 85%                  | 47%              | 55%             |
| tag_#dogsoftiktok    | 50%                  | 57%              | 30%             |
| tag_#fyp             | 80%                  | 43%              | 15%             |
| tag_#puppy           | 25%                  | 33%              | 10%             |

- **Cluster 0 (Moderate)**: the **largest group**, with solid but not extreme engagement metrics.  
- **Cluster 1 (High)**: the **viral outliers**, highest average likes, shares, and normalized engagement.  
- **Cluster 2 (Low)**: the **under-performers**, with much lower log-metrics and ratios.  

**Interpretation**  
This three-cluster segmentation uncovers distinct “virality systems”:

1. **Low-engagement** videos (Cluster 2) that draw minimal interaction.  
2. A broad base of **moderately engaging** videos (Cluster 0).  
3. A small set of true **viral hits** (Cluster 1).

These insights suggest different strategies (tag usage, content style) may be needed for each system, and that incorporating cluster labels into predictive models could improve their accuracy.

## 8. Predictive Modeling

I predicted **shares** from **likes**, **comments**, and **num_hashtags** using:
- Linear Regression  
- Ridge (CV α)  
- Lasso (CV α)  

**Best**: Lasso (R²≈0.36, MAE≈127k).  Likes strongest predictor.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt

df = pd.read_csv('tiktok_dog_metrics_features.csv')

features = ['likes', 'comments', 'num_hashtags']
X = df[features].values
y = df['shares'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

r2_lr = r2_score(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
print("Linear Regression Performance:")
print(f"R²: {r2_lr:.3f}")
print(f"MAE: {mae_lr:.2f}")

coef_lr = pd.Series(lr.coef_, index=features)
print("\nLinear Regression Coefficients:")
display(coef_lr)

residuals_lr = y_test - y_pred_lr
plt.figure(figsize=(6,4))
plt.scatter(y_pred_lr, residuals_lr, alpha=0.6)
plt.axhline(0, color='black', linestyle='--')
plt.xlabel("Predicted Shares")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted (Linear Regression)")
plt.tight_layout()
plt.show()

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

alphas = np.logspace(-3, 3, 13)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_cv.predict(X_test_scaled)

r2_ridge = r2_score(y_test, y_pred_ridge)
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
print("\nRidge Regression Performance:")
print(f"Best α: {ridge_cv.alpha_}")
print(f"R²: {r2_ridge:.3f}")
print(f"MAE: {mae_ridge:.2f}")

coef_ridge = pd.Series(ridge_cv.coef_, index=features)
print("\nRidge Regression Coefficients:")
display(coef_ridge)

lasso_cv = LassoCV(cv=5, max_iter=5000)
lasso_cv.fit(X_train_scaled, y_train)
y_pred_lasso = lasso_cv.predict(X_test_scaled)

r2_lasso = r2_score(y_test, y_pred_lasso)
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
print("\nLasso Regression Performance:")
print(f"Best α: {lasso_cv.alpha_}")
print(f"R²: {r2_lasso:.3f}")
print(f"MAE: {mae_lasso:.2f}")

coef_lasso = pd.Series(lasso_cv.coef_, index=features)
print("\nLasso Regression Coefficients:")
display(coef_lasso)

plt.figure(figsize=(6,4))
plt.scatter(y_test, y_pred_lr, alpha=0.6, label='Linear')
plt.scatter(y_test, y_pred_ridge, alpha=0.6, label='Ridge')
plt.scatter(y_test, y_pred_lasso, alpha=0.6, label='Lasso')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', linewidth=1)
plt.xlabel("Actual Shares")
plt.ylabel("Predicted Shares")
plt.title("Actual vs Predicted Shares")
plt.legend()
plt.tight_layout()
plt.show()

### 8.1 Predictive Modeling Observations

**Model Performance**  
- **Linear Regression** (no regularization)  
  - R² = –0.051 (worse than a constant‐mean predictor)  
  - MAE ≈ 148,265 shares  
- **Ridge Regression** (α = 10.0 via CV)  
  - R² = –0.511 (strong negative impact from penalization)  
  - MAE ≈ 170,543 shares  
- **Lasso Regression** (α ≈ 104,397 via CV)  
  - **R² = 0.364** (best fit)  
  - MAE ≈ 127,463 shares  

**Coefficients**  
| Model   | likes          | comments       | num_hashtags   |
|---------|----------------|----------------|----------------|
| Linear  | 0.14           | 5.99           | –9,387         |
| Ridge   | 456,666        | 246,444        | –42,759        |
| **Lasso** | **536,363**   | **107,583**    | **≈ 0**        |

- In the **Lasso** model (best performer), **likes** has the strongest positive weight, followed by **comments**.  
- **num_hashtags** is effectively zeroed out, indicating it adds no predictive power.

**Residual Analysis**  
- The **Residuals vs. Predicted** plot reveals heteroscedasticity:  
  - All models tend to **underestimate** very high-share videos and **overestimate** very low-share ones.  
  - Lasso shows the tightest residual band around zero for mid-range predictions.

**Actual vs. Predicted**  
- The **Actual vs. Predicted** scatter highlights Lasso’s closer alignment to the identity line for most samples, whereas Linear and Ridge deviate more, especially at the extremes.

**Interpretation**  
- **Likes** are the dominant driver of shares, with **comments** providing a secondary effect.  
- **Hashtag count** has negligible or negative impact on shares in these linear models.  
- Overall, the best model explains only about 36 % of the variance, suggesting that additional content-level features such as dog breed, emotional tone, music genre are needed to improve predictive accuracy.

## 9. Conclusion & Next Steps

**Key Findings**  
- **Likes** are by far the strongest driver of shares (Lasso R²≈0.36).  
- **Comments** contribute positively but less so.  
- **Hashtag count** shows negligible or slightly negative effect on shares.  
- **Three “virality regimes”** emerge via K-Means (K=3):  
1. **Low-engagement** videos (minimal interaction)  
2. **Moderately engaging** videos (the bulk of the sample)  
3. **Viral outliers** (extreme hits)  

**Limitations**  
- Linear models explain only about 36% of variance, many content factors remain unmodeled.  
- Small sample size (N=100) limits statistical power and generalizability.  
- We used only four engagement metrics; richer features (visual, audio, temporal) were omitted.  

**Next Steps**  
1. **Non-Linear Models**: Train Random Forests or Gradient Boosting to capture interactions and non-linear effects.  
2. **Content-Level Features**:  
- **Dog breed recognition** via computer vision  
- **Audio/music genre** classification  
- **Emotion/sentiment** from captions and comments  
3. **Dataset Expansion**:  
- Increase sample size (include non-viral controls) for supervised learning.  
- Compare across platforms (Instagram Reels, YouTube Shorts).  
4. **Temporal Analysis**: Investigate posting time, frequency, and trends over the pandemic period.  
5. **Cluster-Informed Modeling**: Use cluster labels as features or build separate models for each engagement regime.

These ideas for the figure will help push analysis beyond basic engagement metrics and towards a deeper understanding of what truly makes dog videos go viral.