# üìä D1 ‚Äî Deep Feature Engineering for Multimodal YouTube Clickability

In this notebook we build **deep, multimodal feature sets** for our YouTube Clickability Study.

We will create **three new feature matrices**:

- `youtube_features_structured_deep.parquet` ‚Üí rich handcrafted + semantic title/channel features  
- `youtube_features_text_deep.parquet` ‚Üí Sentence-BERT embeddings (768-dim)  
- `youtube_features_image_deep.parquet` ‚Üí CLIP ViT-L/14 + EfficientNet-B0 thumbnail embeddings + visual metadata  

We will **reuse** our existing targets:

- `youtube_target_regression.parquet`  
- `youtube_target_classification.parquet` *(top-25% cutoff, already created earlier)*  

These files are designed specifically for the **multimodal deep NN** used in `D2_multimodal_deep_nn.ipynb`.

## 0. üì¶ Install Dependencies

This cell installs all Python packages needed for **deep feature engineering**:

- `requests`, `tqdm`, `pillow` ‚Üí downloading thumbnails & progress bars  
- `sentence-transformers` ‚Üí Sentence-BERT for text embeddings  
- `transformers` ‚Üí CLIP model for vision embeddings  
- `timm` ‚Üí EfficientNet for thumbnail embeddings  
- `opencv-python` ‚Üí face detection  
- `pytesseract` ‚Üí thumbnail OCR (text density)  
- `textstat`, `textblob`, `vaderSentiment` ‚Üí readability & sentiment signals  

> If some are already installed, pip will just say **‚ÄúRequirement already satisfied.‚Äù**

In [1]:
%pip install --quiet \
    requests \
    tqdm \
    pillow \
    sentence-transformers \
    transformers \
    timm \
    opencv-python \
    pytesseract \
    textstat \
    textblob \
    vaderSentiment

print("‚úÖ Dependencies installed (or already satisfied).")

Note: you may need to restart the kernel to use updated packages.
‚úÖ Dependencies installed (or already satisfied).


## 1. üìÇ Setup Paths, Device, and Load Clean Dataset

Here we:

1. Define project paths (`base`, `processed_path`)
2. Load the cleaned dataset: `youtube_clean_final.parquet`
3. Detect whether we can use **CUDA**, **MPS**, or CPU (Sentence-BERT is always forced to CPU for safety)

In [2]:
from pathlib import Path
import pandas as pd
import torch

# Paths
base = Path.cwd().parent
processed_path = base / "data" / "processed"
processed_path.mkdir(parents=True, exist_ok=True)

# Load cleaned dataset
df = pd.read_parquet(processed_path / "youtube_clean_final.parquet")
print("Loaded dataset:", df.shape)

# Device selection
if torch.cuda.is_available():
    device = torch.device("cuda")
elif getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print("Vision device:", device)
print("Text device:  cpu (SentenceTransformers safer on CPU)")

df.head()

Loaded dataset: (5742, 19)
Vision device: mps
Text device:  cpu (SentenceTransformers safer on CPU)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,subscribers,views_per_subscriber,views_per_subscriber_log
0,-0CMnp02rNY,18.11.06,Mindy Kaling's Daughter Had the Perfect Reacti...,TheEllenShow,24,2018-06-04 13:00:00+00:00,"ellen|""ellen degeneres""|""the ellen show""|""elle...",800359,9773,332,423,https://i.ytimg.com/vi/-0CMnp02rNY/default.jpg,False,False,False,Ocean's 8 star Mindy Kaling dished on bringing...,23760020.0,0.033685,0.03313
1,-0NYY8cqdiQ,18.01.02,Megan Mullally Didn't Notice the Interesting P...,TheEllenShow,24,2018-01-29 14:00:39+00:00,"megan mullally|""megan""|""mullally""|""will and gr...",563746,4429,54,94,https://i.ytimg.com/vi/-0NYY8cqdiQ/default.jpg,False,False,False,Ellen and Megan Mullally have known each other...,23760020.0,0.023727,0.02345
2,-1Hm41N0dUs,18.01.05,Cast of Avengers: Infinity War Draws Their Cha...,Jimmy Kimmel Live,23,2018-04-27 07:30:02+00:00,"jimmy|""jimmy kimmel""|""jimmy kimmel live""|""late...",2058516,41248,580,1484,https://i.ytimg.com/vi/-1Hm41N0dUs/default.jpg,False,False,False,"Benedict Cumberbatch, Don Cheadle, Elizabeth O...",11262900.0,0.18277,0.167859
3,-1yT-K3c6YI,17.02.12,YOUTUBER QUIZ + TRUTH OR DARE W/ THE MERRELL T...,Molly Burke,22,2017-11-28 18:30:43+00:00,"youtube quiz|""youtuber quiz""|""truth or dare""|""...",231341,7734,212,846,https://i.ytimg.com/vi/-1yT-K3c6YI/default.jpg,False,False,False,Check out the video we did on the Merrell Twin...,274004.0,0.844295,0.612097
4,-2RVw2_QyxQ,17.16.11,2017 Champions Showdown: Day 3,Saint Louis Chess Club,27,2017-11-12 02:39:01+00:00,"Chess|""Saint Louis""|""Club""",71089,460,27,20,https://i.ytimg.com/vi/-2RVw2_QyxQ/default.jpg,False,False,False,The Saint Louis Chess Club hosts a series of f...,147718.0,0.481245,0.392883


## 2. üßπ Basic Title Cleaning

We ensure the `title` column is:

- Cast to string  
- Free of leading/trailing whitespace  

This keeps downstream text processing consistent.

In [3]:
df["title"] = df["title"].astype(str).str.strip()
print("Sample titles:")
df["title"].head()

Sample titles:


0    Mindy Kaling's Daughter Had the Perfect Reacti...
1    Megan Mullally Didn't Notice the Interesting P...
2    Cast of Avengers: Infinity War Draws Their Cha...
3    YOUTUBER QUIZ + TRUTH OR DARE W/ THE MERRELL T...
4                       2017 Champions Showdown: Day 3
Name: title, dtype: object

## 3. üß± Structured Deep Features (Handcrafted + Functional)

We build a **richer structured feature set**, including:

### **Original structured features**
- `title_length`
- `word_count`
- `caps_ratio`
- `has_question`, `has_exclamation`, `has_number`
- `avg_word_len`
- `sentiment_vader`
- `subscribers`

### **Functional deep structured features**
- `sentiment_tb` ‚Üí TextBlob polarity  
- `readability` ‚Üí Flesch Reading Ease  
- `emoji_count`  
- `punctuation_intensity`

Saved to:  
`youtube_features_structured_deep.parquet`

In [25]:
import numpy as np
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import textstat

analyzer = SentimentIntensityAnalyzer()
titles = df["title"].astype(str)

def count_emojis(text):
    emoji_chars = "üòÄüòÉüòÑüòÅüòÜüòÖüòÇü§£üòäüòçüòéüî•üíÄ‚ú®üôèüíØüéâü•∂ü•µü§Øüò±üò®üò¢üò≠üò°"
    return sum(ch in emoji_chars for ch in text)

df_struct_deep = pd.DataFrame(index=df.index)

# Original structured
df_struct_deep["title_length"] = titles.apply(len)
df_struct_deep["word_count"] = titles.apply(lambda x: len(x.split()))
df_struct_deep["caps_ratio"] = titles.apply(
    lambda x: sum(c.isupper() for c in x) / len(x) if len(x) else 0
)
df_struct_deep["has_question"] = titles.apply(lambda x: int("?" in x))
df_struct_deep["has_exclamation"] = titles.apply(lambda x: int("!" in x))
df_struct_deep["has_number"] = titles.apply(lambda x: int(any(c.isdigit() for c in x)))
df_struct_deep["avg_word_len"] = titles.apply(
    lambda x: np.mean([len(w) for w in x.split()]) if x.split() else 0
)
df_struct_deep["sentiment_vader"] = titles.apply(
    lambda x: analyzer.polarity_scores(x)["compound"]
)
df_struct_deep["subscribers"] = df["subscribers"].astype(float)

# Functional features
df_struct_deep["sentiment_tb"] = titles.apply(lambda x: TextBlob(x).sentiment.polarity)
df_struct_deep["readability"] = titles.apply(
    lambda x: textstat.flesch_reading_ease(x) if x.strip() else 0
)
df_struct_deep["emoji_count"] = titles.apply(count_emojis)
punct_chars = "!?.,:;"
df_struct_deep["punctuation_intensity"] = titles.apply(
    lambda x: sum(x.count(p) for p in punct_chars) / len(x) if len(x) else 0
)
struct_path = processed_path / "youtube_features_structured_deep.parquet"
df_struct_deep.to_parquet(struct_path, index=False)

print("Structured DEEP feature shape:", df_struct_deep.shape)
df_struct_deep.head()

Structured DEEP feature shape: (5742, 13)


Unnamed: 0,title_length,word_count,caps_ratio,has_question,has_exclamation,has_number,avg_word_len,sentiment_vader,subscribers,sentiment_tb,readability,emoji_count,punctuation_intensity
0,74,11,0.121622,0,0,0,5.818182,0.5719,23760020.0,1.0,49.542727,0,0.0
1,75,10,0.106667,0,0,0,6.6,-0.3089,23760020.0,0.5,27.485,0,0.0
2,53,8,0.132075,0,0,0,5.75,-0.5994,11262900.0,0.0,61.24,0,0.018868
3,51,10,0.764706,0,1,0,4.2,0.5147,274004.0,0.0,75.5,0,0.019608
4,30,5,0.1,0,0,1,5.2,0.5267,147718.0,0.0,66.4,0,0.033333


## 4. üìù Text Deep Features ‚Äî Sentence-BERT + PCA-128

We generate strong text embeddings by combining **title + tags + description** into a single semantic string:

### üìò Sentence-BERT `all-mpnet-base-v2`
- 768-dimensional semantic embedding  
- Captures meaning, sentiment, keywords, writing style  

### ‚≠ê PCA ‚Üí **128 dimensions**
- Retains >95% of semantic variance  
- Speeds up training  
- Reduces model size & overfitting  
- Better stability for multimodal fusion  

### üì¶ Saved Output  
Final 128-dim text embeddings stored to:

`youtube_features_text_deep.parquet`

In [39]:
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import pandas as pd
from tqdm import tqdm

# Build text rows (TITLE + TAGS + DESCRIPTION)
def build_text_row(row):
    return f"[TITLE] {row['title']} [TAGS] {row['tags']} [DESC] {row['description']}"

texts = df.apply(build_text_row, axis=1).tolist()
print("Total text rows:", len(texts))

# Load MPNet (Sentence-BERT)
print("üì¶ Loading Sentence-BERT (MPNet)...")
model_text = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device="cpu")

# Encode ‚Üí 768-dim embeddings
print("üîç Encoding text with MPNet...")
text_embeddings = model_text.encode(
    texts,
    batch_size=32,
    convert_to_numpy=True,
    show_progress_bar=True
)

print("Raw text embedding shape:", text_embeddings.shape)

# PCA ‚Üí 128 dims
print("‚öôÔ∏è Applying PCA ‚Üí 128 dims...")
pca_text = PCA(n_components=128, random_state=42)
text_pca = pca_text.fit_transform(text_embeddings).astype(np.float32)

print("‚úì PCA complete. New shape:", text_pca.shape)

# Save to parquet
X_text_deep = pd.DataFrame(
    text_pca,
    columns=[f"text_pca_{i}" for i in range(128)],
    index=df.index
)

text_path = processed_path / "youtube_features_text_deep.parquet"
X_text_deep.to_parquet(text_path, index=False)

print("üéâ Saved PCA-compressed text features:", X_text_deep.shape)

X_text_deep.head()

Total text rows: 5742
üì¶ Loading Sentence-BERT (MPNet)...
üîç Encoding text with MPNet...


Batches:   0%|          | 0/180 [00:00<?, ?it/s]

Raw text embedding shape: (5742, 768)
‚öôÔ∏è Applying PCA ‚Üí 128 dims...
‚úì PCA complete. New shape: (5742, 128)
üéâ Saved PCA-compressed text features: (5742, 128)


Unnamed: 0,text_pca_0,text_pca_1,text_pca_2,text_pca_3,text_pca_4,text_pca_5,text_pca_6,text_pca_7,text_pca_8,text_pca_9,...,text_pca_118,text_pca_119,text_pca_120,text_pca_121,text_pca_122,text_pca_123,text_pca_124,text_pca_125,text_pca_126,text_pca_127
0,-0.151295,-0.14798,0.383545,-0.006988,0.083549,0.061882,0.089808,-0.13111,-0.015959,-0.133157,...,-0.074844,0.074155,-0.024881,-0.016199,-0.009139,-0.006956,0.03427,0.017859,0.021243,-0.008997
1,-0.058435,-0.145014,0.396311,-0.041779,-0.01831,0.153744,0.093258,-0.123044,-0.017802,0.025689,...,0.011435,0.023881,-0.053888,0.002444,0.019642,-0.011896,0.048322,0.002644,0.015007,0.036443
2,-0.152892,-0.187936,0.103694,0.339933,-0.044565,-0.154826,0.028612,0.098123,-0.113959,-0.004525,...,0.070238,0.025376,0.001864,0.006263,0.012419,0.03818,-0.002867,0.017639,-0.033473,-0.017865
3,0.116747,0.252121,0.182934,-0.024036,-0.05901,-0.122711,0.080005,0.088937,0.070096,-0.056868,...,0.065549,0.025949,0.049594,-0.011026,0.006787,-0.013862,0.026179,-0.053807,0.059425,0.016043
4,-0.075288,-0.090466,-0.221809,-0.067399,0.117314,-0.198868,-0.130149,0.015747,-0.157611,-0.006242,...,-0.002234,0.024061,0.003783,-0.012131,-0.015217,-0.027122,0.076398,-0.051127,0.033308,0.021247


## 5. üñº Image Deep Features ‚Äî CLIP ViT-B/32 + Visual Metadata + PCA-128

We extract deep visual features from each YouTube thumbnail using **CLIP ViT-B/32**, a fast and highly semantic vision encoder.

### üß† CLIP ViT-B/32 (768-dim)
- Recognizes objects, style, composition, and scene structure  
- Much faster than ViT-L/14 while still strong for clickability prediction  
- Outputs a **768-dim normalized embedding**

### üß© Visual Metadata (4 interpretable features)
- **brightness** ‚Äî overall luminance  
- **saturation** ‚Äî color richness  
- **face_count** ‚Äî # of detected human faces  
- **text_density** ‚Äî OCR skipped for speed (set to 0)

### ‚≠ê PCA ‚Üí 128 dims (recommended)
We compress `[768 + 4] = 772` raw features into **128-dim** using PCA:
- Reduces noise and speeds up training
- Lowers overfitting  
- More stable for multimodal fusion

### üì¶ Saved Output  
`youtube_features_image_deep.parquet` ‚Äî with **128 columns**.


In [40]:
# ==============================================================
# 5. üñº Image Deep Features ‚Äî CLIP ViT-B/32 + Visual Metadata + OCR + PCA-128
# ==============================================================

from io import BytesIO
from PIL import Image
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
from pathlib import Path
import requests
import cv2
import pytesseract
from sklearn.decomposition import PCA
import torch
from transformers import CLIPProcessor, CLIPModel

print("üîå Vision device:", device)

# --------------------------------------------------------------
# Load CLIP ViT-B/32
# --------------------------------------------------------------
print("üì¶ Loading CLIP ViT-B/32...")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model.eval()


# --------------------------------------------------------------
# Helper: download_thumbnail
# --------------------------------------------------------------
def download_img(url: str) -> Image.Image:
    try:
        if not isinstance(url, str) or not url.startswith("http"):
            raise ValueError("Bad URL")
        url_hq = url.replace("/default.jpg", "/hqdefault.jpg")
        resp = requests.get(url_hq, timeout=7)
        resp.raise_for_status()
        return Image.open(BytesIO(resp.content)).convert("RGB")
    except Exception:
        # fallback blank image
        return Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))


# --------------------------------------------------------------
# Helper: OCR-safe visual metadata
# --------------------------------------------------------------
def compute_metadata(img: Image.Image):

    img_cv = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)

    # 1. Brightness
    brightness = float(img_cv.mean())

    # 2. Saturation
    hsv = cv2.cvtColor(img_cv, cv2.COLOR_BGR2HSV)
    saturation = float(hsv[:, :, 1].mean())

    # 3. Face Count
    gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
    face_cascade = cv2.CascadeClassifier(
        cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
    )
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)
    face_count = int(len(faces))

    # 4. OCR Text Density ‚Äî SAFE VERSION
    img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)

    try:
        # Faster OCR config
        ocr_data = pytesseract.image_to_data(
            img_rgb,
            config="--oem 3 --psm 6",
            output_type=pytesseract.Output.DICT,
        )
    except Exception:
        return brightness, saturation, face_count, 0.0

    text_area = 0
    total_area = img_rgb.shape[0] * img_rgb.shape[1]

    n_items = len(ocr_data.get("text", []))

    for i in range(n_items):
        text = ocr_data["text"][i].strip()

        if text == "":
            continue

        # ---- skip missing/invalid OCR values ----
        try:
            w = int(ocr_data["width"][i])
            h = int(ocr_data["height"][i])
        except:
            continue

        if w <= 0 or h <= 0:
            continue

        text_area += w * h

    text_density = text_area / total_area if total_area > 0 else 0.0

    # stability clip
    text_density = min(text_density, 0.5)

    return brightness, saturation, face_count, float(text_density)


# --------------------------------------------------------------
# Helper: CLIP batch inference
# --------------------------------------------------------------
def clip_batch_embed(batch):
    if len(batch) == 0:
        return np.zeros((0, 768), dtype=np.float32)

    inputs = clip_processor(images=batch, return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        feats = clip_model.get_image_features(**inputs)
        feats = feats / feats.norm(dim=-1, keepdim=True)

    return feats.cpu().numpy().astype(np.float32)


# ==============================================================
# 5.1 Process thumbnails
# ==============================================================

print("\nüîç Step 1/3 ‚Äî Downloading thumbnails + computing metadata...")

thumb_urls = df["thumbnail_link"].tolist()
BATCH_SIZE = 16

brightness_list = []
saturation_list = []
face_count_list = []
text_density_list = []

batched_clip_results = []
batch_imgs = []

pbar = tqdm(total=len(thumb_urls), desc="Images processed", unit="img")

for url in thumb_urls:
    img = download_img(url)

    # Metadata
    br, sat, fc, td = compute_metadata(img)
    brightness_list.append(br)
    saturation_list.append(sat)
    face_count_list.append(fc)
    text_density_list.append(td)

    batch_imgs.append(img)

    if len(batch_imgs) == BATCH_SIZE:
        batch_emb = clip_batch_embed(batch_imgs)
        batched_clip_results.append(batch_emb)
        batch_imgs = []

    pbar.update(1)

# leftover
if len(batch_imgs) > 0:
    batched_clip_results.append(clip_batch_embed(batch_imgs))

pbar.close()

X_clip = np.vstack(batched_clip_results)
print(f"‚úì CLIP embeddings complete: {X_clip.shape}")


# ==============================================================
# 5.2 Combine with metadata
# ==============================================================

meta_array = np.stack(
    [brightness_list, saturation_list, face_count_list, text_density_list],
    axis=1,
).astype(np.float32)

X_img_raw = np.hstack([X_clip, meta_array])
print("Combined feature matrix:", X_img_raw.shape)


# ==============================================================
# 5.3 PCA ‚Üí 128 dims
# ==============================================================

print("\n‚öôÔ∏è Step 2/3 ‚Äî Running PCA ‚Üí 128 dims...")
pca = PCA(n_components=128, random_state=42)
X_img_pca = pca.fit_transform(X_img_raw).astype(np.float32)
print("‚úì PCA complete ‚Üí shape:", X_img_pca.shape)


# ==============================================================
# 5.4 Save
# ==============================================================

img_cols = [f"img_pca_{i}" for i in range(128)]
X_image_deep = pd.DataFrame(X_img_pca, columns=img_cols, index=df.index)

img_path = processed_path / "youtube_features_image_deep.parquet"
X_image_deep.to_parquet(img_path, index=False)

print(f"üéâ Step 3/3 ‚Äî Final image deep features saved: {X_image_deep.shape}")

üîå Vision device: mps
üì¶ Loading CLIP ViT-B/32...

üîç Step 1/3 ‚Äî Downloading thumbnails + computing metadata...


Images processed:   0%|          | 0/5742 [00:00<?, ?img/s]

‚úì CLIP embeddings complete: (5742, 512)
Combined feature matrix: (5742, 516)

‚öôÔ∏è Step 2/3 ‚Äî Running PCA ‚Üí 128 dims...
‚úì PCA complete ‚Üí shape: (5742, 128)
üéâ Step 3/3 ‚Äî Final image deep features saved: (5742, 128)


## 6. üíæ Deep Feature Matrices ‚Äî Summary (No Saving Here)

All **deep multimodal feature matrices** have now been generated and saved in their
respective sections earlier in this notebook:

- `youtube_features_structured_deep.parquet`  
  ‚Üí enriched handcrafted + semantic title/channel features  

- `youtube_features_text_deep.parquet`  
  ‚Üí Sentence-BERT title embeddings (768-dim)  

- `youtube_features_image_deep.parquet`  
  ‚Üí CLIP ViT-B/32 + metadata compressed with PCA-512  

Our targets were created previously and remain unchanged:

- `youtube_target_regression.parquet`  
  ‚Üí continuous *views_per_subscriber* metric  
- `youtube_target_classification.parquet`  
  ‚Üí binary clickability label (top-25% cutoff)

These **five datasets** collectively serve as the complete input to  
`D2_multimodal_deep_nn.ipynb`, where we train the full multimodal deep neural network.



In [41]:
# Display shapes for confirmation
print("üìê Deep Feature Summary")
print("Structured deep ‚Üí", df_struct_deep.shape)
print("Text deep       ‚Üí", X_text_deep.shape)
print("Image deep      ‚Üí", X_image_deep.shape)

print("\nüéØ Targets")
print("Regression target shape     ‚Üí", pd.read_parquet(processed_path / "youtube_target_regression.parquet").shape)
print("Classification target shape ‚Üí", pd.read_parquet(processed_path / "youtube_target_classification.parquet").shape)

print("\n‚úÖ All deep features are ready for D2_multimodal_deep_nn.ipynb")

üìê Deep Feature Summary
Structured deep ‚Üí (5742, 13)
Text deep       ‚Üí (5742, 128)
Image deep      ‚Üí (5742, 128)

üéØ Targets
Regression target shape     ‚Üí (5742, 1)
Classification target shape ‚Üí (5742, 1)

‚úÖ All deep features are ready for D2_multimodal_deep_nn.ipynb
