### Features we generated:

| Feature               | Description                            |
| --------------------- | -------------------------------------- |
| `engagement_rate`     | `(likes + comments)/views`             |
| `like_rate`           | `likes/views`                          |
| `comment_rate`        | `comments/views`                       |
| `duration_sec`        | Convert ISO 8601 `duration` to seconds |
| `views_per_second`    | `views/duration_sec`                   |
| `comments_per_second` | `comments/duration_sec`                |


In [9]:
import pandas as pd
import isodate

# Load video metadata
df_videos = pd.read_csv("../data/raw/videos/only_1_minute_videos.csv")

# Convert duration to seconds
df_videos["duration_sec"] = df_videos["duration"].apply(lambda x: isodate.parse_duration(x).total_seconds())

# Engagement metrics
df_videos["engagement_rate"] = (df_videos["likes"] + df_videos["comments"]) / df_videos["views"]
df_videos["like_rate"] = df_videos["likes"] / df_videos["views"]
df_videos["comment_rate"] = df_videos["comments"] / df_videos["views"]
df_videos["views_per_second"] = df_videos["views"] / df_videos["duration_sec"]
df_videos["comments_per_second"] = df_videos["comments"] / df_videos["duration_sec"]

# Save processed video features
df_videos.to_csv("../data/processed/video_features.csv", index=False)
df_videos.head()


Unnamed: 0,video_id,title,published_at,views,likes,comments,duration,duration_sec,engagement_rate,like_rate,comment_rate,views_per_second,comments_per_second
0,l5PTG1m9vEE,She Used To Be A Man,2020-04-20T12:00:21Z,5205053,81166,7098,PT1M5S,65.0,0.016957,0.015594,0.001364,80077.738462,109.2
1,7wQ93t7q8ss,Women VS. Men,2020-04-17T12:00:04Z,1797149,45202,3550,PT1M14S,74.0,0.027127,0.025152,0.001975,24285.797297,47.972973
2,qsMiiLHVDe8,Why I Wake Up Early,2020-04-15T12:00:23Z,3184634,64681,5124,PT1M11S,71.0,0.021919,0.02031,0.001609,44854.0,72.169014
3,mn2UAPYD5PM,How Armenia Teaches Kids,2020-04-14T12:00:24Z,604307,13521,1062,PT1M10S,70.0,0.024132,0.022374,0.001757,8632.957143,15.171429
4,ZuLC4j_ohdw,The Hidden Cost Of Japan,2020-04-13T12:00:32Z,2230513,45776,2212,PT1M7S,67.0,0.021514,0.020523,0.000992,33291.238806,33.014925


### Comment Features:

| Feature            | Description                |
| ------------------ | -------------------------- |
| `comment_length`   | Number of words in comment |
| `top_commenters`   | Count of comments per user |
| `like_per_comment` | Like count on comment      |


In [10]:
df_comments = pd.read_csv("../data/raw/comments/only_1_minute_comments.csv")

# Comment length
df_comments["comment_length"] = df_comments["comment_text"].str.split().str.len()

# Top commenters
top_commenters = df_comments.groupby("author").size().reset_index(name="comment_count")
top_commenters = top_commenters.sort_values("comment_count", ascending=False)

# Save processed comments
df_comments.to_csv("../data/processed/comment_features.csv", index=False)
top_commenters.to_csv("../data/processed/top_commenters.csv", index=False)


### Transcript Features:

| Feature            | Description                              |
| ------------------ | ---------------------------------------- |
| `word_count`       | Total words in transcript                |
| `sentence_count`   | Total sentences (split by `.`, `?`, `!`) |
| `words_per_second` | `word_count/duration_sec`                |


In [11]:
import pandas as pd
import re

df_transcripts = pd.read_csv("../data/raw/transcripts/only_1_minute_manual.csv")
df_videos = pd.read_csv("../data/processed/video_features.csv")

# Merge transcripts with video durations
df_transcripts = df_transcripts.merge(df_videos[["video_id", "duration_sec"]], on="video_id", how="left")

# Word count
df_transcripts["word_count"] = df_transcripts["transcript"].str.split().str.len()

# Sentence count
df_transcripts["sentence_count"] = df_transcripts["transcript"].apply(lambda x: len(re.split(r'[.!?]', str(x))) if pd.notnull(x) else 0)

# Words per second
df_transcripts["words_per_second"] = df_transcripts["word_count"] / df_transcripts["duration_sec"]

# Save processed transcripts
df_transcripts.to_csv("../data/processed/transcript_features.csv", index=False)
df_transcripts.head()


Unnamed: 0,video_id,transcript,duration_sec,word_count,sentence_count,words_per_second
0,l5PTG1m9vEE,"Meet Angie. Hi, my name is Angie and I used to...",65.0,167,16,2.569231
1,7wQ93t7q8ss,hi I'm a man and I'm a woman and after traveli...,74.0,176,1,2.378378
2,qsMiiLHVDe8,hey let me ask you a question what what time d...,71.0,157,9,2.211268
3,mn2UAPYD5PM,everyone knows that children learn very quickl...,70.0,144,1,2.057143
4,ZuLC4j_ohdw,"Hi. When I was 14 years old, my dad would comp...",67.0,148,12,2.208955


## We now have:

1.  video_features.csv → KPI metrics per video

2.  comment_features.csv → detailed comment stats

3.  top_commenters.csv → key audience members

4.  transcript_features.csv → content efficiency metrics