# Synthetic Video Dataset Generation

This notebook generates a synthetic dataset representing video metadata such as duration, views, likes, comments, and engagement level. The dataset will be used for building ML models in later steps.


In [1]:
import pandas as pd
import numpy as np

np.random.seed(42)


In [2]:
# Generating synthetic dataset of 10,000 rows
n = 10000

durations = np.random.randint(5, 600, n)  # video duration between 5 sec to 10 min
views = np.random.randint(100, 1000000, n)
likes = (views * np.random.uniform(0.01, 0.10, n)).astype(int)
comments = (views * np.random.uniform(0.001, 0.02, n)).astype(int)

engagement_level = np.where(likes > views * 0.07, "High",
                     np.where(likes > views * 0.03, "Medium", "Low"))

df = pd.DataFrame({
    "duration": durations,
    "views": views,
    "likes": likes,
    "comments": comments,
    "engagement_level": engagement_level
})

df.head()


Unnamed: 0,duration,views,likes,comments,engagement_level
0,107,864410,66724,3768,High
1,440,274483,22623,3115,High
2,275,833154,63556,9591,High
3,111,571053,53267,3036,High
4,76,818806,41116,11894,Medium


In [3]:
# Saving the synthetic dataset to raw folder
save_path = "../data/raw/synthetic_videos.csv"
df.to_csv(save_path, index=False)

print("Synthetic dataset saved successfully at:", save_path)


Synthetic dataset saved successfully at: ../data/raw/synthetic_videos.csv
