# TikTok Video Popularity Prediction

This notebook predicts whether a TikTok video will be popular using a Random Forest classifier. It demonstrates data loading, feature engineering, model training, evaluation, explanation, and a simple fairness check based on whether the content creator is verified.

In [None]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# For fairness check
import numpy as np


In [None]:
# Load dataset
# Make sure the CSV file is in the same repository directory
DATA_FILE = 'tiktok_video_performance_v2.csv'
df = pd.read_csv(DATA_FILE)

# Display first few rows
df.head()


In [None]:
# Feature engineering
# Define a target variable 'popular' based on like counts
# Here we consider a video popular if likes are above the median
like_threshold = df['like_count'].median()
df['popular'] = (df['like_count'] > like_threshold).astype(int)

# Select features for prediction
feature_cols = ['play_count', 'comment_count', 'share_count', 'author_follower_count']
X = df[feature_cols].fillna(0)
y = df['popular']


In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Initialize and train the model
rf_clf = RandomForestClassifier(n_estimators=200, random_state=42)
rf_clf.fit(X_train, y_train)

# Make predictions
y_pred = rf_clf.predict(X_test)


In [None]:
# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest')
ax.set_title("Confusion Matrix")
plt.xlabel('Predicted')
plt.ylabel('True')
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='white')
plt.show()


In [None]:
# Feature importances
importances = rf_clf.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': importances
}).sort_values(by='importance', ascending=False)
feature_importance_df


In [None]:
# Fairness check based on creator verification status (if available)
if 'author_verified' in df.columns:
    test_df = df.loc[X_test.index]
    test_df = test_df.assign(predicted=y_pred)

    # Compute positive prediction rates for verified vs non-verified
    rates = test_df.groupby('author_verified')['predicted'].mean()
    print("Positive prediction rates by verified status:")
    print(rates)

    if len(rates) == 2:
        diff = abs(rates.iloc[1] - rates.iloc[0])
        print(f"Difference in positive rates: {diff:.3f}")
else:
    print("Column 'author_verified' not found. Fairness check skipped.")


## Conclusion

This notebook trains a Random Forest classifier to predict TikTok video popularity and evaluates its performance. Feature importances highlight which aspects of a video (plays, comments, shares, followers) influence popularity. A simple fairness check compares predicted popularity across verified and non-verified creators. Depending on the results, you can iterate with more features or different models to improve accuracy and fairness.