## Step 1: Load and Inspect the Data  

**What we expect:** We will load the dataset containing TikTok video performance metrics to understand its structure and review sample records. We expect to see columns such as play counts, comment counts, share counts, like counts and follower counts.  

**Instructions:** Load the CSV file `tiktok_video_performance_v2.csv` into a pandas DataFrame. Display the first few rows and review the shape of the data to understand the number of samples and features available.


In [None]:
# Load the dataset
import pandas as pd

df = pd.read_csv('tiktok_video_performance_v2.csv')
print('Dataset shape:', df.shape)
df.head()

### What we learned from Step 1  

Loading the dataset gives us the initial shape and a preview of the data. We can see the number of rows (videos) and columns (features). This helps us plan feature engineering and model training in later steps.

## Step 2: Engineer Features and Target  

**What we expect:** We will create a target variable `popular` based on whether a video’s like count is above the median. We also select relevant numerical features that might predict popularity, such as play count, comment count, share count, and follower count.  

**Instructions:** Compute the median of the `like_count` column to define a binary target. Then select a subset of feature columns that will be used to train our model. Display basic statistics of the features to understand their distribution.


In [None]:
# Engineer target and select features
import numpy as np

# Create target: 1 if like_count >= median, else 0
median_likes = df['like_count'].median()
df['popular'] = (df['like_count'] >= median_likes).astype(int)

# Select relevant numerical features
feature_cols = ['play_count', 'comment_count', 'share_count', 'author_follower_count']
X = df[feature_cols]
y = df['popular']

# Show basic statistics of features
X.describe()

### What we learned from Step 2  

By creating the `popular` target and selecting numerical features, we prepared the dataset for model training. The summary statistics provide insights into the scale and distribution of each feature, which may help when tuning models.

## Step 3: Train a Random Forest Classifier  

**What we expect:** Using a tree-based ensemble model like Random Forest should capture non-linear relationships between features and the target. We expect the model to perform reasonably well on the classification task.  

**Instructions:** Split the data into training and test sets. Train a `RandomForestClassifier` on the training data. After training, output the first few predictions on the test set to see how the model classifies examples.


In [None]:
# Train a Random Forest classifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_clf.predict(X_test)
y_pred[:10]

### What we learned from Step 3  

Training the Random Forest classifier provides us with a predictive model capable of classifying videos as popular or not based on the selected features. The first few predictions give us a sense of the model’s output and confirm that the training process completed successfully.

## Step 4: Evaluate the Model and Interpret Results  

**What we expect:** A classification report and confusion matrix will show how well the model distinguishes popular videos. Feature importances will indicate which metrics are most influential in predicting popularity.  

**Instructions:** Generate a classification report and confusion matrix for the test set. Also extract feature importances from the trained model and display them in a table. Interpret these metrics to understand the model’s strengths and weaknesses.


In [None]:
# Evaluate the model and interpret results
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Classification report
report = classification_report(y_test, y_pred, output_dict=False)
print(report)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:')
print(cm)

# Feature importances
importances = rf_clf.feature_importances_
feature_importances = pd.DataFrame({'feature': feature_cols, 'importance': importances})
feature_importances.sort_values(by='importance', ascending=False)


### What we learned from Step 4  

The classification report and confusion matrix tell us how accurately the model predicts popular videos. Higher precision and recall values indicate better performance. The confusion matrix reveals the number of true positives, true negatives, false positives, and false negatives. Feature importances highlight which variables have the greatest effect on the model’s decisions, helping us understand the drivers of popularity.

## Step 5: Fairness Check  

**What we expect:** We want to ensure the model does not unfairly favor or disadvantage creators based on verification status. The positive prediction rate for verified and non-verified creators should be similar if the model is fair.  

**Instructions:** If the dataset includes an `author_verified` column, append predictions to the test set and compute the mean predicted popularity for each group. Compare these rates to identify any potential bias. If the column is missing, note that fairness cannot be evaluated.


In [None]:
# Fairness check
if 'author_verified' in df.columns:
    # Append predictions to test set
    X_test_with_group = X_test.copy()
    X_test_with_group['author_verified'] = df.loc[X_test.index, 'author_verified']
    X_test_with_group['pred'] = y_pred

    # Compute positive rate for each group
    rates = X_test_with_group.groupby('author_verified')['pred'].mean()
    print("Positive prediction rate by group:")
    print(rates)
else:
    print("The column 'author_verified' does not exist in the dataset, so we cannot perform a fairness check.")

### What we learned from Step 5  

If the `author_verified` column exists, the output will show the positive prediction rate for verified and non-verified creators. A large gap between these rates might indicate potential bias in the model’s predictions. If the column is absent, the fairness check cannot be performed and we note that the dataset does not include this information.

## Conclusion  

In this notebook we built a pipeline to predict whether a TikTok video will be popular based on features such as play count, comment count, share count, and follower count. We followed a structured approach: loading and exploring the data, engineering features, training a Random Forest classifier, evaluating its performance, interpreting feature importance, and checking for fairness. This step-by-step process not only produced a working model but also emphasized transparency and fairness, aligning with best practices for responsible AI.