## Step 1: Load and Inspect the Data  

**What we expect:** We will load the dataset containing TikTok video performance metrics to understand its structure and review sample records. We expect to see columns such as Views, Likes, Comments, Shares, Duration, Hashtags_Count and follower counts (if available).  

**Instructions:** Load the CSV file `tiktok_video_performance_v2.csv` into a pandas DataFrame using `pandas.read_csv()`. Display the first few rows and review the shape of the data to understand the number of samples and features available.

In [ ]:
# Load the dataset
import pandas as pd

df = pd.read_csv('tiktok_video_performance_v2.csv')
print('Dataset shape:', df.shape)
df.head()

**What we learned from Step 1**  

Loading the dataset gives us the initial shape and a preview of the data. We can see the number of rows (videos) and columns (features). This helps us plan feature engineering and model training in later steps.

## Step 2: Engineer Features and Target  

**What we expect:** We will create a target variable **popular** based on whether a video’s like count is above the median. We also select relevant numerical features that might predict popularity, such as Views, Comments, Shares, and Follower count (if available).  

**Instructions:** Compute the median of the `Likes` column to define a binary target. Then select a subset of feature columns that will be used to train our model. Display basic statistics of the features to understand their distribution.

In [ ]:
# Engineer target and select features
import numpy as np

# Create target: 1 if Likes >= median, else 0
median_likes = df['Likes'].median()
df['popular'] = (df['Likes'] >= median_likes).astype(int)

# Select relevant numerical features
feature_cols = ['Views', 'Comments', 'Shares']
if 'Followers' in df.columns:
    feature_cols.append('Followers')

X = df[feature_cols]
y = df['popular']

# Show basic statistics of features
X.describe()

**What we learned from Step 2**  

By creating the **popular** target based on the median of likes and selecting numerical features, we prepared the dataset for model training. The summary statistics provide insights into the scale and distribution of each feature, which may help when tuning models.

## Step 3: Train a Random Forest Classifier  

**What we expect:** We'll split the data into training and test sets, then train a Random Forest classifier to predict video popularity. We expect the model to capture relationships between features (views, comments, shares, followers) and the popularity target.  

**Instructions:** Use `train_test_split` to create training and test sets. Instantiate a `RandomForestClassifier`, fit it on the training data, and generate predictions for the test set.

In [ ]:
# Train a Random Forest classifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)
y_pred[:10]

**What we learned from Step 3**  

Training the Random Forest classifier allows us to capture nonlinear relationships between the features and the target. The model outputs a prediction for whether a video is likely to be popular based on the features. At this stage, we have a trained model and some initial predictions.

## Step 4: Evaluate the Model and Interpret Results  

**What we expect:** We will assess the model’s performance on the test set using metrics such as accuracy, precision, recall, and F1-score. We will also visualize the confusion matrix and examine feature importances to understand which variables influence the model most.  

**Instructions:** Use `classification_report` and `confusion_matrix` from scikit-learn to evaluate the predictions. Plot the confusion matrix using `matplotlib` and compute feature importances from the Random Forest. Present the results in a readable format.

In [ ]:
# Evaluate model performance and interpret results
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Classification report
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
print('Classification Report:')
display(report_df)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure()
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Feature importances
importances = rf.feature_importances_
importance_df = pd.DataFrame({'feature': feature_cols, 'importance': importances}).sort_values(by='importance', ascending=False)
print('Feature Importances:')
display(importance_df)

**What we learned from Step 4**  

The classification report and confusion matrix provide insights into the model's accuracy and how well it distinguishes between popular and non-popular videos. The feature importances highlight which metrics (e.g., views, comments, shares, followers) most influence the model’s predictions. This helps us understand and explain the model's decisions.

## Step 5: Fairness Check  

**What we expect:** We will check whether the model exhibits bias across different groups. If the dataset includes a column like `author_verified`, we can compare the positive prediction rates for verified vs. non-verified creators.  

**Instructions:** If the `author_verified` column exists, calculate the proportion of videos predicted as popular for both verified and non-verified creators. Report the difference and discuss potential bias.

In [ ]:
# Fairness check
if 'author_verified' in df.columns:
    df_test = X_test.copy()
    df_test['author_verified'] = df.loc[X_test.index, 'author_verified']
    df_test['prediction'] = y_pred
    # positive rate by verification status
    rates = df_test.groupby('author_verified')['prediction'].mean()
    print('Positive prediction rates by author verification status:')
    print(rates)
    if len(rates) == 2:
        bias = abs(rates.iloc[0] - rates.iloc[1])
        print('Difference in positive rates:', bias)
else:
    print('No author_verified column found; skipping fairness check.')

**What we learned from Step 5**  

By comparing positive prediction rates across groups (e.g., verified vs. non-verified creators), we can detect potential bias in the model. If there is a significant difference between groups, it might indicate that the model is unfairly favoring one group. Otherwise, the model appears to treat creators similarly regardless of verification status.

## Conclusion  

In this notebook, we built a pipeline to predict whether a TikTok video will be popular using a Random Forest classifier. We followed a step-by-step approach inspired by the Spotify workshop examples, including expectations and reflections for each step. We loaded and explored the dataset, engineered features and a target, trained a model, evaluated it, and performed a fairness check. The model's performance and feature importances provide insights into what drives video popularity, and the fairness check ensures that the model's predictions are equitable across different groups.