# Unlocking YouTube Channel Performance Secrets — Notebook

This notebook follows the project described in the provided PDF to analyze YouTube channel performance and build a predictive model for **Estimated Revenue (USD)**. It includes import, cleaning, feature engineering, visualization, modeling, evaluation, and model export steps. The PDF used as reference: see file citation in the chat. fileciteturn1file0

In [None]:
# 1) Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
import joblib
warnings.filterwarnings('ignore')
print('Libraries imported')

**Description:** Import necessary libraries for data handling, visualization, modeling and saving artifacts. This matches the PDF's recommended stack. fileciteturn1file0

In [None]:
# 2) Load dataset
# Update this path to your CSV file downloaded from the link in the PDF
data_path = 'youtube_channel_real_performance_analytics.csv'
try:
    df = pd.read_csv(data_path)
    print('Loaded data with shape:', df.shape)
    display(df.head())
except FileNotFoundError:
    print('File not found. Please download the dataset from the PDF link and set data_path accordingly.')

**Description:** Load dataset (the PDF shows an example Kaggle CSV). If you haven't downloaded it yet, use the Google Drive link provided in the PDF. fileciteturn1file0

In [None]:
# 3) Basic info and missing values
if 'df' in globals():
    print(df.info())
    print('\nMissing values per column:')
    print(df.isnull().sum().sort_values(ascending=False).head(30))
    display(df.describe().T)

**Description:** Inspect data types, missing values and summary statistics to plan cleaning. The PDF indicates there are 70 columns and no nulls in the sample, but always check your copy. fileciteturn1file6

In [None]:
# 4) Parse datetimes & ensure numeric duration
if 'df' in globals():
    # Convert publish time
    try:
        df['Video Publish Time'] = pd.to_datetime(df['Video Publish Time'])
    except Exception as e:
        print('Publish time parse warning:', e)

    # Ensure Video Duration is numeric (seconds) as in the PDF
    if df['Video Duration'].dtype == object:
        # try to coerce
        df['Video Duration'] = pd.to_numeric(df['Video Duration'], errors='coerce')

    print('Video Publish Time dtype:', df['Video Publish Time'].dtype)
    print('Video Duration dtype:', df['Video Duration'].dtype)
    display(df[['Video Duration','Video Publish Time']].head())

**Description:** Convert 'Video Publish Time' to datetime and ensure 'Video Duration' is numeric (seconds). The PDF shows these columns as present and important. fileciteturn1file6

In [None]:
# 5) Feature engineering
if 'df' in globals():
    # Revenue per View
    df['Revenue_per_View'] = df['Estimated Revenue (USD)'] / df['Views'].replace(0, np.nan)
    # Engagement Rate (Likes + Shares + Comments) / Views *100
    for col in ['Likes','Shares','Comments']:
        if col not in df.columns:
            df[col] = 0
    df['Engagement_Rate'] = (df['Likes'] + df['Shares'] + df['Comments']) / df['Views'].replace(0, np.nan) * 100
    # Average View Percentage and Average View Duration exist in the dataset and are useful
    # Extract publish hour, dayofweek
    df['publish_hour'] = df['Video Publish Time'].dt.hour
    df['publish_dayofweek'] = df['Video Publish Time'].dt.day_name()
    display(df[['Revenue_per_View','Engagement_Rate','publish_hour','publish_dayofweek']].head())

**Description:** Create `Revenue_per_View`, `Engagement_Rate`, and extract publish hour/day. These features often drive revenue and engagement as suggested in the PDF. fileciteturn1file2

In [None]:
# 6) EDA: distributions and correlation
if 'df' in globals():
    plt.figure(figsize=(10,5))
    sns.histplot(df['Estimated Revenue (USD)'], bins=40, kde=True)
    plt.title('Estimated Revenue Distribution')
    plt.show()

    plt.figure(figsize=(10,5))
    sns.scatterplot(x='Views', y='Estimated Revenue (USD)', data=df, alpha=0.6)
    plt.title('Revenue vs Views')
    plt.xscale('log')
    plt.yscale('log')
    plt.show()

    # Correlation heatmap for numeric features
    numeric_df = df.select_dtypes(include=[np.number])
    corr = numeric_df.corr()
    plt.figure(figsize=(12,10))
    sns.heatmap(corr, cmap='coolwarm', center=0)
    plt.title('Correlation matrix (numeric)')
    plt.show()

**Description:** Visualize revenue distribution, revenue vs views, and correlation matrix — all recommended in the PDF to identify strong predictors. fileciteturn1file3

In [None]:
# 7) Prepare features & target
if 'df' in globals():
    # Choose a reasonable set of features (can expand later)
    features = [
        'Views','Subscribers','Likes','Shares','Comments',
        'Average View Duration','Average View Percentage (%)',
        'Video Duration','Impressions','Video Thumbnail CTR (%)',
        'publish_hour','Engagement_Rate'
    ]
    # some column names may have slightly different spellings in your CSV; adapt as needed
    present_features = [f for f in features if f in df.columns]
    print('Using features:', present_features)
    X = df[present_features].copy()
    y = df['Estimated Revenue (USD)']

    # Simple imputation for any missing numeric values
    imputer = SimpleImputer(strategy='median')
    X_imp = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

    X_train, X_test, y_train, y_test = train_test_split(X_imp, y, test_size=0.2, random_state=42)
    print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

**Description:** Select features for the regression target (Estimated Revenue) and impute missing numeric values with median. The PDF uses a similar set for modelling. fileciteturn1file3

In [None]:
# 8) Train Random Forest Regressor
if 'X_train' in globals():
    model = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    print(f'RMSE: {rmse:.4f}')
    print(f'R2: {r2:.4f}')

**Description:** Train a baseline Random Forest regressor and evaluate using RMSE and R². The PDF demonstrates similar modeling steps. fileciteturn1file2

In [None]:
# 9) Feature importance
if 'model' in globals():
    importances = model.feature_importances_
    fi = pd.DataFrame({'feature': X_train.columns, 'importance': importances}).sort_values('importance', ascending=False)
    display(fi)
    plt.figure(figsize=(8,6))
    sns.barplot(x='importance', y='feature', data=fi.head(15))
    plt.title('Top features by importance')
    plt.show()

**Description:** Show which features the model finds most predictive for revenue. Use this to generate recommendations (e.g., increase thumbnails CTR or average view duration). fileciteturn1file3

In [None]:
# 10) Save model and preprocessing objects
if 'model' in globals():
    joblib.dump(model, 'youtube_revenue_rf.pkl')
    joblib.dump(imputer, 'youtube_imputer.pkl')
    print('Saved model and imputer: youtube_revenue_rf.pkl, youtube_imputer.pkl')

**Description:** Persist trained model and imputer for later prediction/deployment. The PDF suggests exporting artifacts. fileciteturn1file2

In [None]:
# 11) Predict helper example
import numpy as _np

def predict_sample(sample_dict):
    # sample_dict must contain keys for present_features
    s = pd.DataFrame([sample_dict])
    # ensure columns
    for c in X.columns:
        if c not in s.columns:
            s[c] = np.nan
    s = s[X.columns]
    s_imp = pd.DataFrame(imputer.transform(s), columns=s.columns)
    model_loaded = joblib.load('youtube_revenue_rf.pkl')
    pred = model_loaded.predict(s_imp)[0]
    return pred

# Example usage (uncomment and update values):
# sample = {'Views':10000,'Subscribers':50,'Likes':200,'Shares':5,'Comments':10,'Average View Duration':120,'Average View Percentage (%)':50,'Video Duration':300,'Impressions':15000,'Video Thumbnail CTR (%)':3,'publish_hour':15,'Engagement_Rate':2}
# print('Predicted Estimated Revenue (USD):', predict_sample(sample))

**Description:** Helper function to preprocess a single sample and return predicted revenue using saved artifacts. Update `sample` with values and run. fileciteturn1file2

## Recommendations & Next steps

- Perform hyperparameter tuning (GridSearchCV / RandomizedSearchCV) to improve model quality.
- Add time-series analysis if you want to forecast revenue over time per channel.
- Use advanced features: text features from title/description, thumbnail images (computer vision), or sequence models for temporal trends.
- Create dashboards (Streamlit / Dash / Tableau) to present findings to creators.

Refer to the PDF for the original project outline and dataset description. fileciteturn1file0