# Predicting High-Performance Social Posts (Binary) — Periospot

**Question:** Using only **pre-posting** signals, can we predict if a post will be a **High Performer**?  

**Decisions:** Threshold chosen empirically during EDA; Tree ensembles (RF/XGBoost) as main tuned family.  

**Anti-leakage:** No post-outcome columns (impressions/engagements/reach) in features.


In [None]:
# TODO: imports
# import pandas as pd, numpy as np
# from pathlib import Path
# from src import features, labeling, evaluation, utils

# TODO: define paths
# ROOT = Path("..").resolve().parent / "periospot-ml-performance"
# RAW = ROOT / "data" / "raw"
# ART = ROOT / "artifacts"
# HINT: ensure ART exists
# ART.mkdir(parents=True, exist_ok=True)


## Load Data & Initial Checks

We load post-level data. We'll inspect columns, NA rates, and confirm which columns are "pre-posting" vs "post-outcome".


In [None]:
# TODO: load raw post performance CSV; inspect head, columns, dtypes, NA %
# df = pd.read_csv(RAW / "post_performance.csv", low_memory=False)
# df.head()
# df.columns.tolist()
# df.isna().mean().sort_values(ascending=False).head(30)


## Define Target Candidate(s)

We propose **Engagement Rate (per Impression)** as the continuous outcome to explore. We'll examine its distribution, detect outliers, and decide how to binarize later.


In [None]:
# TODO: verify the target column exists; explore distribution
# target_col = "Engagement Rate (per Impression)"
# assert target_col in df.columns
# df[target_col].describe()
# Plot histogram/log-hist (matplotlib)


## Identify Pre-Posting Features (No Leakage)

We will use **only features knowable before pressing publish**:

- Network, Post Type, Content Type, Profile (categorical)
- Posting timestamp → hour, weekday, month, season, year
- Caption text (Post): length, hashtag_count, mention_count, url_count
- Optional: rolling account-level activity windows (past 7/30/90 days aggregates) **computed from history prior to each post** (avoid peeking into future)

We explicitly **exclude**: impressions, reach, engagements, clicks, video views, saves, shares, etc. Any feature derived from those would leak.


In [None]:
# TODO: parse date; keep only necessary columns at this stage
# - Parse 'Date' to datetime
# - Strip/normalize categorical text
# - Create a clean frame df_clean with a subset of columns we might use
# HINTS:
# df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
# df['Network'] = df['Network'].str.strip()
# df['Post Type'] = df['Post Type'].str.strip()
# df['Content Type'] = df['Content Type'].str.strip()
# df['Profile'] = df['Profile'].str.strip()


## Feature Engineering (Pre-Posting Only)

We derive engineered features and keep an explicit list of predictors.


In [None]:
# TODO: implement feature builders in src/features.py and call them here
# - time features: hour, weekday, month, is_weekend
# - text features from 'Post': char_len, word_len, hashtag_count (#...), mention_count (@...), url_count (http/https), emoji_count (optional)
# - categorical encodings: one-hot or ordinal for Network, Post Type, Content Type, Profile
# - OPTIONAL: lag features by Profile (e.g., past-30-day posting frequency) — careful with temporal leakage
# HINTS:
# df_feat = features.build_preposting_features(df)
# predictors = [...]  # set by your builder


## Train/Test Split Strategy (Temporal)

We simulate production by training on **past**, testing on **future**.  

Proposed split: Train on posts dated ≤ 2024-12-31; Test on posts dated ≥ 2025-01-01.


In [None]:
# TODO: implement temporal split
# cutoff = pd.Timestamp("2025-01-01")
# train_idx = df_feat['Date'] < cutoff
# test_idx  = df_feat['Date'] >= cutoff
# X_train, X_test = df_feat.loc[train_idx, predictors], df_feat.loc[test_idx, predictors]
# y_cont_train = df.loc[train_idx, target_col]
# y_cont_test  = df.loc[test_idx, target_col]


## Choose Binary Threshold During EDA

From the **training set only**, pick a threshold rule:

- Percentile (e.g., top 20% engagement rate ⇒ label 1)
- Or a data-driven mixture/robust rule

We never use test data to define the threshold.


In [None]:
# TODO: implement make_label_from_percentile in src/labeling.py and apply to y_cont_train
# pct = 0.80  # example; tune after inspecting distribution
# y_train = labeling.make_label_from_percentile(y_cont_train, pct=pct)
# y_test  = (y_cont_test >= y_cont_train.quantile(pct)).astype(int)  # use train threshold for test
# Check positivity rate


## Baseline Models (Sanity)

Establish simple baselines:

- Majority class
- Logistic Regression (quick baseline)

We expect low but non-zero performance.


In [None]:
# TODO: majority baseline metrics; then fit LogisticRegression with simple preprocessing (OneHot for categoricals)
# HINT: use ColumnTransformer + Pipeline; score ROC-AUC, PR-AUC, F1


## Tree Ensembles (RF, XGBoost)

Main family: tree ensembles.  

We'll start with untuned models, then GridSearchCV on key hyperparameters.


In [None]:
# TODO: fit RandomForestClassifier and XGBClassifier with default-ish params
# - Compute predicted probabilities on test
# - Evaluate ROC-AUC and PR-AUC


## Hyperparameter Tuning

Use GridSearchCV (or RandomizedSearchCV) with **Stratified K-Fold** on the training set.  

Keep the search space realistic (you have many rows).


In [None]:
# TODO: GridSearchCV for RF and XGB
# HINTS:
# - RF: n_estimators, max_depth, min_samples_split, max_features
# - XGB: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, reg_alpha, reg_lambda
# - Use scoring='average_precision' or 'roc_auc'
# Save best params to artifacts


## Threshold Tuning for Business Goal

Business objective: **maximize recall of High Performers** at an acceptable precision.  

We'll tune the decision threshold on validation folds or on the train set via cross-val predictions.


In [None]:
# TODO: produce y_proba on the validation (or test) set, then scan thresholds
# Choose a threshold that achieves target recall (e.g., ≥ 0.75) with the highest precision
# evaluation.threshold_by_recall(...)
# Report threshold, precision, recall, F1


## Final Evaluation (Test)

Lock the threshold and evaluate on the test period.  

Report: ROC-AUC, PR-AUC, Precision, Recall, F1, Confusion Matrix. Plot ROC & PR curves.


In [None]:
# TODO: compute and print metrics; plot ROC and PR curves; show confusion matrix
# Save figures to artifacts


## Feature Importance & Explainability

Compute feature importances (model-based) and optionally SHAP to interpret key drivers.  

Explain what matters: time-of-day? network? content type? caption length?


In [None]:
# TODO: model.feature_importances_ (RF/XGB). Optional: SHAP summary plot.
# Save importance table to artifacts


## Robustness Checks

- Year-by-year performance stability  
- Per-network slices (X vs Instagram vs Threads)  
- Per-profile slices


In [None]:
# TODO: compute metrics by Network and by Profile on test set
# HINT: loop over unique groups; filter and re-evaluate with the fixed threshold


## Conclusions

- What did we learn about drivers of high performance?
- Are results as expected?
- Business trade-offs for threshold choice (recall vs precision)
- Next steps: richer text features, topic modeling, uplift vs baseline scheduling, online learning


In [None]:
# TODO: persist: best params JSON, chosen threshold, metrics JSON, importance CSVs
# HINT: use json.dump(...) and df.to_csv(...)
