# Threshold Analysis for Repository Activity Labeling

This notebook demonstrates the methodology used to select optimal thresholds for labeling repositories as active or inactive.

## Approach

1. **Compute weighted activity score** using multiple metrics
2. **Create pseudo-labels** from extreme percentiles (top 25% = active, bottom 25% = inactive)
3. **Evaluate different threshold selection methods**:
   - F1-maximization
   - Youden's J statistic
   - Precision/Recall tradeoffs
4. **Visualize precision-recall and ROC curves**
5. **Analyze activity score distributions**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    precision_recall_curve, 
    roc_curve, 
    auc, 
    f1_score,
    confusion_matrix
)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Random seed for reproducibility
np.random.seed(42)

## 1. Load Quarterly Data

In [None]:
# Load quarterly aggregated data
data_path = '../data/processed/quarters.parquet'

try:
    df = pd.read_parquet(data_path)
    print(f"✅ Loaded {len(df):,} records")
    print(f"   Unique repositories: {df['repo_id'].nunique():,}")
    print(f"   Columns: {list(df.columns)}")
except FileNotFoundError:
    print(f"❌ Data file not found: {data_path}")
    print("   Please run: python preprocessing/aggregate_quarters.py")
    df = None

In [None]:
# Display sample
if df is not None:
    display(df.head())
    print("\nData types:")
    display(df.dtypes)

## 2. Compute Activity Scores

### Metric Weights (Rationale)

We use weighted combination of metrics:

- **Commits (weight=1.0)**: Direct indicator of code activity
- **Pull Requests (weight=2.0)**: High-value collaboration, code review
- **Issues (weight=0.5)**: Community engagement, bug reports
- **Stars (weight=0.1)**: Interest signal, but not active contribution
- **Forks (weight=0.3)**: Potential for external development

Scores are computed using log-transform to handle skewed distributions.

In [None]:
if df is not None:
    # Define weights
    weights = {
        'commits': 1.0,
        'commit': 1.0,
        'pull_requests': 2.0,
        'pulls': 2.0,
        'pr': 2.0,
        'issues': 0.5,
        'issue': 0.5,
        'stars': 0.1,
        'star': 0.1,
        'forks': 0.3,
        'fork': 0.3,
    }
    
    # Identify available metric columns
    available_metrics = {}
    for col in df.columns:
        col_lower = col.lower()
        for metric_key, weight in weights.items():
            if metric_key in col_lower and pd.api.types.is_numeric_dtype(df[col]):
                available_metrics[col] = weight
                break
    
    print(f"Available metrics: {list(available_metrics.keys())}")
    print(f"Weights: {available_metrics}")
    
    # Compute weighted score
    df['activity_score'] = 0.0
    for col, weight in available_metrics.items():
        normalized = np.log1p(df[col].fillna(0))
        df['activity_score'] += weight * normalized
    
    print(f"\nActivity score statistics:")
    print(df['activity_score'].describe())

## 3. Visualize Activity Score Distribution

In [None]:
if df is not None:
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Histogram
    axes[0].hist(df['activity_score'], bins=100, edgecolor='black', alpha=0.7)
    axes[0].set_xlabel('Activity Score')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of Activity Scores')
    axes[0].axvline(df['activity_score'].median(), color='red', linestyle='--', label='Median')
    axes[0].axvline(df['activity_score'].mean(), color='green', linestyle='--', label='Mean')
    axes[0].legend()
    
    # Box plot
    axes[1].boxplot(df['activity_score'], vert=True)
    axes[1].set_ylabel('Activity Score')
    axes[1].set_title('Activity Score Box Plot')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('../data/processed/activity_score_distribution.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    # Print percentiles
    percentiles = [10, 25, 50, 75, 90, 95, 99]
    print("\nActivity Score Percentiles:")
    for p in percentiles:
        val = np.percentile(df['activity_score'], p)
        print(f"  {p}th: {val:.4f}")

## 4. Create Pseudo-Labels for Validation

We use a semi-supervised approach:
- **Top 25%** of scores → definitely active (label=1)
- **Bottom 25%** of scores → definitely inactive (label=0)
- Middle 50% → ambiguous (not used for threshold tuning)

In [None]:
if df is not None:
    scores = df['activity_score'].values
    
    # Define percentile thresholds
    top_percentile = np.percentile(scores, 75)
    bottom_percentile = np.percentile(scores, 25)
    
    print(f"Top 25% threshold: {top_percentile:.4f}")
    print(f"Bottom 25% threshold: {bottom_percentile:.4f}")
    
    # Create validation set (only extreme values)
    validation_mask = (scores >= top_percentile) | (scores <= bottom_percentile)
    val_scores = scores[validation_mask]
    val_labels = (scores[validation_mask] >= top_percentile).astype(int)
    
    print(f"\nValidation set size: {len(val_scores):,}")
    print(f"  Active (label=1): {val_labels.sum():,}")
    print(f"  Inactive (label=0): {(1 - val_labels).sum():,}")
    
    # Visualize
    plt.figure(figsize=(12, 6))
    plt.hist(scores[scores <= bottom_percentile], bins=50, alpha=0.6, label='Inactive', color='red')
    plt.hist(scores[scores >= top_percentile], bins=50, alpha=0.6, label='Active', color='green')
    plt.axvline(bottom_percentile, color='red', linestyle='--', linewidth=2)
    plt.axvline(top_percentile, color='green', linestyle='--', linewidth=2)
    plt.xlabel('Activity Score')
    plt.ylabel('Frequency')
    plt.title('Pseudo-Labels: Active vs Inactive')
    plt.legend()
    plt.savefig('../data/processed/pseudo_labels_distribution.png', dpi=150, bbox_inches='tight')
    plt.show()

## 5. Compute Precision-Recall Curve

In [None]:
if df is not None:
    # Compute precision-recall curve
    precision, recall, thresholds_pr = precision_recall_curve(val_labels, val_scores)
    
    # Compute F1 scores
    f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
    
    # Find optimal threshold (max F1)
    best_idx = np.argmax(f1_scores[:-1])
    optimal_threshold_f1 = thresholds_pr[best_idx]
    optimal_f1 = f1_scores[best_idx]
    optimal_precision = precision[best_idx]
    optimal_recall = recall[best_idx]
    
    print(f"Optimal Threshold (max F1): {optimal_threshold_f1:.4f}")
    print(f"  F1 Score: {optimal_f1:.4f}")
    print(f"  Precision: {optimal_precision:.4f}")
    print(f"  Recall: {optimal_recall:.4f}")
    
    # Plot PR curve
    plt.figure(figsize=(10, 6))
    plt.plot(recall, precision, linewidth=2, label='PR Curve')
    plt.scatter([optimal_recall], [optimal_precision], s=200, c='red', marker='*', 
                label=f'Optimal (F1={optimal_f1:.3f})', zorder=5)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.savefig('../data/processed/precision_recall_curve.png', dpi=150, bbox_inches='tight')
    plt.show()

## 6. Compute ROC Curve and Youden's J Statistic

In [None]:
if df is not None:
    # Compute ROC curve
    fpr, tpr, thresholds_roc = roc_curve(val_labels, val_scores)
    roc_auc = auc(fpr, tpr)
    
    # Compute Youden's J statistic
    youden_j = tpr - fpr
    best_idx_youden = np.argmax(youden_j)
    optimal_threshold_youden = thresholds_roc[best_idx_youden]
    optimal_youden = youden_j[best_idx_youden]
    
    print(f"Optimal Threshold (Youden's J): {optimal_threshold_youden:.4f}")
    print(f"  Youden's J: {optimal_youden:.4f}")
    print(f"  TPR: {tpr[best_idx_youden]:.4f}")
    print(f"  FPR: {fpr[best_idx_youden]:.4f}")
    print(f"  ROC AUC: {roc_auc:.4f}")
    
    # Plot ROC curve
    plt.figure(figsize=(10, 6))
    plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC={roc_auc:.3f})')
    plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
    plt.scatter([fpr[best_idx_youden]], [tpr[best_idx_youden]], s=200, c='red', marker='*',
                label=f'Optimal (J={optimal_youden:.3f})', zorder=5)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.savefig('../data/processed/roc_curve.png', dpi=150, bbox_inches='tight')
    plt.show()

## 7. Compare Threshold Selection Methods

In [None]:
if df is not None:
    # Create comparison table
    threshold_methods = {
        'F1 Maximization': optimal_threshold_f1,
        'Youden\'s J': optimal_threshold_youden,
        'Percentile (50th)': np.percentile(scores, 50),
        'Percentile (75th)': np.percentile(scores, 75),
    }
    
    comparison_results = []
    
    for method_name, threshold in threshold_methods.items():
        # Apply threshold to validation set
        pred_labels = (val_scores >= threshold).astype(int)
        
        # Compute metrics
        from sklearn.metrics import precision_score, recall_score, accuracy_score
        
        precision_val = precision_score(val_labels, pred_labels, zero_division=0)
        recall_val = recall_score(val_labels, pred_labels, zero_division=0)
        f1_val = f1_score(val_labels, pred_labels, zero_division=0)
        accuracy_val = accuracy_score(val_labels, pred_labels)
        
        # Apply to full dataset
        active_pct = (scores >= threshold).mean() * 100
        
        comparison_results.append({
            'Method': method_name,
            'Threshold': threshold,
            'F1': f1_val,
            'Precision': precision_val,
            'Recall': recall_val,
            'Accuracy': accuracy_val,
            'Active %': active_pct
        })
    
    comparison_df = pd.DataFrame(comparison_results)
    print("\nThreshold Method Comparison:")
    display(comparison_df.style.format({
        'Threshold': '{:.4f}',
        'F1': '{:.4f}',
        'Precision': '{:.4f}',
        'Recall': '{:.4f}',
        'Accuracy': '{:.4f}',
        'Active %': '{:.2f}'
    }))

## 8. Recommended Threshold

Based on the analysis above, we recommend using the **F1-maximization** method as it provides a balanced tradeoff between precision and recall.

In [None]:
if df is not None:
    recommended_threshold = optimal_threshold_f1
    
    print("="*60)
    print("RECOMMENDED THRESHOLD")
    print("="*60)
    print(f"Threshold: {recommended_threshold:.4f}")
    print(f"Method: F1 Maximization")
    print(f"F1 Score: {optimal_f1:.4f}")
    print(f"Precision: {optimal_precision:.4f}")
    print(f"Recall: {optimal_recall:.4f}")
    print("="*60)
    
    # Save threshold to file
    import json
    threshold_config = {
        'threshold': float(recommended_threshold),
        'method': 'f1_maximization',
        'f1_score': float(optimal_f1),
        'precision': float(optimal_precision),
        'recall': float(optimal_recall)
    }
    
    with open('../config/threshold_config.json', 'w') as f:
        json.dump(threshold_config, f, indent=2)
    
    print("\n✅ Threshold configuration saved to: ../config/threshold_config.json")

## 9. Visualize Final Labeling

Apply the recommended threshold and visualize the resulting labels.

In [None]:
if df is not None:
    # Apply threshold
    df['is_active'] = (df['activity_score'] >= recommended_threshold).astype(int)
    
    # Statistics
    active_count = df['is_active'].sum()
    inactive_count = len(df) - active_count
    active_pct = (active_count / len(df)) * 100
    
    print(f"Total records: {len(df):,}")
    print(f"Active: {active_count:,} ({active_pct:.1f}%)")
    print(f"Inactive: {inactive_count:,} ({100 - active_pct:.1f}%)")
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Distribution with threshold
    axes[0].hist(df[df['is_active'] == 0]['activity_score'], bins=50, alpha=0.6, 
                 label='Inactive', color='red')
    axes[0].hist(df[df['is_active'] == 1]['activity_score'], bins=50, alpha=0.6, 
                 label='Active', color='green')
    axes[0].axvline(recommended_threshold, color='black', linestyle='--', linewidth=2,
                    label=f'Threshold={recommended_threshold:.2f}')
    axes[0].set_xlabel('Activity Score')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Final Activity Labels')
    axes[0].legend()
    
    # Bar chart
    axes[1].bar(['Inactive', 'Active'], [inactive_count, active_count], 
                color=['red', 'green'], alpha=0.6)
    axes[1].set_ylabel('Count')
    axes[1].set_title('Label Distribution')
    for i, (label, count) in enumerate([('Inactive', inactive_count), ('Active', active_count)]):
        pct = (count / len(df)) * 100
        axes[1].text(i, count, f"{count:,}\n({pct:.1f}%)", ha='center', va='bottom')
    
    plt.tight_layout()
    plt.savefig('../data/processed/final_labels.png', dpi=150, bbox_inches='tight')
    plt.show()

## Summary

This notebook demonstrated:

1. ✅ Computation of weighted activity scores from multiple metrics
2. ✅ Creation of pseudo-labels from extreme percentiles
3. ✅ Evaluation of multiple threshold selection methods
4. ✅ Recommendation of F1-maximization method
5. ✅ Visualization of precision-recall and ROC curves
6. ✅ Final labeling with selected threshold

The selected threshold provides a balanced tradeoff between precision and recall, making it suitable for both forecasting and classification tasks downstream.