# MD003 - Sesion 4: Moltbook Karma Prediction Pipeline

**Objetivo**: Pipeline de ingenieria de datos end-to-end para predecir el karma de usuarios en moltbook.com

**Variable Target**: `users.karma` (regresion)

---

## 1. Configuracion del Entorno y Definicion del Perimetro

### 1.1 Imports y Setup

In [None]:
import sys
from pathlib import Path

# Add project root to path (parent of notebooks/)
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Configure logging
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Core imports
import polars as pl

from config.settings import settings
settings.ensure_directories()

print(f"Project root: {settings.project_root}")
print(f"Data directory: {settings.data_dir}")

### 1.2 Definicion del Caso de Estudio

**Dominio**: Moltbook.com - red social para agentes de IA

**Entidades**:
- `users`: Perfiles de agentes (target: karma)
- `posts`: Publicaciones en comunidades
- `comments`: Comentarios en posts
- `sub_molt`: Comunidades tematicas

**Variable Objetivo**: `karma` - puntuacion de reputacion del usuario (regresion)

---

## 2. Web Scraping con Playwright

### 2.1 Inicializacion de Base de Datos

In [None]:
from src.database.connection import init_database, check_database_exists
from src.database.operations import DatabaseOperations
from src.database.models import User, Post, Comment, SubMolt

# Initialize database if needed
if not check_database_exists():
    print("Initializing database...")
    init_database()
else:
    print("Database already exists")

db_ops = DatabaseOperations()
print(f"Current counts - Users: {db_ops.count(User)}, Posts: {db_ops.count(Post)}")

### 2.2 Ejecucion del Scraper

In [None]:
from src.scraper.scrapers import MoltbookScraper

# Configure scraping limits
MAX_USERS = 50
MAX_POSTS = 200

# Run scraper (use headless=False to see browser)
with MoltbookScraper(headless=True) as scraper:
    results = scraper.scrape_all(
        max_users=MAX_USERS,
        max_posts=MAX_POSTS,
    )

print(f"\nScraping complete:")
print(f"  Users: {results['users']}")
print(f"  SubMolts: {results['submolts']}")
print(f"  Posts: {results['posts']}")

---

## 3. Exploratory Data Analysis (EDA)

### 3.1 Carga de Datos con Polars

In [None]:
from src.processing.silver import load_table_to_lazy

# Load data using Polars LazyFrames
users_lf = load_table_to_lazy("users")
posts_lf = load_table_to_lazy("posts")
comments_lf = load_table_to_lazy("comments")

# Collect for EDA
users_df = users_lf.collect()
posts_df = posts_lf.collect()

print(f"Users: {len(users_df)} records")
print(f"Posts: {len(posts_df)} records")

### 3.2 Estadisticas Descriptivas de Karma

In [None]:
# Karma distribution
karma_stats = users_df.select([
    pl.col("karma").mean().alias("mean"),
    pl.col("karma").median().alias("median"),
    pl.col("karma").std().alias("std"),
    pl.col("karma").min().alias("min"),
    pl.col("karma").max().alias("max"),
    pl.col("karma").quantile(0.25).alias("q25"),
    pl.col("karma").quantile(0.75).alias("q75"),
])

print("Karma Statistics:")
print(karma_stats)

### 3.3 Distribucion de Variables

In [None]:
# User statistics
user_stats = users_df.select([
    pl.col("followers").mean().alias("avg_followers"),
    pl.col("following").mean().alias("avg_following"),
    pl.col("description").is_not_null().sum().alias("users_with_description"),
    pl.col("human_owner").is_not_null().sum().alias("users_with_owner"),
])

print("User Profile Statistics:")
print(user_stats)

---

## 4. Limpieza y Preparacion de Datos (Silver Layer)

### 4.1 Pipeline de Limpieza con Polars Lazy

In [None]:
from src.processing.silver import build_silver_layer

# Build silver layer - cleaned data
silver_results = build_silver_layer()

print("Silver Layer Built:")
for table, count in silver_results.items():
    print(f"  {table}: {count} records")

# Verify output
silver_users = pl.read_parquet(settings.silver_dir / "users.parquet")
print(f"\nSilver users sample:")
print(silver_users.head(3))

---

## 5. Feature Engineering (Gold Layer)

### 5.1 Ingenieria de Features con Polars Lazy

In [None]:
from src.processing.gold import build_gold_layer

# Build gold layer - engineered features
gold_results = build_gold_layer()

print("Gold Layer Built:")
print(f"  User features: {gold_results['user_features']} records")
print(f"  Feature columns: {gold_results['feature_columns']}")

### 5.2 Descripcion de Features

In [None]:
from src.processing.gold import get_modeling_data

# Load gold layer
features_df = get_modeling_data()

print("Feature Columns:")
for col in features_df.columns:
    dtype = features_df[col].dtype
    print(f"  {col}: {dtype}")

print(f"\nFeature Statistics:")
print(features_df.describe())

---

## 6. Modelado con H2O AutoML

### 6.1 Entrenamiento del Modelo

In [None]:
from src.models.trainer import H2OTrainer, FEATURE_COLUMNS

# Initialize trainer
trainer = H2OTrainer(
    max_models=10,
    max_runtime_secs=300,
)

# Train model
print("Training H2O AutoML model...")
print(f"Features: {FEATURE_COLUMNS}")

results = trainer.train(
    data=features_df,
    target="karma",
    features=FEATURE_COLUMNS,
)

print(f"\nBest Model: {results['model_id']}")

### 6.2 Evaluacion del Modelo

In [None]:
print("Model Evaluation Metrics:")
print(f"  MAE:  {results['mae']:.4f}")
print(f"  RMSE: {results['rmse']:.4f}")
print(f"  R2:   {results['r2']:.4f}")
print(f"\n  Train samples: {results['train_size']}")
print(f"  Test samples:  {results['test_size']}")

### 6.3 Predicciones

In [None]:
# Generate predictions
predictions = trainer.predict(features_df)

# Compare actual vs predicted
comparison = predictions.select(["name", "karma", "karma_predicted"]).head(10)
print("Actual vs Predicted Karma:")
print(comparison)

### 6.4 Guardado del Modelo

In [None]:
# Save model
model_path = trainer.save_model()
print(f"Model saved to: {model_path}")

# Save predictions
pred_path = settings.models_dir / "predictions.parquet"
predictions.write_parquet(pred_path)
print(f"Predictions saved to: {pred_path}")

---

## 7. Conclusiones

### 7.1 Resumen del Pipeline

In [None]:
print("=" * 50)
print("RESUMEN DEL PIPELINE")
print("=" * 50)
print(f"\n1. Web Scraping:")
print(f"   - Usuarios: {db_ops.count(User)}")
print(f"   - Posts: {db_ops.count(Post)}")
print(f"   - SubMolts: {db_ops.count(SubMolt)}")
print(f"\n2. Procesamiento:")
print(f"   - Silver layer: {settings.silver_dir}")
print(f"   - Gold layer: {settings.gold_dir}")
print(f"\n3. Modelado:")
print(f"   - Algoritmo: H2O AutoML")
print(f"   - Target: karma (regresion)")
print(f"   - MAE: {results.get('mae', 'N/A'):.4f}")
print(f"   - R2: {results.get('r2', 'N/A'):.4f}")
print("\n" + "=" * 50)

### 7.2 Observaciones

1. **Web Scraping**: Playwright permite renderizar JavaScript para sitios SPA como moltbook.com

2. **Procesamiento con Polars**: Lazy evaluation optimiza memoria y rendimiento

3. **Feature Engineering**: Las features derivadas (follower_ratio, total_activity) capturan engagement

4. **Modelado**: H2O AutoML automatiza la seleccion del mejor algoritmo

5. **Limitaciones**: 
   - El dataset es pequeno para produccion
   - Algunas features pueden tener alta correlacion
   - El karma puede depender de factores no capturados (tiempo en plataforma, calidad de contenido)