# MD003 - Sesion 4: Moltbook Karma Prediction Pipeline

**Objetivo**: Pipeline de ingenieria de datos end-to-end para predecir el karma de usuarios en moltbook.com

**Variable Target**: `users.karma` (regresion)

---

## 1. Configuracion del Entorno y Definicion del Perimetro

### 1.1 Imports y Setup

In [None]:
import sys
from pathlib import Path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import logging
import polars as pl
from config.settings import settings

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
settings.ensure_directories()
print(f"Project root: {settings.project_root}")
print(f"Data directory: {settings.data_dir}")

Project root: c:\Users\Paulina Peralta\Desktop\moltbook-karma
Data directory: c:\Users\Paulina Peralta\Desktop\moltbook-karma\data


### 1.2 Definicion del Caso de Estudio

**Dominio**: Moltbook.com - red social para agentes de IA

**Entidades**:
- `users`: Perfiles de agentes (target: karma)
- `posts`: Publicaciones en comunidades
- `comments`: Comentarios en posts
- `sub_molt`: Comunidades tematicas

**Variable Objetivo**: `karma` - puntuacion de reputacion del usuario (regresion)

---

## 2. Web Scraping con Playwright

### 2.1 Inicializacion de Base de Datos

In [2]:
from src.database.connection import init_database, check_database_exists
from src.database.operations import DatabaseOperations
from src.database.models import User, Post, Comment, SubMolt

if not check_database_exists():
    print("Initializing database...")
    init_database()
else:
    print("Database already exists")

db_ops = DatabaseOperations()
print(f"Current counts - Users: {db_ops.count(User)}, Posts: {db_ops.count(Post)}")

Database already exists
Current counts - Users: 981, Posts: 1242


### 2.2 Ejecucion del Scraper

In [3]:
from src.scraper.scrapers import MoltbookScraper
MAX_USERS = 50
MAX_POSTS = 200

with MoltbookScraper(headless=True) as scraper:
    results = scraper.scrape_all(max_users=MAX_USERS,max_posts=MAX_POSTS,)

print(f"\nScraping complete:")
print(f"  Users: {results['users']}")
print(f"  SubMolts: {results['submolts']}")
print(f"  Posts: {results['posts']}")

2026-02-09 16:37:59,858 - INFO - Starting Playwright browser (headless=True)


Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

---

## 3. Exploratory Data Analysis (EDA)

### 3.1 Carga de Datos con Polars

In [6]:
from src.processing.silver import load_table_to_lazy

# Load data using Polars LazyFrames
users_lf = load_table_to_lazy("users")
posts_lf = load_table_to_lazy("posts")
comments_lf = load_table_to_lazy("comments")

# Collect for EDA
users_df = users_lf.collect()
posts_df = posts_lf.collect()

print(f"Users: {len(users_df)} records")
print(f"Posts: {len(posts_df)} records")

2026-02-09 16:44:38,147 - INFO - Loaded 981 rows from users
2026-02-09 16:44:38,167 - INFO - Loaded 1242 rows from posts
2026-02-09 16:44:38,178 - INFO - Loaded 644 rows from comments


Users: 981 records
Posts: 1242 records


### 3.2 Estadisticas Descriptivas de Karma

In [7]:
# Karma distribution
karma_stats = users_df.select([
    pl.col("karma").mean().alias("mean"),
    pl.col("karma").median().alias("median"),
    pl.col("karma").std().alias("std"),
    pl.col("karma").min().alias("min"),
    pl.col("karma").max().alias("max"),
    pl.col("karma").quantile(0.25).alias("q25"),
    pl.col("karma").quantile(0.75).alias("q75"),
])

print("Karma Statistics:")
print(karma_stats)

Karma Statistics:
shape: (1, 7)
┌─────────────┬────────┬──────────────┬─────┬────────┬─────┬─────┐
│ mean        ┆ median ┆ std          ┆ min ┆ max    ┆ q25 ┆ q75 │
│ ---         ┆ ---    ┆ ---          ┆ --- ┆ ---    ┆ --- ┆ --- │
│ f64         ┆ f64    ┆ f64          ┆ i64 ┆ i64    ┆ f64 ┆ f64 │
╞═════════════╪════════╪══════════════╪═════╪════════╪═════╪═════╡
│ 6553.063201 ┆ 0.0    ┆ 44460.967544 ┆ 0   ┆ 500002 ┆ 0.0 ┆ 0.0 │
└─────────────┴────────┴──────────────┴─────┴────────┴─────┴─────┘


### 3.3 Distribucion de Variables

In [8]:
# User statistics
user_stats = users_df.select([
    pl.col("followers").mean().alias("avg_followers"),
    pl.col("following").mean().alias("avg_following"),
    pl.col("description").is_not_null().sum().alias("users_with_description"),
    pl.col("human_owner").is_not_null().sum().alias("users_with_owner"),
])

print("User Profile Statistics:")
print(user_stats)

User Profile Statistics:
shape: (1, 4)
┌───────────────┬───────────────┬────────────────────────┬──────────────────┐
│ avg_followers ┆ avg_following ┆ users_with_description ┆ users_with_owner │
│ ---           ┆ ---           ┆ ---                    ┆ ---              │
│ f64           ┆ f64           ┆ u32                    ┆ u32              │
╞═══════════════╪═══════════════╪════════════════════════╪══════════════════╡
│ 0.462793      ┆ 0.046891      ┆ 41                     ┆ 41               │
└───────────────┴───────────────┴────────────────────────┴──────────────────┘


---

## 4. Limpieza y Preparacion de Datos (Silver Layer)

### 4.1 Pipeline de Limpieza con Polars Lazy

In [9]:
from src.processing.silver import build_silver_layer

# Build silver layer - cleaned data
silver_results = build_silver_layer()

print("Silver Layer Built:")
for table, count in silver_results.items():
    print(f"  {table}: {count} records")

# Verify output
silver_users = pl.read_parquet(settings.silver_dir / "users.parquet")
print(f"\nSilver users sample:")
print(silver_users.head(3))

2026-02-09 16:44:47,539 - INFO - Loaded 981 rows from users
2026-02-09 16:44:47,619 - INFO - Wrote 981 users to c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\silver\users.parquet
2026-02-09 16:44:47,639 - INFO - Loaded 1242 rows from posts
2026-02-09 16:44:47,671 - INFO - Wrote 1242 posts to c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\silver\posts.parquet
2026-02-09 16:44:47,682 - INFO - Loaded 644 rows from comments
2026-02-09 16:44:47,692 - INFO - Wrote 644 comments to c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\silver\comments.parquet
2026-02-09 16:44:47,700 - INFO - Loaded 55 rows from sub_molt
2026-02-09 16:44:47,706 - INFO - Wrote 55 submolts to c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\silver\submolts.parquet
2026-02-09 16:44:47,709 - INFO - Silver layer build complete: {'users': 981, 'posts': 1242, 'comments': 644, 'submolts': 55}


Silver Layer Built:
  users: 981 records
  posts: 1242 records
  comments: 644 records
  submolts: 55 records

Silver users sample:
shape: (3, 9)
┌─────────────┬─────────────┬───────┬────────────┬───┬────────┬───────────┬───────────┬────────────┐
│ id_user     ┆ name        ┆ karma ┆ descriptio ┆ … ┆ joined ┆ followers ┆ following ┆ scraped_at │
│ ---         ┆ ---         ┆ ---   ┆ n          ┆   ┆ ---    ┆ ---       ┆ ---       ┆ ---        │
│ str         ┆ str         ┆ i64   ┆ ---        ┆   ┆ str    ┆ i64       ┆ i64       ┆ str        │
│             ┆             ┆       ┆ str        ┆   ┆        ┆           ┆           ┆            │
╞═════════════╪═════════════╪═══════╪════════════╪═══╪════════╪═══════════╪═══════════╪════════════╡
│ user_7ca81d ┆ AureliusPro ┆ 0     ┆            ┆ … ┆ null   ┆ 0         ┆ 0         ┆ 2026-02-09 │
│ d06237      ┆ tocol       ┆       ┆            ┆   ┆        ┆           ┆           ┆ T19:05:22. │
│             ┆             ┆       ┆         

---

## 5. Feature Engineering (Gold Layer)

### 5.1 Ingenieria de Features con Polars Lazy

In [10]:
from src.processing.gold import build_gold_layer

# Build gold layer - engineered features
gold_results = build_gold_layer()

print("Gold Layer Built:")
print(f"  User features: {gold_results['user_features']} records")
print(f"  Feature columns: {gold_results['feature_columns']}")

2026-02-09 16:44:52,150 - INFO - Loaded users from silver layer
2026-02-09 16:44:52,152 - INFO - Loaded posts from silver layer
2026-02-09 16:44:52,152 - INFO - Loaded comments from silver layer
2026-02-09 16:44:52,154 - INFO - Loaded submolts from silver layer
2026-02-09 16:44:52,237 - INFO - Wrote 981 user feature records with 21 columns to c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\gold\user_features.parquet
2026-02-09 16:44:52,239 - INFO - Feature columns: ['id_user', 'name', 'karma', 'followers', 'following', 'follower_ratio', 'has_description', 'has_human_owner', 'description_length', 'post_count', 'total_post_rating', 'avg_post_rating', 'max_post_rating', 'avg_title_length', 'avg_post_desc_length', 'comment_count', 'total_comment_rating', 'avg_comment_rating', 'avg_comment_length', 'total_activity', 'total_rating']
2026-02-09 16:44:52,239 - INFO - Gold layer build complete: {'user_features': 981, 'feature_columns': 21}


Gold Layer Built:
  User features: 981 records
  Feature columns: 21


### 5.2 Descripcion de Features

In [11]:
from src.processing.gold import get_modeling_data

# Load gold layer
features_df = get_modeling_data()

print("Feature Columns:")
for col in features_df.columns:
    dtype = features_df[col].dtype
    print(f"  {col}: {dtype}")

print(f"\nFeature Statistics:")
print(features_df.describe())

Feature Columns:
  id_user: String
  name: String
  karma: Int64
  followers: Int64
  following: Int64
  follower_ratio: Float64
  has_description: Int32
  has_human_owner: Int32
  description_length: UInt32
  post_count: UInt32
  total_post_rating: Int64
  avg_post_rating: Float64
  max_post_rating: Int64
  avg_title_length: Float64
  avg_post_desc_length: Float64
  comment_count: UInt32
  total_comment_rating: Int64
  avg_comment_rating: Float64
  avg_comment_length: Float64
  total_activity: UInt32
  total_rating: Int64

Feature Statistics:
shape: (9, 22)
┌────────────┬────────────┬──────┬────────────┬───┬────────────┬───────────┬───────────┬───────────┐
│ statistic  ┆ id_user    ┆ name ┆ karma      ┆ … ┆ avg_commen ┆ avg_comme ┆ total_act ┆ total_rat │
│ ---        ┆ ---        ┆ ---  ┆ ---        ┆   ┆ t_rating   ┆ nt_length ┆ ivity     ┆ ing       │
│ str        ┆ str        ┆ str  ┆ f64        ┆   ┆ ---        ┆ ---       ┆ ---       ┆ ---       │
│            ┆            ┆    

---

## 6. Modelado con H2O AutoML

### 6.1 Entrenamiento del Modelo

In [13]:
!pip install H2O

Collecting H2O
  Using cached h2o-3.46.0.9-py2.py3-none-any.whl.metadata (2.1 kB)
Using cached h2o-3.46.0.9-py2.py3-none-any.whl (266.0 MB)
Installing collected packages: H2O
Successfully installed H2O-3.46.0.9




In [14]:
from src.models.trainer import H2OTrainer, FEATURE_COLUMNS

# Initialize trainer
trainer = H2OTrainer(max_models=10,max_runtime_secs=300,)

# Train model
print("Training H2O AutoML model...")
print(f"Features: {FEATURE_COLUMNS}")

results = trainer.train(data=features_df,target="karma",features=FEATURE_COLUMNS,)
print(f"\nBest Model: {results['model_id']}")

Training H2O AutoML model...
Features: ['followers', 'following', 'follower_ratio', 'has_description', 'has_human_owner', 'description_length', 'post_count', 'total_post_rating', 'avg_post_rating', 'max_post_rating', 'avg_title_length', 'avg_post_desc_length', 'comment_count', 'total_comment_rating', 'avg_comment_rating', 'avg_comment_length', 'total_activity', 'total_rating']
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.461-b11, mixed mode)
  Starting server from C:\Users\Paulina Peralta\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\PAULIN~1\AppData\Local\Temp\tmppivf28ph
  JVM stdout: C:\Users\PAULIN~1\AppData\Local\Temp\tmppivf28ph\h2o_Paulina_Peralta_started_from_python.out
  JVM stderr: C:\Users\PAULIN~1\AppData\Local\Temp\tmppivf28ph\h2o_Paulina_Peralta_started_from_python.err
  Server is running at http://127.0.0.1:5432

0,1
H2O_cluster_uptime:,07 secs
H2O_cluster_timezone:,-03:00
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.9
H2O_cluster_version_age:,2 months and 16 days
H2O_cluster_name:,H2O_from_python_Paulina_Peralta_lt8mxu
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.549 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


2026-02-09 16:46:20,011 - INFO - H2O initialized
2026-02-09 16:46:20,011 - INFO - Training with 18 features, 981 samples


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


2026-02-09 16:46:21,670 - INFO - Train size: 773, Test size: 208


AutoML progress: |
16:46:21.780: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
16:46:21.799: AutoML: XGBoost is not available; skipping it.
16:46:21.869: _train param, Dropping bad and constant columns: [avg_comment_rating, total_comment_rating]


16:46:22.530: _train param, Dropping bad and constant columns: [avg_comment_rating, total_comment_rating]

█
16:46:23.693: _train param, Dropping bad and constant columns: [avg_comment_rating, total_comment_rating]

█
16:46:24.122: _train param, Dropping bad and constant columns: [avg_comment_rating, total_comment_rating]

█
16:46:24.630: _train param, Dropping bad and constant columns: [avg_comment_rating, total_comment_rating]
16:46:24.933: _train param, Dropping bad and constant columns: [avg_comment_rating, total_comment_rating]



2026-02-09 16:46:28,869 - INFO - Best model: GBM_2_AutoML_1_20260209_164621
2026-02-09 16:46:28,889 - INFO - Training complete - MAE: 3234.8072, RMSE: 23050.9321, R2: 0.6363



Best Model: GBM_2_AutoML_1_20260209_164621


### 6.2 Evaluacion del Modelo

In [15]:
print("Model Evaluation Metrics:")
print(f"  MAE:  {results['mae']:.4f}")
print(f"  RMSE: {results['rmse']:.4f}")
print(f"  R2:   {results['r2']:.4f}")
print(f"\n  Train samples: {results['train_size']}")
print(f"  Test samples:  {results['test_size']}")

Model Evaluation Metrics:
  MAE:  3234.8072
  RMSE: 23050.9321
  R2:   0.6363

  Train samples: 773
  Test samples:  208


### 6.3 Predicciones

In [16]:
# Generate predictions
predictions = trainer.predict(features_df)

# Compare actual vs predicted
comparison = predictions.select(["name", "karma", "karma_predicted"]).head(10)
print("Actual vs Predicted Karma:")
print(comparison)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Actual vs Predicted Karma:
shape: (10, 3)
┌──────────────────┬───────┬─────────────────┐
│ name             ┆ karma ┆ karma_predicted │
│ ---              ┆ ---   ┆ ---             │
│ str              ┆ i64   ┆ f64             │
╞══════════════════╪═══════╪═════════════════╡
│ AureliusProtocol ┆ 0     ┆ 176.243253      │
│ ZenithGarcia     ┆ 0     ┆ 176.243253      │
│ NimbusDrifts     ┆ 0     ┆ 176.243253      │
│ MBC20MintPoster  ┆ 0     ┆ 176.243253      │
│ DivineLuna       ┆ 0     ┆ 176.243253      │
│ Clawd            ┆ 0     ┆ 176.243253      │
│ NebulaBot2026    ┆ 0     ┆ 176.243253      │
│ Virgil_DT        ┆ 0     ┆ 176.243253      │
│ Shellraiser      ┆ 0     ┆ 176.243253      │
│ PenkoAI          ┆ 0     ┆ 176.243253      │
└──────────────────┴───────┴─────────────────┘





### 6.4 Guardado del Modelo

In [17]:
# Save model
model_path = trainer.save_model()
print(f"Model saved to: {model_path}")

# Save predictions
pred_path = settings.models_dir / "predictions.parquet"
predictions.write_parquet(pred_path)
print(f"Predictions saved to: {pred_path}")

2026-02-09 16:46:54,142 - INFO - Saved model to C:\Users\Paulina Peralta\Desktop\moltbook-karma\data\models\GBM_2_AutoML_1_20260209_164621


Model saved to: C:\Users\Paulina Peralta\Desktop\moltbook-karma\data\models\GBM_2_AutoML_1_20260209_164621
Predictions saved to: c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\models\predictions.parquet


---

## 7. Conclusiones

### 7.1 Resumen del Pipeline

In [18]:
print("=" * 50)
print("RESUMEN DEL PIPELINE")
print("=" * 50)
print(f"\n1. Web Scraping:")
print(f"   - Usuarios: {db_ops.count(User)}")
print(f"   - Posts: {db_ops.count(Post)}")
print(f"   - SubMolts: {db_ops.count(SubMolt)}")
print(f"\n2. Procesamiento:")
print(f"   - Silver layer: {settings.silver_dir}")
print(f"   - Gold layer: {settings.gold_dir}")
print(f"\n3. Modelado:")
print(f"   - Algoritmo: H2O AutoML")
print(f"   - Target: karma (regresion)")
print(f"   - MAE: {results.get('mae', 'N/A'):.4f}")
print(f"   - R2: {results.get('r2', 'N/A'):.4f}")
print("\n" + "=" * 50)

RESUMEN DEL PIPELINE

1. Web Scraping:
   - Usuarios: 981
   - Posts: 1242
   - SubMolts: 55

2. Procesamiento:
   - Silver layer: c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\silver
   - Gold layer: c:\Users\Paulina Peralta\Desktop\moltbook-karma\data\gold

3. Modelado:
   - Algoritmo: H2O AutoML
   - Target: karma (regresion)
   - MAE: 3234.8072
   - R2: 0.6363



### 7.2 Observaciones

1. **Web Scraping**: Playwright permite renderizar JavaScript para sitios SPA como moltbook.com

2. **Procesamiento con Polars**: Lazy evaluation optimiza memoria y rendimiento

3. **Feature Engineering**: Las features derivadas (follower_ratio, total_activity) capturan engagement

4. **Modelado**: H2O AutoML automatiza la seleccion del mejor algoritmo

5. **Limitaciones**: 
   - El dataset es pequeno para produccion
   - Algunas features pueden tener alta correlacion
   - El karma puede depender de factores no capturados (tiempo en plataforma, calidad de contenido)