Instructor-provided code

In [1]:
def click_prob(u, it):
    base = 0.05
    premium_boost = 0.02 if u["is_premium"] else 0.0
    age_boost = 0.01 if (25 <= u["age"] <= 40) else 0.0
    price_penalty = 0.02 if it["price"] > 120 else 0.0
    cat_affinity = {
        "Electronics": 0.01 if u["country_code"] in ["US"] else -0.005,
        "Home": 0.01 if u["country_code"] in ["CA","UK"] else 0.0,
        "Beauty": 0.005,
        "Clothing": 0.003,
        "Books": 0.004
    }[it["category"]]
    p = base + premium_boost + age_boost + cat_affinity - price_penalty
    return np.clip(p, 0.001, 0.5)

In [2]:
#probs = interactions.apply(lambda r: click_prob(r[["user_id","age","is_premium","country_code"]], 
#                                           r[["item_id","price","category"]]), axis=1)

In [3]:
#interactions["clicked"] = rng.binomial(n=1, p=probs, size=n_interactions)

Provide Python code to generate random data using the following schema: - user_id - age - is_premium - country_code (e.g. US, CA, UK, IN) - item_id - price - category (e.g. Electronics, Home, Beauty, Clothing, Books)

Here’s a clean, self-contained Python example that generates random data following your schema. It uses standard libraries plus `pandas` (optional but convenient).

In [4]:
import random
import uuid
import pandas as pd

# Configuration
NUM_ROWS = 100

COUNTRIES = ["US", "CA", "UK", "IN"]
CATEGORIES = ["Electronics", "Home", "Beauty", "Clothing", "Books"]

def generate_random_data(n=NUM_ROWS):
    data = []

    for _ in range(n):
        record = {
            "user_id": str(uuid.uuid4()),
            "age": random.randint(18, 75),
            "is_premium": random.choice([True, False]),
            "country_code": random.choice(COUNTRIES),
            "item_id": str(uuid.uuid4()),
            "price": round(random.uniform(5.0, 500.0), 2),
            "category": random.choice(CATEGORIES),
        }
        data.append(record)

    return data

# Generate data
records = generate_random_data()

# Convert to DataFrame (optional)
df = pd.DataFrame(records)

# Preview
print(df.head())

                                user_id  age  is_premium country_code  \
0  e1091c99-f461-42f6-ab51-1456c1c41237   39        True           US   
1  855d457e-ef70-4fa0-8cfe-340d05b1e65d   41       False           IN   
2  7d0adfb8-d94a-4960-9df5-3d33d101e238   62        True           UK   
3  bbbe10a5-6bcd-4d3f-85f8-cf8065c2e58b   64        True           UK   
4  c58675d2-91d5-4c2a-aa9e-ed9b090ab96c   62       False           IN   

                                item_id   price     category  
0  baac2c7b-e47f-4c55-95a8-36b64cd39a2b  465.44       Beauty  
1  049f7cbe-a7ea-451f-b3a5-36415127e24c  237.96  Electronics  
2  9bb19667-2c46-40b2-af6d-2128203f909a  300.12        Books  
3  16089fed-a057-4880-a754-04fb345a96f8  326.54  Electronics  
4  decc7547-4af1-447e-b1f1-dfac1e71b57e  274.48  Electronics  


**Notes**

- `uuid.uuid4()` ensures unique `user_id` and `item_id`
- Price is a float with 2 decimal places
- Age range and price range can be adjusted easily
- If you don’t need pandas, you can work directly with the list of dictionaries

If you want this exported to CSV/Parquet, skewed distributions (e.g., more US users), or category-specific price ranges, I can extend it.

Assume this schema is used in an ecommerce site. Provide Python code that generates random user interactions for use in a machine learning training session.

Below is a practical Python example that simulates **ecommerce user–item interactions** suitable for ML training (e.g., recommendation systems, conversion prediction, churn modeling).

The code generates **behavioral events** such as views, add-to-cart, and purchases with realistic probabilities.

---

**Interaction Schema (Generated)**

Each interaction includes:

- `user_id`
- `item_id`
- `event_type` (view, add_to_cart, purchase)
- `timestamp`
- `price`
- `category`
- `country_code`
- `is_premium`
- `label_purchase` (useful for supervised ML)

In [5]:
import random
import uuid
from datetime import datetime, timedelta
import pandas as pd

# Configuration
NUM_USERS = 200
NUM_ITEMS = 100
NUM_INTERACTIONS = 2000

COUNTRIES = ["US", "CA", "UK", "IN"]
CATEGORIES = ["Electronics", "Home", "Beauty", "Clothing", "Books"]
EVENT_TYPES = ["view", "add_to_cart", "purchase"]

EVENT_PROBABILITIES = {
    "view": 0.70,
    "add_to_cart": 0.20,
    "purchase": 0.10
}

# Generate users
users = [
    {
        "user_id": str(uuid.uuid4()),
        "age": random.randint(18, 75),
        "is_premium": random.choice([True, False]),
        "country_code": random.choice(COUNTRIES),
    }
    for _ in range(NUM_USERS)
]

# Generate items
items = [
    {
        "item_id": str(uuid.uuid4()),
        "category": random.choice(CATEGORIES),
        "price": round(random.uniform(5.0, 500.0), 2),
    }
    for _ in range(NUM_ITEMS)
]

def random_timestamp(days_back=60):
    return datetime.now() - timedelta(
        days=random.randint(0, days_back),
        seconds=random.randint(0, 86400),
    )

# Generate interactions
interactions = []

for _ in range(NUM_INTERACTIONS):
    user = random.choice(users)
    item = random.choice(items)

    event_type = random.choices(
        population=EVENT_TYPES,
        weights=[
            EVENT_PROBABILITIES["view"],
            EVENT_PROBABILITIES["add_to_cart"],
            EVENT_PROBABILITIES["purchase"],
        ],
    )[0]

    interaction = {
        "user_id": user["user_id"],
        "item_id": item["item_id"],
        "event_type": event_type,
        "timestamp": random_timestamp(),
        "price": item["price"],
        "category": item["category"],
        "country_code": user["country_code"],
        "is_premium": user["is_premium"],
        # Supervised learning label
        "label_purchase": 1 if event_type == "purchase" else 0,
    }

    interactions.append(interaction)

# Convert to DataFrame
df = pd.DataFrame(interactions)

# Sort chronologically (important for sequence models)
df = df.sort_values("timestamp").reset_index(drop=True)

print(df.head())

                                user_id                               item_id  \
0  86ae5279-a69e-477a-b0ba-9507eccebdb9  c6866b8f-e318-48b3-9b3d-870641cbc3c7   
1  465b40ef-b62b-4523-a11a-0054529eb29e  99ebd32d-de92-4625-9e4c-ce8b285b356a   
2  6cbe609d-4e33-4123-a1df-35922393954e  9e30fe9d-ee9b-4301-b63b-b40d70807faa   
3  0cc1fe10-8ea2-45b7-b1e4-5bf702453003  3af2441a-090c-4b36-a9d2-ff1eb15305f7   
4  7fa857bc-20e0-4336-9996-91c8288982da  455df015-8728-4196-8844-b2ee943576b0   

    event_type                  timestamp   price     category country_code  \
0  add_to_cart 2025-11-13 17:16:48.623091  261.32       Beauty           IN   
1         view 2025-11-13 18:35:56.624591  345.17  Electronics           UK   
2         view 2025-11-13 19:05:51.621537  353.21         Home           US   
3         view 2025-11-13 20:06:07.623591  365.60       Beauty           CA   
4  add_to_cart 2025-11-13 22:28:17.624591  328.80        Books           UK   

   is_premium  label_purchase  
0     

**How This Can Be Used for ML**

- **Binary classification** → predict `label_purchase`
- **Recommendation systems** → user–item interaction matrix
- **Sequence models (LSTM / Transformers)** → ordered `event_type` per user
- **Feature engineering** → session counts, conversion rates, recency

---

**Possible Extensions**

- Session IDs
- Funnel logic (view → cart → purchase dependency)
- Category-specific pricing
- User preference bias
- Time-of-day or seasonal effects

If you tell me your **ML task (ranking, prediction, sequence modeling)**, I can tailor the data exactly for that use case.

The goal is to predict user behavior.

Below is a **behavior-aware data generator** specifically designed for **predicting user behavior** (e.g., _will the user purchase next?_, _will they add to cart?_).
It produces **sequential interactions**, **realistic funnels**, and **supervised labels** suitable for most ML models.

---

**Modeling Assumptions**

- Users have latent preferences for categories
- Behavior follows a funnel: view → add_to_cart → purchase
- Premium users convert more often
- Past behavior influences future actions
- Output supports **classification, ranking, and sequence models**

---

**Output Schema (Training-Ready)**

Each row represents **one interaction**:

- `user_id`
- `item_id`
- `event_type` (view, add_to_cart, purchase)
- `timestamp`
- `category`
- `price`
- `country_code`
- `is_premium`
- `user_category_affinity`
- `label_next_purchase` ← **prediction target**

In [6]:
import random
import uuid
from datetime import datetime, timedelta
import pandas as pd

# -----------------------
# Configuration
# -----------------------
NUM_USERS = 300
NUM_ITEMS = 200
MAX_EVENTS_PER_USER = 30

CATEGORIES = ["Electronics", "Home", "Beauty", "Clothing", "Books"]
COUNTRIES = ["US", "CA", "UK", "IN"]

# Base probabilities
VIEW_TO_CART = 0.25
CART_TO_PURCHASE = 0.30
PREMIUM_BOOST = 1.4

# -----------------------
# Users with preferences
# -----------------------
users = []
for _ in range(NUM_USERS):
    preferred_category = random.choice(CATEGORIES)
    users.append({
        "user_id": str(uuid.uuid4()),
        "country_code": random.choice(COUNTRIES),
        "is_premium": random.choice([True, False]),
        "preferred_category": preferred_category
    })

# -----------------------
# Items
# -----------------------
items = []
for _ in range(NUM_ITEMS):
    items.append({
        "item_id": str(uuid.uuid4()),
        "category": random.choice(CATEGORIES),
        "price": round(random.uniform(10, 600), 2)
    })

# -----------------------
# Helper functions
# -----------------------
def category_affinity(user, item):
    return 1.0 if user["preferred_category"] == item["category"] else 0.3

def event_time(base_time, step):
    return base_time + timedelta(minutes=step * random.randint(5, 30))

# -----------------------
# Generate interactions
# -----------------------
rows = []
base_time = datetime.now() - timedelta(days=60)

for user in users:
    num_events = random.randint(10, MAX_EVENTS_PER_USER)
    current_time = base_time

    for i in range(num_events):
        item = random.choice(items)
        affinity = category_affinity(user, item)

        # View event
        rows.append({
            "user_id": user["user_id"],
            "item_id": item["item_id"],
            "event_type": "view",
            "timestamp": current_time,
            "category": item["category"],
            "price": item["price"],
            "country_code": user["country_code"],
            "is_premium": user["is_premium"],
            "user_category_affinity": affinity,
            "label_next_purchase": 0
        })

        # Add to cart
        cart_prob = VIEW_TO_CART * affinity
        if user["is_premium"]:
            cart_prob *= PREMIUM_BOOST

        if random.random() < cart_prob:
            current_time = event_time(current_time, 1)
            rows.append({
                **rows[-1],
                "event_type": "add_to_cart",
                "timestamp": current_time
            })

            # Purchase
            purchase_prob = CART_TO_PURCHASE * affinity
            if user["is_premium"]:
                purchase_prob *= PREMIUM_BOOST

            if random.random() < purchase_prob:
                current_time = event_time(current_time, 1)
                rows.append({
                    **rows[-1],
                    "event_type": "purchase",
                    "timestamp": current_time,
                    "label_next_purchase": 1
                })

        current_time = event_time(current_time, 1)

# -----------------------
# Final dataset
# -----------------------
df = pd.DataFrame(rows).sort_values(
    ["user_id", "timestamp"]
).reset_index(drop=True)

print(df.head())

                                user_id                               item_id  \
0  00225e25-6302-437d-b447-2ffedf1b6bc0  3477f190-0d35-44f4-b33e-d8214f78ac07   
1  00225e25-6302-437d-b447-2ffedf1b6bc0  422ee63c-0193-4fba-919f-099aab636793   
2  00225e25-6302-437d-b447-2ffedf1b6bc0  422ee63c-0193-4fba-919f-099aab636793   
3  00225e25-6302-437d-b447-2ffedf1b6bc0  cd51df58-fe2e-468c-8ff2-33e19686f990   
4  00225e25-6302-437d-b447-2ffedf1b6bc0  cd51df58-fe2e-468c-8ff2-33e19686f990   

    event_type                  timestamp     category   price country_code  \
0         view 2025-11-14 17:16:35.650615  Electronics  576.40           IN   
1         view 2025-11-14 17:44:35.650615         Home  260.27           IN   
2  add_to_cart 2025-11-14 18:13:35.650615         Home  260.27           IN   
3         view 2025-11-14 18:37:35.650615  Electronics  118.71           IN   
4  add_to_cart 2025-11-14 19:02:35.650615  Electronics  118.71           IN   

   is_premium  user_category_affinity 

**Prediction Targets You Can Use**

| Goal                     | Label                 |
| ------------------------ | --------------------- |
| Will user purchase next? | `label_next_purchase` |
| Funnel prediction        | `event_type`          |
| Next-item recommendation | `item_id`             |
| Conversion likelihood    | binary on purchase    |

---

**Ready for Models Like**

- Logistic Regression / XGBoost
- RNN / LSTM / Transformer
- Matrix Factorization
- Session-based recommenders

---

**Want this adapted for:**

- **Next-event prediction**
- **Session-based recommendation**
- **Offline RL / bandits**
- **Cold-start simulation**

Tell me the **model type** and I’ll tune the generator precisely.

Provide Python code for two models, logistic regression and random forest. Use what's available in `sklearn`. Use its `Pipeline` where appropriate.

Below is a **complete, sklearn-native training example** using the interaction data you generated, with **two models**:

1. **Logistic Regression** (strong baseline, interpretable)
2. **Random Forest** (non-linear, interaction-aware)

Both use `sklearn.pipeline.Pipeline` and `ColumnTransformer` for proper preprocessing.

The target is:

> `label_next_purchase` → _Will the user purchase?_

---

**Feature Assumptions**

Input `DataFrame df` contains:

numerical:
- price
- user_category_affinity

categorical:
- event_type
- category
- country_code
- is_premium

target:
- label_next_purchase

---

**Shared Preprocessing Pipeline**

In [7]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score

In [8]:
# Target and features
TARGET = "label_next_purchase"

NUM_FEATURES = [
    "price",
    "user_category_affinity",
]

CAT_FEATURES = [
    "event_type",
    "category",
    "country_code",
    "is_premium",
]

X = df[NUM_FEATURES + CAT_FEATURES]
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Preprocessing
numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(
        handle_unknown="ignore",
        sparse_output=False
    ))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, NUM_FEATURES),
        ("cat", categorical_transformer, CAT_FEATURES),
    ]
)

1️⃣ **Logistic Regression Pipeline**

In [9]:
from sklearn.linear_model import LogisticRegression

logreg_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        n_jobs=-1
    ))
])

logreg_pipeline.fit(X_train, y_train)

y_pred = logreg_pipeline.predict(X_test)
y_proba = logreg_pipeline.predict_proba(X_test)[:, 1]

print("Logistic Regression Results")
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

Logistic Regression Results
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1372
           1       1.00      1.00      1.00        37

    accuracy                           1.00      1409
   macro avg       1.00      1.00      1.00      1409
weighted avg       1.00      1.00      1.00      1409

ROC AUC: 1.0


**Why Logistic Regression?**

- Strong baseline
- Fast training
- Probabilistic output
- Feature interpretability

---

2️⃣ **Random Forest Pipeline**

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=12,
        min_samples_leaf=5,
        class_weight="balanced",
        random_state=42,
        n_jobs=-1
    ))
])

rf_pipeline.fit(X_train, y_train)

y_pred = rf_pipeline.predict(X_test)
y_proba = rf_pipeline.predict_proba(X_test)[:, 1]

print("Random Forest Results")
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

Random Forest Results
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1372
           1       1.00      1.00      1.00        37

    accuracy                           1.00      1409
   macro avg       1.00      1.00      1.00      1409
weighted avg       1.00      1.00      1.00      1409

ROC AUC: 1.0


**Why Random Forest?**

- Captures non-linear behavior
- Handles feature interactions well
- Robust to noise
- Minimal feature engineering needed

---

**Recommended Next Steps**

- Hyperparameter tuning (`GridSearchCV`)
- Feature importance analysis
- Time-based train/test split
- Session-level aggregation
- Gradient boosting (XGBoost / LightGBM)

If you want, I can:

- Add **cross-validation**
- Add **calibration**
- Adapt this for **next-event prediction**
- Convert to **ranking loss**