### Data Initialization and Preprocessing Setup

This cell sets up the initial stage for data preprocessing, including file handling and structure initialization:

1. **Imports Required Libraries**:
   - `gzip`: Facilitates reading and writing compressed files in gzip format.
   - `json`: Allows parsing and manipulation of JSON data structures.
   - `datetime`: Provides tools for working with date and time.

2. **Defines File Paths**:
   - `input_file`: Specifies the input dataset, a gzipped JSONL file named `training-dataset.jsonl.gz`.
   - `output_file`: Specifies the name of the output file, `extracted_data.json`, where processed data will be stored.

3. **Initializes Data Structures**:
   - `extracted_data`: An empty list intended to hold extracted and processed records from the dataset.
   - `day_mapping`: A dictionary mapping weekday names (e.g., "Monday") to integers (0–6) for uniform representation.
   - `media_type_mapping`: A dictionary that converts media types (e.g., "VIDEO", "IMAGE") into numeric codes (0–2).

4. **Processes the Input File**:
   - Opens the gzipped JSONL input file in text mode using `gzip.open`.
   - Iterates through each line of the file, assuming each line contains a valid JSON object.
   - Parses the JSON data and extracts specific profile-related attributes (e.g., `username`, `follower_count`, `is_private`) for later use.


In [1]:
import gzip
import json
from datetime import datetime

# Input and output file paths
input_file = "training-dataset.jsonl.gz"
output_file = "extracted_data.json"

# Initialize a list to store the extracted data
extracted_data = []

day_mapping = {
    "monday": 0,
    "tuesday": 1,
    "wednesday": 2,
    "thursday": 3,
    "friday": 4,
    "saturday": 5,
    "sunday": 6
}

media_type_mapping = {
    "VIDEO": 0,
    "IMAGE": 1,
    "CAROUSEL_ALBUM": 2
}

# Process the gzipped JSONL file
with gzip.open(input_file, 'rt', encoding='utf-8') as f:
    for line in f:
        record = json.loads(line)  # Parse each JSON line
        
        # Extract profile details
        profile_data = {
            "username": record["profile"].get("username"),
            "is_private": record["profile"].get("is_private"),
            "follower_count": record["profile"].get("follower_count"),
            "following_count": record["profile"].get("following_count")
        }
        
        # Extract post details and transform timestamp into day of week and hour interval
        like_counts = []
        media_types = []
        time_indexes = []
        for post in record.get("posts", []):
            # Parse timestamp
            timestamp = post.get("timestamp")
            if timestamp:
                dt = datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
                day_of_week = dt.strftime("%A")  # Get the day of the week
                day_of_week = day_mapping[day_of_week.lower()]
                hour_interval = dt.hour  # Extract the hour (0-23)
            else:
                day_of_week = None
                hour_interval = None
            time_index = day_of_week * 24 + hour_interval

            time_indexes.append(hour_interval)
            media_types.append(media_type_mapping[post.get("media_type")])
            like_counts.append(post.get("like_count"))
        
        # Combine profile and post data
        extracted_data.append({
            "profile": profile_data,
            "time_indexes": time_indexes,
            "media_types": media_types,
            "like_counts":like_counts
        })

# Save the extracted data to a new JSON file
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(extracted_data, f, indent=4)

print(f"Extracted data written to {output_file}")


Extracted data written to extracted_data.json


### Data Loading, Transformation, and Regression Pipeline Implementation

This cell defines two main functions and executes the regression pipeline for modeling and evaluating the dataset.

---

1. **Imports Required Libraries**:
   - `json`: Parses JSON data from files.
   - `pandas`: Provides tools for data manipulation and analysis.
   - `sklearn`: Includes modules for data splitting, machine learning models, and performance metrics.

2. **Function 1: `load_and_prepare_data`**:
   - **Purpose**: Reads a JSON file (`extracted_data.json`) containing profile-level and post-level data, transforms it into a flattened structure, and returns it as a pandas DataFrame.
   - **Key Steps**:
     - Reads the JSON data from the file.
     - Extracts user-level information (`username`, `is_private`, `follower_count`, etc.).
     - Combines post-level details (`time_indexes`, `media_types`, `like_counts`) with user data.
     - Converts the processed data into a tabular format using pandas.

3. **Function 2: `run_regression_pipeline`**:
   - **Purpose**: Executes a complete workflow, including data cleaning, feature engineering, model training, and evaluation.
   - **Key Steps**:
     - Loads the dataset using the `load_and_prepare_data` function.
     - Cleans the data (e.g., removes rows with missing `like_count` values).
     - Encodes categorical features (`media_type`) using one-hot encoding.
     - Splits the data into training and testing sets (80-20 split).
     - Trains a `RandomForestRegressor` on the training data.
     - Evaluates the model using test data and calculates performance metrics (`Mean Squared Error`, `R² Score`).
     - Performs 5-fold cross-validation and reports average metrics.

4. **Pipeline Execution**:
   - The `run_regression_pipeline` function is called within the `if __name__ == "__main__":` block to execute the full pipeline:
     - Loads and preprocesses the data.
     - Trains the model.
     - Evaluates the model and prints the results.
     - Returns the trained model for potential reuse.


In [2]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1. LOAD & COMBINE THE DATA
# ---------------------------------------------------------------------------
def load_and_prepare_data(json_path='extracted_data.json'):
    """
    Expects a JSON file containing a list of data blocks, 
    each with 'profile' and lists of 'time_indexes', 'media_types', 'like_counts'.
    
    Example structure of one data block:
    
    {
        "profile": {
            "username": "...",
            "is_private": false,
            "follower_count": ...,
            "following_count": ...
        },
        "time_indexes": [...],
        "media_types": [...],
        "like_counts": [...]
    }
    
    Returns a single DataFrame with columns:
        username, is_private, follower_count, following_count,
        time_index, media_type, like_count
    """
    
    with open(json_path, 'r') as f:
        data_blocks = json.load(f)  # list of user blocks
    
    all_rows = []
    for block in data_blocks:
        profile = block['profile']
        follower_count = profile.get('follower_count', 0)
        following_count = profile.get('following_count', 0)
        is_private = profile.get('is_private', False)
        username = profile.get('username', None)
        
        time_indexes = block.get('time_indexes', [])
        media_types = block.get('media_types', [])
        like_counts = block.get('like_counts', [])
        
        # Combine post-level data with user-level data
        for t_idx, m_type, likes in zip(time_indexes, media_types, like_counts):
            row = {
                'username': username,
                'is_private': is_private,
                'follower_count': follower_count,
                'following_count': following_count,
                'time_index': t_idx,
                'media_type': m_type,
                'like_count': likes
            }
            all_rows.append(row)
    
    # Convert to DataFrame
    df = pd.DataFrame(all_rows)
    return df

# 2. BUILD A PIPELINE-LIKE WORKFLOW
# ---------------------------------------------------------------------------
def run_regression_pipeline(json_path='extracted_data.json'):
    # A) Load data
    df = load_and_prepare_data(json_path)
    
    # B) Basic cleaning / filtering (optional)
    # For example, remove rows with missing like_count
    df = df.dropna(subset=['like_count'])
    
    # Convert is_private to int (if it varies in your dataset)
    df['is_private'] = df['is_private'].astype(int)
    
    # One-hot encode media_type
    df = pd.get_dummies(df, columns=['media_type'], prefix='media_type')
    
    # C) Define features and target
    feature_cols = [
        'is_private',
        'follower_count',
        'following_count',
        'time_index',
        # Add the one-hot columns for media_type
        # We don't know how many distinct media_types, so let's just grab them dynamically:
    ] + [col for col in df.columns if col.startswith('media_type_')]
    
    X = df[feature_cols]
    y = df['like_count']
    
    # D) Split data into train/test
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # E) Train a Random Forest (example regressor)
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # F) Evaluate on the test set
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print("Test MSE:", mse)
    print("Test R^2:", r2)
    
    # G) (Optional) Cross-validation
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores_r2 = cross_val_score(rf, X, y, cv=kf, scoring='r2')
    cv_scores_mse = cross_val_score(rf, X, y, cv=kf, scoring='neg_mean_squared_error')
    
    print("\nCross-validation R^2 Scores:", cv_scores_r2)
    print("Mean R^2:", cv_scores_r2.mean())
    print("\nCross-validation MSE Scores:", -cv_scores_mse)
    print("Mean MSE:", -cv_scores_mse.mean())

    return rf  # return the trained model if you want further use

# 3. RUN THE PIPELINE
# ---------------------------------------------------------------------------
if __name__ == "__main__":
    # This will train a model on the entire dataset, evaluate it, and print results
    model = run_regression_pipeline("extracted_data.json")


Test MSE: 1051879570.2377697
Test R^2: 0.5155767769333439

Cross-validation R^2 Scores: [0.52471075 0.03195472 0.47179404 0.24972152 0.61995331]
Mean R^2: 0.3796268694344282

Cross-validation MSE Scores: [1.03204600e+09 3.46435394e+09 2.37201360e+09 5.06540545e+09
 1.24997184e+09]
Mean MSE: 2636758165.9296446


### Comprehensive Machine Learning Pipeline with Random Forest and XGBoost

This cell implements a complete machine learning pipeline, including data preprocessing, hyperparameter tuning, and model evaluation using Random Forest and XGBoost regressors.

---

1. **Imports Required Libraries**:
   - `numpy` and `pandas`: Essential tools for numerical operations and data manipulation.
   - `sklearn`: Modules for data preprocessing, splitting, model training, evaluation, and hyperparameter tuning.
   - `xgboost`: Library for the XGBoost regressor, optimized for gradient boosting.

2. **Function 1: `load_and_prepare_data`**:
   - **Purpose**: Reads and preprocesses the dataset from a JSON file.
   - **Key Steps**:
     - Loads user-level and post-level data.
     - One-hot encodes the `media_type` feature.
     - Drops unnecessary or missing data (`like_count`).
     - Outputs a pandas DataFrame with all relevant features.

3. **Function 2: `run_random_forest`**:
   - **Purpose**: Tunes a Random Forest regressor using `RandomizedSearchCV`.
   - **Key Steps**:
     - Preprocesses numeric features using a `StandardScaler`.
     - Defines a pipeline for scaling and modeling.
     - Conducts hyperparameter optimization using cross-validation.
     - Returns the best model and search results.

4. **Function 3: `run_xgboost`**:
   - **Purpose**: Tunes an XGBoost regressor using `RandomizedSearchCV`.
   - **Key Steps**:
     - Similar preprocessing pipeline as Random Forest.
     - Optimizes hyperparameters specific to XGBoost (e.g., learning rate, subsample).
     - Returns the best model and search results.

5. **Function 4: `run_regression_pipeline`**:
   - **Purpose**: Orchestrates the entire machine learning workflow.
   - **Key Steps**:
     - Loads and preprocesses the dataset.
     - Applies a log transformation to the target (`like_count`) for normalization.
     - Splits the data into training and testing sets.
     - Tunes both Random Forest and XGBoost models.
     - Evaluates the best-performing model on the test set using MSE and R² metrics.
     - Chooses the better model based on test performance and returns it.

6. **Execution (`if __name__ == "__main__")**:
   - Calls the `run_regression_pipeline` function to execute the complete workflow.
   - Prints evaluation metrics and determines the superior model (Random Forest or XGBoost).
   - Outputs the best model and its hyperparameter tuning results for further use.


In [3]:
import json
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# XGBoost
from xgboost import XGBRegressor

###############################################################################
# 1. LOADING & PREPARING DATA
###############################################################################
def load_and_prepare_data(json_path='extracted_data.json'):
    """
    Expects a JSON file containing a list of data blocks,
    each with 'profile' and lists of 'time_indexes', 'media_types', 'like_counts'.
    
    Returns a single DataFrame with columns:
        [username, is_private, follower_count, following_count,
         time_index, media_type_*, like_count]
    """
    with open(json_path, 'r') as f:
        data_blocks = json.load(f)
    
    all_rows = []
    for block in data_blocks:
        profile = block['profile']
        follower_count = profile.get('follower_count', 0)
        following_count = profile.get('following_count', 0)
        is_private = profile.get('is_private', False)
        username = profile.get('username', None)
        
        time_indexes = block.get('time_indexes', [])
        media_types = block.get('media_types', [])
        like_counts = block.get('like_counts', [])
        
        for t_idx, m_type, likes in zip(time_indexes, media_types, like_counts):
            row = {
                'username': username,
                'is_private': int(is_private),   # convert bool->int
                'follower_count': follower_count,
                'following_count': following_count,
                'time_index': t_idx,
                'media_type': m_type,
                'like_count': likes
            }
            all_rows.append(row)
    
    df = pd.DataFrame(all_rows)
    
    # Drop rows with missing like_count (if any)
    df.dropna(subset=['like_count'], inplace=True)
    
    # One-hot encode media_type
    df = pd.get_dummies(df, columns=['media_type'], prefix='media_type')
    df = df.drop(columns=["username"])  
    
    return df

###############################################################################
# 2. PIPELINE & HYPERPARAM TUNING FOR RANDOM FOREST
###############################################################################
def run_random_forest(X_train, y_train, numeric_features):
    """
    Perform RandomizedSearchCV on a RandomForestRegressor with 
    some typical hyperparameters, using 5-fold cross-validation.
    
    Returns: best_model, search object
    """
    # Define a preprocessing pipeline for numeric columns
    numeric_transformer = Pipeline([
        ('scaler', StandardScaler())
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features)
        ]
    )
    
    # Create a pipeline: Preprocessing + Regressor
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', RandomForestRegressor(random_state=42))
    ])
    
    # Define parameter grid / distributions
    param_dist = {
        'regressor__n_estimators': [100, 200, 300, 500],
        'regressor__max_depth': [None, 10, 20, 30],
        'regressor__min_samples_split': [2, 5, 10],
        'regressor__min_samples_leaf': [1, 2, 4],
    }
    
    random_search = RandomizedSearchCV(
        pipeline,
        param_distributions=param_dist,
        n_iter=20,
        scoring='neg_mean_squared_error',
        cv=5,
        verbose=1,
        random_state=42,
        n_jobs=-1
    )
    
    random_search.fit(X_train, y_train)
    
    print("[Random Forest] Best Params:", random_search.best_params_)
    print("[Random Forest] Best CV Score (neg MSE):", random_search.best_score_)
    
    best_model = random_search.best_estimator_
    return best_model, random_search


###############################################################################
# 3. PIPELINE & HYPERPARAM TUNING FOR XGBoost
###############################################################################
def run_xgboost(X_train, y_train, numeric_features):
    """
    Perform RandomizedSearchCV on an XGBRegressor with 
    typical hyperparameters, using 5-fold cross-validation.
    
    Returns: best_model, search object
    """
    # Preprocessing pipeline
    numeric_transformer = Pipeline([
        ('scaler', StandardScaler())
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features)
        ]
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', XGBRegressor(random_state=42, use_label_encoder=False,
                                   eval_metric='rmse'))
    ])
    
    # Define parameter distributions
    param_dist = {
        'regressor__n_estimators': [100, 200, 300, 500],
        'regressor__max_depth': [3, 5, 7, 10],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.3],
        'regressor__subsample': [0.6, 0.8, 1.0],
        'regressor__colsample_bytree': [0.6, 0.8, 1.0]
    }
    
    random_search = RandomizedSearchCV(
        pipeline,
        param_distributions=param_dist,
        n_iter=20,
        scoring='neg_mean_squared_error',
        cv=5,
        verbose=1,
        random_state=42,
        n_jobs=-1
    )
    
    random_search.fit(X_train, y_train)
    
    print("[XGBoost] Best Params:", random_search.best_params_)
    print("[XGBoost] Best CV Score (neg MSE):", random_search.best_score_)
    
    best_model = random_search.best_estimator_
    return best_model, random_search


###############################################################################
# 4. MAIN WORKFLOW
###############################################################################
def run_regression_pipeline(json_path='extracted_data.json', test_size=0.2, random_state=42):
    """
    1. Load data
    2. Log-transform the target
    3. Train/Test split
    4. Hyperparameter tuning for both Random Forest & XGBoost
    5. Evaluate best model on test set
    """
    # A) Load & Prepare Data
    df = load_and_prepare_data(json_path)
    
    # B) Define features and target
    feature_cols = [c for c in df.columns if c != 'like_count']
    X = df[feature_cols]
    y = df['like_count']
    
    # C) Log Transform the target
    #    Convert y -> log(y+1) to reduce skew
    y_log = np.log1p(y)
    
    # D) Train/Test Split
    X_train, X_test, y_train_log, y_test_log = train_test_split(
        X, y_log, test_size=test_size, random_state=random_state
    )
    
    # Keep a copy of untransformed y test
    _, y_test = train_test_split(
        y, test_size=test_size, random_state=random_state
    )
    
    # E) Identify numeric features (assuming all are numeric/dummy-coded)
    numeric_features = list(X.columns)
    
    # F) Run Random Forest Tuning
    print("========== RANDOM FOREST TUNING ==========")
    rf_best_model, rf_search = run_random_forest(X_train, y_train_log, numeric_features)
    
    # G) Run XGBoost Tuning
    print("\n========== XGBOOST TUNING ==========")
    xgb_best_model, xgb_search = run_xgboost(X_train, y_train_log, numeric_features)
    
    # H) Evaluate each best model on TEST set
    #    We'll invert the log transform with np.expm1
    # ---------------------------------------------------
    print("\n========== EVALUATING BEST MODELS ON TEST SET ==========")
    
    # --- Random Forest ---
    y_pred_log_rf = rf_best_model.predict(X_test)
    y_pred_rf = np.expm1(y_pred_log_rf)
    mse_rf = mean_squared_error(y_test, y_pred_rf)
    r2_rf = r2_score(y_test, y_pred_rf)
    print("[Random Forest] Test MSE:", mse_rf)
    print("[Random Forest] Test R^2:", r2_rf)
    
    # --- XGBoost ---
    y_pred_log_xgb = xgb_best_model.predict(X_test)
    y_pred_xgb = np.expm1(y_pred_log_xgb)
    mse_xgb = mean_squared_error(y_test, y_pred_xgb)
    r2_xgb = r2_score(y_test, y_pred_xgb)
    print("[XGBoost] Test MSE:", mse_xgb)
    print("[XGBoost] Test R^2:", r2_xgb)
    
    # I) Choose which model to return
    # If we only want the best performing model, we can pick it by MSE or R^2:
    if mse_xgb < mse_rf:
        print("\n**XGBoost performed better on the TEST set**")
        best_model = xgb_best_model
        search_obj = xgb_search
    else:
        print("\n**Random Forest performed better on the TEST set**")
        best_model = rf_best_model
        search_obj = rf_search
    
    return best_model, search_obj

###############################################################################
# 5. ENTRY POINT
###############################################################################
if __name__ == "__main__":
    model, search_obj = run_regression_pipeline("extracted_data.json")


Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Random Forest] Best Params: {'regressor__n_estimators': 200, 'regressor__min_samples_split': 10, 'regressor__min_samples_leaf': 1, 'regressor__max_depth': 30}
[Random Forest] Best CV Score (neg MSE): -0.5232709996800945

Fitting 5 folds for each of 20 candidates, totalling 100 fits


Parameters: { "use_label_encoder" } are not used.



[XGBoost] Best Params: {'regressor__subsample': 1.0, 'regressor__n_estimators': 300, 'regressor__max_depth': 10, 'regressor__learning_rate': 0.3, 'regressor__colsample_bytree': 0.8}
[XGBoost] Best CV Score (neg MSE): -0.6163268144830145

[Random Forest] Test MSE: 696320078.4031826
[Random Forest] Test R^2: 0.6793229698435448
[XGBoost] Test MSE: 812575443.1124936
[XGBoost] Test R^2: 0.6257837624430779

**Random Forest performed better on the TEST set**


### Advanced Regression Pipeline with Multi-Model Tuning and Comparison

This cell implements an enhanced regression pipeline that includes advanced preprocessing, feature engineering, and hyperparameter tuning for multiple models (Random Forest, XGBoost, LightGBM, and CatBoost).

---

1. **Imports Required Libraries**:
   - Libraries for data manipulation (`numpy`, `pandas`) and model evaluation (`sklearn`).
   - Advanced machine learning libraries: `xgboost`, `lightgbm`, and `catboost`.

2. **Function 1: `load_and_prepare_data`**:
   - **Purpose**: Loads and preprocesses the dataset from a JSON file.
   - **Key Steps**:
     - Extracts features and target from the JSON structure.
     - Creates a new feature, `follower_following_ratio`, by dividing follower count by (following count + 1).
     - Caps the `like_count` at its 99th percentile to handle outliers.
     - One-hot encodes the `media_type` feature.
     - Returns a cleaned and feature-engineered DataFrame.

3. **Function 2: `tune_and_evaluate_model`**:
   - **Purpose**: Conducts hyperparameter tuning for a given model using `RandomizedSearchCV` and evaluates its performance.
   - **Key Steps**:
     - Creates a pipeline for preprocessing (scaling numeric features) and modeling.
     - Uses a parameter grid to search for the best hyperparameters with cross-validation.
     - Evaluates the tuned model on the test set and computes metrics (MSE and R²).
     - Returns the best model and its performance metrics.

4. **Function 3: `run_regression_pipeline`**:
   - **Purpose**: Orchestrates the end-to-end workflow, including model comparison.
   - **Key Steps**:
     - Loads and preprocesses the dataset.
     - Defines parameter grids for Random Forest, XGBoost, LightGBM, and CatBoost.
     - Tunes each model using `tune_and_evaluate_model` and evaluates performance on a test set.
     - Compares the models based on MSE and R² and identifies the best-performing model.
     - Outputs the best model and results for further use.

5. **Execution (`if __name__ == "__main__")**:
   - Calls the `run_regression_pipeline` function to execute the full workflow.
   - Performs model tuning and comparison.
   - Prints the best model based on MSE and its performance metrics.


In [6]:
import json
import numpy as np
import pandas as pd

# Sklearn utilities
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Regressors
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

###############################################################################
# 1. LOADING, FEATURE ENGINEERING, AND OUTLIER HANDLING
###############################################################################
def load_and_prepare_data(json_path='extracted_data.json'):
    """
    1) Reads a JSON file with data blocks, each containing:
        - profile.follower_count, profile.following_count
        - time_indexes, media_types, like_counts
    2) Creates a DataFrame with:
        - 'follower_following_ratio'
        - one-hot-encoded 'media_type'
        - 'like_count' capped at the 99th percentile
    """
    with open(json_path, 'r') as f:
        data_blocks = json.load(f)
    
    all_rows = []
    for block in data_blocks:
        profile = block['profile']
        follower_count = profile.get('follower_count', 0)
        following_count = profile.get('following_count', 0)
        is_private = int(profile.get('is_private', False))
        username = profile.get('username', None)
        
        time_indexes = block.get('time_indexes', [])
        media_types = block.get('media_types', [])
        like_counts = block.get('like_counts', [])
        
        for t_idx, m_type, likes in zip(time_indexes, media_types, like_counts):
            row = {
                'username': username,
                'is_private': is_private,
                'follower_count': follower_count,
                'following_count': following_count,
                'time_index': t_idx,
                'media_type': m_type,
                'like_count': likes
            }
            all_rows.append(row)
    
    df = pd.DataFrame(all_rows)
    # Drop rows with missing like_count if any
    df.dropna(subset=['like_count'], inplace=True)

    # One-hot encode media_type
    df = pd.get_dummies(df, columns=['media_type'], prefix='media_type')

    # Create an interaction feature: ratio of follower_count / (following_count + 1)
    df['follower_following_ratio'] = df['follower_count'] / (df['following_count'] + 1)

    # Outlier handling: cap 'like_count' at 99th percentile
    cap_value = df['like_count'].quantile(0.99)
    df.loc[df['like_count'] > cap_value, 'like_count'] = cap_value

    return df

###############################################################################
# 2. DEFINE HELPER FUNCTION TO RUN RANDOMIZED SEARCH ON A GIVEN MODEL
###############################################################################
from sklearn.model_selection import KFold

def tune_and_evaluate_model(model, param_dist, X_train, y_train, X_test, y_test, model_name="Model", n_iter=50):
    """
    Runs RandomizedSearchCV on the given model + param_dist.
    - n_iter can be large (e.g., 50 or 100) if you have enough time (~4 hours).
    - Uses 5-fold CV, negative MSE as the scoring.
    
    Returns best_estimator, MSE, R^2 on test set.
    """
    # We'll do a simple pipeline: scaling + model
    numeric_transformer = Pipeline([('scaler', StandardScaler())])
    preprocessor = ColumnTransformer(
        transformers=[('num', numeric_transformer, list(X_train.columns))],
        remainder='drop'
    )
    
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    
    # We'll use KFold CV with shuffle=True to get robust performance
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    
    random_search = RandomizedSearchCV(
        pipe,
        param_distributions=param_dist,
        n_iter=n_iter,
        scoring='neg_mean_squared_error',
        cv=cv,
        verbose=1,
        random_state=42,
        n_jobs=-1  # Use all available cores
    )
    
    random_search.fit(X_train, y_train)
    
    best_est = random_search.best_estimator_
    y_pred = best_est.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"\n[{model_name}] Best Params:")
    print(random_search.best_params_)
    print(f"[{model_name}] Test MSE: {mse:.4f}")
    print(f"[{model_name}] Test R^2:  {r2:.4f}")
    
    return best_est, mse, r2

###############################################################################
# 3. MAIN PIPELINE: LOAD DATA, (OPTIONAL) LOG-TRANSFORM, RUN MODELS
###############################################################################
def run_regression_pipeline(json_path='extracted_data.json'):
    df = load_and_prepare_data(json_path)

    # Features & Target
    # We'll drop 'username' if it's not numeric
    X = df.drop(columns=['username', 'like_count'])
    y = df['like_count']

    # OPTIONAL: log-transform 'like_count' if beneficial
    # y = np.log1p(y)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # If log-transform is used above, remember to apply np.log1p to y_train, y_test
    # and for final predictions you'd do np.expm1(...) on the predicted values.

    # ========================
    # 3.1 Define BIG param grids
    # ========================

    # A) Random Forest
    rf_param_dist = {
        'regressor__n_estimators': [100,300, 500,],
        'regressor__max_depth': [None, 10, 20,],
        'regressor__min_samples_split': [2, 5, 10, 20],
        'regressor__min_samples_leaf': [1, 3, 5],
        'regressor__max_features': [None, 'sqrt', 'log2', 0.5],
    }

    # B) XGBoost
    xgb_param_dist = {
        'regressor__n_estimators': [100, 300, 500],
        'regressor__max_depth': [3,6, 10],
        'regressor__learning_rate': [0.01,0.05, 0.1,0.2],
        'regressor__subsample': [0.6, 0.8, 1.0],
        'regressor__colsample_bytree': [0.6, 0.8, 1.0],
        'regressor__gamma': [0, 0.5, 1, 2],
        'regressor__reg_alpha': [0, 1, 10],
        'regressor__reg_lambda': [1, 3, 10],
    }

    # C) LightGBM
    lgb_param_dist = {
        'regressor__n_estimators': [100, 300, 500, 1000],
        'regressor__max_depth': [-1, 10, 15, 30],
        'regressor__num_leaves': [31, 127, 255],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__subsample': [0.6, 0.8, 1.0],
        'regressor__colsample_bytree': [0.6, 0.8, 1.0],
        'regressor__reg_alpha': [0, 0.1, 1, 5, 10],
        'regressor__reg_lambda': [0, 3, 10],
    }

    # D) CatBoost
    cat_param_dist = {
        'regressor__iterations': [100, 300, 500, 1000],
        'regressor__depth': [4, 5, 6, 7, 8, 9, 10],
        'regressor__learning_rate': [0.01, 0.1, 0.15, 0.2],
        'regressor__l2_leaf_reg': [1, 5, 20],
        'regressor__subsample': [0.6, 0.8, 1.0],
    }

    # ========================
    # 3.2 Set up the models
    # ========================
    models = {
        "RandomForest": (
            RandomForestRegressor(random_state=42),
            rf_param_dist
        ),
        "XGBoost": (
            XGBRegressor(
                random_state=42,
                use_label_encoder=False,
                eval_metric='rmse'
            ),
            xgb_param_dist
        ),
        "LightGBM": (
            LGBMRegressor(random_state=42),
            lgb_param_dist
        ),
        "CatBoost": (
            CatBoostRegressor(random_state=42, silent=True),
            cat_param_dist
        ),
    }
    
    # We'll store results for each model
    results = []

    # We can choose a large n_iter for the random search if we have ~4 hours
    N_ITER_SEARCH = 50  # Try 50 or 100 if you have enough time

    for model_name, (model, param_dist) in models.items():
        print(f"\n========== TUNING: {model_name} ==========")
        best_estimator, mse, r2 = tune_and_evaluate_model(
            model,
            param_dist,
            X_train,
            y_train,
            X_test,
            y_test,
            model_name=model_name,
            n_iter=N_ITER_SEARCH
        )
        results.append((model_name, best_estimator, mse, r2))

    # Sort by MSE ascending
    results.sort(key=lambda x: x[2])
    best_model_name, best_estimator, best_mse, best_r2 = results[0]
    
    print("\n======================================")
    print(f"** Best Model by MSE: {best_model_name} **")
    print(f"MSE: {best_mse:.4f}, R^2: {best_r2:.4f}")
    print("======================================")
    
    return best_estimator, results

###############################################################################
# 4. ENTRY POINT
###############################################################################
if __name__ == "__main__":
    best_model, all_results = run_regression_pipeline("extracted_data.json")



Fitting 5 folds for each of 50 candidates, totalling 250 fits

[RandomForest] Best Params:
{'regressor__n_estimators': 100, 'regressor__min_samples_split': 20, 'regressor__min_samples_leaf': 1, 'regressor__max_features': None, 'regressor__max_depth': None}
[RandomForest] Test MSE: 68289592.3657
[RandomForest] Test R^2:  0.8006

Fitting 5 folds for each of 50 candidates, totalling 250 fits


Parameters: { "use_label_encoder" } are not used.




[XGBoost] Best Params:
{'regressor__subsample': 0.8, 'regressor__reg_lambda': 10, 'regressor__reg_alpha': 0, 'regressor__n_estimators': 500, 'regressor__max_depth': 10, 'regressor__learning_rate': 0.05, 'regressor__gamma': 0, 'regressor__colsample_bytree': 0.6}
[XGBoost] Test MSE: 69212654.2728
[XGBoost] Test R^2:  0.7979

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004163 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 802
[LightGBM] [Info] Number of data points in the train set: 146466, number of used features: 8
[LightGBM] [Info] Start training from score 4369.537770

[LightGBM] Best Params:
{'regressor__subsample': 0.8, 'regressor__reg_lambda': 10, 'regressor__reg_alpha': 0.1, 'regressor__num_leaves': 255, 'regressor__n_estimators': 100, 'regressor__max_depth': -1, '

9 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\borab\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\borab\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\borab\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_st


[CatBoost] Best Params:
{'regressor__subsample': 0.8, 'regressor__learning_rate': 0.2, 'regressor__l2_leaf_reg': 20, 'regressor__iterations': 1000, 'regressor__depth': 8}
[CatBoost] Test MSE: 70684150.5739
[CatBoost] Test R^2:  0.7936

** Best Model by MSE: RandomForest **
MSE: 68289592.3657, R^2: 0.8006


### Prediction Script for Regression Model

This cell implements a script for loading a trained regression model, preparing test data, making predictions, and saving the results in a structured JSON format.

---

1. **Function 1: `load_trained_model`**:
   - **Purpose**: Loads a pre-trained regression model from a pickle file.
   - **Key Steps**:
     - Reads the model file (e.g., `best_model.pkl`) using Python's `pickle` library.
     - Returns the deserialized model object.

2. **Function 2: `prepare_test_data`**:
   - **Purpose**: Processes the test dataset to match the structure used during training.
   - **Key Steps**:
     - Reads the test data file (`test-regression-round3.jsonl`) containing posts with features such as `media_type`, `timestamp`, and `id`.
     - Maps `media_type` values to numeric codes and applies one-hot encoding (`media_type_0`, `media_type_1`, `media_type_2`).
     - Extracts features like `time_index` (based on day of the week and hour) and computes `follower_following_ratio` (defaulting to zero if data is unavailable).
     - Ensures all required columns (e.g., one-hot-encoded features) exist in the output DataFrame.
     - Outputs a processed DataFrame ready for prediction.

3. **Function 3: `predict_like_count`**:
   - **Purpose**: Predicts the `like_count` for posts using the trained regression model.
   - **Key Steps**:
     - Extracts feature columns from the test DataFrame.
     - Uses the model's `predict` method to generate predictions.
     - Returns a dictionary mapping `post_id` to the predicted `like_count`.

4. **Function 4: `main`**:
   - **Purpose**: Executes the entire prediction workflow.
   - **Key Steps**:
     - Loads the trained model (`best_model.pkl`).
     - Prepares the test data from `test-regression-round3.jsonl`.
     - Makes predictions and stores them in a dictionary.
     - Saves the predictions as a JSON object in `prediction-regression-round3.json`.

5. **Execution (`if __name__ == "__main__")**:
   - Calls the `main` function to perform predictions.
   - Outputs the prediction results to `prediction-regression-round3.json`.
   - Provides debug information during test data preparation for verification purposes.



In [22]:
# predict_regression_round3.py

import json
import pickle
import numpy as np
import pandas as pd
from datetime import datetime

def load_trained_model(model_path="best_model.pkl"):
    """
    Loads the trained regression model (pipeline) from disk.
    """
    with open(model_path, "rb") as f:
        model = pickle.load(f)
    return model


def prepare_test_data(input_file="test-regression-round3.jsonl"):
    """
    Expects lines (or a file) shaped like:
    
        {
          "id": "17893302182329646",
          "caption": "...",
          "comments_count": 0,
          "media_type": "IMAGE",   # or "VIDEO", "CAROUSEL_ALBUM"
          "media_url": "...",
          "timestamp": "2023-11-01 12:43:50",
          "username": "some_user"
        }

    We produce columns that match training, specifically:
      - post_id => from record["id"]
      - is_private => 0 if not in test data
      - follower_count, following_count => 0 if not in test data
      - time_index => day_of_week * 24 + hour from 'timestamp'
      - follower_following_ratio => same logic from training
      - media_type => numeric code => get_dummies => media_type_0,1,2
    """

    # Map from string-based media_type to numeric code used in training
    media_type_map = {
        "VIDEO": 0,
        "IMAGE": 1,
        "CAROUSEL_ALBUM": 2
    }

    all_rows = []

    with open(input_file, "r", encoding="utf-8") as fh:
        for line_num, line in enumerate(fh, start=1):
            record = json.loads(line.strip())

            post_id = record.get("id", f"line_{line_num}")

            # Convert string "IMAGE"/"VIDEO"/"CAROUSEL_ALBUM" to numeric
            raw_media_type = record.get("media_type", "IMAGE")
            numeric_mtype = media_type_map.get(raw_media_type, 1)  # fallback=1 => IMAGE

            # Build time_index from 'timestamp' => day_of_week*24 + hour
            time_str = record.get("timestamp", "")
            if time_str:
                try:
                    dt = datetime.strptime(time_str, "%Y-%m-%d %H:%M:%S")
                    day_of_week = dt.weekday()  # Monday=0, Sunday=6
                    hour = dt.hour
                    time_index = day_of_week * 24 + hour
                except ValueError:
                    time_index = 0
            else:
                time_index = 0

            # If test data doesn't provide these, default to 0
            follower_count = 0
            following_count = 0
            ratio = follower_count / (following_count + 1)
            is_private = 0

            row = {
                "post_id": post_id,
                "is_private": is_private,
                "follower_count": follower_count,
                "following_count": following_count,
                "time_index": time_index,
                "follower_following_ratio": ratio,
                "media_type": numeric_mtype
            }
            all_rows.append(row)

    df = pd.DataFrame(all_rows)
    print("DEBUG - final df columns before get_dummies:", df.columns.tolist())
    print("DEBUG - sample rows:\n", df.head(5))

    # One-hot encode the numeric "media_type"
    df = pd.get_dummies(df, columns=["media_type"], prefix="media_type")

    # Ensure columns for all 3 possible types: media_type_0, media_type_1, media_type_2
    for col_name in ["media_type_0", "media_type_1", "media_type_2"]:
        if col_name not in df.columns:
            df[col_name] = 0

    print("DEBUG - final df columns after get_dummies:", df.columns.tolist())
    print("DEBUG - sample rows post-dummies:\n", df.head(5))

    return df


def predict_like_count(model, df):
    """
    Input:
      - model: the fitted pipeline/regressor
      - df: DataFrame with columns used for training + 'post_id'
    Returns a dict { post_id: predicted_like_count }
    """
    if "post_id" not in df.columns:
        raise KeyError("No column 'post_id' found in the DataFrame.")

    post_ids = df["post_id"]
    feature_cols = [c for c in df.columns if c != "post_id"]

    X_test = df[feature_cols]
    y_pred = model.predict(X_test)

    # If log transform was used, do y_pred = np.expm1(y_pred)

    results = {}
    for i, pid in enumerate(post_ids):
        results[str(pid)] = int(y_pred[i])
    return results


def main():
    # 1) Load the best model
    model = load_trained_model("best_model.pkl")

    # 2) Prepare test data from a file named "test-regression-round3.json"
    df_test = prepare_test_data("test-regression-round3.jsonl")

    # 3) Predict
    results_dict = predict_like_count(model, df_test)

    # 4) Write a single JSON object to "prediction-regression-round3.json"
    with open("prediction-regression-round3.json", "w", encoding="utf-8") as f:
        # indent=4 ensures each post_id is on its own line
        json.dump(results_dict, f, indent=4)

    print("Predictions saved to prediction-regression-round3.json")


if __name__ == "__main__":
    main()


DEBUG - final df columns before get_dummies: ['post_id', 'is_private', 'follower_count', 'following_count', 'time_index', 'follower_following_ratio', 'media_type']
DEBUG - sample rows:
              post_id  is_private  follower_count  following_count  time_index  \
0  18299464882193238           0               0                0          60   
1  17870639199008459           0               0                0          11   
2  17976060503438195           0               0                0          13   
3  17980348256173250           0               0                0          10   
4  18030944311530609           0               0                0          85   

   follower_following_ratio  media_type  
0                       0.0           2  
1                       0.0           0  
2                       0.0           1  
3                       0.0           2  
4                       0.0           1  
DEBUG - final df columns after get_dummies: ['post_id', 'is_private', 'foll