# Demo: Using `littlelogger` with Parallel Tools

`GridSearchCV` and other parallel tools use `joblib` to run tasks on multiple CPU cores at once. This creates a "race condition" where multiple processes try to write to the same log file, causing corruption.

`littlelogger` solves this by adding the unique **Process ID (PID)** to the log file. This guarantees that each parallel worker writes to its *own* safe file.

This notebook demonstrates this feature using `joblib` directly.

### 1. Setup

First, we import our tools and define our decorated function.

In [1]:
import os
import glob
import time
import pandas as pd

# Import joblib for parallel processing
from joblib import Parallel, delayed

# Import our logger
from littlelogger import log_run

# Define a base log file name
LOG_FILE = "parallel_run.jsonl"

# --- Clean up old log files for this demo ---
print("Cleaning up old logs...")
for f in glob.glob(f"{LOG_FILE}.*"):
    os.remove(f)
    print(f"Removed {f}")

Cleaning up old logs...


### 2. Define the Decorated Function

We decorate our `train_model` function just like before. We'll add a `time.sleep(1)` to simulate real work.

In [2]:
@log_run(log_file=LOG_FILE)
def train_model(learning_rate, n_estimators):
    """A mock training function that sleeps for 1 sec."""
    print(f"  Running model with lr={learning_rate}, n_estimators={n_estimators}...")

    # Simulate 1 second of work
    time.sleep(1)

    # Calculate some mock scores
    f1 = 0.8 + (learning_rate * 0.1) - (n_estimators * 0.0001)

    print(f"  Finished model with lr={learning_rate}. F1 = {f1:.4f}")
    return {"f1_score": round(f1, 4)}


### 3. Define the Parameter Grid

This is the same list of parameters you would pass to `GridSearchCV`.

In [3]:
param_grid = [
    {'learning_rate': 0.1, 'n_estimators': 100},
    {'learning_rate': 0.1, 'n_estimators': 200},
    {'learning_rate': 0.05, 'n_estimators': 100},
    {'learning_rate': 0.05, 'n_estimators': 200},
    {'learning_rate': 0.01, 'n_estimators': 300},
    {'learning_rate': 0.01, 'n_estimators': 500},
]

print(f"Created a grid of {len(param_grid)} experiments.")

Created a grid of 6 experiments.


### 4. Run in Parallel!

We'll use `joblib` to run all 6 experiments in parallel. `n_jobs=-1` tells it to use all available CPU cores.

(If the 6 jobs finish in ~1 second instead of 6 seconds, you know it ran in parallel!)

In [4]:
print("--- Starting parallel run... ---")

start_time = time.time()

Parallel(n_jobs=-1)(
    delayed(train_model)(**params) for params in param_grid
)

end_time = time.time()
print(f"\n--- Parallel run finished in {end_time - start_time:.2f} seconds --- ")

--- Starting parallel run... ---

--- Parallel run finished in 4.28 seconds --- 


### 5. See the Result: Multiple Log Files

Now, look in your directory. `littlelogger` has safely created a separate file for each worker process. This is **proof** that no data was corrupted.

In [5]:
log_files = glob.glob(f"{LOG_FILE}.*")

print(f"Found {len(log_files)} log files:")
for f in log_files:
    print(f) 

Found 6 log files:
parallel_run.jsonl.14620
parallel_run.jsonl.15996
parallel_run.jsonl.2016
parallel_run.jsonl.23292
parallel_run.jsonl.4532
parallel_run.jsonl.4772


### 6. The Payoff: Combine and Analyze

Here is the simple 3-line pattern to combine all these files into one master `pandas` DataFrame for analysis.

In [6]:
# 1. Find all log files that match the pattern
log_files = glob.glob(f"{LOG_FILE}.*")

# 2. Read them all into a list of DataFrames
df_list = [pd.read_json(f, lines=True) for f in log_files]

# 3. Combine them into one master DataFrame!
df_raw = pd.concat(df_list, ignore_index=True)

print(f"Combined {len(log_files)} log files into a single DataFrame with {len(df_raw)} runs.")
df_raw.head()

Combined 6 log files into a single DataFrame with 6 runs.


Unnamed: 0,timestamp,function_name,runtime_seconds,params,metrics
0,2025-11-16 16:47:49+00:00,train_model,1.000443,"{'learning_rate': 0.05, 'n_estimators': 200}",{'f1_score': 0.785}
1,2025-11-16 16:47:49+00:00,train_model,1.000705,"{'learning_rate': 0.01, 'n_estimators': 500}",{'f1_score': 0.751}
2,2025-11-16 16:47:49+00:00,train_model,1.000741,"{'learning_rate': 0.01, 'n_estimators': 300}",{'f1_score': 0.771}
3,2025-11-16 16:47:49+00:00,train_model,1.00119,"{'learning_rate': 0.1, 'n_estimators': 200}",{'f1_score': 0.79}
4,2025-11-16 16:47:49+00:00,train_model,1.000646,"{'learning_rate': 0.05, 'n_estimators': 100}",{'f1_score': 0.795}


### 7. Flatten and Find the Best Model

Now that we have our combined `df_raw`, we can analyze it just like we did in the simple demo.

In [None]:
# Flatten the 'params' and 'metrics' columns
df_params = pd.json_normalize(df_raw['params']).add_prefix('param_')
df_metrics = pd.json_normalize(df_raw['metrics']).add_prefix('metric_')

# Join them all together
df_analysis = pd.concat([
    df_raw.drop(['params', 'metrics'], axis=1), 
    df_params, 
    df_metrics
], axis=1)

# Sort by F1 score to find the best run!
df_sorted = df_analysis.sort_values(by="metric_f1_score", ascending=False)

df_sorted