# Roman Microlensing Data Challenge 2: Nexus Workflow
***
## Learning Goals

By the end of this tutorial, you, the participant, will be able to:

- Load official Data Challenge light curve data from cloud storage.
- Install required software in the Roman Research Nexus (RRN).
- Initialize a submission project using the `microlens-submit` tool.
- Perform a microlensing model fit using `MulensModel`.
- Package your model results, plots, and notes into a standardized solution file.
- Validate and export your final submission for evaluation.

## Introduction

This notebook provides the complete, end-to-end workflow for submitting an entry to the Roman Microlensing Data Challenge 2. Following these steps is mandatory for a valid submission.

The process involves accessing data directly from the cloud, performing your analysis within the RRN, and using the `microlens-submit` package to standardize your results. This ensures a level playing field and allows the evaluation committee to process all submissions efficiently.

### Important Links
- **Microlens-Submit Documentation:** [Read the Docs](https://microlens-submit.readthedocs.io/en/latest/)
- **RRN Teams & Servers:** For information on setting up a collaborative team server, please see the teams page
  <img src="../../../images/icons/team.svg" style="vertical-align: bottom; width:1.5em; margin-right:0.5em;"/> [Working on a Team](../../../markdown/teams.md)
- **RRN Software Guide:** [Installing Extra Software](./software.md)


## 1. Setup: Installing Required Libraries

First, we install the necessary Python packages into our environment. This command reads the `requirements.txt` file you should have placed in this same directory.

In [None]:
!pip install -r requirements.txt

## 2. Imports

Now we import all the libraries we'll need for this workflow.

In [None]:
import microlens_submit
import MulensModel
import s3fs
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import time
import emcee
import multiprocessing as mp
from multiprocessing.pool import ThreadPool
from IPython.display import HTML

## 3. Data Access

We will now load a light curve directly from the challenge's S3 bucket. You do not need to download anything. We will use `s3fs` to stream the data directly into memory, as shown in the `data_discovery_and_access.md` guide.

In [None]:
# This is the official, secure path to the data challenge files.
TIER = "2018-test"
EVENT_ID = "EVENT001"
DATA_URI = f"s3://roman-dc2-data-public/{TIER}/{EVENT_ID}.dat"

# Establish connection to S3
fs = s3fs.S3FileSystem(anon=True)

# Load the data with header and band column
print(f"Loading data from {DATA_URI}...")
with fs.open(DATA_URI, 'r') as f:
    data = np.genfromtxt(f, names=True, dtype=None, encoding=None)

# Get unique bands
bands = np.unique(data['band'])
print(f"Bands found: {bands}")

# Plot each band separately
plt.figure(figsize=(8, 5))
for band in bands:
    mask = data['band'] == band
    plt.errorbar(
        data['HJD'][mask], data['mag'][mask], yerr=data['mag_err'][mask],
        fmt='.', label=f'Band {band}', alpha=0.7
    )

plt.gca().invert_yaxis()
plt.title(f"{EVENT_ID} Light Curve by Band")
plt.xlabel("HJD")
plt.ylabel("Magnitude")
plt.legend(title="Band")
plt.show()

## 4. Initialize Your Submission Project

To initialize a submission object in Python, use `microlens_submit.load(project_path)`.
If the directory at `project_path` does not exist, it will be created with the correct structure and a blank submission.
You can then set the required metadata (like `team_name`, `tier`, etc.) on the returned `Submission` object, and call `.save()` to persist it.

In [None]:
# Define your project directory in your persistent home folder
project_path = Path("./my_dc2_submission")
# To locaate your submission project in the current working directory, use:
# project_path = Path.cwd() / "my_dc2_submission"
# To locate your submission project in a your home directory, use:
# project_path = Path.home() / "my_dc2_submission"

# Now, load the project into our session
submission = microlens_submit.load(project_path)  # creates the project directory if it doesn't exist
submission.team_name = "The Transiting Poachers"
submission.tier = TIER
print(f"\nProject for '{submission.team_name}' loaded successfully.")

# You can expect saving at this point to result in warnings about missing info
submission.save()

After initializing or loading your microlens-sibmit project, and initializing your linked git repository, simply set the repo_url attribute on your Submission. If the directory already exists, load() will just load the existing project, not overwrite it. If the directory exists but no submission content, the content will be added when you run `submission.save()`.

In [None]:
# initialize your git repo
!git init
!git add .
!git commit -m "Initial commit"
!git branch -m main
!git remote add origin https://github.com/yourusername/your-repo.git
!git push -u origin main

# set the repo url in the submission object
submission.repo_url = "https://github.com/yourusername/your-repo.git"

> You may leave this repository private during the data challenge, but we ask that you make it public once submissions close, to ensure it is accessible to evaluators.

If you want to auto-populate your hardware info and are using the Nexus (like the CLI `nexus-init` command does), you can call the method:

In [None]:
# add your hardware info
submission.autofill_nexus_info()

This will attempt to detect and fill in hardware details if running in the Roman Science Platform environment, but you can always set or override any values manually.

```python
# add your hardware info
submission.hardware_info = {
    "cpu_model": "Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz",
    "num_cores": 16,
    "memory_gb": 64,
    "platform": "Linux",
    # ...any other relevant info
}
```

A valid submission will include the attributes:
* `team_name` to know who the submission belongs to
* `tier` to validate event IDs
* `repo_url` for reproducability and evaluation
* `hardware_info` for benchmarking purposes

You can continue to edit your project without all of this information to make it "valid", but you will not be able to submit. You can check the submission validity of you entire project at any time using `submission.run_validation()`.

In [None]:
# save the project
submission.save()

The result in a project directory of the following structure:

```
   <project_dir>/
   ├── submission.json
   └── events/
```

As you add events, solutions, note, and figures they will be populated inside this project as follows:

```
   <submission_dir>/
   ├── dossier/
   ├       ├── index.html
   ├       ├── <event_id>.html
   ├       ├── <solution_id>.html 
   ├       ├── full_report.html 
   ├       └── assets/
   ├── submission.json
   └── events/
       └── <event_id>/
           ├── event.json
           └── solutions/
               ├── <lightcurve image>  # your generated lightcurve plots (optional)
               ├── <lens plane image>  # your generated lens-plane diagram (optional)
               ├── <posteriors>.csv    # your generated posteriors (optional)
               ├── <solution_id>.md    # your notes (optional)
               └── <solution_id>.json
```

> A note for later: We recommend you save generated lightcurve plots, lens-plane diagrams, and notes according to this format, but the submission tool is flexible to all save locations, so long as you include the relative path to your <project_dir> in the <solution_id>.json (accessed through the microlens_submit tool or manually). 

## 5. Microlensing Model Fitting

This is where the science happens. Define your `MulensModel` and use your preferred fitting algorithm (e.g., MCMC) to find the best-fit parameters for the event data. 

### 5.1. Define the Model
As an exmaple, we are goin to do a preliminary fit using only only one band and a point-source-point-lens model.

In [None]:
# 5.1. Define the model (e.g., Point Source Point Lens)
# Use only the first band for this preliminary fit
first_band = bands[0]
band_mask = data['band'] == first_band
band_data = data[band_mask]

# Create MulensModel data object for the first band
my_data = MulensModel.MulensData(
    data_list=[band_data['HJD'].to_numpy(), 
               band_data['mag'].to_numpy(), 
               band_data['mag_err'].to_numpy()],
    phot_fmt='mag',
    bandpass='H'
)

# Initial model parameters (no parallax for preliminary fit)
initial_params = {
    't_0': band_data['HJD'].iloc[np.argmin(band_data['mag'])],
    'u_0': 0.1,
    't_E': 25.,
}

pspl_model = MulensModel.Model(initial_params)

# Create the event object
event_object = MulensModel.Event(datasets=my_data, model=pspl_model)

### 5.2. Setting up the Fitting Procedure

We will use `emcee` to drive this fit, so we can demonstrate more attributes of the submission object, like the `posterior_path` and `cpu_time` vs `wall_time`.

In [None]:
# 5.2. Define the likelihood function for emcee
def log_likelihood(theta, event_object, parameters_to_fit):
    """Log likelihood function for emcee"""
    try:
        # Update model parameters
        for i, param_name in enumerate(parameters_to_fit):
            if param_name == 'log_rho':
                # Convert log_rho back to rho
                event_object.model.parameters.rho = 10**theta[i]
            else:
                setattr(event_object.model.parameters, param_name, theta[i])
        
        # Fit the fluxes given the current model parameters
        event_object.fit_fluxes()
        
        # Get the source and blend fluxes
        ([F_S], F_B) = event_object.get_flux_for_dataset(event_object.datasets[0])
        
        # Calculate chi-squared
        chi2 = event_object.get_chi2()
        
        # Add flux priors if needed (optional)
        penalty = 0.0
        if F_B <= 0:
            penalty = ((F_B / 100)**2)  # Penalize negative blend flux
        if F_S <= 0 or (F_S + F_B) <= 0:
            return -np.inf  # Return inf if fluxes are non-physical
        
        return -0.5 * chi2 - penalty
    except:
        return -np.inf

# Set up emcee
nwalkers = 32
ndim = 3  # t_0, u_0, t_E
nsteps = 1000
burnin = 200

### 5.3. Performing the Fit

We will use multithreading with `ThreadPool` from the `mutliprocessing` library, for this fit, because it is more compatable with notebooks. However, it caches the state of your notebook and can cause unexpected behaviours when you edit code related to your fit, and re-run. It is a better idea to use a script and multiprocessing using the `Pool` function. 

In [None]:
# 5.3. Run MCMC fitting
print("Starting MCMC fit...")
start_time = time.time()
start_cpu = time.process_time()

# Set up parameters to fit
parameters_to_fit = ["t_0", "u_0", "t_E"]
ndim = len(parameters_to_fit)

# Initial positions (slightly perturbed from initial guess)
pos = np.array([
    initial_params['t_0'], initial_params['u_0'], initial_params['t_E']
]) + 1e-4 * np.random.randn(nwalkers, ndim)

# Use ThreadPool instead of Pool for notebooks
n_cores = mp.cpu_count()
print(f"Using {n_cores} threads for MCMC")

# Create the thread pool
with ThreadPool(processes=n_cores) as pool:
    sampler = emcee.EnsembleSampler(
        nwalkers, ndim, log_likelihood, 
        args=(event_object, parameters_to_fit),
        pool=pool
    )
    sampler.run_mcmc(pos, nsteps, progress=True)

# Calculate timing
wall_time_hours = (time.time() - start_time) / 3600
cpu_time_hours = (time.process_time() - start_cpu) / 3600

# Get "best fit" parameters (median of posterior)
samples = sampler.chain[:, burnin:, :].reshape((-1, ndim))
best_fit_params = {
    "t0": np.median(samples[:, 0]),
    "u0": np.median(samples[:, 1]),
    "tE": np.median(samples[:, 2]),
}

# Update the model with best fit parameters and calculate fluxes
for i, param_name in enumerate(parameters_to_fit):
    setattr(event_object.model.parameters, param_name, best_fit_params[f"t{param_name[2:]}" if param_name.startswith('t_') else param_name.replace('_', '')])

# Fit fluxes using MulensModel
event_object.fit_fluxes()
([F_S], F_B) = event_object.get_flux_for_dataset(event_object.datasets[0])

best_fit_params[f"F{first_band}_S"] = F_S
best_fit_params[f"F{first_band}_B"] = F_B

# Calculate log likelihood at best fit
best_fit_log_likelihood = log_likelihood([
    best_fit_params["t0"], 
    best_fit_params["u0"], 
    best_fit_params["tE"]
], event_object, parameters_to_fit)

n_data_points = len(my_data.time)

print(f"Best-fit parameters obtained: {best_fit_params}")
print(f"Wall time: {wall_time_hours:.3f} hours")
print(f"CPU time: {cpu_time_hours:.3f} hours")

### 5.4. Results
Let's save and plot all the fit results.

In [None]:
# 5.4. Create the initial solution and event objects
event = submission.get_event(EVENT_ID)
solution = event.add_solution(
    model_type="1S1L",
    parameters=best_fit_params,
    alias=f"Preliminary PSPL - Band {first_band}"
)

# We are doing this now because we want generated solution id for our file paths

In [None]:
# Save posteriors
posterior_dir = Path(project_path) / "events" / EVENT_ID / "solutions" / solution.solution_id
posterior_dir.mkdir(parents=True, exist_ok=True)
posterior_path = posterior_dir / "posteriors.csv"

# Save posterior samples with column headers (no parallax parameters)
posterior_data = np.column_stack([
    samples[:, 0], samples[:, 1], samples[:, 2]
])
np.savetxt(posterior_path, posterior_data, 
           delimiter=',', 
           header='t0,u0,tE',
           comments='')

# Save lightcurve plot
plt.figure(figsize=(10, 6))
plt.errorbar(my_data.time, my_data.mag, yerr=my_data.err_mag, fmt='.', 
             color='k', alpha=0.5, label='Data')

# Plot best fit model using the event object
event_object.plot_model(color='r', label='Best Fit')
plt.legend()
plt.gca().invert_yaxis()
plt.title(f"{EVENT_ID} Best Fit - Band {first_band}")
plt.xlabel("HJD")
plt.ylabel("Magnitude")

lightcurve_path = posterior_dir / "lightcurve.png"
plt.savefig(lightcurve_path, dpi=150, bbox_inches='tight')
plt.show()

# Save caustic diagram (for PSPL, this will be empty since no caustics)
plt.figure(figsize=(8, 8))
event_object.plot_caustics()
plt.title(f"{EVENT_ID} Caustic Diagram - Band {first_band}")
caustic_path = posterior_dir / "caustic.png"
plt.savefig(caustic_path, dpi=150, bbox_inches='tight')
plt.show()

### 5.5 Solution Managment

Now, we package these results into the submission object.

A solution entry has the following attributes:
* `model_type`
* `parameters` matching the indicated `model_type` and `higher_order_effects`
* `higher_order_effects` (if any are present in the model)
* `log_likelihood` (the fit’s log-likelihood value; $-\chi^2/2$)
* `relative_probability` (optional, probability of this solution being the true model)
* `n_data_points` (number of data points used in the fit)
* `compute_info` contanining
  - `cpu_hours`
  - `wall_time_hours`
  - `dependencies` (list of installed Python packages, auto-captured)
  - `git_info` (commit, branch, is_dirty; auto-captured if in a git repo)
* `t_ref` (reference time, if required for indicated higher-order effects)
* `bands` (if the fit is band-dependent)
* `alias` (optional, for human-readable, editable solution names)
* `notes_path` (path to Markdown notes file, optional but recommended)
* `parameter_uncertainties` (uncertainties for parameters)
* `physical_parameters` (optional, derived physical parameters)
* `posterior_path` (optional, path to posterior samples)
* `lightcurve_plot_path` (optional, path to lightcurve plot image)
* `lens_plane_plot_path` (optional, path to lens plane plot image)
* `used_astrometry` (optional, whether astrometric data was used)
* `used_postage_stamps` (optional, whether postage stamp data was used)
* `limb_darkening_model` and limb_darkening_coeffs (if relevant)

A soltutions requires only these fields for validity:
* `model_type`
* `parameters` matching the indicated `model_type` with no `higher_order_effects`
* `log_likelihood` (the fit’s log-likelihood value; $-\chi^2/2$)
* `n_data_points` (number of data points used in the fit)
* `compute_info` contanining at least:
  - `cpu_hours`
  - `wall_time_hours`

You can set these `compute_info` values using the method:
`solution.set_compute_info(cpu_hours=..., wall_time_hours=...)`
This method also captures:
* The current Python environment (pip freeze output)
* Git repository info (commit, branch, dirty status)
* You can call `set_compute_info()` with either or both values, or leave them blank if you want only the environment info.

We calculated these compute times in the previous section using `time.time()` and `time.process_time()`. `time.time()` records the literal time, where as returns the total CPU time (in seconds) that the current process has used since it started. So calculating the diference between to `time.time()` calls will tell you the time elapsed (in seconds) between each call, while the difference between two `time.process_time()` calls will tell you the CPU time elpased, not including time spent sleeping or waiting for I/O. For example, for a process with 4 cores, running for 1 hour, this would mean a CPU time of 4 hours.

In [None]:
# Set compute info
# automatically detects the current python environment and git repository state
solution.set_compute_info(cpu_hours=cpu_time_hours, wall_time_hours=wall_time_hours)

# Set other metadata
solution.log_likelihood = log_likelihood
solution.n_data_points = n_data_points
solution.bands = [str(first_band)]

# Set file references
solution.posterior_path = str(posterior_path.relative_to(project_path))
solution.lightcurve_plot_path = str(lightcurve_path.relative_to(project_path))
solution.lens_plane_plot_path = str(caustic_path.relative_to(project_path))

# Set notes
solution.set_notes(f"""
# Preliminary Fit Analysis

This is a preliminary PSPL fit with parallax to determine approximate parameters.

## Fit Details
- **Model Type**: 1S1L with parallax
- **Band Used**: {first_band}
- **Data Points**: {n_data_points}
- **MCMC Walkers**: {nwalkers}
- **MCMC Steps**: {nsteps}
- **Burn-in**: {burnin} steps

This solution is marked as inactive as it represents a preliminary analysis.
""")

# Save everything
submission.save()

print(f"✅ Solution '{solution.alias}' ({solution.solution_id}) created and saved!")
print(f"   - Posteriors saved to: {posterior_path}")
print(f"   - Lightcurve plot saved to: {lightcurve_path}")
print(f"   - Caustic diagram saved to: {caustic_path}")
print(f"   - Solution marked as inactive")

## 6. Dossier and Solution Managment

This is the final step. We will generate a human-readable HTML report (a 'dossier'), validate the submission for completeness, and export the final `.zip` file to the official submission location.

In [None]:
# 1. Create the HTML Dossier for your own records
from microlens_submit.dossier import generate_dashboard_html

dossier_path = Path("my_project/dossier")
generate_dashboard_html(submission, dossier_path, open=False)
print(f"HTML Dossier report generated at: {dossier_path}/index.html")

### 6.1. Display the Dashboard

In [None]:
# Display the dashboard inline (requires IPython)

# Read the generated HTML file
with open("./dossier/index.html", "r", encoding="utf-8") as f:
    html = f.read()

# Show the dashboard in the notebook
HTML(html)

### 6.2. Display the Event Page

In [None]:
# Read the generated HTML file
with open(f"./dossier/{event.event_id}.html", "r", encoding="utf-8") as f:
    html = f.read()

# Show the dashboard in the notebook
HTML(html)

### 6.3. Display the Solution Page

In [None]:
# Read the generated HTML file
with open(f"./dossier/{solution.solution_id}.html", "r", encoding="utf-8") as f:
    html = f.read()

# Show the dashboard in the notebook
HTML(html)

### 6.4. Dispaly Full Report

This is the report that will be generated for the evaluators, with placeholders for sections that are not meant for participants. You can generate this report to view the full state of your project as you go, in the format in which it will be evaluated. 

In [None]:
# Read the generated HTML file
with open(f"./dossier/full_dossier_report.html", "r", encoding="utf-8") as f:
    html = f.read()

# Show the dashboard in the notebook
HTML(html)

### But what if you don't like a fit?

You can soft delete a fit without removing it from your local project by deactivating it:

In [None]:
# Deactivate the solution
# we are doing this because this is a preliminary fit and we don't want it to 
# be used in the final report
solution.is_active = False

You can also hard delete a solution by calling `event.remove_solution(solution_id, force=True)` or navigating to the solution in the project directory and delete it manually.

For these changes to persist outside of the active python objects, you must again save the submission.

In [None]:
submission.save()

## 7. Validation and Submitting the Project

This is the final step. We will validate the submission for completeness, and export the final `.zip` file to the official submission location.

In [None]:
# 1. Validate the submission
print("\nRunning validation...")
warnings = submission.run_validation()
if warnings:
    print("⚠️ Validation Warnings Found:")
    for w in warnings:
        print(f"  - {w}")
else:
    print("✅ Submission is valid and ready for export!")

# 2. Export the final submission file
# As the organizer, YOU will replace this placeholder with the real, secure S3 URI.
secure_submission_uri = "s3://roman-dc2-submissions-private/"
output_filename = f"{submission.team_name.replace(' ', '_')}_{EVENT_ID}.zip"
full_output_path = f"{secure_submission_uri}{output_filename}"

print(f"\nExporting submission to: {full_output_path}")
try:
    # Note: Exporting directly to S3 would require s3fs integration
    # in microlens-submit, or a temporary local file.
    # For now, we export locally and then would manually upload.
    local_export_path = project_path / output_filename
    submission.export(str(local_export_path))
    print(f"\n🎉 Successfully exported to {local_export_path}")
    print("You would now upload this file to the secure submission location.")
except ValueError as e:
    print(f"❌ Export failed: {e}")

## About this Notebook
**Author(s):** Amber Malpas (Data Challenge Organizer) <br>
**Keyword(s):** Roman, Microlensing, Data Challenge, Workflow <br>
**Last Updated:** July 2025
***