# Optional: Programming DMP Generation
In this notebook I will use the [Blablador API](https://sdlaml.pages.jsc.fz-juelich.de/ai/guides/blablador_api_access/#step-1-register-on-gitlab) to turn a fictive project description and a skeleton for a Data Management Plan (DMP) into a project-specific DMP. If you want to rerun this notebook, you need a Blablador API key, and store it as `BLABLADOR_API_KEY` in your environment variables. Also make sure to execute this notebook in an environment where the [openai python library](https://pypi.org/project/openai/) is installed, e.g. using `pip install openai`.

In [1]:
import openai
from IPython.display import display, Markdown

We define some helper-function to send a prompt to blablador and retrieve the result. ([source](https://scads.github.io/generative-ai-notebooks/15_endpoint_apis/03_blablador_endpoint.html))

In [2]:
def prompt(message:str, model="1 - Llama3 405 on WestAI with 4b quantization"):
    """A prompt helper function that sends a message to Blablador (FZ Jülich)
    and returns only the text response.
    """
    import os
    import openai
    
    # setup connection to the LLM
    client = openai.OpenAI()
    client.base_url = "https://helmholtz-blablador.fz-juelich.de:8000/v1"
    client.api_key = os.environ.get('BLABLADOR_API_KEY')
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": message}]
    )
    
    # extract answer
    return response.choices[0].message.content

## Asking chatGPT about DMPs

In [3]:
result = prompt("""
Give me a short list of typical sections of a Data Management Plan. 
Write bullet points in markdown format and no detailed explanation.
""")

display(Markdown(result))

* Data Types and Formats
* Data Collection and Storage
* Data Sharing and Access
* Data Quality and Validation
* Data Security and Backup
* Data Archiving and Preservation
* Data Sharing and Reuse
* Roles and Responsibilities

In [4]:
result = prompt("""
What is commonly described in a section about "Backup and Archiving" in a 
Data Management Plan? Answer in 3 sentences.
""")

display(Markdown(result))

In a Data Management Plan, the "Backup and Archiving" section typically describes the procedures for creating and managing backup copies of research data to ensure its availability and integrity in case of data loss or corruption. This section may outline the frequency and method of backups, the type of storage media used, and the location of backup storage. Additionally, it may discuss long-term archiving plans, including the format and storage of data for preservation and potential reuse.

## Our project description
In the following cell you find a description of a fictive project. It contains all aspects of such a project that came to my mind when I though of the aspects chatGPT mentioned above. It is structured chronologously, listing things that happen early in the project first, and transitioning towards publication of a manuscript, code and data. 

In [5]:
project_description = """
In our project we investigate the underlying physical principles for Gastrulation 
in Tribolium castaneum embryo development. Therefore, we use light-sheet microscopes
to acquire 3D timelapse imaging data. We store this data in the NGFF file format. 
After acquistion, two scientists, typically a PhD student and a post-doc or 
group leader look into the data together and decide if the dataset will be analyzed 
in detail. In case yes, we upload the data to an Omero-Server, a research data 
management solution specifically developed for microscopy imaging data. Data on 
this server is automatically backed-up by the compute center of our university. We then login 
to the Jupyter Lab server of the institute where we analyze the data. Analysis results
are also stored in the Omero-Server next to the imaging data results belong to. The
Python analysis code we write is stored in the institutional git-server. Also this 
server is backed up by the compute center. When the project advances, we start writing
a manuscipt using overleaf, an online service for collaborative manuscipt editing 
based on latex files. After every writing session, we save back the changed manuscript 
to the institutional git server. As soon as the manuscript is finished and 
submitted to the bioRxiv, a preprint server in the life-sciences, we also publish the 
project-related code by marking the project on the git-server as public. We also
tag the code with a release version. At the same time we publish the imaging data 
by submitting a copy of the dataset from the Omero-Server to zenodo.org, a 
community-driven repository for research data funded by the European Union. Another 
copy of the data, the code and the manuscript is stored on the institutional archive 
server. This server, maintained by the compute center, garantees to archive data for 
15 years. Documents and data we published is licensed under CC-BY 4.0 license. The code 
we publish is licensed BSD3. The entire project and all steps of the data life-cycle 
are documented in an institutional labnotebook where every user has to pay 10 Euro 
per month. Four people will work on the project. The compute center estimates the 
costs for storage and maintenance of the infrastructure to 20k Euro and half a 
position of an IT specialist. The project duration is four years.
"""

We can then use this project description as part of a prompt to chatGPT to turn this unstructured text into a DMP.

In [6]:
result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of 
a common project in our "Tribolium Development" Research Group at the University. 
Your task is to reformulate this project description into a Data Management Plan.

{project_description}
""")

display(Markdown(result))

**Data Management Plan for "Tribolium Development" Research Group**

**Project Overview**

Our research group aims to investigate the underlying physical principles for Gastrulation in Tribolium castaneum embryo development. This project will generate large amounts of 3D timelapse imaging data, which will be analyzed and published in accordance with the principles of open science.

**Data Generation and Storage**

* 3D timelapse imaging data will be acquired using light-sheet microscopes and stored in the NGFF file format.
* All imaging data will be uploaded to an Omero-Server, a research data management solution specifically developed for microscopy imaging data, which is automatically backed up by the compute center of our university.
* The Omero-Server will store both raw and analyzed data, ensuring that all data is properly versioned and linked to relevant metadata.

**Data Analysis and Code Management**

* Data analysis will be performed using Jupyter Lab server, which is connected to the Omero-Server.
* Analysis code will be written in Python and stored in the institutional git-server, which is backed up by the compute center.
* All code will be version-controlled, and changes will be tracked through the git-server.

**Collaboration and Manuscript Writing**

* Manuscripts will be written collaboratively using Overleaf, an online service for collaborative manuscript editing based on LaTeX files.
* Manuscript versions will be saved regularly to the institutional git-server.

**Data and Code Publication**

* Upon manuscript submission to bioRxiv, a preprint server in the life sciences, the project-related code will be made public by marking the project on the git-server as public and tagging the code with a release version.
* Imaging data will be published by submitting a copy of the dataset from the Omero-Server to Zenodo.org, a community-driven repository for research data funded by the European Union.

**Long-term Preservation and Archiving**

* A copy of the data, code, and manuscript will be stored on the institutional archive server, which is maintained by the compute center and guarantees data archiving for 15 years.

**Licensing and Access**

* All published data and documents will be licensed under the CC-BY 4.0 license.
* Published code will be licensed under the BSD3 license.

**Documentation and Project Management**

* The entire project and all steps of the data life cycle will be documented in an institutional lab notebook, accessible to all project members.
* Four people will work on the project, and the compute center estimates the costs for storage and maintenance of the infrastructure to be 20k Euro and half a position of an IT specialist.

**Project Timeline and Funding**

* The project duration is four years.
* Funding for the project has been allocated to cover the costs of storage, maintenance, and personnel.

By following this Data Management Plan, our research group aims to ensure that all data and code generated during this project are properly stored, analyzed, and published in accordance with the principles of open science, and that all data and materials are preserved for long-term access and reuse.

## Combining information and structure
We next modify the prompt to also add information about the structure we need. This structure may be different from funding agency to funding agency and thus, this step is crucial in customizing the DMP accoring to given formal requirements.

In [7]:
result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of 
a common project in our "Tribolium Development" Research Group at the University. 
Your task is to reformulate this project description into a Data Management Plan.

{project_description}

The required structure for the data management plan, we need to write is like this:

# Data Management Plan
## Data description
## Documentation and data quality
## Storage and technical archiving the project
## Legal obligations and conditions 
## Data exchange and long-term data accessibility
## Responsibilities and resources

Use Markdown for headlines and text style.
""")

display(Markdown(result))

# Data Management Plan

## Data description

In this project, we will generate 3D timelapse imaging data of Tribolium castaneum embryo development using light-sheet microscopes. The data will be stored in the NGFF file format. Additionally, we will produce analysis results, Python code for data analysis, and a manuscript.

## Documentation and data quality

* All data and analysis steps will be documented in an institutional lab notebook, accessible to all project members at a cost of €10 per month per user.
* Data quality will be ensured through collaborative review by at least two scientists (PhD student and post-doc or group leader) before deciding on further analysis.
* Analysis code will be stored in an institutional Git-server, with regular backups by the compute center.

## Storage and technical archiving the project

* Imaging data will be stored on an Omero-Server, a research data management solution specifically developed for microscopy imaging data, with automatic backups by the compute center.
* Analysis results will be stored alongside the imaging data on the Omero-Server.
* Python analysis code will be stored in the institutional Git-server, with regular backups by the compute center.
* Manuscript drafts will be stored on the institutional Git-server, with regular backups by the compute center.
* A copy of the dataset, code, and manuscript will be archived on the institutional archive server, maintained by the compute center, which guarantees data archiving for 15 years.

## Legal obligations and conditions

* All published data, code, and manuscripts will be licensed under CC-BY 4.0 (data and manuscripts) or BSD3 (code).
* The project will comply with the European Union's data protection regulations.

## Data exchange and long-term data accessibility

* Imaging data will be published on zenodo.org, a community-driven repository for research data, upon manuscript submission to bioRxiv.
* Analysis code will be made publicly available on the institutional Git-server, with a release version tag, upon manuscript submission.
* The institutional archive server will ensure long-term data accessibility for at least 15 years.

## Responsibilities and resources

* Four project members will be responsible for data management and analysis.
* The compute center will provide storage and maintenance of the infrastructure at an estimated cost of €20,000 and half a position of an IT specialist.
* The project duration is four years.