## Conversion of .yml to DALIA format
This Notebook holds the workflow to convert the Training Material ([nfdi4bioimage.yml](resources/nfdi4bioimage.yml) file) to the DALIA format.

Language detection, which is also needed for the DALIA format, is outsourced to a [different notebook](scripts/Export_DALIA_language.ipynb) to speed up the process.

#### Load the Yml as a pandas DF

In [1]:
import pandas as pd

# This file exports selected data as csv file
source = "../resources/"

destination = '../docs/export/DALIA_training_materials.csv'

from generate_link_lists import load_dataframe

df = load_dataframe(source)
df.head()

Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date
0,[Elisabeth Kugler],Sharing Your Poster on Figshare: A Community G...,novice,"[Sharing, Research Data Management]",[Blog Post],https://focalplane.biologists.com/2023/07/26/s...,,,,,,,,,
1,[Marcelo Zoccoler],Running Deep-Learning Scripts in the BiA-PoL O...,proficient,"[Python, Artificial Intelligence, Bioimage Ana...",[Blog Post],https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,,,,,,
2,[Robert Haase],Browsing the Open Microscopy Image Data Resour...,competent,"[OMERO, Python]",[Blog Post],https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,,,,,,
3,[Mara Lampert],Getting started with Mambaforge and Python,novice,"[Python, Conda, Mamba]",[Blog Post],https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,,,,,,
4,[Jennifer Waters],Promoting Data Management at the Nikon Imaging...,novice,[Research Data Management],[Blog Post],https://datamanagement.hms.harvard.edu/news/pr...,,,,,,,,,


#### 1. Change the entries with a author column by writing those entries to the authors column

In [2]:
#check which entries have 'author' column
df[df['author'].notna()]

Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date
394,,Virtual-I2K-2024-multiview-stitcher,advanced beginner,"[Big Data, Bioimageanalysis]","[Github repository, Tutorial]",[https://github.com/m-albert/Virtual-I2K-2024-...,BSD-3-CLAUSE,,,Repository accompanying the multiview-stitcher...,,2024-10-30T07:38:11+00:00,,Marvin Albert,
397,,Prompt-Engineering-LLMs-Course,,"[Llms, Prompt Engineering, Code Generation]","[Github repository, Tutorial]",https://github.com/HelmholtzAI-Consultants-Mun...,MIT,,,,,2024-09-11T07:45:30+00:00,,Isra Mekki,


In [3]:
# Iterate over rows to change the information to the authors column
for index, entry in df[df['author'].notna()].iterrows():
    df.loc[index, 'authors'] = entry['author']
    
df[df['author'].notna()]

Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date
394,Marvin Albert,Virtual-I2K-2024-multiview-stitcher,advanced beginner,"[Big Data, Bioimageanalysis]","[Github repository, Tutorial]",[https://github.com/m-albert/Virtual-I2K-2024-...,BSD-3-CLAUSE,,,Repository accompanying the multiview-stitcher...,,2024-10-30T07:38:11+00:00,,Marvin Albert,
397,Isra Mekki,Prompt-Engineering-LLMs-Course,,"[Llms, Prompt Engineering, Code Generation]","[Github repository, Tutorial]",https://github.com/HelmholtzAI-Consultants-Mun...,MIT,,,,,2024-09-11T07:45:30+00:00,,Isra Mekki,


#### 2. Exclude entries without mandatory attributes (License, Authors, Title, Link)

In [4]:
data = df[~df['license'].str.lower().isin(['unknown']) & df['license'].notna() & df['authors'].notna() & df['name'].notna()& df['url'].notna()]
data.head()

Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date
1,[Marcelo Zoccoler],Running Deep-Learning Scripts in the BiA-PoL O...,proficient,"[Python, Artificial Intelligence, Bioimage Ana...",[Blog Post],https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,,,,,,
2,[Robert Haase],Browsing the Open Microscopy Image Data Resour...,competent,"[OMERO, Python]",[Blog Post],https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,,,,,,
3,[Mara Lampert],Getting started with Mambaforge and Python,novice,"[Python, Conda, Mamba]",[Blog Post],https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,,,,,,
9,[Robert Haase],Managing Scientific Python environments using ...,novice,"[Python, Conda, Mamba]",[Blog Post],https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,,,,,,
29,[Robert Haase et al.],BioImage Analysis Notebooks,advanced beginner,"[Python, Bioimage Analysis]","[Book, Notebook]",https://haesleinhuepf.github.io/BioImageAnalys...,"[CC-BY-4.0, BSD-3-CLAUSE]",,,,,,,,


In [5]:
print(f'Total number of entries found: {len(df)}')
print(f'Number of entries found with all mandatory entries: {len(data)}')

Total number of entries found: 698
Number of entries found with all mandatory entries: 465


In [6]:
data.head()

Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date
1,[Marcelo Zoccoler],Running Deep-Learning Scripts in the BiA-PoL O...,proficient,"[Python, Artificial Intelligence, Bioimage Ana...",[Blog Post],https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,,,,,,
2,[Robert Haase],Browsing the Open Microscopy Image Data Resour...,competent,"[OMERO, Python]",[Blog Post],https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,,,,,,
3,[Mara Lampert],Getting started with Mambaforge and Python,novice,"[Python, Conda, Mamba]",[Blog Post],https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,,,,,,
9,[Robert Haase],Managing Scientific Python environments using ...,novice,"[Python, Conda, Mamba]",[Blog Post],https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,,,,,,
29,[Robert Haase et al.],BioImage Analysis Notebooks,advanced beginner,"[Python, Bioimage Analysis]","[Book, Notebook]",https://haesleinhuepf.github.io/BioImageAnalys...,"[CC-BY-4.0, BSD-3-CLAUSE]",,,,,,,,


#### 3. Change the format of the **Tags** and **License** columns to fit the DALIA format

In [7]:
data["tags"] = data["tags"].apply(lambda x: ' * '.join(x) if isinstance(x, list) else x) #Tags
data["license"] = data["license"].apply(lambda x: ' * '.join(x) if isinstance(x, list) else x) #License
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["tags"] = data["tags"].apply(lambda x: ' * '.join(x) if isinstance(x, list) else x) #Tags
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["license"] = data["license"].apply(lambda x: ' * '.join(x) if isinstance(x, list) else x) #License


Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date
1,[Marcelo Zoccoler],Running Deep-Learning Scripts in the BiA-PoL O...,proficient,Python * Artificial Intelligence * Bioimage An...,[Blog Post],https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,,,,,,
2,[Robert Haase],Browsing the Open Microscopy Image Data Resour...,competent,OMERO * Python,[Blog Post],https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,,,,,,
3,[Mara Lampert],Getting started with Mambaforge and Python,novice,Python * Conda * Mamba,[Blog Post],https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,,,,,,
9,[Robert Haase],Managing Scientific Python environments using ...,novice,Python * Conda * Mamba,[Blog Post],https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,,,,,,
29,[Robert Haase et al.],BioImage Analysis Notebooks,advanced beginner,Python * Bioimage Analysis,"[Book, Notebook]",https://haesleinhuepf.github.io/BioImageAnalys...,CC-BY-4.0 * BSD-3-CLAUSE,,,,,,,,


In [8]:
# Map the License Entries to valid input
license_mapping = {
    'APACHE-2.0 LICENSE' : 'Apache-2.0',
    'CC0 1.0 UNIVERSAL' : 'CC0-1.0',
    'CC-BY-4.0 * BSD-3-CLAUSE' : 'CC-BY-4.0 * BSD-3-Clause',
    'CC0 (MOSTLY, BUT CAN DIFFER DEPENDING ON RESOURCE)' : 'CC0-1.0',
    'CCY-BY-SA-4.0' : 'CC-BY-SA-4.0',
    'YOUTTUBE STANDARD LICENSE' : 'YOUTUBE STANDARD LICENSE',
    'CC-BY-NC-SA' : 'CC-BY-NC-SA-4.0',
    'BSD3-CLAUSE' : 'BSD-3-Clause',
    'CC-ZERO' : 'CC0-1.0',
    'BSD 3-Clause "New" or "Revised" License' : 'BSD-3-Clause',
    'cc-by-4.0' : ' CC-BY-4.0',
    'Creative Commons Attribution Share Alike 4.0 International' : 'CC-BY-SA-4.0',
    'GNU General Public License v3.0' : 'GPL-3.0-only',
    'CC BY-NC-SA 4.0' : 'CC-BY-NC-SA-4.0',
    'BSD-3-CLAUSE' : 'BSD-3-Clause',
    'BSD-2-CLAUSE' : 'BSD-2-Clause',
    'APACHE-2.0' : 'Apache-2.0'
}
data["license"] = data["license"].replace(license_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["license"] = data["license"].replace(license_mapping)


#### 4. Morph the **Type** Column into the **LearningResourceType** and **MediaType** Column

In [9]:
# Create Mapping for the Type Column:
type_to_learning_resource = {
    "Application": "Software Application",
    "Big Data": "Data",
    "Bioimage Analysis": "Other",
    "Blog": "Web Page",
    "Blog Post": "Text",
    "Book": "Book",
    "Book Chapter": "Book",
    "Code": None,
    "Collection": "Other",
    "Conference Abstract": "Text",
    "Data": "Data",
    "Document": "Text",
    "Documentation": "Text",
    "Event": "Other",
    "Forum Post": "Text",
    "Github Repository": "Other",
    "Jupyter Book": "Code Notebook",
    "Notebook": "Code Notebook",
    "Online Course": "Course",
    "Online Tutorial": "Tutorial",
    "Open Source Software": "Software Application",
    "Poster": "Poster",
    "Practicals": "Course",
    "Preprint": "Text",
    "Presentation": "Presentation",
    "Publication": "Article",
    "Python": None,
    "Report": "Report",
    "Slide": "Presentation",
    "Slides": "Presentation",
    "Tutorial": "Tutorial",
    "Video": None,
    "Videos": None,
    "Website": "Web Page",
    "Workshop": "Course",
    "Youtube Channel": "Other"
}

In [10]:
type_to_media_type = {
    "Application": None,
    "Big Data": None,
    "Bioimage Analysis": None,
    "Blog": "text",
    "Blog Post": "text",
    "Book": "text",
    "Book Chapter": "text",
    "Code": "code",
    "Collection": None,
    "Conference Abstract": "text",
    "Data": None,
    "Document": "text",
    "Documentation": "text",
    "Event": None,
    "Forum Post": "text",
    "Github Repository": None,
    "Jupyter Book": "code",
    "Notebook": "code",
    "Online Course": None,
    "Online Tutorial": None,
    "Open Source Software": None,
    "Poster": None,
    "Practicals": None,
    "Preprint": "text",
    "Presentation": "presentation",
    "Publication": "text",
    "Python": None,
    "Report": "text",
    "Slide": "presentation",
    "Slides": "presentation",
    "Tutorial": None,
    "Video": "video",
    "Videos": "video",
    "Website": None,
    "Workshop": None,
    "Youtube Channel": "video"
}

In [11]:
def map_learning_resource(entry):
    # Skip empty or NaN rows
    if entry is None or (isinstance(entry, float) and pd.isna(entry)):
        return ""
    # Use a set to avoid duplicates
    matches = set()
    if isinstance(entry, list):
        for item in entry:
            if item in type_to_learning_resource:
                matches.add(type_to_learning_resource[item])
    elif entry in type_to_learning_resource:
        matches.add(type_to_learning_resource[entry])
    return " * ".join([m for m in matches if m is not None])

def map_media_type(entry):
    # Skip empty or NaN rows
    if entry is None or (isinstance(entry, float) and pd.isna(entry)):
        return ""
    # Use a set to avoid duplicates
    matches = set()
    if isinstance(entry, list):
        for item in entry:
            if item in type_to_media_type:
                matches.add(type_to_media_type[item])
    elif entry in type_to_media_type:
        matches.add(type_to_media_type[entry])
    return " * ".join([m for m in matches if m is not None])

# Apply the mapping functions
data["LearningResourceType"] = data["type"].apply(map_learning_resource)
data["MediaType"] = data["type"].apply(map_media_type)

data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["LearningResourceType"] = data["type"].apply(map_learning_resource)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["MediaType"] = data["type"].apply(map_media_type)


Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date,LearningResourceType,MediaType
1,[Marcelo Zoccoler],Running Deep-Learning Scripts in the BiA-PoL O...,proficient,Python * Artificial Intelligence * Bioimage An...,[Blog Post],https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,,,,,,,Text,text
2,[Robert Haase],Browsing the Open Microscopy Image Data Resour...,competent,OMERO * Python,[Blog Post],https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,,,,,,,Text,text
3,[Mara Lampert],Getting started with Mambaforge and Python,novice,Python * Conda * Mamba,[Blog Post],https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,,,,,,,Text,text
9,[Robert Haase],Managing Scientific Python environments using ...,novice,Python * Conda * Mamba,[Blog Post],https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,,,,,,,Text,text
29,[Robert Haase et al.],BioImage Analysis Notebooks,advanced beginner,Python * Bioimage Analysis,"[Book, Notebook]",https://haesleinhuepf.github.io/BioImageAnalys...,CC-BY-4.0 * BSD-3-Clause,,,,,,,,,Book * Code Notebook,text * code


#### 5. Change the author names to fit the DALIA format (for persons: surname, prename and for organizations: organization-name)

In [12]:
import pandas as pd
import re

def normalize_author_format(authors):
    # Helper function to reformat a single name
    def reformat_name(name):
        # Check if it's already in "Surname, Prename" format
        if "," in name:
            return name.strip()
        # If in "Prename Surname" format, convert to "Surname, Prename"
        parts = name.split()
        et_al = ['et', 'al.']
        if len(parts) == 2 and all(p not in et_al for p in parts):
            return f"{parts[1]}, {parts[0]}"
        if len(parts) == 3 and all(p not in et_al for p in parts):
            return f"{parts[2]}, {parts[0]}{parts[1]}"
        return name.strip()  # Return unchanged if not a simple name format


    # Convert single strings to lists for uniform processing
    if isinstance(authors, str):
        # Split on commas for inline lists like "Prename Surname, Prename Surname"
        authors = [a.strip() for a in re.split(r",\s*|\*|\band\b", authors)]
    elif isinstance(authors, list):
        authors = [str(a).strip() for a in authors]  # Ensure all elements are strings

    # Process each author entry
    formatted_authors = []
    for author in authors:
        formatted_authors.append(reformat_name(author))

    # Join all processed names with "*"
    return " * ".join(formatted_authors)


# Apply the normalization function
data["Authors"] = data["authors"].apply(normalize_author_format)

data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Authors"] = data["authors"].apply(normalize_author_format)


Unnamed: 0,authors,name,proficiency_level,tags,type,url,license,event_date,event_location,description,num_downloads,publication_date,fingerprint,author,submission_date,LearningResourceType,MediaType,Authors
1,[Marcelo Zoccoler],Running Deep-Learning Scripts in the BiA-PoL O...,proficient,Python * Artificial Intelligence * Bioimage An...,[Blog Post],https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,,,,,,,Text,text,"Zoccoler, Marcelo"
2,[Robert Haase],Browsing the Open Microscopy Image Data Resour...,competent,OMERO * Python,[Blog Post],https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,,,,,,,Text,text,"Haase, Robert"
3,[Mara Lampert],Getting started with Mambaforge and Python,novice,Python * Conda * Mamba,[Blog Post],https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,,,,,,,Text,text,"Lampert, Mara"
9,[Robert Haase],Managing Scientific Python environments using ...,novice,Python * Conda * Mamba,[Blog Post],https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,,,,,,,Text,text,"Haase, Robert"
29,[Robert Haase et al.],BioImage Analysis Notebooks,advanced beginner,Python * Bioimage Analysis,"[Book, Notebook]",https://haesleinhuepf.github.io/BioImageAnalys...,CC-BY-4.0 * BSD-3-Clause,,,,,,,,,Book * Code Notebook,text * code,Robert Haase et al.


####  6. Change to names of the columns that already fit the DALIA format to their corresponding name in DALIA

In [13]:
# Rename columns
data = data.rename(columns={'name': 'Title', 'license': 'License', 'url': 'Link', 'description': 'Description', 'publication_date': 'PublicationDate', 'tags': 'Keywords'})

# Remove unwanted columns with no important data
data = data.drop(columns=['event_date', 'event_location', 'num_downloads', 'submission_date', 'fingerprint', 'author', 'type', 'authors'])

data.head()

Unnamed: 0,Title,proficiency_level,Keywords,Link,License,Description,PublicationDate,LearningResourceType,MediaType,Authors
1,Running Deep-Learning Scripts in the BiA-PoL O...,proficient,Python * Artificial Intelligence * Bioimage An...,https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,Text,text,"Zoccoler, Marcelo"
2,Browsing the Open Microscopy Image Data Resour...,competent,OMERO * Python,https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,Text,text,"Haase, Robert"
3,Getting started with Mambaforge and Python,novice,Python * Conda * Mamba,https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,Text,text,"Lampert, Mara"
9,Managing Scientific Python environments using ...,novice,Python * Conda * Mamba,https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,Text,text,"Haase, Robert"
29,BioImage Analysis Notebooks,advanced beginner,Python * Bioimage Analysis,https://haesleinhuepf.github.io/BioImageAnalys...,CC-BY-4.0 * BSD-3-Clause,,,Book * Code Notebook,text * code,Robert Haase et al.


#### 7. Introduce the **Community Column**: NFDI4BioImage if it is listed in the tags

In [14]:
def include_community(entry):
    if isinstance(entry, list):
        if any(e.lower() == 'nfdi4bioimage' for e in entry if isinstance(e, str)):
            return 'NFDI4Bioimage'
    elif isinstance(entry, str):
        if entry.lower() == 'nfdi4bioimage':
            return 'NFDI4Bioimage'
    return None


# Apply the function
data['Community'] = data['Keywords'].apply(include_community)
data.head()

Unnamed: 0,Title,proficiency_level,Keywords,Link,License,Description,PublicationDate,LearningResourceType,MediaType,Authors,Community
1,Running Deep-Learning Scripts in the BiA-PoL O...,proficient,Python * Artificial Intelligence * Bioimage An...,https://biapol.github.io/blog/marcelo_zoccoler...,CC-BY-4.0,,,Text,text,"Zoccoler, Marcelo",
2,Browsing the Open Microscopy Image Data Resour...,competent,OMERO * Python,https://biapol.github.io/blog/robert_haase/bro...,CC-BY-4.0,,,Text,text,"Haase, Robert",
3,Getting started with Mambaforge and Python,novice,Python * Conda * Mamba,https://biapol.github.io/blog/mara_lampert/get...,CC-BY-4.0,,,Text,text,"Lampert, Mara",
9,Managing Scientific Python environments using ...,novice,Python * Conda * Mamba,https://focalplane.biologists.com/2022/12/08/m...,CC-BY-4.0,,,Text,text,"Haase, Robert",
29,BioImage Analysis Notebooks,advanced beginner,Python * Bioimage Analysis,https://haesleinhuepf.github.io/BioImageAnalys...,CC-BY-4.0 * BSD-3-Clause,,,Book * Code Notebook,text * code,Robert Haase et al.,


#### 8. Now also correct the Format of the Link Column:

In [15]:
# Make * Delimiter for the Links if there is more than one for some entries
data["Link"] = data["Link"].apply(lambda x: ' * '.join(x) if isinstance(x, list) else x) #URL

### Export the data to a csv that now fits the DALIA Format

In [16]:
# save selected data
data.to_csv(destination, index=False)

num_rows = data.shape[0]
print(f"Exported {num_rows} rows.")

Exported 465 rows.
