<img src="https://github.com/LinkedEarth/Logos/blob/master/PyLiPD/pyLiPD_logo1_transparent.png?raw=true" width ="800">

# Creating LiPD files from a tabular template

## Authors

[Deborah Khider](https://orcid.org/0000-0001-7501-8430)


## Preamble

If you are planning to only create one LiPD file on your own, we recommend using the [LiPD Playground](https://lipd.net/playground). This tutorial is intended for users who wish to programatically create multiple files from a template. 

In this example, we use [this templated file](https://github.com/LinkedEarth/pylipdTutorials/blob/main/data/Oman.Tian.2023.xlsx).You can repurpose the Excel template as needed; it is only meant as an example. 

### Goals

* Create a LiPD formatted Dataset from an excel template
* Adding an ensemble table 
* Save the Dataset to a file

Reading Time: 10 minutes

### Keywords

LiPD, LinkedEarth Ontology, Object-Oriented Programming

### Pre-requisites

An understanding of OOP and the LinkedEarth Ontology. Completion of [Dataset class example](L3_dataset_class.ipynb). An understanding how to [edit LiPD files](L3_editing.ipynb) can also be useful. 

## Data Description

- Tian, Y., Fleitmann, D., Zhang, Q., Sha, L., Wassenburg, J. A., Axelsson, J., … Cheng, H. (2023). Holocene climate change in southern Oman deciphered by speleothem records and climate model simulations. Nature Communications, 14(1), 4718. doi:[10.1038/s41467-023-40454-z](https://www.nature.com/articles/s41467-023-40454-z). 

## Demonstration

Let's import the necessary packages. 

In [19]:
from pylipd.classes.dataset import Dataset
from pylipd.classes.archivetype import ArchiveTypeConstants
from pylipd.classes.funding import Funding
from pylipd.classes.interpretation import Interpretation
from pylipd.classes.interpretationvariable import InterpretationVariableConstants
from pylipd.classes.location import Location
from pylipd.classes.paleodata import PaleoData
from pylipd.classes.datatable import DataTable
from pylipd.classes.paleounit import PaleoUnitConstants
from pylipd.classes.paleovariable import PaleoVariableConstants
from pylipd.classes.person import Person
from pylipd.classes.publication import Publication
from pylipd.classes.resolution import Resolution
from pylipd.classes.variable import Variable
from pylipd.classes.model import Model
from pylipd.classes.chrondata import ChronData

import pandas as pd
import json

import re

### Opening our template file

The Excel file contains the following sheets:
- `About`
- `Guidelines`
- `Metadata`
- `paleo1measurementTable1`
- `chron1measurementTable1`
- `Lists`

The information we are interested in contained in `Metadata`, `paleo1measurementTable1` and `chron1measurementTable1`. Notice that the last two sheets follow the [LiPD nomenclature](https://lipd.net/playground) closely and this can be helpful to keep track of the tables and where to insert them. However, you may choose any names that is convenient for you. 

Let's start with the root metadata portion.

#### Metadata

In [15]:
def read_metadata(df):
    # Check for empty rows across all columns
    empty_rows = df.isnull().all(axis=1)
    
    # Initialize the start index of the first table
    start_idx = 0
    tables = []
    
    # Iterate through the indices of the DataFrame
    for idx in empty_rows[empty_rows].index:
        # Slice from the current start index to the row before the empty row
        if not df[start_idx:idx].empty:
            current_table = df[start_idx:idx]
            # Check if the table should use its first row as header
            if start_idx != 0:  # Skip header adjustment for the first table
                current_table.columns = current_table.iloc[0]  # Set first row as header
                current_table = current_table[1:]  # Remove the first row from the data
                current_table.reset_index(drop=True, inplace=True)  # Reset index after dropping row
            tables.append(current_table)
        # Update start_idx to the row after the current empty row
        start_idx = idx + 1
    
    # Handle the last table, if any, after the last empty row to the end of the DataFrame
    if start_idx < len(df):
        current_table = df[start_idx:]
        if start_idx != 0:  # Likely unnecessary check but for consistency
            current_table.columns = current_table.iloc[0]  # Set first row as header
            current_table = current_table[1:]  # Remove the first row from the data
            current_table.reset_index(drop=True, inplace=True)
        tables.append(current_table)

    # place the tables according to their types
    root=tables[0]
    pub=tables[1]
    geo=tables[2]
    fund=tables[3]

    return root, pub, geo, fund    

In [16]:
file_path = "../data/Oman.Tian.2023.xlsx"
sheet_name = 'Metadata'

df = pd.read_excel(file_path, sheet_name=sheet_name)

In [17]:
# get the various tables
root, pub, geo, fund = read_metadata(df)

The next step is to create an empty [`Dataset`](https://pylipd.readthedocs.io/en/latest/api.html#pylipd.classes.dataset.Dataset) object so we can start storing the information. 

In [None]:
ds = Dataset()