# Convert xlsx to latex
In this notebook we convert the first sheet of the excel file in `gene_benchmark/tasks/task_descriptions.xlsx` to a set of latex tables, to allow easy insertion into the appendix of the related manuscript. 

In [1]:
import pandas as pd

In [2]:
tasks = pd.read_excel("../gene_benchmark/tasks/task_descriptions.xlsx")
tasks.head()

Unnamed: 0,Task Name,Family,Type,Size,Description,Origin,Link
0,Bivalent vs non-methylated,Genomic Properties,binary,133.0,Does the gene go through methylation or is it ...,GenePT,https://www.biorxiv.org/content/10.1101/2023.1...
1,Chromosome,Genomic Properties,categorical,19784.0,Chromosome,human protein atlas,https://www.proteinatlas.org/
2,Dosage sensitive vs insensitive tf,Genomic Properties,binary,487.0,Is gene expression affected by the number of c...,GenePT,https://www.biorxiv.org/content/10.1101/2023.1...
3,Lys4-only-methylated vs non-methylated,Genomic Properties,binary,171.0,Does gene go through Lys4 methylation,GenePT,http://humantfs.ccbr.utoronto.ca/download.php
4,Protein class,Genomic Properties,multi label,19784.0,Protein class(es) of the gene product accordin...,human protein atlas,https://www.proteinatlas.org/


In [3]:
# remove the NaN from size of Pathology prognostics to convert the column to int
tasks.loc[49, "Size"] = 0

In [4]:
tasks["Size"] = pd.to_numeric(tasks["Size"], downcast="integer")  # convert to integer

In [5]:
def print_task_family_table(task_family: str, description: pd.DataFrame):
    print(r"\begin{table}")
    print(
        rf"\capation{{Detailed desctiption for the benchmarking tasks belonging to the family {task_family}}}"
    )
    print(rf"\label{{tab:task_descriptions_{task_family.lower().replace(' ', '_')}}}")
    print(r"\centering")
    print(
        description.to_latex(
            columns=["Task Name", "Type", "Size", "Description", "Origin"], index=False
        )
    )

In [6]:
for task_family, details in tasks.groupby(tasks["Family"]):
    print_task_family_table(task_family, details)

\begin{table}
\capation{Detailed desctiption for the benchmarking tasks belonging to the family Biological Processes}
\label{tab:task_descriptions_biological_processes}
\centering
\begin{tabular}{llrll}
\toprule
Task Name & Type & Size & Description & Origin \\
\midrule
Biological process & multi label  & 10796 & UniProt keywords indicating involvement in a particular biological process & human protein atlas  \\
Ccd protein & binary & 1429 & Cell cycle dependent (CCD) proteins in the FUCCI U-2 OS cell line & human protein atlas  \\
Ccd transcript & binary & 1631 & Cell cycle dependent (CCD) genes by RNA expression in the FUCCI u-2 OS cell line & human protein atlas  \\
Disease involvement & multi label  & 5837 & UniProt keywords for disease, cancer, and FDA approved drug targets & human protein atlas  \\
Gene-disease association & regression & 411569 & Disease-gene association score as derived from the open targets platform & open targets \\
Hla class i vs class ii & binary & 44 & Iden