## SRP121892

**paper:** [PMID: 29456143](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11098552/) - Neocortical association cell types in the forebrain of birds and alligators, 2018

**date, curator:** 2024-10-10, Sara Carsanaro

**notes**
- emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect
- infoOrgan was supplemented by the paper, originally each just said brain and I added the brain regions specified in the paper
- paper doesn't specify protocol but does say RNA Selection is polyA

### annotation summary
run this after annotation is complete

In [26]:
anat_summary = library[['infoOrgan', 'anatId', 'anatName', 'anatAnnotationStatus', 'comment']]
unique_anat = anat_summary.drop_duplicates()
display_df(unique_anat)

Unnamed: 0,infoOrgan,anatId,anatName,anatAnnotationStatus,comment
0,brain - arcopallium,UBERON:0007350,arcopallium,perfect match,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"
1,brain - basorostralis,UBERON:0000203,pallium,missing child term,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"
2,brain - entopallium,UBERON:0014759,entopallium,perfect match,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"
3,brain - Field L,UBERON:0007334,nidopallium,missing child term,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"
4,brain - HA (hyperpallium apicale),UBERON:0014757,hyperpallium apicale,perfect match,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"
5,brain - anterior mesopallium,UBERON:0007349,mesopallium,missing child term,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"
6,brain - posterior nidopallium,UBERON:0007334,nidopallium,missing child term,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"


In [25]:
dev_summary = library[['infoStage', 'stageId', 'stageName', 'stageAnnotationStatus']]
unique_dev = dev_summary.drop_duplicates()
display_df(unique_dev)

Unnamed: 0,infoStage,stageId,stageName,stageAnnotationStatus
0,Embryonic day 14,GgalDv:0000054,Hamburger Hamilton stage 40,perfect match


### set variables, import packages, define functions

In [1]:
experiment_id = "SRP121892"

path_to_create_exp_script = "/Users/scarsana/Desktop/git/scRNA-Seq/scripts/Create_ExpLib_tables.py" 
experiment_type = "bulk"

path_to_output_main = "/Users/scarsana/Desktop/git/expression-annotations/Notebooks/bulk/" 
path_to_output = "{}{}/".format(path_to_output_main, experiment_id)
library_path_from_script = "{}RNASeqLibrary_{}.tsv".format(path_to_output, experiment_id)
experiment_path_from_script = "{}RNASeqExperiment_{}.tsv".format(path_to_output, experiment_id)
library_to_add_path = "{}complete_RNASeqLibrary_{}.tsv".format(path_to_output, experiment_id)
experiment_to_add_path = "{}complete_RNASeqExperiment_{}.tsv".format(path_to_output, experiment_id)
script_file = "{}.ipynb".format(experiment_id)
commit_message_exp = '"adding annotated bulk experiment {}"'.format(experiment_id)
commit_message_py = '"adding annotation files for {} to notebook folder"'.format(experiment_id)


## to add to git
path_to_git_annotations = "/Users/scarsana/Desktop/git/expression-annotations/RNA_Seq/"
git_library_path = "{}RNASeqLibrary.tsv".format(path_to_git_annotations)
git_experiment_path = "{}RNASeqExperiment.tsv".format(path_to_git_annotations)

library_cols = ['#libraryId', 'experimentId', 'platform', 'SRSId', 'anatId', 'anatName', 'stageId', 'stageName', 'url_GSM', 'infoOrgan', 'infoStage', 'anatAnnotationStatus', 'anatBiologicalStatus', 'stageAnnotationStatus', 'sex', 'strain', 'genotype', 'speciesId', 'protocol', 'protocolType', 'RNASelection', 'globin_reduction', 'replicate', 'lib_name', 'sampleName', 'sampleAge_value', 'sampleAge_unit', 'PATOid', 'PATOname','comment', 'condition', 'physiologicalStatus', 'annotatorId', 'lastModificationDate']

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import os
import csv

# displays df with the scrollbar next to the DataFrame
def display_df(df):
    pd.set_option("display.max_rows", None)
    pd.set_option("display.max_columns", None)
    display(HTML("<div style='height: 300px; overflow: auto; width: fit-content'>" +
        df.style.to_html(index=False) + "</div>"))

# function that compares two columns in a dataframe and tells you which ones are not equal (case insensitive)
def compare_columns(df, col1, col2, return_col):
    compare_return = df[col1].str.lower() != df[col2].str.lower()  
    df.loc[compare_return, return_col] 
    if not any(compare_return):
        print("The two columns are equal (case insensitive)")
    else:
        print("The following rows are not equal: ")
        print(df.loc[compare_return, return_col])

# fixes formatting of file to match libreoffice settings/historic file format
def update_format(path):
    with open(path, 'r') as file:
        filedata = file.read()
    # Replace the target string
    filedata = filedata.replace("\t\"\"", "\t")
    # Write the file out again
    with open(path, 'w') as file:
        file.write(filedata)

# checks for duplicate values in a specific column and prints those values + the corresponding library id
def dup_check(df, column):
    duplicateCheck = df.duplicated(subset=[column], keep=False)
    duplicateCheck.sort_values(inplace=True)
    if duplicateCheck.unique().any() == False:
        print("no duplicate values in " + column)
    elif duplicateCheck.unique().any() == True and column != '#libraryId':
        dups = df[duplicateCheck].loc[:,['#libraryId', column]]
        df_dups = pd.DataFrame(dups)
        df_dups.sort_values(inplace=True, by=column)
        print(df_dups)
    elif duplicateCheck.unique().any() == True and column == '#libraryId':
        print(df[duplicateCheck].loc[:,['#libraryId']])

# prints all unique values in a specific column
def unique_sorted(df, column):
    unique = df[column].unique()
    unique.sort()
    print(unique)

### script

In [3]:
! python3 $path_to_create_exp_script $experiment_id $path_to_output $experiment_type

  all_protoc = [w.replace('(', '\(') for w in all_protoc]
  all_protoc = [w.replace(')', '\)') for w in all_protoc] 
Be patient, it may take a few minutes.
curl: (22) The requested URL returned error: 500
[m[?7h[4l>7[r[?1;3;4;6l8[31m[1m[7m ERROR: [m[?7h[4l>7[r[?1;3;4;6l8[31m[1m curl command failed ( Thu Oct 10 15:55:51 CEST 2024 ) with: 22[m[?7h[4l>7[r[?1;3;4;6l8
[34mhttps://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi -d id=29456143&dbfrom=pubmed&cmd=prlinks&tool=edirect&edirect=16.2&edirect_os=Darwin&email=scarsana%40SORGE42778[m[?7h[4l>7[r[?1;3;4;6l8
HTTP/1.1 500 Internal Server Error
[34mnquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ elink.fcgi -id 29456143 -dbfrom pubmed -cmd prlinks -tool edirect -edirect 16.2 -edirect_os Darwin -email scarsana@SORGE42778
[m[31mEMPTY RESULT[m
[34mSECOND ATTEMPT[m
curl: (22) The requested URL returned error: 500
[m[?7h[4l>7[r[?1;3;4;6l8[31m[1m[7m ERROR: [m[?7h[4l>7[r[?1;

### library annnotations

In [5]:
library = pd.read_csv(library_path_from_script, sep='\t', index_col=False, keep_default_na=False, na_values=['NULL','null', 'nan','NaN'], dtype=object)
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,,,,brain - arcopallium,Embryonic day 14,perfect match,not documented,,not collected,White leghorn,,9031,,,,,,Gga_arcopallium,SAMN07839988,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,,,,brain - basorostralis,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_basorostralis,SAMN07839989,,,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,,,,brain - entopallium,Embryonic day 14,perfect match,not documented,,not collected,White leghorn,,9031,,,,,,Gga_entopallium,SAMN07839990,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,,,,brain - Field L,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_FieldL,SAMN07839991,,,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,,,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,,not collected,White leghorn,,9031,,,,,,Gga_HA,SAMN07839992,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,,,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_mesopallium,SAMN07839993,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,,,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_Nidopallium,SAMN07839994,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### anatomical entity
* [uberon ols](https://www.ebi.ac.uk/ols4/ontologies/uberon)
* infoOrgan was supplemented by the paper, originally each just said brain and I added the brain regions specified in the paper
* manually did anatId, anatName, anatAnnotationStatus, and anatBiologicalStatus - see comments for extra details

In [6]:
unique_sorted(library, "infoOrgan")

['brain - Field L' 'brain - HA (hyperpallium apicale)'
 'brain - anterior mesopallium' 'brain - arcopallium'
 'brain - basorostralis' 'brain - entopallium'
 'brain - posterior nidopallium']


In [7]:
# view manual annotations
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,,,,brain - arcopallium,Embryonic day 14,perfect match,not documented,,not collected,White leghorn,,9031,,,,,,Gga_arcopallium,SAMN07839988,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,,,,brain - basorostralis,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_basorostralis,SAMN07839989,,,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,,,,brain - entopallium,Embryonic day 14,perfect match,not documented,,not collected,White leghorn,,9031,,,,,,Gga_entopallium,SAMN07839990,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,,,,brain - Field L,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_FieldL,SAMN07839991,,,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,,,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,,not collected,White leghorn,,9031,,,,,,Gga_HA,SAMN07839992,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,,,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_mesopallium,SAMN07839993,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,,,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,,not collected,White leghorn,,9031,,,,,,Gga_Nidopallium,SAMN07839994,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### stage
- [species specific developmental ontologies](https://github.com/obophenotype/developmental-stage-ontologies/tree/master/src)

In [8]:
unique_sorted(library, "infoStage")

['Embryonic day 14']


In [9]:
# all
library.loc[:,'stageId'] = 'GgalDv:0000054'
library.loc[:,'stageName'] = 'Hamburger Hamilton stage 40'
# Usually obtained after 14.0 days of incubation.
# perfect match, missing child term, other
library.loc[:,'stageAnnotationStatus'] = 'perfect match'


# view
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - arcopallium,Embryonic day 14,perfect match,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_arcopallium,SAMN07839988,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - basorostralis,Embryonic day 14,missing child term,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_basorostralis,SAMN07839989,,,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - entopallium,Embryonic day 14,perfect match,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_entopallium,SAMN07839990,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - Field L,Embryonic day 14,missing child term,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_FieldL,SAMN07839991,,,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_HA,SAMN07839992,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_mesopallium,SAMN07839993,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,perfect match,not collected,White leghorn,,9031,,,,,,Gga_Nidopallium,SAMN07839994,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### sex, strain, genotype, speciesId
- uniprot [strain list](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/strains)
- uniprot [species list](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speclist)
- bgee [strain mapping](https://gitlab.sib.swiss/Bgee/expression-annotations/-/tree/develop/Strains?ref_type=heads)

In [11]:
library.loc[library["sex"] == "not collected", "sex"] = "NA"

# strain White leghorn is correct as is
#library.loc[:,'strain'] = ''

#library.loc[:,'genotype'] = ''

#library.loc[:,'speciesId'] = ''

# view
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - arcopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_arcopallium,SAMN07839988,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - basorostralis,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_basorostralis,SAMN07839989,,,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - entopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_entopallium,SAMN07839990,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - Field L,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_FieldL,SAMN07839991,,,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_HA,SAMN07839992,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_mesopallium,SAMN07839993,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,,,,Gga_Nidopallium,SAMN07839994,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### protocol
see [bulk kits](https://gitlab.sib.swiss/Bgee/scRNA-Seq/-/blob/main/scripts/bulk_kits.csv) for some common protocols

In [12]:
# making these variables because we use them again in the experiment file
#my_protocol = ''
# full_length or 3'
#my_protocolType = ''

#library.loc[:,'protocol'] = my_protocol
#library.loc[:,'protocolType'] = my_protocolType
# polyA, ribo-minus, miRNA, lncRNA, circRNA
library.loc[:,'RNASelection'] = 'polyA'

# view
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - arcopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_arcopallium,SAMN07839988,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - basorostralis,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_basorostralis,SAMN07839989,,,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - entopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_entopallium,SAMN07839990,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - Field L,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_FieldL,SAMN07839991,,,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_HA,SAMN07839992,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_mesopallium,SAMN07839993,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_Nidopallium,SAMN07839994,,,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### globin, replicates

In [13]:
# check for duplicate SRSId values
dup_check(library, "SRSId")

no duplicate values in SRSId


In [None]:
#library.loc[:,'globin_reduction'] = 'Y'

# replicates
#library.loc[library["#libraryId"] == "old", "replicate"] = "1"
#library.loc[library["#libraryId"].isin(["one", "two"]), "replicate"] = "1"

# view
display_df(library)

#### sample age, pato, physiological status

In [14]:
library.loc[:,'sampleAge_value'] = '14'
library.loc[:,'sampleAge_unit'] = 'embryonic day'

# ex. castrated male
#library.loc[:,'PATOid'] = ''
#library.loc[:,'PATOname'] = ''

# ex. castrated, pregnant, pre-smoltification, post-smoltification, laying eggs
#library.loc[:,'physiologicalStatus'] = ''

# view
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - arcopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_arcopallium,SAMN07839988,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - basorostralis,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_basorostralis,SAMN07839989,14,embryonic day,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - entopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_entopallium,SAMN07839990,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - Field L,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_FieldL,SAMN07839991,14,embryonic day,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_HA,SAMN07839992,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_mesopallium,SAMN07839993,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_Nidopallium,SAMN07839994,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,,10/10/2024,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### condition

In [None]:
# ex. control, diet, light, reproductive capacity, time post mortem, time post feeding, 
# exercise details, menstruation, personality, litter size 
#library.loc[library["condition"] == "old", "condition"] = "new"

# view
display_df(library)

#### annotator id, last modification date

In [15]:
library.loc[:,'annotatorId'] = 'SAC'
library.loc[:,'lastModificationDate'] = '2024-10-14'

# view
display_df(library)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate,library_contruction_protocol,source_qc,lib_name_2,lib_name_3,source_name,individual,infoStage_2,infoStage_3
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - arcopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_arcopallium,SAMN07839988,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_arcopallium,,,,E14,
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - basorostralis,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_basorostralis,SAMN07839989,14,embryonic day,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_basorostralis,,,,E15,
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - entopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_entopallium,SAMN07839990,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_entopallium,,,,E16,
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - Field L,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_FieldL,SAMN07839991,14,embryonic day,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_FieldL,,,,E17,
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_HA,SAMN07839992,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_HA,,,,E18,
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_mesopallium,SAMN07839993,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_mesopallium,,,,E19,
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_Nidopallium,SAMN07839994,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14,"RNA integrity was analyzed with a Bioanalyzer 2100; only samples with clean rRNA peaks and little to no degradation were used. Total RNA was polyA selected and directionally sequenced at the University of Chicago Genomics Facility on an Illumina HiSeq2000 per manufacturer's instructions, generating 100bp paired-end reads with an insert size of 300bp",,Gga_Nidopallium,,,,E20,


#### comments

In [None]:
#library.loc[:,'comment'] = 'PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect'

#### save complete file with correct columns

In [16]:
library_file_complete = library[library_cols]
library_file_complete.to_csv(library_to_add_path, sep="\t", index=False, quoting=csv.QUOTE_ALL)

# view
display_df(library_file_complete)

Unnamed: 0,#libraryId,experimentId,platform,SRSId,anatId,anatName,stageId,stageName,url_GSM,infoOrgan,infoStage,anatAnnotationStatus,anatBiologicalStatus,stageAnnotationStatus,sex,strain,genotype,speciesId,protocol,protocolType,RNASelection,globin_reduction,replicate,lib_name,sampleName,sampleAge_value,sampleAge_unit,PATOid,PATOname,comment,condition,physiologicalStatus,annotatorId,lastModificationDate
0,SRX3334825,SRP121892,Illumina HiSeq 2000,SRS2637233,UBERON:0007350,arcopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - arcopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_arcopallium,SAMN07839988,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14
1,SRX3334824,SRP121892,Illumina HiSeq 2000,SRS2637232,UBERON:0000203,pallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - basorostralis,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_basorostralis,SAMN07839989,14,embryonic day,,,"nucleus basorostralis of the pallium (PMID: 15116397), annotated as pallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14
2,SRX3334823,SRP121892,Illumina HiSeq 2000,SRS2637229,UBERON:0014759,entopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - entopallium,Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_entopallium,SAMN07839990,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14
3,SRX3334822,SRP121892,Illumina HiSeq 2000,SRS2637228,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - Field L,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_FieldL,SAMN07839991,14,embryonic day,,,"Field L is the nidopallial region containing the primary auditory thalamorecipient zone (PMID: 15116397), annotated as nidopallium since this doesn't appear to be in uberon and homologous structure is not known; PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14
4,SRX3334821,SRP121892,Illumina HiSeq 2000,SRS2637227,UBERON:0014757,hyperpallium apicale,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - HA (hyperpallium apicale),Embryonic day 14,perfect match,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_HA,SAMN07839992,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14
5,SRX3334820,SRP121892,Illumina HiSeq 2000,SRS2637226,UBERON:0007349,mesopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - anterior mesopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_mesopallium,SAMN07839993,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14
6,SRX3334819,SRP121892,Illumina HiSeq 2000,SRS2637225,UBERON:0007334,nidopallium,GgalDv:0000054,Hamburger Hamilton stage 40,,brain - posterior nidopallium,Embryonic day 14,missing child term,not documented,perfect match,,White leghorn,,9031,,,polyA,,,Gga_Nidopallium,SAMN07839994,14,embryonic day,,,"PMID: 29456143, emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect",,,SAC,2024-10-14


### experiment annotations

In [17]:
experiment = pd.read_csv(experiment_path_from_script, sep='\t', index_col=False, keep_default_na=False, na_values=['NULL','null', 'nan','NaN'], dtype=object)
display_df(experiment)

Unnamed: 0,#experimentId,experimentName,experimentDescription,experimentSource,experimentStatus,projectTags,numberOfAnnotatedLibraries,protocol,protocolType,GSE,Bioproject,PMID,reference_url,DOI,xrefs,comment
0,SRP121892,Chicken telencephalon RNAseq,Embryonic chicken telencephalon nuclei were isolated for RNAseq to identify transcripts differentially expressed across different brain regions.,SRA,,,,,,,PRJNA416004,,,"E,r,r,o,r,:, ,U,n,a,b,l,e, ,t,o, ,r,e,t,r,i,e,v,e, ,d,a,t,a,,, ,S,t,a,t,u,s, ,c,o,d,e, ,4,0,4",,


#### experiment and protocol details

In [18]:
# this will give you the number of rows in the complete library file 
# this should be the number of annotated libraries
ann_lib = len(library_file_complete.index)
len(library_file_complete.index)

7

In [19]:
# partial or total
experiment.loc[:,'experimentStatus'] = 'total'
#experiment.loc[:,'projectTags'] = '' 
# see above cell, also can add as free text
experiment.loc[:,'numberOfAnnotatedLibraries'] = ann_lib

# these variables should already exist from above but if not can just add as free text
#experiment.loc[:,'protocol'] = my_protocol
#experiment.loc[:,'protocolType'] = my_protocolType

display_df(experiment)

Unnamed: 0,#experimentId,experimentName,experimentDescription,experimentSource,experimentStatus,projectTags,numberOfAnnotatedLibraries,protocol,protocolType,GSE,Bioproject,PMID,reference_url,DOI,xrefs,comment
0,SRP121892,Chicken telencephalon RNAseq,Embryonic chicken telencephalon nuclei were isolated for RNAseq to identify transcripts differentially expressed across different brain regions.,SRA,total,,7,,,,PRJNA416004,,,"E,r,r,o,r,:, ,U,n,a,b,l,e, ,t,o, ,r,e,t,r,i,e,v,e, ,d,a,t,a,,, ,S,t,a,t,u,s, ,c,o,d,e, ,4,0,4",,


#### paper and xrefs

In [20]:
#experiment.loc[:,'GSE'] = ''
#experiment.loc[:,'Bioproject'] = '' 
experiment.loc[:,'PMID'] = '29456143'
experiment.loc[:,'reference_url'] = 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11098552/'
experiment.loc[:,'DOI'] = '10.1016/j.cub.2018.01.036'
#experiment.loc[:,'xrefs'] = ''

display_df(experiment)

Unnamed: 0,#experimentId,experimentName,experimentDescription,experimentSource,experimentStatus,projectTags,numberOfAnnotatedLibraries,protocol,protocolType,GSE,Bioproject,PMID,reference_url,DOI,xrefs,comment
0,SRP121892,Chicken telencephalon RNAseq,Embryonic chicken telencephalon nuclei were isolated for RNAseq to identify transcripts differentially expressed across different brain regions.,SRA,total,,7,,,,PRJNA416004,29456143,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11098552/,10.1016/j.cub.2018.01.036,,


#### comments

In [21]:
experiment.loc[:,'comment'] = 'emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect'

display_df(experiment)

Unnamed: 0,#experimentId,experimentName,experimentDescription,experimentSource,experimentStatus,projectTags,numberOfAnnotatedLibraries,protocol,protocolType,GSE,Bioproject,PMID,reference_url,DOI,xrefs,comment
0,SRP121892,Chicken telencephalon RNAseq,Embryonic chicken telencephalon nuclei were isolated for RNAseq to identify transcripts differentially expressed across different brain regions.,SRA,total,,7,,,,PRJNA416004,29456143,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11098552/,10.1016/j.cub.2018.01.036,,"emailed stevendb@uchicago.edu to clarify developmental stage - all E14, SRA is incorrect"


#### save complete file

In [22]:
experiment.to_csv(experiment_to_add_path, sep="\t", index=False, quoting=csv.QUOTE_ALL)

### added to git separately
used this annotation as a demo example with anne, QA and add to git was done together and not documented