Compiling and comparing databases (GEO_based), and checking the metadata validation
python > 3.0 and create a env with pandas
This script receives four dataframes:
- GEO (e.g GEO_metadata_2023_91930_stand.csv)
- NGS-QC (e.g NGS_HS_ChipSeq_nodup_33233.csv)
- ChIP-Atlas (e.g CA_hg38_Hs_GSM_GSE_2022_antigenclass_2023_01_24.csv)
- CistromeDB (e.g Cistrome_filter_human_noENC.csv)
-
The script returns a dataframe (.csv file) containing all metadata merged from these four databases.
-
Also, returns a filtered dataframe with samples (GSM) associated with Histones of interest - IHEC(h3k4me3, h3k4me1, h3k27me3, h3k27ac, h3k9me3, h3k36me3) and inputs belonging to the same experiment (GSE).
-
Moreover, four additional files will be generated (df_hist_inp_SPECIFIC_DB.csv).
-
All
targetcolumns from these four databases will be standardized. The new columns will be followed by_stand.
A script to merge several dataframes from different databases, standardize and
filter the Histones and Input samples
optional arguments:
-h, --help show this help message and exit
-g GEO, --geo GEO GEO metadata csv file generated by GEO-Metadata script
-n NGS, --ngs NGS NGS-QC metadata csv file generated by NGS-QC-
extraction script
-c CA, --ca CA ChIP-Atlas metadata csv file generated by ChIP-Atla-
extraction script
-C CISTROME, --cistrome CISTROME
Cistrome metadata csv file generated by Cistrome-
extraction script
To see how to submit this script via slurm, please check sh/files/run_comparison_dbs.sh
Script to merge the table generated by main_stand_dbs.py and the EpIClass prediction tables.
python merge_prediction.py Histones_basedDBs.tsv ChIP_Atlas_pred_EpiLaP.csv
To be able to run this script, you should run the merge_prediction.py first.
Script to generate a table including columns containing the information of how many databases agree/disagree compared to EpiClass prediction. Also, how many samples agree/disagree among DBs.
The additional columns will be added to the output file (e.g Histones_allDBs_CA_pred_comparisonDBs.tsv):
python main.py Histones_allDBs_CA_pred.tsv Histones_allDBs_CA_pred_comparisonDBs.tsv
This script removes columns from Hist_DBs_predCA table to facilitate the manipulation. You can find the col list in the script.
python clean_cols_histDbsPred.py Histones_DBs_filled_CA_pred_consensus_ENCODE_upset.tsv
Script to merge all predictions generated by EpiClass associated to ChIP-Atlas predictions. (e.g assay_prediction.csv, sex_prediction.csv, biomaterial_prediction.csv). Hint: all pred.csv should be in the same folder as the script. You need to pass the folder containing all outputs (tsv files from EpiClass)
python merge_metadata_epilap.py <path_to_pred_outputs>