Skip to content

GFrosi/Comparing_Databases_GEO

Repository files navigation

Comparing_Databases_GEO

Compiling and comparing databases (GEO_based), and checking the metadata validation

Requiremets

python > 3.0 and create a env with pandas

stand-histones

main_stand_dbs.py

This script receives four dataframes:

  • GEO (e.g GEO_metadata_2023_91930_stand.csv)
  • NGS-QC (e.g NGS_HS_ChipSeq_nodup_33233.csv)
  • ChIP-Atlas (e.g CA_hg38_Hs_GSM_GSE_2022_antigenclass_2023_01_24.csv)
  • CistromeDB (e.g Cistrome_filter_human_noENC.csv)
  1. The script returns a dataframe (.csv file) containing all metadata merged from these four databases.

  2. Also, returns a filtered dataframe with samples (GSM) associated with Histones of interest - IHEC(h3k4me3, h3k4me1, h3k27me3, h3k27ac, h3k9me3, h3k36me3) and inputs belonging to the same experiment (GSE).

  3. Moreover, four additional files will be generated (df_hist_inp_SPECIFIC_DB.csv).

  4. All target columns from these four databases will be standardized. The new columns will be followed by _stand.

Usage

A script to merge several dataframes from different databases, standardize and
filter the Histones and Input samples

optional arguments:
  -h, --help            show this help message and exit
  -g GEO, --geo GEO     GEO metadata csv file generated by GEO-Metadata script
  -n NGS, --ngs NGS     NGS-QC metadata csv file generated by NGS-QC-
                        extraction script
  -c CA, --ca CA        ChIP-Atlas metadata csv file generated by ChIP-Atla-
                        extraction script
  -C CISTROME, --cistrome CISTROME
                        Cistrome metadata csv file generated by Cistrome-
                        extraction script

To see how to submit this script via slurm, please check sh/files/run_comparison_dbs.sh

merge_prediction.py

Script to merge the table generated by main_stand_dbs.py and the EpIClass prediction tables.

Usage

python merge_prediction.py Histones_basedDBs.tsv ChIP_Atlas_pred_EpiLaP.csv

compare_dbs_prediction/

main.py

To be able to run this script, you should run the merge_prediction.py first.

Script to generate a table including columns containing the information of how many databases agree/disagree compared to EpiClass prediction. Also, how many samples agree/disagree among DBs.

The additional columns will be added to the output file (e.g Histones_allDBs_CA_pred_comparisonDBs.tsv):

python main.py Histones_allDBs_CA_pred.tsv Histones_allDBs_CA_pred_comparisonDBs.tsv

clean_cols_histDbsPred.py

This script removes columns from Hist_DBs_predCA table to facilitate the manipulation. You can find the col list in the script.

python clean_cols_histDbsPred.py Histones_DBs_filled_CA_pred_consensus_ENCODE_upset.tsv

merge_metadata_epilap.py

Script to merge all predictions generated by EpiClass associated to ChIP-Atlas predictions. (e.g assay_prediction.csv, sex_prediction.csv, biomaterial_prediction.csv). Hint: all pred.csv should be in the same folder as the script. You need to pass the folder containing all outputs (tsv files from EpiClass)

Usage

python merge_metadata_epilap.py <path_to_pred_outputs>

About

Compiling and comparing databases (GEO_based), and checking the metadata validation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors