ipsc preprocess pipeline

This pipeline processes/quantifies the fastq files to organized allelic count matrices & quantile normalized expression matrices. It can be run through 'preprocess_driver.key.sh' 'preprocess_driver.key.sh' splits pipeline into 7 steps that can be run in series (all the user has to do is remove the if false; then code arround the desired part). The code is well-commented so it should be clear from 'preprocess_driver_key'.sh' what each section does.

Deliverables

As for important output files from this pipeline (using dir_names defined in 'preprocess_driver_key.sh':

1. $preprocess_total_expression_dir"quantile_normalized.txt" contains quantile normalized expression matrices for all samples across all protein-coding autosomal genes that have at least 10 samples such that RPKM >= .1 and counts >= 6.

2. $preprocess_total_expression_dir"rpkm.txt" contains raw rpkm expression matrices for all samples across alll protein-coding autosomal genes that have at least 10 samples such that RPKM >= .1 and counts >= 6

3. $covariate_dir"principal_components_10.txt" contains matrix for loadings of first 10 PCs for all samples

4. $covariate_dir"processed_covariates_categorical.txt" contains an organized matrix of covariates

5. $covariate_dir"sva_loadings.txt" contains results of running sva (while inputting time to allow sva to make factors independent of time)

6. $visualize_total_expression_dir: a lot of plots describing total expression quantification

7. $genotype_dir"YRI_genotype.vcf": imputed-dosage based genotypes for our 10 cell lines

8. $genotype_dir"YRI_het_prob_genotype.vcf": heterozygous probabilities for our 10 cell lines (based on impute2)

9. $processed_allelic_counts_dir"allelic_counts_gene_mapped_het_prob_*.txt" where * is the heterozygous probability threshold used to call heterozygous sites. I computed this matrices for a bunch of different thresholds. This matrix contains refAlleleCounts_totalCounts in each cell. Currently no filtering on sites (this will be changed)

10. $visualize_allelic_counts_dir contains plots describing allelic count quantification

11. $fastq_input_dir"fastq_mapping.txt" contains list of mapping from sample_ids (cellLine_timeStep) to fastq files composing those samples

12. $preprocess_total_expression_dir"quant_expr_sva_corrected.txt contains residual quantile normalized expression matrix after regressing out sva latent factors.

Let me know if something isn't clear or you think you found a mistake!

Computer cluster

This pipeline was written to run on midway rcc

Authors

Ben Strober -- BennyStrobes

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
GenomeAnalysisTK.jar		GenomeAnalysisTK.jar
README.md		README.md
bam2h5_tables_update.py		bam2h5_tables_update.py
chromosome.py		chromosome.py
chromstat.py		chromstat.py
convert_genotype_to_vcf.py		convert_genotype_to_vcf.py
create_text_based_snp_files.py		create_text_based_snp_files.py
download-exons.R		download-exons.R
download-genome.R		download-genome.R
fastqc_and_download_reference_genome.sh		fastqc_and_download_reference_genome.sh
filter_remapped_reads.py		filter_remapped_reads.py
find_intersecting_snps.py		find_intersecting_snps.py
get_genotype_sample_names.py		get_genotype_sample_names.py
get_mean_expression_matrix_for_each_cell_line_cluster.py		get_mean_expression_matrix_for_each_cell_line_cluster.py
merge_fastq_replicates.py		merge_fastq_replicates.py
merge_fastq_replicates.sh		merge_fastq_replicates.sh
picard.jar		picard.jar
prepare_covariate_files.R		prepare_covariate_files.R
preprocess_driver_key.sh		preprocess_driver_key.sh
preprocess_total_expression.R		preprocess_total_expression.R
preprocess_total_expression.sh		preprocess_total_expression.sh
preprocess_total_expression_by_cell_lines.py		preprocess_total_expression_by_cell_lines.py
process_and_organize_allelic_counts.py		process_and_organize_allelic_counts.py
process_and_organize_allelic_counts.sh		process_and_organize_allelic_counts.sh
rmdup.py		rmdup.py
run-subread.R		run-subread.R
snptable.py		snptable.py
submit-subread.sh		submit-subread.sh
util.py		util.py
visualize_processed_allelic_counts.R		visualize_processed_allelic_counts.R
visualize_processed_total_expression.R		visualize_processed_total_expression.R
wasp_mapping_pipeline_part1.sh		wasp_mapping_pipeline_part1.sh
wasp_mapping_pipeline_part2.sh		wasp_mapping_pipeline_part2.sh

BennyStrobes/ipsc_preprocess_pipeline

Folders and files

Latest commit

History

Repository files navigation

ipsc preprocess pipeline

Deliverables

Computer cluster

Authors

About

Resources

Stars

Watchers

Forks

Languages