Placenta_XCI

Understanding patterns of X chromosome inactivation in full term human placenta

Processing placenta data

Directory placenta

00_sexcheck

Snakefile sexcheck.snakefile

01_process_dna

Generate config files
- python generate_json_config_dna_females.py: take in the input file female_sample_ids.csv and output the config file process_dna_females_config.json (for female placentas)
- python generate_json_config_dna_males.py: take in the input file male_sample_ids.csv and output the config file process_dna_males_config.json (for male placenta)
Snakemake files:
- for mapping and genotype variants
- process_dna_females.snakefile and process_dna_males.snakefile:

02_run_asereadcounter

Config file: asereadcounter_config.json
Snakemake file: asereadcounter.snakefile
Output directory: 02_run_asereadcounter/asereadcounter

03_analyze_ase

Subdirectories: scripts and results
Calculate (unphased) allele balance:
- Python script calc_allele_balance.py
- Config file: analyze_ase_config.json
- Snakemake file: analyze_ase.snakefile
Calculate median allele balance per individual:
- Python script calc_median_allele_balance_placenta_decidua.py (see Bash script run_calc_median_allele_balance_placenta_decidua.sh)

04_phasing

Phasing strategy:
- For each pair of placenta (site A and site B):
  - Subset to contain shared expressed variants
  - Using the site with more variants where allele balance is greater than 0.8
  - Generate a haplotype by adding all the biased allele together. If the allele balance is equal to 0.5, pick at random
  - Calculate allele balance using the phased data
Steps:
1. For each pair of placenta (site A and B), find shared variants between site A and site B:
2. python subset_paired_placentas_for_shared_variants.py chrX > chrX_summary_stats.txt
3. python subset_paired_placentas_for_shared_variants.py chr8 > chr8_summary_stats.txt
4. Results are in directory paired_placentas_shared_variants/
5. Run snakefile: snakemake --snakefile phase.snakefile to compute allele balance for phased data
6. Concat for plotting Figure 2:
```
cd 04_phasing/phased_allele_balance/
cat *chrX*allele_balance_summary.tsv | grep -v sample_id | sort -n -r -k 3,3 > all_placenta_chrX_phased_allele_balance.tsv
```

05_pca

Contain files for generating the PCA

Processing GTEx data

Directory gtex
In this directory, we are analyzing the ASEReadCounter counts from GTEx version 8.

01_download_data

Download the file participant.tsv from anvil project website. This file has information about the sample id
Download the file sample.tsv from anvil project website. This file has information about the rna id and which tissue
Obtain a list of individuals
There are 979 individuals
Run the python script obtain_individuals_list.py
Download using the file download_asereadcounter_count.sh
After downloading, I noticed that there are some files with this message inside: No such object: fc-secure-ff8156a3-ddf3-42e4-9211-0fd89da62108/GTEx_Analysis_2017-06-05_v8_ASE_WASP_chrX_raw_counts_by_subject/GTEX-1J8EW.v8.readcounts.chrX.txt.gz. We want to remove these individuals from further analyses. Therefore, we need to know which are these individuals.
Run the python script: python check_corrupted_files.py
The outfile is failed_files.txt. There are 147 individuals without ASEReadCounter results.
Generate a config file from the file sample.tsv: python generate_config.py
Remove the individuals in the failed_files.txt
Only keep the females 1. Download: wget https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt

02_analyze_gtex_counts

Snakemake file: analyze_gtex_counts.snakefile

Subset each downloaded count file for each tissue

Because each count file includes all of the tissues for an individual, I need to subset for each tissue
Use the python script subset_gtex_counts.py. See snakemake rule subset_gtex_counts (line 9)

calculate allele balance 1.See snakemake rule calc_allele_balance (line 24)
Calculate median allele balance per tissue:
See snakemake rule calc_median_allele_balance_per_tissue (line 36)
Find tissues where there are at least 10 samples per tissue: python find_tissues_more_than_10_samples_per_tissue.py

Processing GTEx heart data

Directory heart

1. Find subject ids with rnaseq data for both heart left ventricle and heart atrial appendage

python find_samples_with_2_hearts.py

2. Calculate the proportion of skewed variants per sample

python calc_prop_variants_skewed_per_sample.py

3. Employ a phasing strategy for heart in the same way as placenta

python /scratch/tphung3/Placenta_XCI/heart/subset_paired_hearts_for_shared_variants.py chrX. Results are in directory paired_hearts_shared_variants/
Use the snakemake file phase.snakefile. Results are in directory phased_allele_balance/

cat *chrX* | grep -v sample_id | awk '{print$1"\t"$3"\t"$2}' | sort -n -r -k 3,3 > all_heart_chrX_phased_allele_balance.tsv

Analysis at the gene level

In this directory, I am analyzing genes that escape XCI for placenta and gtex tissues using the individuals that show skewed allele balance (median allele balance is greater than 0.8)
Directory gene_level
- Sub-directories: gtex_counts and specific_gene_analysis

wes_genotyping

Process the placenta skewed samples

Convert skewed samples to bed file format

Use the python script convert_asereadcounter_to_bed.py
See the snakemake rule convert_asereadcounter_to_bed_placenta

Convert gtf file to bed file format

python convert_gtf_to_bed.py

Output is /scratch/tphung3/Placenta_XCI/gene_level/wes_genotyping/gtf_bed/gencode.v29.annotation.chrX.bed

Use bedtools to find where on the genes the variants are
Snakemake rule bedtools_intersect
Remove duplicated: snakemake rule find_unique_lines_after_bedtools

gtex_counts

Find samples that are skewed in GTEX tissues, placenta, decidua females, and decidua males

python /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/scripts/find_samples_skewed.py

The config file is /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/escape_genes_config.json
There are 525 samples that are skewed in the GTEx

Tabulate how many individuals there are for each tissue that are highly skewed

python /scratch/tphung3/PlacentaSexDiff/E_escape_genes/gtex_counts/scripts/tabulate_individuals.py

The result file is tabulate_individuals.csv

Convert skewed samples to bed file format
Rule convert_asereadcounter_to_bed_gtex in escape_genes.snakefile
Use bedtools to find where on the genes the variants are
Snakemake rule bedtools_intersect
Remove duplicated: snakemake rule find_unique_lines_after_bedtools

Assign escape status by combining all of the gtex samples

Find genes that have at least one heterozygous and expressed variant across all skewed samples
python /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/scripts/find_expressed_genes.py produces the output file /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/expressed_genes_all_samples.txt that lists all of the genes with at least one heterozygous and expressed variant across all samples. There are 689 genes.
For each skewed individual in gtex, compute mean allele balance for each gene. For the placenta and decidua sample, I have already done this step here /scratch/tphung3/Placenta_XCI/gene_level/wes_genotyping/asereadcounter_geneinfo/chrX.

Use the Python script compute_allele_balance_per_gene.py
See snakemake rule compute_allele_balance_per_gene_gtex

Run python make_allele_count_per_gene.py
Add genes to config: python add_genes_to_config.py
See snakemake rule plot_per_gene_allele_balance_compare_gtex_placenta_decidua
Categorize genes into inactivated, escape, or variable escape for gtex, placenta, decidua females, and decidua males.
Use Python script categorize_genes.py - This script categorizes the genes, remove NA, and also sort for plotting heatmaps

Generate plots for manuscript

Directory manuscript_plots
Figure 2: scripts/figure_2.R
Figure 2C: scripts/figure_2C.R
Figure 3: scripts/figure_3.R
Figure 4: scripts/figure_4.R
Supplementary figures:
- Figure S2: scripts/figure_s2.R
- Figure S3: scripts/figure_s3_pca.R
- Figure S5: ./scripts/run_figure_s5_determine_threshold.sh
- Figure S6: ./scripts/figure_s6_nonPAR_males.R
- Figure S7: ./scripts/figure_s7_xci_entire_X.R
- Figure S8: scripts/figure_s8.R
- Figure S9:
  - Directory: /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/
  - python scripts/compute_escaping_samples_prop_per_gene.py chrX_escaping_samples_prop_per_gene.tsv
  - Run R script /scratch/tphung3/Placenta_XCI/manuscript_plots/scripts/figure_s9_chrX_escaping_samples_prop_per_gene.R
- Figure S10:
  - Directory: /scratch/tphung3/Placenta_XCI/gene_level/gtex_counts/
  - python scripts/generate_data_for_gene_heatmap.py
  - Run R script /scratch/tphung3/Placenta_XCI/manuscript_plots/scripts/figure_s10.R
- Figure S11:
  - Directory: /scratch/tphung3/Placenta_XCI/gene_level/female_male_log2ratio/
  - python find_log2ratio_genes.py
  - Run R script scripts/figure_s11_log2ratio.R

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
gene_level		gene_level
gtex		gtex
heart		heart
manuscript_plots		manuscript_plots
misc		misc
placenta		placenta
LICENSE		LICENSE
README.md		README.md

License

SexChrLab/Placenta_XCI

Folders and files

Latest commit

History

Repository files navigation

Placenta_XCI

Processing placenta data

00_sexcheck

01_process_dna

02_run_asereadcounter

03_analyze_ase

04_phasing

05_pca

Processing GTEx data

01_download_data

02_analyze_gtex_counts

Processing GTEx heart data

1. Find subject ids with rnaseq data for both heart left ventricle and heart atrial appendage

2. Calculate the proportion of skewed variants per sample

3. Employ a phasing strategy for heart in the same way as placenta

Analysis at the gene level

wes_genotyping

gtex_counts

Assign escape status by combining all of the gtex samples

Generate plots for manuscript

About

Resources

License

Stars

Watchers

Forks

Languages