## WebSTR_project

#### WebSTR-API assess
Here is the Webstr API getting started guide
https://github.com/acg-team/webSTR-API/blob/main/GETTING_STARTED.md , this will allow you to query the data in your browser or in your code. Check out Python examples at the bottom of this page and try to search for STRs by gene names or multiple gene names, and genomic region, see if you get the same results as in the web portal.

In [9]:
import requests
import pandas as pd

search for STRs by gene names or multiple gene names

In [10]:
### your code here

search for STR by genomic region

In [11]:
### your code here

#### STR Variation in CRC

Short tandem repeats, also called microsatellite, are abundant throughout the genome and polymorphic among individuals. Microsatellite instability (MSI) is a hypermutable phenotype caused by DNA mismatch repair (MMR) system deficiencies in colorectal cancer. Certains genes have been found to be associated with MMR. For more background about MSI in CRC, you can read this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037515/

With the STRs genotyped from TCGA CRC normal and tumor samples, we are able to explore the STR length variations in CRC. We will focus on three MMR related genes (MSH2, MSH6, PMS1).

To get started, mount your google drive and upload files *normal_sample_ch2.csv* and *tumor_sample_chr2.csv* into your drive.

In [14]:
from google.colab import drive
drive.mount('/content/drive')

bn_chr2 = pd.read_csv("/content/drive/MyDrive/normal_sample_chr2.csv")
pt_chr2 = pd.read_csv("/content/drive/MyDrive/tumor_sample_chr2.csv")

Mounted at /content/drive


Before doing further analysis, it's a good practice to have a basic knowledge of the datasets. For example, how many unique samples and STRs in each of the dataset? What are the numbers of STRs with different unit size? Which type of STRs are most abundent?

In [15]:
pt_chr2.head()

Unnamed: 0,patient,allele_a,allele_b,sample_type,chr,start,end,period,ref
0,sample1,9,9,Primary Tumor,chr2,10777693,10777710,2,9
1,sample2,9,9,Primary Tumor,chr2,10777693,10777710,2,9
2,sample3,9,9,Primary Tumor,chr2,10777693,10777710,2,9
3,sample4,9,9,Primary Tumor,chr2,10777693,10777710,2,9
4,sample5,9,9,Primary Tumor,chr2,10777693,10777710,2,9


In [16]:
### your code here

With WebSTR API, search for STRs by MMR genes (**MSH2, MSH6, PMS1**) like what you did in the first section.


In [17]:
### your code here

Since we are going to investigate the STR variations in CRC, keep STRs only from panel **gangstr_crc_hg38**.

In [18]:
### your code here

After filtering, you will get **212 STRs** for MSH2, **185 STRs** for MSH6, **127 STRs** for PMS1. With the dataset *pt_chr2* you imported before, check the STR length distribution in CRC tumor samples and then select the most variable STR for each gene. For the STRs selected, compare the length distribution between tumor and normal samples.

In [19]:
import seaborn as sns

In [20]:
### your code here

#### VCF file from EnsembleTR

The ensembletr_hg38 panel is based on the GRCh38 reference assembly and contains 1.7 million unique autosomal STRs based on a combined set of TRs genotyped by four separate methods (HipSTR, GangSTR, ExpansionHunter and AdVNTR) You can download EnsembleTR calls on samples from 1000 Genomes Project and H3Africa here (https://github.com/gymrek-lab/EnsembleTR). You will need the VCF file and tbi file on Chromosome 2 of version II for this project.

First of all, install the vcf parser for python **cyvcf2**

In [12]:
!pip install cyvcf2
from cyvcf2 import VCF

Collecting cyvcf2
  Downloading cyvcf2-0.30.22-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting coloredlogs (from cyvcf2)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->cyvcf2)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: humanfriendly, coloredlogs, cyvcf2
Successfully installed coloredlogs-15.0.1 cyvcf2-0.30.22 humanfriendly-10.0


Don't forget to upload the VCF file and tbi file you just downloaded into your google drive

In [21]:
vcf_ch2 = VCF("/content/drive/MyDrive/ensemble_chr2_filtered.vcf.gz")

For more information on VCF file format, see the VCF spec (https://samtools.github.io/hts-specs/VCFv4.2.pdf). Then try to extract the STR genotypes using cyvcf2 (https://brentp.github.io/cyvcf2/index.html). Here are a few exmaples for you to start.

In [22]:
example = next(vcf_ch2)

In [None]:
example.POS, example.ALT, example.CHROM, example.REF, example.end,

(10567,
 ['CCCTAAC', 'CCCTAACCCCTAACCCCTAACCCCTAACCCCTAACCCCTAAC', 'CCCTAACCCCTAAC'],
 'chr2',
 'CCCTAACCCCTAACCCCTAAC',
 10587)

In [23]:
example.FORMAT

['GT', 'GB', 'NCOPY', 'EXP', 'SCORE', 'GTS', 'ALS', 'INPUTS', 'FILTER']

In [24]:
example.format("ALS")

array(['0|2', '0|2', '0|2', ..., '.', '0|2', '0|2'], dtype='<U8')

In [25]:
# START, END, PERIOD, RU, METHODS, HRUN, HET, HWEP, AC, REFAC
example.INFO.get("HRUN")

4

Once you understand the structure of the VCF file and how to parse it, search for STRs by MMR genes (MSH2, MSH6, PMS1) via WebSTR API and keep only STRs from panel **EnsembleTR** . Similarly, select the most variable STR for each gene and visualise the length distribution of the STRs among all the samples.

For the STRs selected, this will allow you to do a region-query:

In [32]:
rq = vcf_ch2('chr2:47398679-47401268')

In [None]:
### your code here