# H+, or how to build a perfect human.

**Project 5.**\
Lab journal by Anna Ogurtsova

---

*For this project, I imagined that I am in the not-too-distant-future, where transhumanism has been widely accepted, and I am allowed to use CRISPR-Cas9 on humans. I can just order a DIY kit to make any corrections to my DNA (actually, I can order it now, but just for E. coli). What would I change?*



**Dataset**: [“GitHub Guy”](https://github.com/msporny/dna)

### Step 1. File conversion

For analysis I needed to convert 23andMe's raw data into standard vcf format. I used [plink](https://www.cog-genomics.org/plink/), a program widely used in population genetics. But there are other options, such as bcftools convert.

In [None]:
!mkdir project_5
!wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
!nzip plink_linux_x86_64_20231211.zip -d plink
!sudo mv plink/plink /usr/local/bin
!plink --version #PLINK v1.90b7.2 64-bit (11 Dec 2023)

First I removed all SNPs corresponding to deletions and insertions to make the file compatible with annotation tools.

In [None]:
!plink --23file ManuSporny-genome.txt --recode vcf --out snps_clean --output-chr MT --snps-only just-acgt

The resulting file contains all the analyzed SNPs, and we are interested only in variable positions.

So I decided to filter only variable positions with the use of bcftools.

In [None]:
!bcftools view -v snps snps_clean.vcf -o snps_variable.vcf

I count the number of SNPs

In [None]:
!grep -v "^#" snps_clean.vcf | wc -l    # 965887
!grep -v "^#" snps_variable.vcf| wc -l  # 287174

That means that this genome has snps in 287 174 positions.

### Step 2. Origins, haplogroups.

**Establish maternal (mtDNA) and paternal (Y-chromosome) haplogroups and, optionally, probable ethnicity.**

1. There are many ready-made tools from colleagues at [the International Society of GeneticnGenealogy](https://isogg.org/wiki/MtDNA_tools).\
For example, for [mtDNA](https://dna.jameslick.com/mthap/) - shows all SNPs that distinguish the haplogroup, and takes 23andMe input.







**Most probably this person belongs to haplogroup M (M6a is the closest one).
Subgroup haplogroup M6 – found mainly in South Asia, with highest concentrations in mid-eastern India and Kashmir.**

2. For the Y chromosome there is also a lot of interesting stuff - [for example](https://isogg.org/wiki/Y-DNA_tools)\
For raw 23andMe, for example, [this one](https://ytree.morleydna.com/extractFromAutosomal)

### Step 3. Annotation - sex and eye colour

Person's sex and eye color.
This person is male, because his genome contains Y chromosome.

SNPs which are specific for y chromosome:
```
rs2032597
rs13447352
rs2032658
rs9786184
rs9786176
```

In [None]:
!grep -E "rs2032597|rs13447352|rs2032658|rs9786184|rs9786176" ManuSporny-genome.txt

Output:\
rs2032597	Y	13357186	A\
rs2032658	Y	14091377	G\
rs13447352	Y	21159241	A


For eye colour you I used this [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694299/)

In [None]:
!grep -E "rs12203592|rs12896399|rs12913832|rs16891982|rs611947" ManuSporny-genome.txt > extracted_snps_raw.txt


Most probably our guy has brown eyes:


`rs16891982`	5	`33987450`	CG

`rs12203592`	6	`341321`		CC

`rs12896399`	14	`91843416`	GT

**rs12913832**	15	26039213	**AG** not blue

### Step 4. Annotation of all SNPs, selection of clinically relevant ones.

#### **a) The variant with SnpEff/SnpSift**

In [None]:
!snpEff GRCh37.75 snps_clean.vcf > snps_snpeff_trial.vcf
!snpSift annotate -v clinvar.vcf snps_clean.vcf > snps_clean_snpsift_clinvar.vcf
!grep CLNDN snps_clean_snpsift_clinvar.vcf > filtered_diagnosis.vcf

After comparison with ClinVar database the following potentioal diagnosis were obtained:

In [None]:
!awk -F'\t' '/^#/ {next} {split($8, info, ";"); clndn=""; clnsig=""; for (i in info) {split(info[i], kv, "="); \
if (kv[1] == "CLNDN") {clndn = kv[2]} if (kv[1] == "CLNSIG") {clnsig = kv[2]}} if (clndn != "" && clnsig != "") {print clndn "\t" clnsig}}' filtered_diagnosis.vcf

```Output
Amyotrophic_lateral_sclerosis_type_10|TARDBP-related_frontotemporal_dementia	Uncertain_significance
not_provided	Benign
Inborn_genetic_diseases	Uncertain_significance
Inborn_genetic_diseases	Uncertain_significance
not_provided	Benign
Cardiovascular_phenotype	Likely_benign
Generalized_epilepsy_with_febrile_seizures_plus,_type_7|Neuropathy,_hereditary_sensory_and_autonomic,_type_2A	Uncertain_significance
Dilated_cardiomyopathy_1G|Autosomal_recessive_limb-girdle_muscular_dystrophy_type_2J	Uncertain_significance
Inborn_genetic_diseases	Likely_benign
**not_provided|Familial_adenomatous_polyposis_1|Hereditary_cancer-predisposing_syndrome	Uncertain_significance**
Inborn_genetic_diseases	Uncertain_significance
not_provided	Likely_benign
not_provided	Uncertain_significance
Cardiovascular_phenotype	Likely_benign
Inborn_genetic_diseases	Uncertain_significance
not_provided	Likely_benign
not_provided|Early_infantile_epileptic_encephalopathy_with_suppression_bursts	Likely_benign
Telangiectasia,_hereditary_hemorrhagic,_type_2	Likely_benign
Retinoblastoma	Likely_benign
**Spastic_paraplegia_52,_autosomal_recessive	Uncertain_significance**
Inborn_genetic_diseases	Uncertain_significance
Congenital_myasthenic_syndrome_4A	Uncertain_significance
Inborn_genetic_diseases	Uncertain_significance
**Neurofibromatosis,_type_1|Cardiovascular_phenotype|Hereditary_cancer-predisposing_syndrome	Uncertain_significance
**Malignant_tumor_of_prostate	Uncertain_significance
not_provided	Uncertain_significance
Rhabdoid_tumor_predisposition_syndrome_2	Likely_benign
Developmental_and_epileptic_encephalopathy,_30	Likely_benign
```

**Among these variants, 16 are of uknown significance, 2 are benign and 10 - likely benign. And no variant is pathogenic.**

In [None]:
!snpSift gwasCat -db gwas_catalog_v1.0-associations_e111_r2024-02-11.tsv snps_clean.vcf > snps_clean_gwascat.vcf
!grep GWASCAT_TRAIT snps_clean_gwascat.vcf > GWAS_traits.vcf

To filter only those factors from a GWAS dataset that suggest a strong positive association, I looked at the Odds Ratio (OR) or Beta values along with their corresponding p-values.

In [None]:
%%bash
awk -F'\t' '{
    split($8, info, ";");
    or_beta = 0; p_value = 1;
    for (i in info) {
        split(info[i], kv, "=");
        if (kv[1] == "GWASCAT_OR_BETA") {or_beta = kv[2]}
        if (kv[1] == "GWASCAT_P_VALUE") {p_value = kv[2]}
    }
    if (or_beta > 1 && p_value < 0.05) {print}
}' GWAS_traits.vcf > significant_GWAS_traits.vcf


The obtained genes:

```
rs11708202	0/1	Alcohol_consumption_x_playing_computer_games_interaction
rs6825410	0/1 Diastolic_blood_pressure_in_combination_therapy__beta_blocker_and_thiazide_diuretic_
rs11739417	0/1	Youthful_appearance__self-reported_
rs9356704	0/1	Ulcerative_colitis
rs6980713	0/1	Glaucoma__primary_open-angle_
rs2497219	0/1	Schizophrenia
rs17782124	0/1	Macroalbuminuria_in_type_1_diabetes
```

For each gene VCF REF is "A" and ALT is "G" and there's a heterozygous genotype.

#### **b) VEP EP (Variant Effect Predictor)**

Also easier [online](https://grch37.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=omwUvl8wDdI7EuCo-9893558), with a suitable version of the genome - 37 or, if I have some new data, 38.



Command line equivalent:

In [None]:
!./vep --af --appris --biotype --buffer_size 500 --check_existing --distance 5000 --mane --polyphen b --pubmed --regulatory --show_ref_allele --sift b --species homo_sapiens --symbol --transcript_version --tsl --uploaded_allele --cache --input_file [input_data] --output_file [output_file] --port 3337

I looked at a column "CLIN_SIG" to find for potential risk factors, but there were none of them:

In [None]:
!awk '($32!="-") ' vep_output.txt | grep risk_factor | cut -f 1-3 | sort | uniq  #no risk factors

### Step 5. Genome editing


#### **Corrections**

#### a. **rs2802292**

Our subject is heterozygous at position 109015211 on chromosome 6.
```
6 09015211 rs2802292 G T . . OR GT 0/1
```

If we were to change his genotype to homozygous TT at this position, he would be 1.5 to 2.7 times more likely to live to 100.

This snp happens in FOXO3 gene. This gene is involved in the regulation of oxidative stress, insulin sensitivity, and cellular apoptosis. SNPs in the FOXO3 gene have been linked to longevity in various populations.

##### b. **rs6983267**
```
8	128482487	rs6983267	G	T	.	.	PR	GT	0/1
```
This genotype has an increased risk of prostate cancer ((G;T) risk genotypes yield an odds ratio for developing prostate cancer of 1.37 (CI: 1.18-1.59, p=3.4-10e-5) and may account for 22.2% of population attributable risk).
If we correct this genotype to homozygous TT, the risk of cancer will become normal.

##### c. **rs4680**

rs4680 (Val158Met) is a well studied SNP in the COMT gene. The COMT gene codes for the COMT enzyme, which breaks down dopamine in the brain's prefrontal cortex. The wild-type allele is a (G), coding for a valine amino acid; the (A) substitution polymorphism changes the amino acid to a methionine.

The heterozygous variant has intermediate dopamine levels.

```
22	18331271	rs4680	A	G	.	.	PR	GT	0/1
```
If we change the genotype to homozygous AA, GitHub guy will aquire advantage in memory and attention tasks.

