In [10]:
import pandas as pd
import numpy as np

In [1]:
import json
import sys
sys.path.append('./src/')
from utilities import *
from run import main

Usage:
               python ./src/run.py {data | data-test | data-ftp} 


In [2]:
%load_ext autoreload

In [3]:
%autoreload 2

In [None]:
# Equivalent to python ./src/run.py data
#--------------------------------------------
# Load config
cfg = json.load(open('config/data.json'))

# Download data
datapath = cfg['datapath']
download(datapath, **cfg['data'])

# Process data
process(datapath, cfg['data']['ref_file'], cfg['options'])

# Introduction
<p>
    For this project, I am attempting to replicate the results found in the 1000 genomes project, clustering individuals by their population to show how geographically close populations are also genetically similar. The 1000 genomes data is a dataset containing a genome for over 1000 individuals (actually closer to 2500 individuals). The data was collected by sequencing these individuals using deep sequencing technologies. This technology became more advanced while the project was in progress, which is why they ended up being able to sequence more individuals than they expected to.
</p>

# Data ingestion
<p>
    In this project, the 1000 genomes data is ingested as their completed vcf files, which are files that combine all the genomes into one compressed format that only includes the data for areas in the genome that vary, rather than including the full genome, as this would have a lot of repetition. In this project, I filtered and merged the 1000 genomes data to only include the most common SNPs (single nucleotide polymorphisms), then clustered the data using PCA and combined that with the geographical data to show how the genetic distance is correlated with geographical population (people from the same population are genetically more similar than people from different populations).
</p>

# Specifics on filtering
<p>
    Filtering the data was rather simple, because some areas of the genome have more variation than others, I filtered the data by only including areas that had high variation (SNPs that are seen in more than 5% of the data). 
</p>

# PCA Plot
![plot]data/plots/final.png

# Results
<p>
    From this plot, it is clear how populations are clustered together. The PCA values are a way to numerically quantify individuals' genetic distance from each other, and the plot shows how the individuals from the same populations tend to be clustered together. It helps support belief that genetic variation tends to be dependent on population, i.e. individuals from the same population will be genetically more similar than individuals from different populations. This finding is important because when studying genetics and how variation may affect a certain phenotype, it is essential to take into account the variation that is caused simply by differences in geography, so as to not end up simply studying the variation due to geographical differences.
</p>

# Limitations
<p>
    The 1000 Genomes Project only studies a limited number of populations, and expanding the study to include more populations around the world would help to show how different populations may be more related than others. For example, bordering countries like North Korea and South Korea are likely less similar than countries like Portugal and Spain, because immigration between these neighboring countries is vastly different, and could account for less genetic closeness between populations. 
</p>

# Conclusion
<p>
    With a relatively small amount of data, it is rather simple to be able to show how genetic distance and geographical location/ancestry are correlated. From the PCA plots, it is clear that populations tend to cluster together and individuals are more genetically similar to people with similar ancestry than to people with different ancestry.
</p>