## Week 3: Tutorial

## Goal: Investigate archaic ancestry in modern humans

### Set your individual

In [None]:
# REPLACE with your individual
my_individual = 'NA18974'

## Installing requirements

Connect to Github and load the necessary data and tools (runtime: 2min)

In [None]:
# install bcftools
%%bash
cd /content/
rm -rf Spring-2024 IBDmix
git clone https://github.com/CCB293/Spring-2024
git clone https://github.com/PrincetonUniversity/IBDmix.git
cd IBDmix
mkdir build
cd build
cmake ..
cmake --build .
cd /content/
ln -s /content/IBDmix/build/src/ibdmix /content/Spring-2024/bin/ibdmix
export LD_LIBARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
apt install libgsl-dev
ln -s /usr/lib/x86_64-linux-gnu/libgsl.so /usr/lib/x86_64-linux-gnu/libgsl.so.0
cd /content
chmod +x Spring-2024/bin/smartpca
chmod +x Spring-2024/bin/admixture
chmod +x Spring-2024/bin/tabix
chmod +x Spring-2024/bin/vcftools
chmod +x Spring-2024/bin/bcftools
cd Spring-2024/data/1000G_archaic/ && unzip 1000G_archaic.geno.zip && gunzip altai_22_sub.gz
echo "Packages installed"

In [None]:
# load the libraries
import matplotlib.pyplot as plt
from collections import Counter
import numpy as np
import pandas as pd
import colorsys
import seaborn as sns
from IPython.display import Image
import os
import json

Define plotting functions

In [None]:
# define plotting functions

def plot_slod_chrom(subdf, rl, rh):
  x =  subdf['start'].to_list() + subdf['end'].to_list()
  x = np.array(sorted(x))
  xp = x[(x > rl) & (x < rh)]
  slod = subdf['slod'].to_list()
  y = []
  for i in slod:
    y.append(0)
    y.append(i)
  y = np.array(y)
  y = y[np.where((x > rl) & (x < rh))]
  yp = []
  for j in np.arange(rl, rh, 1000):
    if len(y[np.where(xp >= j)]) > 0:
      yp.append(y[np.where(xp >= j)][0])
    else:
      yp.append(y[-1])
  plt.figure(figsize=(10, 3))
  plt.scatter(np.arange(rl, rh, 1000), yp, s=3)
  plt.axhline(y=4, color='red', linestyle='--')
  plt.ylabel('slod')
  plt.xlabel('Genomic position (bp)')


In [None]:
# set environment variable
import os
os.environ['PATH'] += ":/content/Spring-2024/bin"
!echo $PATH
# set current directory
%cd /content/Spring-2024/data/1000G_archaic/

# Analysis of archaic introgression

## IBDmix
Chen, Lu, et al. "Identifying and interpreting apparent Neanderthal ancestry in African individuals." Cell 180.4 (2020): 677-687.
https://www.sciencedirect.com/science/article/pii/S0092867420300593#sec4

Usage: `!../../bin/ibdmix -g 'altai_22_sub' -d 0 -t -i -o 'altai_22_sub_output'`

In [None]:
# Main idea for identifying segments of archaic ancestry
Image(filename='IBDmix.png', height=500)

$LOD = log( \frac{P(Data | IBD)}{P(Data | nonIBD)})$

Assume the observed data (for alleles A, a) is AA in Neanderthal, Aa in your individual.

$P(Data | IBD) = P_O(AA, Aa | IBD)$

Parameters in probability calculation:
* mutation rate
* divergence time between groups
* genotyping error (sequencing, algorithm, etc.)

SLOD = cumulative LOD for all SNPs in a region

In [None]:
!../../bin/ibdmix --help

In [None]:
!head -n 5 'altai_22_sub'

### Run IBDmix


In [None]:
# -g = genotype file name, -d = LOD score threshold;
# -t= additional summary stats, -i= regions are inclusive [start,end]
# -o output file name
!../../bin/ibdmix -g 'altai_22_sub' -d 0 -t -i -o 'altai_22_sub_output'

In [None]:
# check output file
raw_output = pd.read_csv('altai_22_sub_output', sep='\s+')
raw_output.head()

In [None]:
# plot slod across the genome
# function usage: plot_slod_chrom(datafram, start_pos, end_pos)
plot_slod_chrom(raw_output[raw_output.ID == 'NA18974'], 16e6, 18e6)

Filter the data by `slod > 4`

In [None]:
# filter by slod
raw_output['length'] = raw_output['end'] - raw_output['start']
raw_output['Archaic_proportion'] = raw_output['length'] / 2908180
filter_slod = raw_output[raw_output.slod > 4]

### Stop! Check your understanding
1. How many regions are identified as IBD with Neanderthal for your individual in this subset of data?
2. What is the mean length of 'Archaic' segments on this chromosome?

### Exercise: population level summary statistics

#### Merge datasets so that `filter_slod` dataframe includes population information for each individual

In [None]:
# Get individuals dataframe
individuals = pd.read_csv('1000G_archaic.ind', delim_whitespace=True, header=None, names=['individual', 'sex', 'population'])
population_info = pd.read_csv('population_info.csv')
individuals = individuals.merge(population_info, on='population', how='left').dropna()

In [None]:
# merge filter_slod with individuals

#### Plot a histogram of 'Archaic' segment lengths for the population of your individual.

In [None]:
#check your population

In [None]:
# histogram

#### Plot a boxplot of 'Archaic_proportion' for all continental groups
Use `seaborn` package
https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
# boxplot
import seaborn as sns


In [None]:
# remove outliers to make the figure better


### Plot a boxplot of 'Archaic_proportion' for all populations of your continental group

In [None]:
# boxplot

In [None]:
# remove outliers to make the figure better