# PCA Yemen samples - Context: Reich (Allen Datasets)

169 Yemen samples as well as samples from Chad and Pakistan are used.
The Reich (Allen) datasets contain many samples from countries surrounding the Arabian Peninsula (amongst many others). 
See table below. This context data is taken from the [Reich lab](https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data) web page.
The overlaps in terms of nr of variants between the various contexts


|Study | Context | Overlap|
|---|---|---|
|Yemen | HGDP |329K|
|Yemen | Reich "1240K" (Ancient)| 358K|
|Yemen | Reich Human Origins | 118K|
|Yemen| UKB| 91741|
|Yemen| UAE| 609685|
|---|---|---|
|UAE| hgdp| 552476|
|UAE| ReichHO| 179364|
|UAE| Reich1240K|   596568|
|UAE| UKB| 132315|


|1. Study | 2. Study | Context | Overlap|
|---|---|---|---|
|Yemen| UAE| HGDP| 328303|  
|Yemen| UAE| ReichHO | 118080|
|Yemen| UAE| Reich1240K| 358610|
|Yemen| UAE| UKB |91473|



## Data preprocessing:
### Merging datasets for contextualization
* Identify overlaps (purely based on rs-IDs), create *.snps files (see variantIntersections2.py)
* Plink based dataset merging, see runPCA.sh
* Conduct PCA: Docker/flashPCA, see Syafiq's wiki 

In [7]:
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

%matplotlib notebook

### Comparison between the two Reich datasets
The 1240K+HO (including "Human Origins" has far more data from nearby countries. Downside: variant overlap drops considerably

In [8]:
pd.read_csv('Reich/reich1240K_vs_HO.csv', sep='\t')

Unnamed: 0,Country,ReichHO,1240K
0,Ethiopia,13,1
1,Israel,369,236
2,Jordan,49,39
3,Lebanon,65,37
4,Sudan,4,1
5,Syria,24,16
6,Yemen,107,2


## Yemen metadata
Extract region information. To be combined with context metadata, and appended to PCA table.

In [9]:
import re
def readZallouaMetadata(manifest, selcols, colnames=['Id','Region']):    
    regionalContext = pd.read_csv(manifest, skiprows=8, sep=',', skip_blank_lines=True)
    regionalContext = regionalContext.iloc[:,selcols]
    regionalContext.columns = colnames
    regionalContext.set_index('Id', inplace=True)
    return regionalContext.dropna(subset=colnames[1:])

yemenMeta = readZallouaMetadata("Metadata/3577stdy_manifest_3450_190315.csv.gz", [2,3], colnames=['Id','RegionId'])

## some cleanup
yemenMeta["Region"] = [re.findall('\D+', sup)[0] for sup in yemenMeta['RegionId']]
yemenMeta = yemenMeta[['Region']]
yemenMeta = yemenMeta[yemenMeta.Region!='Empty']

In [10]:
# Chad/Pakistan context - loading Metadata
chadContext = readZallouaMetadata("Metadata/yemcha_manifest_3558_160415.csv.gz", [2,8])
pakistan = readZallouaMetadata("Metadata/sc_egyptlc_manifest_1764_050613.csv.gz", [2,8])
meta = pd.concat([yemenMeta, chadContext, pakistan])
print(Counter(meta.Region))

Counter({'Chad': 276, 'Rsa': 48, 'Pakistan': 48, 'Ibb': 25, 'Hdr': 25, 'Tiz': 24, 'Dhm': 10, 'Abyn': 10, 'Sad': 10, 'Mhw': 10, 'Lahj': 10, 'Jwf': 10, 'San': 10, 'Mrb': 10, 'Amr': 10, 'Haj': 10, 'Shb': 10, 'Dal': 9, 'Byd': 7})


## Reich: Dataset 1240K+HO (Ancient DNA + Human Origins)
Context is large (13K samples) and we downsample (but PCA is done on all samples). 
Downsampling policy: randomly pick 107 samples from closeby countries (relevance=1), radnomly pick 50 from any other country.

In [11]:
reichMeta=pd.read_csv("Reich/v44.3_HO_public.anno", sep='\t')
reichMeta.columns = ['Index', 'Id', 'Id2',
       'Publication', 'contact',
       'Date',
       'Full_Date',
       'Group_Label', 'Locality', 'Region', 'Lat', 'Long',
       'Data source', 'Cov_autosm',
       'SNPs_autosm', 'Sex',
       'Library_type',
       'ASSESSMENT']
reichMeta
## Ancient DNA filter
#reichMeta = reichMeta[((reichMeta.Date>1000) & (reichMeta.SNPs_autosm>100000))]#sort_values(by='Date', ascending=False).head(20)

Unnamed: 0,Index,Id,Id2,Publication,contact,Date,Full_Date,Group_Label,Locality,Region,Lat,Long,Data source,Cov_autosm,SNPs_autosm,Sex,Library_type,ASSESSMENT
0,1798,MAL-005,MAL-005,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Yao,Dedza // Yao,Malawi,-14.166667,34.33333,Fall2015,..,585645,M,..,PASS (genotyping)
1,1799,MAL-009,MAL-009,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Yao,Machinga // Yao,Malawi,-14.862605,35.574122,Fall2015,..,582189,M,..,PASS (genotyping)
2,1800,MAL-011,MAL-011,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Chewa,Mchinga // Chichewa,Malawi,-14.862605,35.574122,Fall2015,..,579844,M,..,PASS (genotyping)
3,1801,MAL-012,MAL-012,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Chewa,Salima // Chichewa,Malawi,-13.75,34.5,Fall2015,..,585204,M,..,PASS (genotyping)
4,1802,MAL-014,MAL-014,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Chewa,Nambuma // Chichewa,Malawi,-13.703473,33.597743,Fall2015,..,584410,M,..,PASS (genotyping)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13192,31953,VK94.SG,VK94,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",1000,800-1100 CE,Denmark_Viking.SG,"Sealand, Gl. Lejre",Denmark,55.62,11.97,Shotgun,0.14,63365,F,ds.minus,PASS (literature)
13193,31954,VK95.SG,VK95,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",850,900-1300 CE,Iceland_Viking.SG,Hofstadir,Iceland,65.61,-17.16,Shotgun,1.32,422417,M,ds.minus,PASS (literature)
13194,31955,VK98.SG,VK98,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",850,900-1300 CE,Iceland_Viking.SG,Hofstadir,Iceland,65.61,-17.16,Shotgun,2.49,524046,M,ds.minus,PASS (literature)
13195,31956,VK99.SG,VK99,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",850,900-1300 CE,Iceland_Viking.SG,Hofstadir,Iceland,65.61,-17.16,Shotgun,0.74,320753,F,ds.minus,PASS (literature)


In [12]:
reichMeta

Unnamed: 0,Index,Id,Id2,Publication,contact,Date,Full_Date,Group_Label,Locality,Region,Lat,Long,Data source,Cov_autosm,SNPs_autosm,Sex,Library_type,ASSESSMENT
0,1798,MAL-005,MAL-005,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Yao,Dedza // Yao,Malawi,-14.166667,34.33333,Fall2015,..,585645,M,..,PASS (genotyping)
1,1799,MAL-009,MAL-009,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Yao,Machinga // Yao,Malawi,-14.862605,35.574122,Fall2015,..,582189,M,..,PASS (genotyping)
2,1800,MAL-011,MAL-011,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Chewa,Mchinga // Chichewa,Malawi,-14.862605,35.574122,Fall2015,..,579844,M,..,PASS (genotyping)
3,1801,MAL-012,MAL-012,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Chewa,Salima // Chichewa,Malawi,-13.75,34.5,Fall2015,..,585204,M,..,PASS (genotyping)
4,1802,MAL-014,MAL-014,SkoglundCell2017,Garrett Hellenthal / Saioa Lopez / Mark Thomas...,0,..,Malawi_Chewa,Nambuma // Chichewa,Malawi,-13.703473,33.597743,Fall2015,..,584410,M,..,PASS (genotyping)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13192,31953,VK94.SG,VK94,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",1000,800-1100 CE,Denmark_Viking.SG,"Sealand, Gl. Lejre",Denmark,55.62,11.97,Shotgun,0.14,63365,F,ds.minus,PASS (literature)
13193,31954,VK95.SG,VK95,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",850,900-1300 CE,Iceland_Viking.SG,Hofstadir,Iceland,65.61,-17.16,Shotgun,1.32,422417,M,ds.minus,PASS (literature)
13194,31955,VK98.SG,VK98,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",850,900-1300 CE,Iceland_Viking.SG,Hofstadir,Iceland,65.61,-17.16,Shotgun,2.49,524046,M,ds.minus,PASS (literature)
13195,31956,VK99.SG,VK99,MargaryanWillerslevNature2020,"Nielsen, Rasmus; Werge, Thomas; Willerslev, Eske",850,900-1300 CE,Iceland_Viking.SG,Hofstadir,Iceland,65.61,-17.16,Shotgun,0.74,320753,F,ds.minus,PASS (literature)


In [13]:
## rich annotation, for the time being, just selecting country (region)

reichMeta = reichMeta[['Id','Region','Group_Label']].set_index('Id')
countryStats = Counter(reichMeta.Region)
reichMeta

Unnamed: 0_level_0,Region,Group_Label
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
MAL-005,Malawi,Malawi_Yao
MAL-009,Malawi,Malawi_Yao
MAL-011,Malawi,Malawi_Chewa
MAL-012,Malawi,Malawi_Chewa
MAL-014,Malawi,Malawi_Chewa
...,...,...
VK94.SG,Denmark,Denmark_Viking.SG
VK95.SG,Iceland,Iceland_Viking.SG
VK98.SG,Iceland,Iceland_Viking.SG
VK99.SG,Iceland,Iceland_Viking.SG


In [325]:
closeContext = 'Ethiopia,Lebanon,Eritrea,Israel,Sudan,Syria,Yemen,Jordan,Iraq,Iran,Egypt,Saudi Arabia'.split(',')
def relevance(country):
    if country =='Yemen': return 2
    if country in closeContext: return 1
    return 0
def downsample(country, caps=[20, 50, 107]):
    df = reichMeta[reichMeta.Region==country]
    samplesize = caps[relevance(country)]
    if samplesize < df.shape[0]:
        return df.sample(samplesize)
    return df 

selCountries = 'Armenia,Ukraine,Pakistan,Turkey,Congo,Italy,China,Morocco,India,Tunisia'.split(',') + closeContext
reichMeta50 = pd.concat([downsample(country, [10, 25, 107]) for country in countryStats if country in selCountries])
Counter(reichMeta50.Region).most_common()

[('Yemen', 107),
 ('Israel', 25),
 ('Iran', 25),
 ('Lebanon', 25),
 ('Jordan', 25),
 ('Egypt', 25),
 ('Syria', 24),
 ('Iraq', 14),
 ('Ethiopia', 13),
 ('Armenia', 10),
 ('Turkey', 10),
 ('Ukraine', 10),
 ('Pakistan', 10),
 ('Congo', 10),
 ('Italy', 10),
 ('China', 10),
 ('Morocco', 10),
 ('Saudi Arabia', 10),
 ('India', 10),
 ('Tunisia', 10),
 ('Sudan', 4),
 ('Eritrea', 3)]

In [326]:
#Simple combination
meta = pd.concat([yemenMeta, reichMeta50])
reichMeta50

Unnamed: 0_level_0,Region
Id,Unnamed: 1_level_1
EZI-074,Armenia
ARM001,Armenia
EZI-010,Armenia
armenia139,Armenia
ASR_005,Armenia
...,...
S_Iraqi_Jew-2.DG,Iraq
S_Iraqi_Jew-1.DG,Iraq
Eritrean_1,Eritrea
Eritrean_2,Eritrea


In [1]:
from utils import colors2, markers

### Loading PCA data
The merged dataset was subjected to Flash-PCA. Combined dataset shown below.

In [5]:
pca = pd.read_csv(f'YemenReichHO/pcs.txt', delimiter='\t')


In [6]:
pca.head()

Unnamed: 0,FID,IID,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
0,1,MAL-005,-0.668563,-0.186939,-0.012376,-0.008212,0.002309,-0.016566,-0.028582,0.014858,0.011621,-0.004653
1,2,MAL-009,-0.652698,-0.185005,-0.006781,-0.010956,-0.005993,-0.021205,-0.02028,0.004516,0.008802,-0.006408
2,3,MAL-011,-0.671137,-0.187152,-0.009245,-0.013635,0.005009,-0.013627,-0.014207,0.006091,0.016281,-0.007094
3,4,MAL-012,-0.665634,-0.184766,-0.006079,-0.004144,0.000172,-0.016001,-0.032014,0.008432,0.020338,-0.012785
4,5,MAL-014,-0.667511,-0.187095,-0.004709,-0.011141,-0.004765,-0.018777,-0.034707,0.002222,0.002565,-0.007879


In [329]:
ids = [iid.split('_')[-1] for iid in pca['IID']]
pca['FID1'] = ids
pca.set_index('FID1', inplace=True)
pca

Unnamed: 0_level_0,FID,IID,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
FID1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
MAL-005,1,MAL-005,-0.668563,-0.186939,-0.012376,-0.008212,0.002309,-0.016566,-0.028582,0.014858,0.011621,-0.004653
MAL-009,2,MAL-009,-0.652698,-0.185005,-0.006781,-0.010956,-0.005993,-0.021205,-0.020280,0.004516,0.008802,-0.006408
MAL-011,3,MAL-011,-0.671137,-0.187152,-0.009245,-0.013635,0.005009,-0.013627,-0.014207,0.006091,0.016281,-0.007094
MAL-012,4,MAL-012,-0.665634,-0.184766,-0.006079,-0.004144,0.000172,-0.016001,-0.032014,0.008432,0.020338,-0.012785
MAL-014,5,MAL-014,-0.667511,-0.187095,-0.004709,-0.011141,-0.004765,-0.018777,-0.034707,0.002222,0.002565,-0.007879
...,...,...,...,...,...,...,...,...,...,...,...,...
3577STDY6068568,urn:wtsi:402769_H09_3577STDY6068568,urn:wtsi:402769_H09_3577STDY6068568,-0.139141,0.127716,0.040863,0.011073,-0.116100,0.046633,0.016714,0.062719,0.005606,-0.013736
3577STDY6068600,urn:wtsi:402769_H10_3577STDY6068600,urn:wtsi:402769_H10_3577STDY6068600,-0.082048,0.159256,0.038233,0.003569,-0.128213,0.042367,0.025431,0.069612,0.009617,-0.010890
3577STDY6068625,urn:wtsi:402769_H11_3577STDY6068625,urn:wtsi:402769_H11_3577STDY6068625,-0.109379,0.145012,0.041254,0.007771,-0.127986,0.046414,0.022386,0.074517,0.011113,-0.008102
3577STDY6068642,urn:wtsi:402769_H12_3577STDY6068642,urn:wtsi:402769_H12_3577STDY6068642,-0.067380,0.176812,0.045047,0.005572,-0.137002,0.050298,0.028667,0.067988,-0.000506,-0.010602


In [330]:
## Including admix data, concat dataframes
k = 12
admix =  "Admixture/YemenHO/yemen_clean_reichHO4.%s.Q" % k
q = pd.read_csv(admix, header=None, sep=' ')
q.index = pca.index
pops = ['Pop%02d'%i for i in range(k)]
q.columns = pops
q

Unnamed: 0_level_0,Pop00,Pop01,Pop02,Pop03,Pop04,Pop05,Pop06,Pop07,Pop08,Pop09,Pop10,Pop11
FID1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
MAL-005,0.000010,0.000010,0.000010,0.000010,0.000010,0.028327,0.000010,0.000010,0.003226,0.955406,0.000010,0.012962
MAL-009,0.000010,0.000477,0.000010,0.000010,0.000010,0.027632,0.000010,0.000012,0.002015,0.954374,0.000010,0.015430
MAL-011,0.000963,0.006671,0.000010,0.000834,0.000010,0.036228,0.000010,0.000010,0.000010,0.950177,0.000010,0.005068
MAL-012,0.000010,0.000010,0.000010,0.003965,0.004643,0.022376,0.000010,0.000010,0.000010,0.960076,0.000010,0.008871
MAL-014,0.000087,0.002577,0.000010,0.000010,0.000010,0.021992,0.000010,0.000010,0.000010,0.965851,0.000010,0.009423
...,...,...,...,...,...,...,...,...,...,...,...,...
3577STDY6068568,0.037113,0.002255,0.001714,0.017097,0.015108,0.006066,0.017859,0.203411,0.000010,0.192776,0.000491,0.506100
3577STDY6068600,0.027027,0.027276,0.003619,0.005126,0.024519,0.001236,0.010862,0.235752,0.019285,0.119312,0.000010,0.525975
3577STDY6068625,0.025257,0.016251,0.002803,0.015455,0.007714,0.000010,0.002243,0.222281,0.017475,0.157613,0.000010,0.532887
3577STDY6068642,0.030690,0.027128,0.003910,0.006031,0.022336,0.004154,0.018254,0.242268,0.001633,0.089835,0.000010,0.553751


In [331]:
pca = pd.concat([pca, q], axis=1)
pca

Unnamed: 0_level_0,FID,IID,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,...,Pop02,Pop03,Pop04,Pop05,Pop06,Pop07,Pop08,Pop09,Pop10,Pop11
FID1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MAL-005,1,MAL-005,-0.668563,-0.186939,-0.012376,-0.008212,0.002309,-0.016566,-0.028582,0.014858,...,0.000010,0.000010,0.000010,0.028327,0.000010,0.000010,0.003226,0.955406,0.000010,0.012962
MAL-009,2,MAL-009,-0.652698,-0.185005,-0.006781,-0.010956,-0.005993,-0.021205,-0.020280,0.004516,...,0.000010,0.000010,0.000010,0.027632,0.000010,0.000012,0.002015,0.954374,0.000010,0.015430
MAL-011,3,MAL-011,-0.671137,-0.187152,-0.009245,-0.013635,0.005009,-0.013627,-0.014207,0.006091,...,0.000010,0.000834,0.000010,0.036228,0.000010,0.000010,0.000010,0.950177,0.000010,0.005068
MAL-012,4,MAL-012,-0.665634,-0.184766,-0.006079,-0.004144,0.000172,-0.016001,-0.032014,0.008432,...,0.000010,0.003965,0.004643,0.022376,0.000010,0.000010,0.000010,0.960076,0.000010,0.008871
MAL-014,5,MAL-014,-0.667511,-0.187095,-0.004709,-0.011141,-0.004765,-0.018777,-0.034707,0.002222,...,0.000010,0.000010,0.000010,0.021992,0.000010,0.000010,0.000010,0.965851,0.000010,0.009423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3577STDY6068568,urn:wtsi:402769_H09_3577STDY6068568,urn:wtsi:402769_H09_3577STDY6068568,-0.139141,0.127716,0.040863,0.011073,-0.116100,0.046633,0.016714,0.062719,...,0.001714,0.017097,0.015108,0.006066,0.017859,0.203411,0.000010,0.192776,0.000491,0.506100
3577STDY6068600,urn:wtsi:402769_H10_3577STDY6068600,urn:wtsi:402769_H10_3577STDY6068600,-0.082048,0.159256,0.038233,0.003569,-0.128213,0.042367,0.025431,0.069612,...,0.003619,0.005126,0.024519,0.001236,0.010862,0.235752,0.019285,0.119312,0.000010,0.525975
3577STDY6068625,urn:wtsi:402769_H11_3577STDY6068625,urn:wtsi:402769_H11_3577STDY6068625,-0.109379,0.145012,0.041254,0.007771,-0.127986,0.046414,0.022386,0.074517,...,0.002803,0.015455,0.007714,0.000010,0.002243,0.222281,0.017475,0.157613,0.000010,0.532887
3577STDY6068642,urn:wtsi:402769_H12_3577STDY6068642,urn:wtsi:402769_H12_3577STDY6068642,-0.067380,0.176812,0.045047,0.005572,-0.137002,0.050298,0.028667,0.067988,...,0.003910,0.006031,0.022336,0.004154,0.018254,0.242268,0.001633,0.089835,0.000010,0.553751


In [332]:

ddf = pca.join(meta).dropna(subset=['Region'])
x = ddf[['PC1', 'PC2']].values
y = ddf['Region'].values
ddf

Unnamed: 0,FID,IID,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,...,Pop03,Pop04,Pop05,Pop06,Pop07,Pop08,Pop09,Pop10,Pop11,Region
3577STDY6068360,urn:wtsi:402768_A04_3577STDY6068360,urn:wtsi:402768_A04_3577STDY6068360,0.040019,0.129034,0.024720,0.075059,-0.045210,-0.028730,0.011798,0.021927,...,0.000010,0.241095,0.000010,0.000010,0.241566,0.010580,0.000010,0.209171,0.286707,Dal
3577STDY6068363,urn:wtsi:402768_B04_3577STDY6068363,urn:wtsi:402768_B04_3577STDY6068363,-0.081108,0.166468,0.042266,0.014726,-0.131718,0.043034,0.019084,0.056632,...,0.004235,0.031158,0.000010,0.014886,0.228387,0.000010,0.115170,0.014057,0.539281,Rsa
3577STDY6068364,urn:wtsi:402768_C04_3577STDY6068364,urn:wtsi:402768_C04_3577STDY6068364,-0.089888,0.167659,0.036224,0.012653,-0.131091,0.043207,0.024396,0.061888,...,0.003357,0.017301,0.004249,0.002878,0.226198,0.000010,0.118692,0.018615,0.541203,Rsa
3577STDY6068366,urn:wtsi:402768_D04_3577STDY6068366,urn:wtsi:402768_D04_3577STDY6068366,0.016327,0.123590,0.018694,0.046293,-0.058834,-0.005855,0.013593,0.028394,...,0.000010,0.185189,0.000010,0.000010,0.227247,0.002731,0.002764,0.146307,0.371392,Abyn
3577STDY6068367,urn:wtsi:402768_E04_3577STDY6068367,urn:wtsi:402768_E04_3577STDY6068367,-0.016454,0.173945,0.033367,0.028361,-0.122320,0.030789,0.031928,0.062608,...,0.000353,0.055599,0.003172,0.000010,0.240484,0.003779,0.026177,0.039694,0.577682,Sad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
syria464,2090,syria464,-0.019923,0.182190,0.036389,0.027161,-0.095037,0.014229,0.025274,0.045190,...,0.000453,0.123974,0.003388,0.010675,0.277256,0.015056,0.050203,0.061638,0.394871,Syria
syria485,2111,syria485,-0.013214,0.192612,0.027955,0.018871,-0.078744,0.016143,0.013852,0.033427,...,0.005591,0.143307,0.000010,0.006617,0.295985,0.000010,0.048272,0.054234,0.330496,Syria
syria520,2122,syria520,-0.037624,0.197090,0.037525,0.001757,-0.084431,0.017726,0.019637,0.045405,...,0.000172,0.121271,0.003837,0.004173,0.320589,0.009133,0.072912,0.015476,0.358569,Syria
syria6,2154,syria6,-0.034033,0.180094,0.030262,0.023172,-0.089102,0.014303,0.024594,0.044767,...,0.004908,0.137202,0.003189,0.007571,0.266026,0.006833,0.069084,0.039515,0.397492,Syria


In [333]:
meta.shape

(648, 1)

In [334]:
from math import pi
from bokeh.palettes import Category20c, Paired


In [335]:
from bokeh.io import output_notebook
from bokeh.io import output_file, show
from bokeh.plotting import figure
from bokeh.transform import cumsum

output_notebook()

x_range=ddf.PC1.min(),ddf.PC1.max()
y_range=ddf.PC2.min(),ddf.PC2.max()

In [336]:
colorDict = dict(zip(pops, Paired[k]))

In [337]:
yemeni = ddf[ddf.apply(lambda sample: sample.Region in markers.keys(), axis=1)]
sortedPops = yemeni[pops].sum(axis=0).sort_values(ascending=False).keys()
colors = [colorDict[pop] for pop in sortedPops]

yemeniS = yemeni[sortedPops].sort_values(by=list(sortedPops), ascending=False)
ax=yemeniS.plot.bar(stacked=True, color=colors, width=1)
#sort_values(by=pops)

<IPython.core.display.Javascript object>

### Yemen data in Reich (Vyas mainly)

In [338]:
yemeni = ddf[ddf.Region=='Yemen']
yemeniS = yemeni[sortedPops].sort_values(by=list(sortedPops), ascending=False)
ax=yemeniS.plot.bar(stacked=True, color=colors, width=1)

<IPython.core.display.Javascript object>

### Admixture of world wide context (local  + few selected far locations)

In [319]:
countriesSortedGeo =  [ ('China', 10), ('India', 10), ('Pakistan', 10), ('Iran', 25), ('Iraq', 14), ('Saudi Arabia', 10),  ('Israel', 25), ('Lebanon', 25), ('Jordan', 25), ('Syria', 24), ('Morocco', 10), ('Tunisia', 10), ('Egypt', 25), ('Ethiopia', 13), ('Sudan', 4), ('Eritrea', 3), ('Armenia', 10), ('Turkey', 10), ('Ukraine', 10), ('Italy', 10), ('Congo', 10)]
others = pd.concat([ddf[ddf.Region==country] for country, _ in countriesSortedGeo])
others = others.iloc[::2].set_index('Region')
#others = others.sort_values(by=list(sortedPops), ascending=False).set_index('Region')
others[sortedPops].plot.bar(stacked=True, color=colors, width=1)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f53db77e668>

In [308]:
sortedPops

Index(['Pop11', 'Pop07', 'Pop09', 'Pop00', 'Pop04', 'Pop01', 'Pop10', 'Pop02',
       'Pop08', 'Pop03', 'Pop06', 'Pop05'],
      dtype='object')

In [342]:
p = figure(plot_height=1200, plot_width=1600, title="PCA + Admixture", x_range=x_range, y_range=y_range,
    tooltips=[("@pop", "@value"), ("Sample", "@sample"), ("Region", "@region")])
# one pie chart per sample:
# taking a row from the master table -> df, adding some convenience columns
for sample in list(ddf.index):
    region = ddf.loc[sample, 'Region']
    #if not region in markers.keys(): continue
    data = ddf.loc[sample, pops].reset_index(name='value').rename(columns={'index':'pop'})
    data['angle'] = data['value']/data['value'].sum() * 2*pi
    data['color'] = Paired[k]
    data['sample'] = sample
    data['line_color'] = colors2[region]
    data['region'] = region
    x = ddf.loc[sample, 'PC1']
    y = ddf.loc[sample, 'PC2']
    radius = 0.002 if region in markers.keys() else 0.001
        
    p.wedge(x=x, y=y, radius=radius,
            start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
            line_color='line_color', fill_color='color',  line_width=3, source=data)
   

show(p)

In [30]:
for region in Counter(ddf.Region):
    n = sum(y == region)
    alpha = 1 if region in markers.keys() else 0.1
    plt.scatter(
        x[y==region].T[0], x[y==region].T[1], label=region,
        c=colors[region],
        marker=markers.get(region, '+'),
        alpha=alpha
    )
_ = plt.legend(ncol=3)

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7f5416838198>

In [15]:
ddf = pca.join(meta).dropna(subset=['Region'])
x = ddf[['PC1', 'PC2']].values
y = ddf['Region'].values

for region in Counter(ddf.Region):
    n = sum(y == region)
    alpha = 1 if region in markers.keys() else 0.3
    plt.scatter(
        x[y==region].T[0], x[y==region].T[1], label=region,
        c=colors[region],
        marker=markers.get(region, '+'),
        alpha=alpha
    )
#plt.legend(ncol=3)

In [16]:
plt.show()

In [103]:
import pandas_bokeh

In [104]:
pd.set_option('plotting.backend', 'pandas_bokeh')

In [105]:
pandas_bokeh.output_notebook()

In [106]:
from bokeh.models.widgets import DataTable, TableColumn
from bokeh.models import ColumnDataSource

In [113]:
ddf1 = ddf['PC1 PC2 Region'.split()]
data_table = DataTable(
    columns=[TableColumn(field=Ci, title=Ci) for Ci in ddf1.columns],
    source=ColumnDataSource(ddf1),
    height=800,
)

In [115]:
p_scatter = ddf1.plot_bokeh.scatter(
    x="PC1",
    y="PC2",
    category="Region",
    title="PCA Yemen - Human Origins",
    show_figure=True,
)

In [111]:
ddf.columns.values

array(['FID', 'IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7',
       'PC8', 'PC9', 'PC10', 'Region'], dtype=object)

In [267]:
from bokeh.palettes import Paired

In [268]:
Paired[12]

('#a6cee3',
 '#1f78b4',
 '#b2df8a',
 '#33a02c',
 '#fb9a99',
 '#e31a1c',
 '#fdbf6f',
 '#ff7f00',
 '#cab2d6',
 '#6a3d9a',
 '#ffff99',
 '#b15928')