# Phylogeny of *Muscari* using genome wide ddRAD data
In this Jupyter Notebook I document the assembly parameter and steps used for the phylogenetic study of **Böhnert *et al.*** (in prep.) **Phylogeny based generic reclassification of *Muscari* sensu lato (Asparagaceae) using plastid and genomic DNA**. Further, the analysis parameter for the Multispecies coalescent using SVDquartet (= tetRAD) and PCA analysis is documented.

This Notebook, its code and explanations are based on the work of Deren Eaton and Isaac Overcast and taken from the [Iyprad documentation](https://ipyrad.readthedocs.io/en/master/index.html). Please check out their project on [GitHub](https://github.com/dereneaton/ipyrad) as well as the orginal publication: [Eaton & Overcast (2020) *Bioinformatics*](https://academic.oup.com/bioinformatics/article/36/8/2592/5697088).

Prerequisite to reproduce the analyses in this notebook is the installation of (Mini)Conda and `iyprad` and `toytree`. See the respective instructions for [Ipyrad](https://ipyrad.readthedocs.io/en/master/index.html) and [Toytree](https://toytree.readthedocs.io/en/latest/index.html) for more information or follow those two lines for a quick start:

```
conda install ipyrad -c conda-forge -c bioconda
conda install toytree -c conda-forge
```

In [None]:
## import Packages and check for versions
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp
import pandas as pd
import toytree
import toyplot

## print Version of ipyrad und toytree
print("ipyrad v. {}".format(ip.__version__))
print("toytree v. {}".format(toytree.__version__))

## print Version of Python
from platform import python_version
print("Python v.", python_version())

#### Parallel processes on independent Python kernels
To start a parallel client you must run the command-line program `ipcluster`. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available.

Open a terminal, activate the correct conda environment and type the following command to start an `ipcluster` instance with `N` engines.

In [3]:
## ipcluster start --n=16

In [None]:
## connect to cluster
ipyclient = ipp.Client()
print(ip.cluster_info(ipyclient))

## Data Assembly
RADseq data can be downloaded here: **XXXX**
### Create an Assembly object and modify *ipyrad* params file
This object stores the parameters of the assembly and the organization of the data

In [None]:
## Provide a name for the assembly
data = ip.Assembly("Muscari")

In [None]:
## set parameters
data.set_params("project_dir", "Mus_Assembly")
data.set_params("sorted_fastq_path", "./Mus_fastq/*.fastq.gz")
data.set_params("clust_threshold", "0.85")
data.set_params("max_Hs_consens", (0.05))
data.set_params("restriction_overhang", ('TGCAG', 'GGCC'))
data.set_params("output_formats", "*")
data.set_params("datatype", "ddrad")

## see / print all parameters
data.get_params()

### Assemble the data from step 1 to 6

In [None]:
## run steps 1 & 2 of the assembly
data.run("12", force = True)

In [None]:
## set cluster treshold to 85 && run assembly steps 3-6
data_clust85 = data.branch("data_clust85")
data_clust85.set_params("clust_threshold", 0.85)
data_clust85.run("3456", force = True)

In [None]:
## show assemby stats until step 6
data_clust85.stats

## show assemby stats until step 6 but listed based 
## on the number of consensus reads (uncomment)
# data_clust85.stats.sort_values(by=['reads_consens'])

In [None]:
## set cluster treshold to 90 && run assembly steps 3-6
data_clust90 = data.branch("data_clust90")
data_clust90.set_params("clust_threshold", 0.90)
data_clust90.run("3456", force = True)

In [None]:
## show assemby stats until step 6
data_clust90.stats

## show assemby stats until step 6 but listed based 
## on the number of consensus reads (uncomment)
# data_clust90.stats.sort_values(by=['reads_consens'])

In [None]:
## set cluster treshold to 95 && run assembly steps 3-6
data_clust95 = data.branch("data_clust95")
data_clust95.set_params("clust_threshold", 0.95)
data_clust95.run("3456", force = True)

In [None]:
## show assemby stats until step 6
data_clust95.stats

## show assemby stats until step 6 but listed based 
## on the number of consensus reads (uncomment)
# data_clust95.stats.sort_values(by=['reads_consens'])

### Final assembly with different `min_samples_locus` settings for different analysis

1. Phylogenetic analysis 
    - RAxML
    - tetRAD
2. Population analysis of selected clades
    - PCA

#### In case comming back to continue from here, load assembly object to continue after step 6

In [None]:
## load assembly object when comming back
data = ip.load_json("./Mus_Assembly/Muscari.json")
data_clust85 = ip.load_json("./Mus_Assembly/data_clust85.json")
data_clust90 = ip.load_json("./Mus_Assembly/data_clust90.json")

## check again the stat-file sorted by number of consensus reads
#data_clust85.stats.sort_values(by=['reads_consens'])

## check names if needed
#data_clust85.stats

### 1. Assembly for Phylogenetic analysis
In case sequencing failed for some samples, those should be excluded before running step 7. One result of sequencing failure can be a low read number (e.g., < 1000 or 10000 reads after step 6, depending on the average read number). Or you have samples which are outsite the target group and must be therefore excluded from the analysis:

Enter those sample names which you wanna have removed in the `keep_list` for loop and make a new subset using the `branch` funktion:

In [None]:
## exclude samples from assembly with ...
keep_list = [i for i in data.samples.keys() if i not in [
    ## ... low read number (e.g., < 10000 )
    "", "",
    
    ## ... other samples to exclude
    "", "", "",
]]

## make a new data branch from the keep_list
data = data.branch("data", subsamples = keep_list, force = True)

## double check taxon sampling
#data.stats.sort_values(by=['reads_consens'])
data.stats

However, this was not need in this case ...


Next, lets identify certain values of missing data to see how the effect the final assembly and the topology and support in subsequent phylogenetic analysis.
1. Set the number of outgroup taxa.
2. define the percentage of missing data you wanna test
3. run a for loop to simply calculate how many sample respresent a certain percentage of missing data

In [265]:
## First check number of remaining samples
ingroup = data.stats.state.count() - 4
print("Number of ingroup taxa:", ingroup)
print("Calculate different sets of missing data:")

## for loop to calculate different values for min_sample_locus
percent = [10, 15, 20, 25, 30, 35, 40]
for i in percent:
    res = ingroup / 100 * i
    print(i,"% = ", round(res))

Number of ingroup taxa: 39
Calculate different sets of missing data:
10 % =  4
15 % =  6
20 % =  8
25 % =  10
30 % =  12
35 % =  14
40 % =  16


In [None]:
## Cluster Treshold 85
## -------------------
## Run the final assembly step 7 through a for loop with different min_sample_locus
## based on estimated number of remaining samples MINUS outgroup

## First, make a dictionary with the percentage of missing data as keys and 
## the actual min_sample_locus specified as values based on the number of "ingroup samples"
sample_dict = {10: 4,
               15: 6,
               20: 8,
               25: 10,
               30: 12,
               35: 14,
               40: 16}

## loop over the dictionary 
for key, value in sample_dict.items():
    newname = "pops{}_clust85".format(key)
    newdata = data_clust85.branch(newname)
    newdata.populations = {
        "ingroup":  (value, [i for i in newdata.samples if "B" not in i]),
        "outgroup": (0,     [i for i in newdata.samples if "B" in i]),
         }
    ## run final step on every interation of the loop
    newdata.run("7", force = True)

In [None]:
## Cluster Treshold 90
## -------------------
## Run the final assembly step 7 through a for loop with different min_sample_locus
## based on estimated number of remaining samples MINUS outgroup

## First, make a dictionary with the percentage of missing data as keys and 
## the actual min_sample_locus specified as values based on the number of "ingroup samples"
sample_dict = {10: 4,
               15: 6,
               20: 8,
               25: 10,
               30: 12,
               35: 14,
               40: 16}

## loop over the dictionary 
for key, value in sample_dict.items():
    newname = "pops{}_clust90".format(key)
    newdata = data_clust90.branch(newname)
    newdata.populations = {
        "ingroup":  (value, [i for i in newdata.samples if "B" not in i]),
        "outgroup": (0,     [i for i in newdata.samples if "B" in i]),
         }
    ## run final step on every interation of the loop
    newdata.run("7", force = True)

In [None]:
## Cluster Treshold 95
## -------------------
## Run the final assembly step 7 through a for loop with different min_sample_locus
## based on estimated number of remaining samples MINUS outgroup

## First, make a dictionary with the percentage of missing data as keys and 
## the actual min_sample_locus specified as values based on the number of "ingroup samples"
sample_dict = {10: 4,
               15: 6,
               20: 8,
               25: 10,
               30: 12,
               35: 14,
               40: 16}

## loop over the dictionary 
for key, value in sample_dict.items():
    newname = "pops{}_clust95".format(key)
    newdata = data_clust95.branch(newname)
    newdata.populations = {
        "ingroup":  (value, [i for i in newdata.samples if "B" not in i]),
        "outgroup": (0,     [i for i in newdata.samples if "B" in i]),
         }
    ## run final step on every interation of the loop
    newdata.run("7", force = True)

### 2. Assembly for PCA analysis with outgroups removed

In [None]:
## load assembly object when comming back
data_clust90 = ip.load_json("./Mus_Assembly/data_clust90.json")

## check name
#data_clust90.stats

In [None]:
## Exclude outgroup samples from assembly
keep_list = [i for i in data_clust90.samples.keys() if i not in [
    "Bellevalia_dubia_W6083", "Bellevalia_paradoxa_ED1272",
    "Bellevalia_speciosa_W6085", "Brimeura_amethystina_W6084"
]]

## make a new data branch from the keep_list
nout_clust90 = data_clust90.branch("nout_clust90", subsamples = keep_list, force = True)

## double check taxon sampling
#nout_clust90.stats.sort_values(by=['reads_consens'])
nout_clust90.stats

In [None]:
## run final assembly without outgroups and no missing data allowed for the ingroup
nout_clust90.set_params("min_samples_locus", 20)
nout_clust90.run("7", force = True)

## Phylogenetic downstream analysis


### 1. RAxML

Analysis for this study were performed directly in the RAxML command line tool using a coustom script.
The results of those analysis with different clustering thresholds and missing data are plotted below:

#### Plot RAxML `clust85` trees together

In [None]:
## Plot all six clust85 RAxML trees together

## Load trees
tre15 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_15.phy")
tre20 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_20.phy")
tre25 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_25.phy")
tre30 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_30.phy")
tre35 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_35.phy")
tre40 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_40.phy")

tre15 = tre15.root(wildcard = "Brimeura")
tre20 = tre20.root(wildcard = "Brimeura")
tre25 = tre25.root(wildcard = "Brimeura")
tre30 = tre30.root(wildcard = "Brimeura")
tre35 = tre35.root(wildcard = "Brimeura")
tre40 = tre40.root(wildcard = "Brimeura")


## set dimensions of the canvas
canvas = toyplot.Canvas(width = 2000, height = 2000)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('2%',  '30%', '5%',  '47.5%'))
ax1 = canvas.cartesian(bounds=('33%', '63%', '5%',  '47.5%'))
ax2 = canvas.cartesian(bounds=('66%', '96%', '5%',  '47.5%'))
ax3 = canvas.cartesian(bounds=('2%',  '30%', '50%', '97.5%'))
ax4 = canvas.cartesian(bounds=('33%', '63%', '50%', '97.5%'))
ax5 = canvas.cartesian(bounds=('66%', '96%', '50%', '97.5%'))

# call draw with the 'axes' argument to pass it to a specific cartesian area
style = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "11px"},
    "node_labels_style":{"font-size": "12px",
                        "baseline-shift": "7px",
                        "-toyplot-anchor-shift": "-13px"},
}
tre15.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre20.ladderize(1).draw(
    axes = ax1,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre25.ladderize(1).draw(
    axes = ax2,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre30.ladderize(1).draw(
    axes = ax3,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre35.ladderize(1).draw(
    axes = ax4,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre40.ladderize(1).draw(
    axes = ax5,
    **style,
    node_sizes = 0,
    node_labels = 'support');

## hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False; ax2.show = False;
ax3.show = False; ax4.show = False; ax5.show = False;

## add names for the single trees
canvas.text(1000, 50, '<b><i>Muscari</b></i> — RAxML — Clustering threshold 85 %', style = {"font-size": "24px"})
canvas.text(150, 125, '85 % missing data', style={"font-size": "18px"})
canvas.text(800, 125, '80 % missing data', style={"font-size": "18px"})
canvas.text(1450, 125, '75 % missing data', style={"font-size": "18px"})
canvas.text(150, 1025, '70 % missing data', style={"font-size": "18px"})
canvas.text(800, 1025, '65 % missing data', style={"font-size": "18px"})
canvas.text(1450, 1025, '60 % missing data', style={"font-size": "18px"});

In [61]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/RAxML_Figures/Suppl-Fig_Mus_RAxML_clust85_20210812_15-20-25-30-35-40_Anno.pdf");

#### Plot RAxML `clust90` trees together

In [None]:
## Plot all six clust90 RAxML trees together

## Load trees
tre15 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust90_20210816/RAxML_bipartitions.pops15_clust90.phy")
tre20 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust90_20210816/RAxML_bipartitions.pops20_clust90.phy")
tre25 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust90_20210816/RAxML_bipartitions.pops25_clust90.phy")
tre30 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust90_20210816/RAxML_bipartitions.pops30_clust90.phy")
tre35 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust90_20210816/RAxML_bipartitions.pops35_clust90.phy")
tre40 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust90_20210816/RAxML_bipartitions.pops40_clust90.phy")

tre15 = tre15.root(wildcard = "Brimeura")
tre20 = tre20.root(wildcard = "Brimeura")
tre25 = tre25.root(wildcard = "Brimeura")
tre30 = tre30.root(wildcard = "Brimeura")
tre35 = tre35.root(wildcard = "Brimeura")
tre40 = tre40.root(wildcard = "Brimeura")


## set dimensions of the canvas
canvas = toyplot.Canvas(width = 2000, height = 2000)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('2%',  '30%', '5%',  '47.5%'))
ax1 = canvas.cartesian(bounds=('33%', '63%', '5%',  '47.5%'))
ax2 = canvas.cartesian(bounds=('66%', '96%', '5%',  '47.5%'))
ax3 = canvas.cartesian(bounds=('2%',  '30%', '50%', '97.5%'))
ax4 = canvas.cartesian(bounds=('33%', '63%', '50%', '97.5%'))
ax5 = canvas.cartesian(bounds=('66%', '96%', '50%', '97.5%'))

# call draw with the 'axes' argument to pass it to a specific cartesian area
style = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "11px"},
    "node_labels_style":{"font-size": "12px",
                        "baseline-shift": "7px",
                        "-toyplot-anchor-shift": "-13px"},
}
tre15.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre20.ladderize(1).draw(
    axes = ax1,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre25.ladderize(1).draw(
    axes = ax2,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre30.ladderize(1).draw(
    axes = ax3,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre35.ladderize(1).draw(
    axes = ax4,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre40.ladderize(1).draw(
    axes = ax5,
    **style,
    node_sizes = 0,
    node_labels = 'support');

# hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False; ax2.show = False;
ax3.show = False; ax4.show = False; ax5.show = False;

## add names for the single trees
canvas.text(1000, 50, 'RAxML — Clustering threshold 90 %', style = {"font-size": "24px"})
canvas.text(150, 125, '85 % missing data', style={"font-size": "18px"})
canvas.text(800, 125, '80 % missing data', style={"font-size": "18px"})
canvas.text(1450, 125, '75 % missing data', style={"font-size": "18px"})
canvas.text(150, 1025, '70 % missing data', style={"font-size": "18px"})
canvas.text(800, 1025, '65 % missing data', style={"font-size": "18px"})
canvas.text(1450, 1025, '60 % missing data', style={"font-size": "18px"});

In [None]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/RAxML_Figures/Suppl-Fig_Mus_RAxML_clust90_20210816_15-20-25-30-35-40_Anno.pdf");

#### Plot RAxML `clust95` trees together

In [None]:
# Plot all six clust90 RAxML trees together

## Load trees
tre15 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust95_20210823/RAxML_bipartitions.pops15_clust95.phy")
tre20 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust95_20210823/RAxML_bipartitions.pops20_clust95.phy")
tre25 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust95_20210823/RAxML_bipartitions.pops25_clust95.phy")
tre30 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust95_20210823/RAxML_bipartitions.pops30_clust95.phy")
tre35 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust95_20210823/RAxML_bipartitions.pops35_clust95.phy")
tre40 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust95_20210823/RAxML_bipartitions.pops40_clust95.phy")

tre15 = tre15.root(wildcard = "Brimeura")
tre20 = tre20.root(wildcard = "Brimeura")
tre25 = tre25.root(wildcard = "Brimeura")
tre30 = tre30.root(wildcard = "Brimeura")
tre35 = tre35.root(wildcard = "Brimeura")
tre40 = tre40.root(wildcard = "Brimeura")


## set dimensions of the canvas
canvas = toyplot.Canvas(width = 2000, height = 2000)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('2%',  '30%', '5%',  '47.5%'))
ax1 = canvas.cartesian(bounds=('33%', '63%', '5%',  '47.5%'))
ax2 = canvas.cartesian(bounds=('66%', '96%', '5%',  '47.5%'))
ax3 = canvas.cartesian(bounds=('2%',  '30%', '50%', '97.5%'))
ax4 = canvas.cartesian(bounds=('33%', '63%', '50%', '97.5%'))
ax5 = canvas.cartesian(bounds=('66%', '96%', '50%', '97.5%'))

# call draw with the 'axes' argument to pass it to a specific cartesian area
style = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "11px"},
    "node_labels_style":{"font-size": "12px",
                        "baseline-shift": "7px",
                        "-toyplot-anchor-shift": "-13px"},
}
tre15.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre20.ladderize(1).draw(
    axes = ax1,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre25.ladderize(1).draw(
    axes = ax2,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre30.ladderize(1).draw(
    axes = ax3,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre35.ladderize(1).draw(
    axes = ax4,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tre40.ladderize(1).draw(
    axes = ax5,
    **style,
    node_sizes = 0,
    node_labels = 'support');

# hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False; ax2.show = False;
ax3.show = False; ax4.show = False; ax5.show = False;

## add names for the single trees
canvas.text(1000, 50, 'RAxML — Clustering threshold 95 %', style = {"font-size": "24px"})
canvas.text(150, 125, '85 % missing data', style={"font-size": "18px"})
canvas.text(800, 125, '80 % missing data', style={"font-size": "18px"})
canvas.text(1450, 125, '75 % missing data', style={"font-size": "18px"})
canvas.text(150, 1025, '70 % missing data', style={"font-size": "18px"})
canvas.text(800, 1025, '65 % missing data', style={"font-size": "18px"})
canvas.text(1450, 1025, '60 % missing data', style={"font-size": "18px"});

In [None]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/RAxML_Figures/Suppl-Fig_Mus_RAxML_clust95_20210823_15-20-25-30-35-40_Anno.pdf");

### tetRAD
#### run multiple retRAD analysis in a for loop
##### Run tetRAD with clustering theshold `clust90` & Plot trees together

In [36]:
## read the *.snps.hdf5 files as values and store those path in a dictionary with assembly names as keys
dict = {
    "pop15": "/home/tim/GBS/Muscari/Mus_Assembly/pops15_clust85_outfiles/pops15_clust85.snps.hdf5",
    "pop20": "/home/tim/GBS/Muscari/Mus_Assembly/pops20_clust85_outfiles/pops20_clust85.snps.hdf5",
    "pop25": "/home/tim/GBS/Muscari/Mus_Assembly/pops25_clust85_outfiles/pops25_clust85.snps.hdf5",
    "pop30": "/home/tim/GBS/Muscari/Mus_Assembly/pops30_clust85_outfiles/pops30_clust85.snps.hdf5",
    "pop35": "/home/tim/GBS/Muscari/Mus_Assembly/pops35_clust85_outfiles/pops35_clust85.snps.hdf5",
    "pop40": "/home/tim/GBS/Muscari/Mus_Assembly/pops40_clust85_outfiles/pops40_clust85.snps.hdf5"
}

In [None]:
## Iterate through the dictionary and run a tetRAD anlysis for each assembly

for key, value in dict.items():
    tet = ipa.tetrad(
        name = "Mus_tet_clust85_" + str(key),
        data = value,
        workdir = "./Mus_Analysis/Mus_tetRAD/tet_clust85",
        nquartets = 1e6,
        nboots = 200)
    ## run 
    tet.run(auto = True, force = True)

In [None]:
## Plot all six clust85 tetRAD coalescent trees together
## Load trees
tet15 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop15.tree.cons").root(wildcard = "Brimeura")
tet20 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop20.tree.cons").root(wildcard = "Brimeura")
tet25 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop25.tree.cons").root(wildcard = "Brimeura")
tet30 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop30.tree.cons").root(wildcard = "Brimeura")
tet35 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop35.tree.cons").root(wildcard = "Brimeura")
tet40 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop40.tree.cons").root(wildcard = "Brimeura")

## set dimensions of the canvas
canvas = toyplot.Canvas(width = 2000, height = 2000)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('2%',  '30%', '5%',  '47.5%'))
ax1 = canvas.cartesian(bounds=('33%', '63%', '5%',  '47.5%'))
ax2 = canvas.cartesian(bounds=('66%', '96%', '5%',  '47.5%'))
ax3 = canvas.cartesian(bounds=('2%',  '30%', '50%', '97.5%'))
ax4 = canvas.cartesian(bounds=('33%', '63%', '50%', '97.5%'))
ax5 = canvas.cartesian(bounds=('66%', '96%', '50%', '97.5%'))

## define style ones and use it for every tree
style = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "11px"},
    "node_labels_style":{"font-size": "12px",
                        "baseline-shift": "7px",
                        "-toyplot-anchor-shift": "-13px"},
}
tet15.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet20.ladderize(1).draw(
    axes = ax1,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet25.ladderize(1).draw(
    axes = ax2,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet30.ladderize(1).draw(
    axes = ax3,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet35.ladderize(1).draw(
    axes = ax4,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet40.ladderize(1).draw(
    axes = ax5,
    **style,
    node_sizes = 0,
    node_labels = 'support');

## hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False; ax2.show = False;
ax3.show = False; ax4.show = False; ax5.show = False;

## add names for the single trees
canvas.text(1000, 50, 'tetRAD/SVDQuartet — Clustering threshold 85 %', style = {"font-size": "24px"})
canvas.text(150, 125, '85 % missing data', style={"font-size": "18px"})
canvas.text(800, 125, '80 % missing data', style={"font-size": "18px"})
canvas.text(1450, 125, '75 % missing data', style={"font-size": "18px"})
canvas.text(150, 1025, '70 % missing data', style={"font-size": "18px"})
canvas.text(800, 1025, '65 % missing data', style={"font-size": "18px"})
canvas.text(1450, 1025, '60 % missing data', style={"font-size": "18px"});

In [8]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tetRAD_Figures/Suppl-Fig_Mus_tetRAD-consens_clust85_20210811_15-20-25-30-35-40_Anno.pdf");

##### Run tetRAD with clustering theshold `clust90` & Plot trees together

In [None]:
## read the *.snps.hdf5 files as values and store those path in a dictionary with assembly names as keys
dict = {
    "pop15": "/home/tim/GBS/Muscari/Mus_Assembly/pops15_clust90_outfiles/pops15_clust90.snps.hdf5",
    "pop20": "/home/tim/GBS/Muscari/Mus_Assembly/pops20_clust90_outfiles/pops20_clust90.snps.hdf5",
    "pop25": "/home/tim/GBS/Muscari/Mus_Assembly/pops25_clust90_outfiles/pops25_clust90.snps.hdf5",
    "pop30": "/home/tim/GBS/Muscari/Mus_Assembly/pops30_clust90_outfiles/pops30_clust90.snps.hdf5",
    "pop35": "/home/tim/GBS/Muscari/Mus_Assembly/pops35_clust90_outfiles/pops35_clust90.snps.hdf5",
    "pop40": "/home/tim/GBS/Muscari/Mus_Assembly/pops40_clust90_outfiles/pops40_clust90.snps.hdf5"
}

In [None]:
## Iterate through the dictionary and run a tetRAD anlysis for each assembly

for key, value in dict.items():
    tet = ipa.tetrad(
        name = "Mus_tet_clust90_" + str(key),
        data = value,
        workdir = "./Mus_Analysis/Mus_tetRAD/tet_clust90",
        nquartets = 1e6,
        nboots = 200)
    ## run 
    tet.run(auto = True, force = True)

In [None]:
## Plot all six clust90 tetRAD coalescent trees together
## Load trees
tet15 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop15.tree.cons").root(wildcard = "Brimeura")
tet20 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop20.tree.cons").root(wildcard = "Brimeura")
tet25 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop25.tree.cons").root(wildcard = "Brimeura")
tet30 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop30.tree.cons").root(wildcard = "Brimeura")
tet35 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop35.tree.cons").root(wildcard = "Brimeura")
tet40 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop40.tree.cons").root(wildcard = "Brimeura")

## set dimensions of the canvas
canvas = toyplot.Canvas(width = 2000, height = 2000)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('2%',  '30%', '5%',  '47.5%'))
ax1 = canvas.cartesian(bounds=('33%', '63%', '5%',  '47.5%'))
ax2 = canvas.cartesian(bounds=('66%', '96%', '5%',  '47.5%'))
ax3 = canvas.cartesian(bounds=('2%',  '30%', '50%', '97.5%'))
ax4 = canvas.cartesian(bounds=('33%', '63%', '50%', '97.5%'))
ax5 = canvas.cartesian(bounds=('66%', '96%', '50%', '97.5%'))

## define style ones and use it for every tree
style = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "11px"},
    "node_labels_style":{"font-size": "12px",
                        "baseline-shift": "7px",
                        "-toyplot-anchor-shift": "-13px"},
}
tet15.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet20.ladderize(1).draw(
    axes = ax1,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet25.ladderize(1).draw(
    axes = ax2,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet30.ladderize(1).draw(
    axes = ax3,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet35.ladderize(1).draw(
    axes = ax4,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet40.ladderize(1).draw(
    axes = ax5,
    **style,
    node_sizes = 0,
    node_labels = 'support');

## hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False; ax2.show = False;
ax3.show = False; ax4.show = False; ax5.show = False;

## add names for the single trees
canvas.text(1000, 50, 'tetRAD/SVDQuartet — Clustering threshold 90 %', style = {"font-size": "24px"})
canvas.text(150, 125, '85 % missing data', style={"font-size": "18px"})
canvas.text(800, 125, '80 % missing data', style={"font-size": "18px"})
canvas.text(1450, 125, '75 % missing data', style={"font-size": "18px"})
canvas.text(150, 1025, '70 % missing data', style={"font-size": "18px"})
canvas.text(800, 1025, '65 % missing data', style={"font-size": "18px"})
canvas.text(1450, 1025, '60 % missing data', style={"font-size": "18px"});

In [None]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tetRAD_Figures/Suppl-Fig_Mus_tetRAD-consens_clust90_20210816_15-20-25-30-35-40_Anno.pdf");

##### Run tetRAD with clustering theshold `clust95` & Plot trees together

In [None]:
## read the *.snps.hdf5 files as values and store those path in a dictionary with assembly names as keys
dict = {
    "pop15": "/home/tim/GBS/Muscari/Mus_Assembly/pops15_clust95_outfiles/pops15_clust95.snps.hdf5",
    "pop20": "/home/tim/GBS/Muscari/Mus_Assembly/pops20_clust95_outfiles/pops20_clust95.snps.hdf5",
    "pop25": "/home/tim/GBS/Muscari/Mus_Assembly/pops25_clust95_outfiles/pops25_clust95.snps.hdf5",
    "pop30": "/home/tim/GBS/Muscari/Mus_Assembly/pops30_clust95_outfiles/pops30_clust95.snps.hdf5",
    "pop35": "/home/tim/GBS/Muscari/Mus_Assembly/pops35_clust95_outfiles/pops35_clust95.snps.hdf5",
    "pop40": "/home/tim/GBS/Muscari/Mus_Assembly/pops40_clust95_outfiles/pops40_clust95.snps.hdf5"
}

In [None]:
## Iterate through the dictionary and run a tetRAD anlysis for each assembly

for key, value in dict.items():
    tet = ipa.tetrad(
        name = "Mus_tet_clust95_" + str(key),
        data = value,
        workdir = "./Mus_Analysis/Mus_tetRAD/tet_clust95",
        nquartets = 1e6,
        nboots = 200)
    ## run 
    tet.run(auto = True, force = True)

In [None]:
## Plot all six clust95 tetRAD coalescent trees together
## Load trees
tet15 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust95/Mus_tet_clust95_pop15.tree.cons").root(wildcard = "Brimeura")
tet20 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust95/Mus_tet_clust95_pop20.tree.cons").root(wildcard = "Brimeura")
tet25 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust95/Mus_tet_clust95_pop25.tree.cons").root(wildcard = "Brimeura")
tet30 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust95/Mus_tet_clust95_pop30.tree.cons").root(wildcard = "Brimeura")
tet35 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust95/Mus_tet_clust95_pop35.tree.cons").root(wildcard = "Brimeura")
tet40 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust95/Mus_tet_clust95_pop40.tree.cons").root(wildcard = "Brimeura")

## set dimensions of the canvas
canvas = toyplot.Canvas(width = 2000, height = 2000)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('2%',  '30%', '5%',  '47.5%'))
ax1 = canvas.cartesian(bounds=('33%', '63%', '5%',  '47.5%'))
ax2 = canvas.cartesian(bounds=('66%', '96%', '5%',  '47.5%'))
ax3 = canvas.cartesian(bounds=('2%',  '30%', '50%', '97.5%'))
ax4 = canvas.cartesian(bounds=('33%', '63%', '50%', '97.5%'))
ax5 = canvas.cartesian(bounds=('66%', '96%', '50%', '97.5%'))

## define style ones and use it for every tree
style = {
    "tip_labels_align": True,
    "tip_labels_style": {"font-size": "11px"},
    "node_labels_style":{"font-size": "12px",
                        "baseline-shift": "7px",
                        "-toyplot-anchor-shift": "-13px"},
}
tet15.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet20.ladderize(1).draw(
    axes = ax1,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet25.ladderize(1).draw(
    axes = ax2,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet30.ladderize(1).draw(
    axes = ax3,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet35.ladderize(1).draw(
    axes = ax4,
    **style,
    node_sizes = 0,
    node_labels = 'support');

tet40.ladderize(1).draw(
    axes = ax5,
    **style,
    node_sizes = 0,
    node_labels = 'support');

## hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False; ax2.show = False;
ax3.show = False; ax4.show = False; ax5.show = False;

## add names for the single trees
canvas.text(1000, 50, 'tetRAD/SVDQuartet — Clustering threshold 95 %', style = {"font-size": "24px"})
canvas.text(150, 125, '85 % missing data', style={"font-size": "18px"})
canvas.text(800, 125, '80 % missing data', style={"font-size": "18px"})
canvas.text(1450, 125, '75 % missing data', style={"font-size": "18px"})
canvas.text(150, 1025, '70 % missing data', style={"font-size": "18px"})
canvas.text(800, 1025, '65 % missing data', style={"font-size": "18px"})
canvas.text(1450, 1025, '60 % missing data', style={"font-size": "18px"});

In [None]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tetRAD_Figures/Suppl-Fig_Mus_tetRAD-consens_clust95_20210824_15-20-25-30-35-40_Anno.pdf");

#### Plot specific tetRAD trees
##### Plot cloud tree with custom tip order

In [22]:
treeorder = ["Brimeura_amethystina_W6084", "Bellevalia_paradoxa_ED1272",
           "Bellevalia_dubia_W6083", "Bellevalia_speciosa_W6085",
           "Muscari_racemosum_ED1258", "Muscari_macrocarpum_ED1252",
           "Pseudomuscari_chalusicum_ED1255", "Pseudomuscari_azureum_ED1270",
           "Muscari_parviflorum_ED1245", "Pseudomuscari_inconstrictum_ED3234",
           "Muscari_commutatum_ED3538", "Muscari_sivrihisardaghlarensis_ED1278",
           "Muscari_anatolicum_W6087", "Muscari_vularlii_ED3232",
           "Muscari_discolor_ED1266", "Pseudomuscari_pallens_ED1267",
           "Pseudomuscari_coeruleum_ED1261", "Muscari_adilii_W6090",
           "Muscari_armeniacum_ED1244", "Muscari_armeniacum_W6089",
           "Muscari_neglectum_ED1253", "Muscari_baeticum_ED1281",
           "Muscari_botryoides_ED1279", "Muscari_neglectum_ED1254",
           "Muscari_pulchellum_ED3231", "Muscari_kerkis_ED1280",
           "Muscari_bourgaei_ED1259", "Muscari_latifolium_ED1265",
           "Leopoldia_tenuiflora_ED1263", "Leopoldia_longipes_ED3233",
           "Muscari_massayanum_ED1251", "Leopoldia_neumannii_ED1243",
           "Leopoldia_neumannii_ED1607", "Muscari_mirum_ED1250",
           "Leopoldia_matritensis_ED1282", "Leopoldia_spreitzenhoferi_ED1248",
           "Leopoldia_cycladica_W6082", "Leopoldia_weissii_W6081",
           "Leopoldia_caucasica_ED1262", "Leopoldia_comosa_ED3539",
           "Leopoldia_comosa_ED3965", "Leopoldia_comosa_ED1274", "Leopoldia_comosa_ED1256"]

In [None]:
## Load the 200 bootstrap trees from pops30 TetRad analysisis and root it
tetcloud30 = toytree.mtree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop30.tree.boots")
tetcloud30.treelist = [i.root(["Brimeura_amethystina_W6084"]) for i in tetcloud30.treelist]

## plot the rooted bootstrap trees as a cloud tree
canvas, axes, mark = tetcloud30.draw_cloud_tree(
    height = 600,
    width = 400,
    
    ## define a fix tree order to make it comparable with the cons tree
    fixed_order = treeorder,
    use_edge_lengths = False,
    edge_style = {"stroke-opacity": 0.05,
                  "stroke-width": 1}
);


#### Plot consensus tree against cloud tree

In [None]:
## Load TetRad tree and consensus tree and root ith with Brimeura
constree30 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop30.tree.cons" ).root(wildcard = "Brimeura")

## Load TetRad bootstrap trees and root it with Brimeura
tetcloud30 = toytree.mtree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop30.tree.boots")
tetcloud30.treelist = [i.root(["Brimeura_amethystina_W6084"]) for i in tetcloud30.treelist]

## set dimensions of the canvas
canvas = toyplot.Canvas(width = 1300, height = 900)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('5%',  '47.5%', '5%',  '95%'))
ax1 = canvas.cartesian(bounds=('52.5%', '95%', '5%',  '95%'))

# call draw with the 'axes' argument to pass it to a specific cartesian area
style = {"tip_labels_align": True,
         "tip_labels_style": {"font-size": "12px"},
         "node_labels_style":{"font-size": "12px",
                              "baseline-shift": "7px",
                              "-toyplot-anchor-shift": "-13px"},
}

cstyle = {"tip_labels_align": True,
          "layout": 'l',
          "tip_labels_style": {"font-size": "12px"},
          "node_labels_style":{"font-size": "12px",
                               "baseline-shift": "7px",
                               "-toyplot-anchor-shift": "-13px"},
}

constree30.ladderize(1).draw(
    axes = ax0,
    **style,
    node_sizes = 0,
    node_labels = 'support');

## plot the rooted bootstrap trees as a cloud tree
tetcloud30.draw_cloud_tree(
    axes = ax1,
    fixed_order = treeorder,  ## define a fix tree order to make it comparable with the cons tree
    **cstyle,
    use_edge_lengths = False,
    #tip_labels = False,
    edge_style = {"stroke-opacity": 0.05,
                  "stroke-width": 1}
);

# hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False;

In [38]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tetRAD_Figures/Fig_Mus_tet_clust90_cons-cloud_20210816_pops30.pdf");

#### plot RAxML tree against tetRAD consensus tree

In [None]:
## Load TetRad tree and consensus tree and root ith with Brimeura
#constree30 = toytree.tree("./Mus_Analysis/Mus_tetRAD/tet_clust90/Mus_tet_clust90_pop30.tree.cons" ).root(wildcard = "Brimeura")
constree30 = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_tetRAD/tet_clust85/Mus_tet_pop20.tree.cons").root(wildcard = "Brimeura")

tre = toytree.tree("/home/tim/GBS/Muscari/Mus_Analysis/Mus_RAxML/Mus_RAxML_clust85_20210812/RAxML_bipartitions.pops_20.phy")
rtre = tre.root(wildcard = "Brimeura")

## Define the leucantha clade to be rotated in the tree
comosa = ["Leopoldia_cycladica_W6082", "Leopoldia_weissii_W6081", 
          "Leopoldia_spreitzenhoferi_ED1248", "Leopoldia_matritensis_ED1282", "Leopoldia_caucasica_ED1262"]

## set dimensions of the canvas
canvas = toyplot.Canvas(width = 1400, height = 900)

## dissect canvas into multiple cartesian areas (x1, x2, y1, y2)
ax0 = canvas.cartesian(bounds=('5%',  '60%', '5%',  '95%'))
ax1 = canvas.cartesian(bounds=('57.5%', '95%', '5%',  '95%'))

# call draw with the 'axes' argument to pass it to a specific cartesian area
style = {"tip_labels_align": True,
         "tip_labels_style": {"font-size": "12px"},
         "node_labels_style":{"font-size": "12px",
                              "baseline-shift": "7px",
                              "-toyplot-anchor-shift": "-13px"},
}

cstyle = {"tip_labels_align": True,
          "layout": 'l',
          "tip_labels_style": {"font-size": "12px"},
          "node_labels_style":{"font-size": "12px",
                               "baseline-shift": "7px",
                               "-toyplot-anchor-shift": "13px"},
}

#rotate_node(wildcard = "comosa").

rtre.ladderize(1).draw(
    axes = ax0,
    **style,
    node_labels = 'support',
    node_sizes = 0,
    );



constree30.ladderize(1).draw(
    axes = ax1,
    **cstyle,
    node_sizes = 0,
    node_labels = 'support');

#
# hide the axes (e.g, ticks and splines)
ax0.show = False; ax1.show = False;

In [None]:
import toyplot.pdf
toyplot.pdf.render(canvas, "/home/tim/GBS/Muscari/Mus_Analysis/FiguresForPaper/Fig_Mus_RAxML_tet_clust85_pops30.pdf");

## 2. Principle component analysis (PCA) of ***Muscari*** with outgroups removed


In [None]:
## load the hdf5 data for the STRUCTURE analysis
dataclust90 = "/home/tim/GBS/Muscari/Mus_Assembly/nout_clust90_outfiles/nout_clust90.snps.hdf5"

Assign the samples into five clades according the results of the phylogentic reconstructions.

In [34]:
# group individuals into populations
imap = {
    "Leop": ["Leopoldia_tenuiflora_ED1263", "Muscari_massayanum_ED1251", "Leopoldia_longipes_ED3233", 
             "Leopoldia_neumannii_ED1243", "Leopoldia_neumannii_ED1607", "Muscari_mirum_ED1250",
             "Leopoldia_caucasica_ED1262", "Leopoldia_matritensis_ED1282", "Leopoldia_comosa_ED3539",
             "Leopoldia_comosa_ED1274", "Leopoldia_comosa_ED3965", "Leopoldia_comosa_ED1256",
             "Leopoldia_weissii_W6081", "Leopoldia_weissii_ED1608", "Leopoldia_cycladica_W6082",
             "Leopoldia_spreitzenhoferi_ED1248"],
    "Musc": ["Pseudomuscari_pallens_ED1267", "Pseudomuscari_coeruleum_ED1261", 
             "Muscari_sivrihisardaghlarensis_ED1278", "Muscari_anatolicum_W6087", "Muscari_vularlii_ED3232",
             "Muscari_discolor_ED1266", "Muscari_adilii_W6090", "Muscari_armeniacum_ED1244", 
             "Muscari_armeniacum_W6089", "Muscari_neglectum_ED1253", "Muscari_neglectum_ED1254",
             "Muscari_baeticum_ED1281", "Muscari_botryoides_ED1279", "Muscari_commutatum_ED3538"],
    "Pull": ["Muscari_pulchellum_ED3231", "Muscari_kerkis_ED1280", "Muscari_bourgaei_ED1259", "Muscari_latifolium_ED1265"],
    "Pseu": ["Pseudomuscari_chalusicum_ED1255", "Pseudomuscari_inconstrictum_ED3234",
             "Pseudomuscari_azureum_ED1270", "Muscari_parviflorum_ED1245"],
    "Mosc": ["Muscari_racemosum_ED1258", "Muscari_macrocarpum_ED1252"],
}

# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}

Run a first PCA analysis to test different paramter settings and to see if results are stable.

In [None]:
# init pca object with input data and (optional) parameter options
pca = ipa.pca(
    data = data,
    imap = imap,
    minmap = minmap,
    mincov = 0.25,
    impute_method = "sample",
)

In [None]:
# run the PCA analysis
pca.run()

In [None]:
## store the PC axes as a dataframe
df = pd.DataFrame(pca.pcaxes[0], index=pca.names)

## write the PC axes to a CSV file
df.to_csv("pca_analysis.csv")

## show the first ten samples and the first 10 PC axes
df.iloc[:10, :10].round(2)

### Running a PCA with subsampling of replications with unlinked SNPs
This was used for publication.

In [None]:
## init pca object
pca2 = ipa.pca(
    data = data,
    imap = imap,
    minmap = minmap,
    mincov = 0.5,
    impute_method = "sample",
)

## run and draw results for impute_method=None and mincov=1.0
pca2.run(nreplicates = 25, seed = 123)

## plot different combinations
pca2.draw(0, 2);
pca2.draw(0, 1);

In [None]:
pca.draw(outfile = "/home/tim/GBS/Muscari/Mus_Analysis/FiguresForPaper/Fig_Mus_PCA.pdf");