# Tutorial of MendelAimSelection

## Julia version

Current code supports Julia version 1.0+ 


## When to use MendelAimSelection

This [Julia](http://julialang.org/) package selects the SNPs that are most informative at predicting ancestry for your data — the best Ancestry Informative Markers (AIMs). 

MendelAimSelection is one component of the umbrella [OpenMendel](https://openmendel.github.io) project.

## Background
Modern genetic studies often include people of many ethnicities or of mixed ethnicity. The potential for confounding ethnicity with disease risk is well known. MendelAimSelection uses an extension of an algorithm described by [Rosenberg et al.](https://www.ncbi.nlm.nih.gov/pubmed/14631557) to quickly find the N most informative AIMs within the data set.

*Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers
for inference of ancestry. Amer J Hum Genet 73:1402–1422.*

## Installation

*Note: Since the OpenMendel packages are not yet registered, the three OpenMendel packages (1) [SnpArrays](https://openmendel.github.io/SnpArrays.jl/latest/), (2) [MendelSearch](https://openmendel.github.io/MendelSearch.jl), and (3) [MendelBase](https://openmendel.github.io/MendelBase.jl) **must** be installed before any other OpenMendel package is installed. It is easiest if these three packages are installed in the above order.*

If you have not already installed the MendelAimSelection, then within Julia, use the package manager to install MendelAimSelection:

In [None]:
] add https://github.com/OpenMendel/MendelAimSelection.jl.git

or once the OpenMendel packages are registered simply use:

`pkg> add MendelAimSelection`

This package supports Julia v1.0+

## Input Files
The MendelAimSelection analysis package accepts the following input files. Example input files can be found in the [data](https://github.com/OpenMendel/MendelAimSelection.jl/tree/master/data) subfolder of the MendelAimSelection project. An analysis won't always need every file type below. The input for all examples in this tutorial were obtained from the 1000 genome project.

* [Control File](https://openmendel.github.io/MendelAimSelection.jl/#control-file): Specifies the names of your data input and output files and any optional parameters (*keywords*) for the analysis. (For a list of common keywords, see [Keywords Table](https://openmendel.github.io/MendelBase.jl/#keywords-table)). The Control file is optional. If you don't use a Control file you will enter your keywords directly in the command line.
* [Locus File](https://openmendel.github.io/MendelBase.jl/#locus-file): Names and describes the genetic loci in your data.
* [Pedigree File](https://openmendel.github.io/MendelBase.jl/#pedigree-file): Gives information about your individuals, such as name, sex, family structure, and ancestry.
* [Phenotype File](https://openmendel.github.io/MendelBase.jl/#phenotype-file): Lists the available phenotypes.
* [SNP Definition File](https://openmendel.github.io/MendelBase.jl/#snp-definition-file): Defines your SNPs with information such as SNP name, chromosome, position, allele names, allele frequencies.
* [SNP Data File](https://openmendel.github.io/MendelBase.jl/#snp-data-file): Holds the genotypes for your data set. Must be a standard binary PLINK BED file in SNP major format. If you have a SNP data file you must have a SNP definition file.

## Control file
The Control file is a text file consisting of keywords and their assigned values. The format of the Control file is:

	Keyword = Keyword_Value(s)

Below is an example of a simple Control file to run AIM Selection:

	#
	# Input and Output files.
	#
	field_separator = ' '
	pedigree_file = 1000genomes_chr1_eas.ped
	
	plink_field_separator = '	'
	plink_input_basename = 1000genomes_chr1_eas
	
	output_field_separator = ','
	output_file = 1000genomes_chr1_eas Output.txt
	#
	# Analysis parameters for AIM Selection option.
	#
    
In the example above, the four keywords specify the input and output files: *1000genomes_chr1_eas.bed*, *1000genomes_chr1_eas.snp*, *1000genomes_chr1_eas.ped* and *1000genomes_chr1_eas Output.txt*. The text after the '=' are the keyword values. The names of keywords are *not* case sensitive. (The keyword values *may* be case sensitive.) A list of OpenMendel keywords common to most analysis package can be found [here](https://openmendel.github.io/MendelBase.jl/#keywords-table).

## Data Files
AIM Selection requires a [Control file](https://openmendel.github.io/MendelBase.jl/#control-file), and a [Pedigree file](https://openmendel.github.io/MendelBase.jl/#pedigree-file). Genotype data can be included in the Pedigree file, in which case a [Locus file](https://openmendel.github.io/MendelBase.jl/#locus-file) is required. Alternatively, genotype data can be provided in a [SNP data file](https://openmendel.github.io/MendelBase.jl/#snp-data-file), in which case a [SNP Definition File](https://openmendel.github.io/MendelBase.jl/#snp-definition-file) is required. Mendel AIM Selection will also accept [PLINK format](http://zzz.bwh.harvard.edu/plink) FAM and BIM files. Details on the format and contents of the Control and data files can be found on the [MendelBase](https://openmendel.github.io/MendelBase.jl) documentation page. There are example data files in the AIM Selection [data](https://github.com/OpenMendel/MendelAIMSelection.jl/tree/master/data) folder.

## Using PLINK compressed file as input

MendelAimSelection accepts [PLINK binary format](https://www.cog-genomics.org/plink2/formats#bed) as input, in which case the triplets (`data.bim`, `data.bed`, `data.fam`) must all be present. In this tutorial, there are no examples that use PLINK binary format to import pedigree and SNP information. But if available, one can import the data by specifying the following in the control file:

`plink_input_basename = data` 

However, sometimes the PLINK .fam file contains non-unique person ids (2nd column of .fam file) across different pedigrees, which is currently **not** permitted in OpenMendel. A person's id cannot be repeated in other pedigrees, even if it is contextually clear that they are different persons. This will be fixed in the near future.

## Running the Analysis

To run this analysis package, first launch Julia. Then load the package with the command:

`julia> using MendelAimSelection`

Next, if necessary, change to the directory containing your files, for example,

`julia> cd("~/path/to/data/files/")`

Finally, to run the analysis using the parameters in your Control file, for example, Control_file.txt, use the command:

`julia> AimSelection("Control_file.txt")`

*Note: The package is called* MendelAimSelection *but the analysis function is called simply* AimSelection.

## Output Files
Each option will create output files specific to that option, and will save them to the same directory that holds the input data files.

# Example 1: 

### Step 0: Load the OpenMendel package and then go to the directory containing the data files:
First we load the MendelEstimateFrequencies package.

In [None]:
using MendelAimSelection

In this example we go to the directory containing the example data files that come with this package.

In [None]:
cd(MendelAimSelection.datadir())
pwd()

### Step 1: Preparing the pedigree files:
Recall the structure of a [valid pedigree structure](https://openmendel.github.io/MendelBase.jl/#pedigree-file). Note that we require a header line. Let's examine the first few lines of such an example:

In [None]:
;head -10 "1000genomes_chr1_eas.ped"

In this example we have unrelated individuals who were genotyped as part of the [1000 Genomes Project](http://www.internationalgenome.org/about). They come from 5 different ethnic groups, Southern Han Chinese (CHS),Chinese Dai in Xishuangbanna, China (CDX), Kinh in Ho Chi Minh City, Vietnam (KHV), Han Chinese in Beijing, China (CHB) and Japanese in Tokyo, Japan (JPT). We want to find SNPs that differ in their allele frequencies among two or more of these groups.

### Step 2: Preparing the control file
A control file gives specific instructions to `MendelAimSelection`. To select the SNPs that are most informative at predicting ancestry for your data — the best Ancestry Informative Markers, a minimal control file looks like the following (in this example the data come from chromosome 1):

In [None]:
;cat "AIM 1000genomes_chr1 Control.txt"

### Step 3: Run the analysis in Julia REPL or directly in notebook


In [None]:
AimSelection("AIM 1000genomes_chr1 Control.txt")

### Step 4: Output File

`MendelAimSelection` should have generated the file `1000genomes_chr1_eas Output.txt` in your local directory. This file lists your markers, and gives an AIMRank for each marker (see below).

In [None]:
;cat "1000genomes_chr1_eas Output.txt"

### Step 5: Interpreting the result

`MendelAimSelection` uses the genotypes and ethnicities in your pedigree to assign a score - the *AIMRank* - to each marker. The higher the AIMRank, the better the marker at differentiating two or more of the ethnic groups. Note that the rankings may change if the ethnic groups in your pedigree change.

## Citation

If you use this analysis package in your research, please cite the following reference in the resulting publications:

*Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.*

## Acknowledgments

This project is supported by the National Institutes of Health under NIGMS awards R01GM053275 and R25GM103774 and NHGRI award R01HG006139.