# Tutorial of MendelKinship 
### last update: 2/7/2019

## Installation instructions

`MendelKinship` currently supports Julia version 1.0 and 1.1, but it is currently an unregistered package. To install, press `]` to invoke the package manager mode and install these packages by typing:

```
add https://github.com/OpenMendel/SnpArrays.jl/
add https://github.com/OpenMendel/MendelSearch.jl
add https://github.com/OpenMendel/MendelBase.jl
add https://github.com/biona001/MendelKinship.jl
```

You will also need a few registered packages. Add them by typing:

```
add PlotlyJS Statistics StatsBase CSV ORCA DataFrames
```
For reproducibility, listing the machine information below:

In [1]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)


## When to use MendelKinship

`MendelKinship.jl` is capable of calculating the theoretical kinship coefficient $\Phi_{ij}$ as long as a [valid pedigree structure](https://openmendel.github.io/MendelBase.jl/#pedigree-file) is provided. When SNP markers are available, `MendelKinship.jl` can also calculate empirical kinship coefficients using GRM, robust GRM, or Method of Moments methods (see [this paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/gepi.20584) and [this paper](https://academic.oup.com/bioinformatics/article/26/22/2867/228512) for details). Here we recommend the Robust GRM or MoM (default) method because their estimates are more robust in the presence of rare alleles. 

`MendelKinship` can optionally compare the empirical kinship and theoretical kinships to check suspect pedigree structures and reveal hidden relatedness. It can also reveal sample mixed ups or other laboratory errors that can lead to inaccurate empirical kinships.  The result is saved in a table sorted in descending order. We optioanlly output 2 interactive plots that allow users to quickly pinpoint pairs with the greatest theoretical vs empirical deviance. 

## Data used in Examples

The input for all examples in this tutorial can be obtained from the free application [Mendel v16](http://software.genetics.ucla.edu/download?package=1) option 29a. These data were obtained from the 1000 genome project, containing 85 people and 253141 SNPs, half of which have maf$< 0.05$. Using these founders' genotype, we simulated 127 extra people, resulting in 27 pedigrees and 212 people. Although the 85 individuals are treated as founders, they were actually somewhat related, and this is reflected in the kinship comparison in the 2nd example below. For more information on this dataset, please see Mendel's documentation example 29.4. 

## Using PLINK compressed file as input

MendelKinship additionally accepts [PLINK binary format](https://www.cog-genomics.org/plink2/formats#bed) as input, in which case the triplets (`data.bim`, `data.bed`, `data.fam`) must all be present. In this tutorial, there are no examples that uses these to import pedigree and SNP information. But if available, one can import the data by specifying the following in the control file:

`plink_input_basename = data` 

However, sometimes the .fam file contains non-unique person id (2nd column of .fam file) across different pedigrees, which is currently **not** permitted in MendelKinship. A person's id cannot be repeated in other pedigrees, even if it is contextually clear that they are different persons. This will be fixed in the near future.

## Analysis keywords available to users 

| Keyword | Default Value | Allowed value | Description |
| --- | --- | --- | --- |
|`kinship_output_file` | Kinship_Output_File.txt | true/false | OpenMendel generated output file with table of kinship coefficients |
|`repetitions` | 1 | Integer | Repetitions for sharing statistics |
|`xlinked_analysis` | false | beelean| Whether markers are on the X chromosome |
|`compare_kinships` | false | boolean | Whether we want to compare theoretical vs empiric kinship |
|`kinship_plot` | "" | User defined file name | A user specified name for a plot comparing theoretical and empiric kinship value |
|`z_score_plot` | "" | User defined file name |  A user specified name for a plot of fisher's z statistic.  |
|`grm_method` | MoM | GRM, MoM, Robust | Method used for empiric kinship calculation. Defaults to `MoM`, but user could choose the more common `GRM` or Robust GRM methods instead. (**Warning:** Based on our experience, Fisher's z score is very unreliable if the GRM method is used for rare (maf < 0.2) snps) |
|`maf_threshold` | 0.01 | Real number between 0 and 1 | The minor allele frequency threshold for the GRM computation |
|`deviant_pairs` | false | Integer less than $n(n+1)/2$ | Number of top deviant pairs (theoretical vs empiric kinship) the user wants to keep |

A list of OpenMendel keywords common to most analysis package can be found [here](https://openmendel.github.io/MendelBase.jl/#keywords-table)

# Example 1: Theoretical Kinship Coefficient Calculation 

### Step 1: Preparing the pedigree files:
Recall what is a [valid pedigree structure](https://openmendel.github.io/MendelBase.jl/#pedigree-file). Note that we require a header line. The extension `.in` have no particular meaning. Let's examine (the first few lines of) such an example:

In [2]:
;head -10 "Ped29a.in"

Pedigree,Person,Mother,Father,Sex,,,simTrait
  1       ,  16      ,          ,          ,  F       ,          ,  29.20564,
  1       ,  8228    ,          ,          ,  F       ,          ,  31.80179,
  1       ,  17008   ,          ,          ,  M       ,          ,  37.82143,
  1       ,  9218    ,  17008   ,  16      ,  M       ,          ,  35.08036,
  1       ,  3226    ,  9218    ,  8228    ,  F       ,          ,  28.32902,
  2       ,  29      ,          ,          ,  F       ,          ,  36.17929,
  2       ,  2294    ,          ,          ,  M       ,          ,  42.88099,
  2       ,  3416    ,          ,          ,  M       ,          ,  40.98316,
  2       ,  17893   ,  2294    ,  29      ,  F       ,          ,  35.55038,


### Step 2: Preparing the control file
A control file gives specific instructions to `MendelKinship`. To perform theoretical kinship calculation, an minimal control file looks like the following:

In [3]:
;cat "control_just_theoretical_29a.txt"

#
# Input and Output files.
#
pedigree_file = Ped29a.in
#
# Analysis parameters for Kinship option.
#
kinship_output_file = just_theoretical_output.txt

### Step 3: Run the analysis in Julia REPL or directly in notebook

We used the package Suppressor to hide warnings. They will be removed when we update `MendelKinship` to Julia version 1.0. However often informative warnings and/or MendelKinship messages will be printed, so it is best practice for new users to at least review the messages.

In [4]:
using MendelKinship
Kinship("control_just_theoretical_29a.txt")

 
 
     Welcome to OpenMendel's
     Kinship analysis option
        version 0.2.0
 
 
Reading the data.

The current working directory is "/Users/biona001/Benjamin_Folder/UCLA/research/open mendel related/Tutorials/Kinship".

Keywords modified by the user:

  control_file = control_just_theoretical_29a.txt
  kinship_output_file = just_theoretical_output.txt
  pedigree_file = Ped29a.in
 


Unnamed: 0_level_0,Pedigree,Person1,Person2,Kinship,Delta7,delta1,delta2,delta3,delta4,delta5,delta6,delta7,delta8,delta9
Unnamed: 0_level_1,String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1,16,16,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1,16,17008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1,16,3226,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,16,8228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,1,16,9218,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,1,17008,17008,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
7,1,17008,3226,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,1,17008,9218,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,1,3226,3226,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
10,1,8228,17008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


 
Analyzing the data.

 
 
Mendel's analysis is finished.



### Step 4: Interpreting the result

`MendelKinship` should have generated the file`just_theoretical_output.txt` in your local directory. One can directly open the file, or import into the Julia environment for ease of manipulation using the DataFrames package. The fourth column contains the desired theoretical kinship coefficient. The 5th column contains the (deterministically) estimated Delta7 matrix. The 6th through the 14 columns contain the (stochastically) estimated Jacquard's 9 identity coefficients.

# Example 2: Compare theoretical/empirical kinship values

When both pedigree structure and *complete* SNP information are available, we can compare theoretical/empirical kinship coefficients. In practice, however, we often have individuals without genotype information, but nevertheless must be included in the pedigree structure. `MendelKinship` does not handle this situation yet, but an analysis option that supports these data is being developed. For now you can impute genotypes but keep in mind that the relationship comparison for these individuals who lack all genotype information will not be meaningful.   

### Step 1: Prepare pedigree file and SNP data file

The pedigree file is the same as the pedigree file in the previous example. The SNP definition file requires a header row, and should have approprietely placed commas. It may be informative to compare the following SNP definition file with the original "SNP_def29a.in" in Mendel Option 29a. 

In [5]:
;head -10 "SNP_def29a_converted.txt"

Locus,Chromosome,Basepairs,Allele1,Allele2
rs3020701,19,90974,1,2
rs56343121,19,91106,1,2
rs143501051,19,93542,1,2
rs56182540,19,95981,1,2
rs7260412,19,105021,1,2
rs11669393,19,107866,1,2
rs181646587,19,107894,1,2
rs8106297,19,107958,1,2
rs8106302,19,107962,1,2


#### Non binary PLINK users

The SNP data files in this case must be stored in PLINK BED file in SNP-major format, with an accompanying SNP definition file. For an explanation of what these are, see [MendelBase documentation](https://openmendel.github.io/MendelBase.jl/).

#### Binary PLINK file users

If your have "data.bim", "data.bed", "data.fam" (i.e. the 3 triplet of PLINK files), then you can replace the 3 fields `snpdata_file`, `snpdefinition_file`, and `pedigree_file` in the next step with just 1 field:

`plink_input_basename = data`.

### Step 2: Preparing control file

The following control file tells MendelKinship to compare theoretical kinship and empirical kinship, and output 2 interactive plots stored in .html format. 

In [6]:
;cat "control_compare_29a.txt"

#
# Input and Output files.
#
snpdata_file = SNP_data29a.bed
snpdefinition_file = SNP_def29a_converted.txt
pedigree_file = Ped29a.in
#
# Analysis parameters for Kinship option.
#
compare_kinships = true
kinship_plot = kinship_plot
z_score_plot = z_score_plot

### Step 3: Running the analysis

In [7]:
using MendelKinship
Kinship("control_compare_29a.txt")

 
 
     Welcome to OpenMendel's
     Kinship analysis option
        version 0.2.0
 
 
Reading the data.

The current working directory is "/Users/biona001/Benjamin_Folder/UCLA/research/open mendel related/Tutorials/Kinship".

Keywords modified by the user:

  compare_kinships = true
  control_file = control_compare_29a.txt
  kinship_plot = kinship_plot
  pedigree_file = Ped29a.in
  snpdata_file = SNP_data29a.bed
  snpdefinition_file = SNP_def29a_converted.txt
  z_score_plot = z_score_plot
 
 
Analyzing the data.

Kinship plot saved.


Unnamed: 0_level_0,Pedigree1,Pedigree2,Person1,Person2,theoretical_kinship,empiric_kinship,fishers_zscore
Unnamed: 0_level_1,String,String,String,String,Float64,Float64,Float64
1,14,14,26732,264,0.0,0.109552,5.31942
2,31,31,15884,19770,0.25,0.150364,-4.2355
3,23,23,9943,392,0.125,0.0225133,-4.20155
4,25,14,22041,16636,0.0,0.0969715,4.75137
5,25,25,11822,24192,0.25,0.159229,-3.82975
6,14,14,25732,264,0.125,0.216622,4.62515
7,25,25,3012,3016,0.125,0.213888,4.4971
8,25,17,23404,12004,0.0,-0.0896437,-3.60943
9,25,23,23404,19279,0.0,-0.0877967,-3.52627
10,17,14,26857,264,0.0,0.0859953,4.25691


Fisher's plot saved.
 
 
Mendel's analysis is finished.



### Step 4: Interpreting the Result

Founders which have 0 theoretical kinships often exhibit a non-zero empirical kinship. In the first row, person 26732 and 264 have 0 theoretical kinship but their empirical kinship is pretty close to 0.125 = 1/8. That is, these 2 people which we initially thought are unrelated, may be half siblings, grandparent-grandchild, or an avuncular pair. On the otherhand, the 8th row has a founder pair that has a $-0.08$ kinship (i.e. they are very *un*related), suggesting that the standard deviation of the moments estimator may have a wide spread. There may also have been a sample mix up. Another explanation is that we are only using one chromosome's worth of data and so the estimates of kinship may be imprecise. 

# Interactive Plots and Tables

`MendelKinship` automatically generates 2 figures and 1 table to allow the user to easily compare theoretical and empirical kinship, detect outliers, and observe skewnesses in distribution. Figures are saved in `.html` format to enable interactive sessions. To summarize, 

+ The table containing all the pairwise kinship and theoretical comparisons is stored in `kinship_file_output.txt`. The table is sorted in descending order of the largest deviance between the theoretical and empiric kinship. The last column lists the [Fisher's Z statistic](https://en.wikipedia.org/wiki/Fisher_transformation) (i.e. the number of standard deviations away from mean). 
    
+ The 2 plots are stored in .html format, which should be automatically be generated in your directory. These figures can be examined interactively via jupyter notebook, as demonstrated below, or opened directly via the browser.

### Generated Interactive Plots part 1:

The first interactive plot allows user to quickly identify which pairs of persons have an empirical kinship most deviated from their expected (theoretical) kinship. The midpoint is placed as an orange dot for interpretability. As an example, the first row in the table above is the highest point on the left most spread. Careful readers might observe that there is a wider spread on those with 0 expected theoretical kinship. This is expected, because most people are not related to each other, so we are making many more comparisons that have 0 expected kinship. 

In [8]:
using MendelKinship, PlotlyJS, CSV

#import the files created from the previous example
result = CSV.read("kinship_file_output.txt")
name = Vector{String}(undef, size(result, 1))

# label the data points according to the persons names
for i in 1:length(name)
    name[i] = "Person1=" * string(result[i, 3]) * ", " * "Person2=" * string(result[i, 4])
end

#create interactive graph
function compare_kinship_plot()
    trace1 = scatter(;x=result[:theoretical_kinship], 
        y=result[:empiric_kinship], mode="markers", 
        name="empiric kinship", text=name)
    
    trace2 = scatter(;x=[1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 0.0],
        y=[1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 0.0], 
        mode="markers", name="marker for midpoint")
        
    layout = Layout(;title="Compare empiric vs theoretical kinship",hovermode="closest", 
        xaxis=attr(title="Theoretical kinship (θ)", showgrid=false, zeroline=false),
        yaxis=attr(title="Empiric Kinship", zeroline=false))
    
    data = [trace1, trace2]
    plot(data, layout)
end
compare_kinship_plot()

### Generated Interactive Plots part 2:

After comparing the theoretical and empirical kinships, as in the previous graph or through the outputted table directly, often one may wonder whether the observed differences between the two statistics are significantly different. As explained in our main OpenMendel paper (section 7), this difference can be precisely quantified by the Fisher's z transformation, which should give us samples from a standard normal distribution $N(0, 1)$. We ploted this statistic in plot 2, and at first glance, the distribution is approximately normal. In Julia, we can easily verify this by computing some summary statistics:

In [9]:
function fishers_transform()
    trace1 = histogram(x=result[:fishers_zscore], text=name)
    data = [trace1]
    
    layout = Layout(barmode="overlay", 
        title="Z-score plot for Fisher's statistic",
        xaxis=attr(title="Standard deviations"),
        yaxis=attr(title="count"))
    
    plot(data, layout)
end
fishers_transform()

### Compute mean and variance

We can verify that the Fisher's statistic is approximately normal by checking its 1~4th moments:

In [10]:
using Statistics, StatsBase

my_zscore = convert(Vector{Float64}, result[:fishers_zscore])
mean(my_zscore), var(my_zscore)

(-1.2084702388691578e-16, 1.0)

### Compute skewness and excess kurtosis

In [11]:
skewness(my_zscore), kurtosis(my_zscore)

(0.0923976421835524, 0.10657222736224581)

## Conclusions
MendelKinship provides a rapid way to calculate the theoretical kinship, which requires accurate pedigrees, and the empirical kinships, which requires genotypes at multiple markers. Further, it can compare these kinships when both the pedigrees and markers are available. 