# FastSMC minimal working example notebook

This notebook demonstrates a minimal working example of the FastSMC python bindings, where sensible default parameters are set automatically.

Please make sure you have installed the python bindings by following the instructions in `../README.md` before attempting to run this notebook.

The example dataset was simulated using the setup described in the paper, corresponding to SNP data for 150 diploid individuals and a chromosomal region of 30 Mb, with recombination rate from chromosome 2 and under a European demographic model (see https://www.nature.com/articles/s41467-020-19588-x for more details).

1) Import `asmc` which is installed with the Python bindings

In [None]:
from asmc.asmc import *

import pathlib
import tempfile

data_dir = pathlib.Path('.').resolve().parent / 'ASMC_data'

2) Specify paths for input (example provided in a submodule of this repository) and output. Input is expected to have the following files (note: make sure the map file is in the right format, as described in https://github.com/PalamaraLab/ASMC/blob/main/docs/fastsmc.md#input-file-formats):
- `<input_files_root>.hap.gz`
- `<input_files_root>.map`
- `<input_files_root>.samples`

In [None]:
input_files_root = str(data_dir / 'examples' / 'fastsmc' / 'example')
dq_file = str(data_dir / 'decoding_quantities' / '30-100-2000_CEU.decodingQuantities.gz')
output_files_root = tempfile.TemporaryDirectory().name

3) Create the Python FastSMC object and run it. This should only take a few seconds.

In [None]:
fast_smc = FastSMC(in_dir=input_files_root, dq_file=dq_file, out_dir=output_files_root)
fast_smc.run()

4) Read data, add column names and filter to remove IBD segments with low IBD score. Note that for a large analysis, loading all data into memory is unlikely to be possible. See fastsmc.ipynb for an example that reads the output line-by-line.

In [None]:
%config InlineBackend.figure_formats = ['svg']

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv(output_files_root + '.1.1.FastSMC.ibd.gz', sep='\t', header=None)

data.columns = ['ind1_famid', 'ind1_id', 'ind1_hap', 'ind2_famid', 'ind2_id', 'ind2_hap', 'chromosome',
                'ibd_start', 'ibd_end', 'length_in_cM', 'ibd_score', 'post_est', 'map_est']

filtered = data[data['ibd_score'] > 0.1]
filtered

5) Visualise data: here we simply bin the MAP age estimates and the IBD segment length

In [None]:
plt.xlabel("MAP age estimate (in generations)")
filtered['map_est'].hist(range=(0, 100))
plt.gca().set_yscale('linear')

In [None]:
plt.xlabel("IBD segments length (in cM)")
filtered['length_in_cM'].hist(range=(0, 15))
plt.gca().set_yscale('log')