# Methylation Simulation Tutorial

The aim of this jupyter notebook is to provide a tutorial for simple simulation of methylation data based on a real-world example.

First we import the necessary libraries.

In [1]:
import pandas as pd

from methylation_simulation import simulate_methyl_data, beta_to_m

Now we load a real-world dataset that we will use as a basis for our simulation.

In [2]:
realworld_data_path = 'realworld_data.tsv'
realworld_data = pd.read_csv(realworld_data_path, sep='\t', index_col=0)
realworld_data.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cg00000029,0.211696,0.114783,0.536747,0.165536,0.160484,0.119399,0.506022,0.144923,0.124337,0.114317,...,0.15231,0.12658,0.054912,0.078791,0.064953,0.078761,0.078993,0.056633,0.087432,0.074412
cg00000103,0.84568,0.684253,0.814065,0.782214,0.467743,0.744215,0.758936,0.42949,0.393501,0.177543,...,0.240991,0.305295,0.095659,0.073466,0.098659,0.322047,0.263926,0.083063,0.111879,0.397761
cg00000109,0.941455,0.904912,0.904288,0.913657,0.916486,0.936152,0.894816,0.930336,0.939587,0.915736,...,0.715386,0.954416,0.928096,0.925447,0.935925,0.832409,0.860549,0.899125,0.918758,0.908527
cg00000155,0.963702,0.948639,0.956757,0.952712,0.954658,0.943447,0.938406,0.954278,0.953473,0.948928,...,0.914553,0.95855,0.959193,0.951843,0.94688,0.938784,0.94396,0.948379,0.953269,0.945896
cg00000158,0.741913,0.947776,0.963727,0.956434,0.939483,0.922293,0.922727,0.884528,0.919704,0.949012,...,0.592271,0.968543,0.957689,0.957843,0.953568,0.884271,0.837953,0.93622,0.956689,0.926489


The real-world dataset is a matrix of methylation values for different sites and observations. The columns represent the observations and the rows the sites. 

We will transpose the matrix so that the columns represent the sites and the rows represent the observations, which is a common way of data representation in statistics and computer science.

In [3]:
realworld_data = realworld_data.transpose()
# We print a subset of the data
realworld_data.iloc[0:10,0:10]

Unnamed: 0,cg00000029,cg00000103,cg00000109,cg00000155,cg00000158,cg00000165,cg00000221,cg00000236,cg00000292,cg00000321
0,0.211696,0.84568,0.941455,0.963702,0.741913,0.092816,0.932019,0.934972,0.616315,0.367206
1,0.114783,0.684253,0.904912,0.948639,0.947776,0.194754,0.69998,0.828949,0.743092,0.56063
2,0.536747,0.814065,0.904288,0.956757,0.963727,0.250221,0.921995,0.891282,0.777081,0.296366
3,0.165536,0.782214,0.913657,0.952712,0.956434,0.501044,0.770795,0.848735,0.593127,0.298586
4,0.160484,0.467743,0.916486,0.954658,0.939483,0.682332,0.894427,0.850622,0.636629,0.442481
5,0.119399,0.744215,0.936152,0.943447,0.922293,0.4359,0.908002,0.935981,0.786134,0.497686
6,0.506022,0.758936,0.894816,0.938406,0.922727,0.342186,0.733169,0.905944,0.756986,0.417028
7,0.144923,0.42949,0.930336,0.954278,0.884528,0.423646,0.911338,0.907349,0.556394,0.350503
8,0.124337,0.393501,0.939587,0.953473,0.919704,0.847006,0.75039,0.935111,0.739626,0.785203
9,0.114317,0.177543,0.915736,0.948928,0.949012,0.860114,0.840515,0.931237,0.476269,0.832337


Now based on the real-world data we will simulate a simple dataset without dependencies between the sites.

The function samples *n_sites* sites from the *realworld_data*, estimates their alpha and beta parameters and then draws *n_observations* observations from a beta distribution with these parameters.

In [4]:
simulated_methylation = simulate_methyl_data(realworld_data = realworld_data,
                                      n_sites=1000,
                                      n_observations=20,
                                      dependencies=False)

print(simulated_methylation)

[[0.77256737 0.71086086 0.56665025 ... 0.77183857 0.01654558 0.71611709]
 [0.7047468  0.71542005 0.6135414  ... 0.26233695 0.05076698 0.91823004]
 [0.89942615 0.96246651 0.84133557 ... 0.52018045 0.01921148 0.63705596]
 ...
 [0.52577532 0.91783685 0.75812309 ... 0.38336757 0.07444793 0.48851897]
 [0.7205487  0.91969425 0.64596037 ... 0.13863992 0.04259136 0.89658947]
 [0.286677   0.843527   0.70537606 ... 0.51049211 0.05917981 0.75539647]]


However, it is known that the methylation values of different sites are correlated. These correlations often arise due to the clustering of CpG sites into CpG islands, where methylation patterns tend to be locally similar. 

We can simulate this by setting the *dependencies* parameter to True. Additionally, one needs to provide the *bin_size* parameter, which determines the size of the islands and the *correlation_coefficient_distribution* parameter, which is an array of correlation coefficients. The function will then create islands of CpG sites with the given correlation coefficients.

In this example we set the island size to 300 and the correlation coefficients to 0.9.

In [5]:
simulated_methylation_with_dependencies = simulate_methyl_data(realworld_data = realworld_data,
                                          n_sites=1000,
                                          n_observations=20,
                                          dependencies=True,
                                          bin_size=300,
                                          correlation_coefficient_distribution=[(-0.85, -0.7), (-0.1, 0.1), (0.7, 0.85)])

print(simulated_methylation_with_dependencies)

[[0.20564482 0.17135183 0.07669515 ... 0.93422679 0.26050225 0.028543  ]
 [0.09556597 0.1853918  0.07204463 ... 0.89221833 0.73016897 0.03157517]
 [0.02484979 0.17794881 0.07433803 ... 0.90929445 0.53598004 0.01897325]
 ...
 [0.58314858 0.01962967 0.05019246 ... 0.9276064  0.1882652  0.02429068]
 [0.28204065 0.0323591  0.06593179 ... 0.90444739 0.30327369 0.020248  ]
 [0.09325758 0.27333958 0.06762173 ... 0.90874401 0.31817611 0.0185957 ]]


Often one is interested in methylation data in m scale since it provides a more statistically suitable representation for downstream analyses. The methylation values can be transformed as follows:

In [6]:
simulated_methylation_mscale = beta_to_m(simulated_methylation)
print(simulated_methylation_mscale)

simulated_methylation_with_dependencies_mscale = beta_to_m(simulated_methylation_with_dependencies)
print(simulated_methylation_with_dependencies_mscale)

[[ 1.76422146  1.29780326  0.38692667 ...  1.75824425 -5.89334079
   1.33489952]
 [ 1.25515234  1.32995655  0.66684682 ... -1.49154094 -4.2248
   3.48921289]
 [ 3.16074966  4.68048591  2.40670256 ...  0.11652023 -5.6739015
   0.81167296]
 ...
 [ 0.14887569  3.48167426  1.64815908 ... -0.68568244 -3.63601057
  -0.06626615]
 [ 1.36649903  3.51757918  0.86753481 ... -2.63527369 -4.49050197
   3.11606442]
 [-1.31512931  2.43052055  1.25951793 ...  0.06055655 -3.99074199
   1.62678879]]
[[-1.94962949 -2.27379809 -3.58959976 ...  3.82820077 -1.5052499
  -5.08894126]
 [-3.24244632 -2.13552892 -3.68709254 ...  3.04928498  1.43617408
  -4.93877752]
 [-5.29431914 -2.20776598 -3.63831312 ...  3.32548476  0.20799239
  -5.69225345]
 ...
 [ 0.48433029 -5.64221897 -4.2420925  ...  3.67957867 -2.10824212
  -5.32797647]
 [-1.34799904 -4.9022284  -3.8244818  ...  3.24266948 -1.19997171
  -5.59656514]
 [-3.28139993 -1.41058698 -3.78535655 ...  3.31588285 -1.09957366
  -5.72180687]]
