Navigation Menu

Skip to content

This program analyzes methylation levels at six CpG sites in the genome of blood cells to produce a prediction of an individual's biological age, using different machine learning and deep learning models.


Repository files navigation

Blood Methylome-Based Epigenetic Clock

This program analyzes methylation levels at six CpG sites in the genome of blood cells to produce a prediction of an individual's biological age, using different machine learning and deep learning models. The code and analysis can be found in the Epigenetic_Clock.ipynb notebook.


Scientists define there to be two main types of age: chronological, and biological age. Chronological age is based on how long you have been alive (years since birth), whereas biological age is a rough estimate of how healthy your body is by measuring various different biomarkers. For more information on this, check out my recent article on biological age right here: What is Biological Age?

One way of determining biological age (as also described in the article) is to measure the methylation levels in your genome. At sites known as CpG sites, along your DNA, various proteins can add on or remove methyl groups, in order to control which genes are expressed into proteins, and which ones aren't. As you age, however, the systems responsible for maintaining this control of the genome begin to break down, leading to errors in methylation. As such, some sites begin to see an accumulation of methyl tags, while others have their tags removed, as you age.

Dr. Steve Horvath, longevity researcher at University of California Los Angeles, used this progression to develop a system known as the Horvath aging clock, which is able to produce an accurate estimate of your biological age. For example, after analyzing the methylation levels in a sample of your cells, the clock might tell you that your biological age is around 30, which basically means that your methylation levels (and your body's health) is similar to the average 30-year-old. If your actual age is somewhere around 40, this means you are living a healthy life, whereas if you are only 20 years old, then this suggests that you are living an unhealthy life. This information thus equips people with the knowledge to take control of their lifestyle and live a healthy life.

Brief Explanation of This Project

In this project, I aim to somewhat replicate the epigenetic clock developed by Horvath, using methylation data taken from blood samples to predict an individual's age. The datasets and methylation sites chosen for this project were advised by the research paper from Li et al. titled Human Age Prediction Based on DNA Methylation Using a Gradient Boosting Regressor. Other than the differences in the datasets used, there is one notable difference between the paper and this code:
This notebook contains three different ML Regression algorithms and a Deep Neural Network, and compares the performance of each model on the training and testing dataset. The four models developed in this notebook are: Multivariable Linear Regression, Random Forest Regression, Gradient Boosting Regression, and a Deep Neural Network.

Data Summary

The healthy patient data was accessed from: GSE20067, GSE20236, GSE20242, GSE27097, GSE27317, GSE32149, GSE34257, GSE34869, GSE36064, GSE36642, GSE37008, GSE41169, GSE53128, GSE65638, all of which were used in the paper by Li et al. The diseased patient data was accessed from: GSE20067, GSE32148, GSE40005, GSE41037, GSE49904, once again all from the list of diseased datasets recommended in the paper.

When accessing these datasets from their sources using R (code not included in this repository), I made sure to only store the columns for the six CpG sites listed in the paper, which were found to be correlated with ageing in the paper using Pearson correlation analysis. The sites are as listed: cg09809672, cg22736354, cg02228185, cg01820374, cg06493994, cg19761273.
The original GSE datasets also had sex as a categorical variable (Male or Female), but, in order to carry out regression on the data, this must be converted into numerical form. As thus, in the sex column in the dataset, 1 represents Females and 0 represents Males.
Some of the diseased datasets above also did have healthy patients which would have resulted in inaccurate plots and residuals for the predictions, so, while obtaining and saving the data with R, I made sure to exclude any of the healthy patients from these datasets, resulting in a decrease in the number of patients available to display the predictive capabilities of the model.
Other forms of data cleaning, like removing text from the age column, were also required, but are not crucial enough to delve into.

The folder Datasets contains all the datasets used to form the overall final healthy dataset, Healthy_Methylation_Dataset.csv, and the overall final diseased dataset, Disease_Methylation_Dataset.csv. The final healthy dataset has 1440 patients, but many of these patients lack some crucial data (such as methylation levels and age), and are thus are dropped in the code, bringing the patient count down to 1334 patients. The final diseased dataset has 632 patients.

Explaining the Notebook

The notebook starts off by first importing the necessary modules and initializing the scaler. These two steps need to be performed regardless of whether you're experimenting with both the healthy patient and diseased patient sections of the notebook, or just one of them. After that, the healthy data is cleaned and used to train the 4 models, evaluating each of their performances on a test set. From this section, the Random Forest Regression model is saved for later use if it has higher accuracy than the one already stored there. In the next section, the Random Forest Regression model is used on the diseased patients as a whole, and on individual subsets for each of the diseases. This section can use either the model created in the healthy patient section, or if you are just playing around with the diseased section, it can load the saved model.

Run the Code Yourself

Launch the repository and notebook using the Binder link below (will open the Binder webpage with a loading screen while it connects to a server). Simply click the button, and wait for the server to connect. Once it's done so, it'll open a Jupyter notebook on your browser with the repository. Simply click the Epigenetic_Clock.ipynb file to open the code, and you may then run and edit the code yourself (without any changes being made to my server). It might take some time to load in depending on if there are any servers available to run the code, so please try and be patient.



This program analyzes methylation levels at six CpG sites in the genome of blood cells to produce a prediction of an individual's biological age, using different machine learning and deep learning models.







No releases published


No packages published