# Genes Location Retrieval
### Laurence Nickel (i6257119)

Libraries used: 
* biomaRt (version: '2.54.1')
* data.table (version: '1.14.8') 

## Introduction

Within this notebook, the location data of the genes present within the gene expression data files will be retrieved. The human reference genome 'hg19' (human genome version h19), which is also the reference genome used by the 'IlluminaHumanMethylation450kanno.ilmn12.hg19' annotation package, is used to retrieve the locations of the genes as the reference genome is generally considered to be a good estimate for the location data of the genes in the majority of individuals of that species.

For reproducibility purposes, the human reference genome 'hg19' has been retrieved on the date: 17-May-23 (14:19).

### Importing libraries

Before we start the retrieval of the locations of the genes, we should first import some libraries that will be used throughout this notebook. These libraries can be installed through 'BiocManager'.

In [1]:
# Checking whether the package 'BiocManager' has already been installed and installing it if it has not been installed yet.
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")


cat("Starting the installing of the libraries...")


# Using 'BiocManager' to install the following libraries (which are also mentioned in the introduction of this notebook).
BiocManager::install('biomaRt')

# Using the R command 'install.packages()' to install the remaining necessary libraries.
install.packages("data.table")


cat("Finishing the installing of the libraries.")

Bioconductor version '3.16' is out-of-date; the current release version '3.17'
  is available with R version '4.3'; see https://bioconductor.org/install



Starting the installing of the libraries...

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.r-project.org

Bioconductor version 3.16 (BiocManager 1.30.20), R 4.2.3 (2023-03-15 ucrt)

"package(s) not installed when version(s) same as or greater than current; use
  `force = TRUE` to re-install: 'biomaRt'"
Installation paths not writeable, unable to update packages
  path: C:/Program Files/R/R-4.2.3/library
  packages:
    class, KernSmooth, lattice, MASS, Matrix, nnet, survival

Old packages: 'cachem', 'DelayedArray', 'dplyr', 'evaluate', 'fs', 'httpuv',
  'httr', 'httr2', 'later', 'profvis', 'rlang', 'sass', 'testthat', 'tzdb',
  'vctrs', 'viridisLite', 'vroom', 'waldo', 'xfun', 'xml2'

Installing package into 'C:/Users/laure/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)



package 'data.table' successfully unpacked and MD5 sums checked


"cannot remove prior installation of package 'data.table'"
"problem copying C:\Users\laure\AppData\Local\R\win-library\4.2\00LOCK\data.table\libs\x64\data_table.dll to C:\Users\laure\AppData\Local\R\win-library\4.2\data.table\libs\x64\data_table.dll: Permission denied"
"restored 'data.table'"



The downloaded binary packages are in
	C:\Users\laure\AppData\Local\Temp\RtmpEN9DXP\downloaded_packages
Finishing the installing of the libraries.

Now that all the libraries have been installed we can load them into this notebook by using the command 'libraries()'. To verify that these libraries have been loaded into this notebook, we can use the command 'packageVersion()' which will display the version of the package currently installed.

In [2]:
# Loading the following libraries (which are also mentioned in the introduction of this notebook) into this notebook. 
library(biomaRt)
library(data.table)


# Retrieving the version of the packages to verify they have been correctly loaded into this notebook.
cat("The library 'biomaRt' has been loaded into the notebook with its version being:")
packageVersion("biomaRt")

cat("The library 'data.table' has been loaded into the notebook with its version being:")
packageVersion("data.table")

The library 'biomaRt' has been loaded into the notebook with its version being:

[1] '2.54.1'

The library 'data.table' has been loaded into the notebook with its version being:

[1] '1.14.8'

### Defining the data directories

In addition, we need to define our data directories from which the files will be loaded and to which the resulting file will be stored. Please mind that these need to be changed to the desired directories to be able to work with the data directories.

In [3]:
data_directory_combined_cleaned_files = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data"
data_directory_location_files = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/location_data"

## Retrieving the Locations of the Genes

Within this section, the locations of the genes will be retrieved.

#### Loading the 'gene_expression_data_cleaned_sorted.csv' file into this notebook

To start the location retrieval of the genes, we first need to load the file 'gene_expression_data_cleaned_sorted.csv' from the directory 'data_directory_combined_cleaned_files'. This can be achieved by calling the function 'fread()' with as argument the path of the file that should be loaded into this notebook as a data table. To explicitly call the function 'fread()' from the 'data.table' package, the '::' operator is used.

In [4]:
# The path of the gene expression data file 'gene_expression_data_cleaned_sorted.csv' to be loaded into this notebook.
path <- file.path(data_directory_combined_cleaned_files, "gene_expression_data_cleaned_sorted.csv")

# If the 'path' defined above does not point to a file, the execution of this code block is terminated and an error message
# is displayed.
if (!file.exists(path)) {
  stop("File not found: ", path)
}

# Loading the gene expression data as a data table into this notebook by calling the function 'fread()'.
gene_expression_data <- data.table::fread(path)

cat("The gene expression data:")
gene_expression_data

The gene_expression data:

Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,⋯,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000003.15,TSPAN6,76.7833,82.9215,50.9843,30.1774,70.0066,40.8241,81.7969,47.4205,⋯,24.3036,105.1259,62.3738,78.7018,113.3587,74.8750,66.0662,117.6090,71.2790,92.9495
ENSG00000000005.6,TNMD,0.4035,0.4189,0.1591,0.1364,0.2893,1.1342,0.1334,0.5607,⋯,0.4089,0.3746,0.3449,0.4110,2.5210,0.2551,0.2063,5.6656,0.2270,0.3167
ENSG00000000419.13,DPM1,97.4399,63.4521,98.1366,55.5088,80.6678,85.4610,80.0162,49.7091,⋯,74.3346,143.1418,118.7632,169.7759,68.5659,64.4713,89.2847,90.8538,84.3173,66.3511
ENSG00000000457.14,SCYL3,6.5428,5.4929,6.3809,5.0426,4.8642,4.0998,4.8043,4.8995,⋯,5.0076,7.0083,9.4073,6.7280,6.4974,5.4887,4.2106,7.6404,6.6323,7.0174
ENSG00000000460.17,C1orf112,5.9849,3.1369,5.7963,2.7663,6.9529,5.4879,2.8904,4.2395,⋯,3.2461,7.2234,6.9749,5.8939,3.9360,4.6042,3.8754,2.3827,5.9630,5.1545
ENSG00000000938.13,FGR,7.0651,16.4290,21.9912,65.6843,25.5317,11.9973,26.2095,17.0881,⋯,11.1232,36.0053,9.5427,3.8866,15.2813,11.1335,15.9864,23.5793,10.5260,13.9823
ENSG00000000971.16,CFH,16.9301,16.4273,19.3073,50.7959,24.9163,45.3018,12.3576,14.5879,⋯,17.2011,8.1233,4.4526,11.4120,22.3823,13.0230,6.9736,10.0796,9.3104,3.4071
ENSG00000001036.14,FUCA2,63.7275,66.9871,73.1009,65.8725,101.1771,96.1271,57.4271,50.5623,⋯,41.8926,73.5414,42.2461,55.6078,91.3001,79.9233,84.9414,29.7248,28.4352,65.8406
ENSG00000001084.13,GCLC,14.5107,22.8885,9.4840,20.9938,17.4425,9.7493,12.6304,14.7003,⋯,14.4338,10.8659,25.2573,11.7874,22.3965,16.0202,9.9544,18.8568,16.0662,11.1008
ENSG00000001167.14,NFYA,24.2783,18.1955,24.5692,22.3543,28.6862,30.9364,19.3338,22.7468,⋯,15.1850,21.6339,32.9123,18.0061,21.8243,25.2603,19.4110,47.1435,27.2980,22.2000


Since we only need the names of the genes and not the gene expression values for the samples to be able to retrieve the location of the genes, we only need the column 'Gene ID' from the data table 'gene_expression_data'. This can be achieved by using the R notation 'data_table$column_name'.

In [5]:
# Retrieving the genes by accessing the elements in the column 'Gene ID'.
genes <- gene_expression_data$'Gene ID'

cat("The genes:")
genes

The genes:

Since each of the genes above features the actual gene id followed by the version ID after the dot, we can shorten the gene IDs by removing the version ID. We can do this, since the database (Ensembl BioMart database) from which we will be retrieving the gene annotation data from later in this notebook does not explicitly contain the version ID but it will always return the latest version of the gene IDs. To achieve this, we can call the function 'gsub()' which replaces the part of the gene IDs that contain a dot followed by any number of values with an empty string such that only the part of the gene IDs remain that actually encode the gene.

In [6]:
# Replacing the part of the gene IDs that contain a dot followed by any number of values with an empty string such that only 
# the part of the gene IDs remain that actually encode the gene.
genes <- gsub("\\.\\d+$", "", genes)

cat("The genes after replacing the part of the gene IDs that contain a dot followed by any number of values with an empty string such that only the part of the gene IDs remain that actually encode the gene:")
genes

The genes after replacing the part of the gene IDs that contain a dot followed by any number of values with an empty string such that only the part of the gene IDs remain that actually encode the gene:

Next, we can check whether all the genes are unique to ensure that by removing the version IDs we have not created duplicate gene IDs. This can be achieved by calling the function 'unique()'.

In [7]:
# Checking whether all the genes in the 'genes' array are unique.
unique_genes <- unique(genes)

if (length(unique_genes) == nrow(gene_expression_data)) {
  cat("All genes are unique.")
} else {
  cat("There are duplicated genes.")
}

All genes are unique.

As we can see from the output above, all the genes are unique even after having removed the version IDs.

#### Retrieving the locations

The first step towards retrieving the location data of the genes is to retrieve the annotation information for each gene. First, the Ensembl BioMart database needs to be set up by calling the function 'useMart' which originates from the loaded 'biomaRt' library. 

It is important to note here that the exact locations for the genes will not be the same for every sample we retrieved as the genome between two samples will be different because of duplications and deletions. We can, however, use the reference genome Human Genome version 19 assembly which is the same one that was used to retrieve the locations of the CpG sites as the reference genome is generally considered to be a good estimate for the positions of the genes in the majority of individuals of that species. This is the reference genome that is used within the Ensembl BioMart database.

In [8]:
# Setting up the Ensembl BioMart database to retrieve the gene locations.
ensembl_database <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

Next, we can retrieve the annotation information for each gene by calling the function 'getBM()' which also originates from the loaded 'biomaRt' library. As arguments to this function, we include which database should be used (which is the 'ensembl_database') and which columns should be retrieved. The following six columns are retrieved: 'ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position', and 'strand'.

In [9]:
# Retrieve the gene locations using the getBM function
annotation_genes <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", "strand"), mart = ensembl_database)

# Converting the retrieved data frame to a data table to improve readability.
annotation_genes <- data.table::as.data.table(annotation_genes)

cat("The annotation information for the genes present in the Ensembl BioMart database:")
annotation_genes

The annotation information for the genes present in the Ensembl BioMart database:

ensembl_gene_id,external_gene_name,chromosome_name,start_position,end_position,strand
<chr>,<chr>,<chr>,<int>,<int>,<int>
ENSG00000210049,MT-TF,MT,577,647,1
ENSG00000211459,MT-RNR1,MT,648,1601,1
ENSG00000210077,MT-TV,MT,1602,1670,1
ENSG00000210082,MT-RNR2,MT,1671,3229,1
ENSG00000209082,MT-TL1,MT,3230,3304,1
ENSG00000198888,MT-ND1,MT,3307,4262,1
ENSG00000210100,MT-TI,MT,4263,4331,1
ENSG00000210107,MT-TQ,MT,4329,4400,-1
ENSG00000210112,MT-TM,MT,4402,4469,1
ENSG00000198763,MT-ND2,MT,4470,5511,1


Now that we have succesfully loaded the annotation information for the genes present on the Human Genome version 19 assembly reference genome, we can check whether indeed all the genes present within the gene expression data stored in the variable 'genes' is also present in the annotation data stored in the variable 'annotation_genes'. This can be achieved by applying the '%in%' operator which in the code below checks whether the 'genes' appear in the 'ensembl_gene_id' column in the 'annotation_genes' data table. If a gene is not present in that column, it will be marked as 'TRUE' within the 'missing_genes' string of characters.

In [10]:
# Checking whether the 'genes' appear in the 'ensembl_gene_id' column in the 'annotation_genes' data table.
missing_genes <- !genes %in% annotation_genes$ensembl_gene_id

# Retrieving the genes which are not present in the 'ensembl_gene_id' column in the 'annotation_genes' data table.
genes_missing <- genes[missing_genes]

# Displaying the missing genes.
if (length(genes_missing) == 0) {
    cat("All the genes are present in the annotation data.")
} else {
    cat("The following genes are not present in the annotation data:\n")
    cat(genes_missing)
    cat("\n\nThe number of genes that are not present in the annotation data:\n")
    cat(length(genes_missing))
}

The following genes are not present in the annotation data:
ENSG00000112096 ENSG00000130723 ENSG00000203812 ENSG00000204092 ENSG00000204805 ENSG00000215271 ENSG00000221995 ENSG00000224739 ENSG00000225178 ENSG00000226380 ENSG00000226403 ENSG00000228139 ENSG00000228906 ENSG00000235245 ENSG00000236166 ENSG00000237838 ENSG00000239467 ENSG00000239665 ENSG00000253878 ENSG00000254615 ENSG00000255633 ENSG00000255823 ENSG00000256045 ENSG00000256427 ENSG00000256618 ENSG00000259834 ENSG00000261490 ENSG00000261534 ENSG00000261737 ENSG00000261963 ENSG00000269028 ENSG00000269894 ENSG00000270030 ENSG00000270178 ENSG00000270195 ENSG00000271043 ENSG00000271870 ENSG00000271895 ENSG00000272040 ENSG00000272196 ENSG00000272370 ENSG00000272551 ENSG00000273301 ENSG00000273837 ENSG00000274031 ENSG00000275560 ENSG00000277050 ENSG00000277203 ENSG00000279400 ENSG00000280374 ENSG00000286699 ENSG00000287116 ENSG00000287686 ENSG00000288617

The number of genes that are not present in the annotation data:
54

As we can see from the output above, there are 54 genes out of 20437 genes from the gene expression data for which no information regarding their location is present in the Ensembl BioMart database. Since this number is not that large as it is only roughly 0.26 percent of the total number of genes, we can discard these genes from all the gene expression datasets which are present in the directory 'data_directory_combined_cleaned_files' as well as discarding them from the array of genes called 'genes'.

Before we delete these genes, we also want to check whether there are any genes which, although they are present in the annotation data, contain some missing data or unexpected data for the columns (information) we want to retrieve: the chromosome, start position, end position, and strand. All the to be deleted genes can then be added to the array 'genes_to_delete' such that we can later delete all of them at once.

In [11]:
# Assigning the genes to be deleted to the array 'genes_to_delete'.
genes_to_delete = genes_missing

##### The chromosome

For the chromosome, we can first of all check whether there are any genes for which there is no information about the chromosome present within the 'annotation_genes'. This can be achieved by finding the subset of rows (genes) within the 'annotation_genes' data table where the gene is of course present in the 'genes' character array and has a missing value for that column. The function 'subset() can help to find these subsets.

In [12]:
# Retrieving the genes for which the column 'chromosome_name' representing the chromosome has a missing value.
missing_genes_chromosome <- subset(annotation_genes, ensembl_gene_id %in% genes & is.na(chromosome_name))

# Checking whether the number of genes in the 'missing_genes_chromosome' is equal to 0.
if (nrow(missing_genes_chromosome) == 0) {
    cat("None of the genes present in the 'genes' character array have a missing value for the 'chromosome_name' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have a missing value for the 'chromosome_name' column:\n")
    print(missing_genes_chromosome)
    cat("\n\nThe number of genes that have a missing value for the 'chromosome_name' column.:\n")
    cat(nrow(missing_genes_chromosome))
}

None of the genes present in the 'genes' character array have a missing value for the 'chromosome_name' column.


As we can see from the output above, none of the genes present in the 'genes' character array have a missing value for the 'chromosome_name' column of the 'annotation_genes' data table. Next, we need to verify whether all the values in the 'chromosome_name' column correspond to values that we would expect to see. In this case, we expect the values to be between 1 and 22 (as we also want to exclude the sex chromosomes).

In [13]:
# Checking whether all the values in the 'chromosome_name' column correspond to values that we would expect to see. In this 
# case, we expect the values to be between 1 and 22 (as we also want to exclude the sex chromosomes).
expected_chromosomes <- c(1:22)
genes_unexpected_chromosome <- subset(annotation_genes, ensembl_gene_id %in% genes & !chromosome_name %in% expected_chromosomes)

# Checking whether the number of genes in the 'genes_unexpected_chromosome' is equal to 0.
if (nrow(genes_unexpected_chromosome) == 0) {
    cat("None of the genes present in the 'genes' character array have an unexpected value for the 'chromosome_name' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have an unexpected value for the 'chromosome_name' column:\n")
    print(genes_unexpected_chromosome)
    cat("\n\nThe number of genes that have an unexpected value for the 'chromosome_name' column.:\n")
    cat(nrow(genes_unexpected_chromosome))
}

The following genes that are present in the 'genes' character array have an unexpected value for the 'chromosome_name' column:
     ensembl_gene_id external_gene_name chromosome_name start_position
  1: ENSG00000210049              MT-TF              MT            577
  2: ENSG00000211459            MT-RNR1              MT            648
  3: ENSG00000210082            MT-RNR2              MT           1671
  4: ENSG00000209082             MT-TL1              MT           3230
  5: ENSG00000198888             MT-ND1              MT           3307
 ---                                                                  
748: ENSG00000101871               MID1               X       10445310
749: ENSG00000124313             IQSEC2               X       53225828
750: ENSG00000077264               PAK3               X      110944285
751: ENSG00000147394             ZNF185               X      152898067
752: ENSG00000126756                UXT               X       47651796
     end_position str

As we can see from the output above, there are 752 genes present in the 'genes' character array which have an unexpected value for the 'chromosome_name' column of the 'annotation_genes' data table. This number is so high as most of them are located on the X and Y chromosomes and on the mitochondrial DNA which contains a single circular chromosome. Since we do not want to investigate the genes located on the X and Y chromosome and on the mitochondrial DNA, we can add them to the 'genes_to_delete' array such that these are deleted later as well.

In [14]:
# Adding the 'genes_unexpected_chromosome' to the 'genes_to_delete' array. 
genes_to_delete <- c(genes_to_delete, genes_unexpected_chromosome$ensembl_gene_id)

##### The start position

For the start position, we can first of all check whether there are any genes for which there is no information about the start position present within the 'annotation_genes'. This can be achieved by finding the subset of rows (genes) within the 'annotation_genes' data table where the gene is of course present in the 'genes' character array and has a missing value for that column. The function 'subset() can help to find these subsets.

In [15]:
# Retrieving the genes for which the column 'start_position' representing the start position has a missing value.
missing_genes_start_position <- subset(annotation_genes, ensembl_gene_id %in% genes & is.na(start_position))

# Checking whether the number of genes in the 'missing_genes_start_position' is equal to 0.
if (nrow(missing_genes_start_position) == 0) {
    cat("None of the genes present in the 'genes' character array have a missing value for the 'start_position' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have a missing value for the 'start_position' column:\n")
    print(missing_genes_start_position)
    cat("\n\nThe number of genes that have a missing value for the 'start_position' column.:\n")
    cat(nrow(missing_genes_start_position))
}

None of the genes present in the 'genes' character array have a missing value for the 'start_position' column.


As we can see from the output above, none of the genes present in the 'genes' character array have a missing value for the 'start_position' column of the 'annotation_genes' data table. Next, we need to verify whether all the values in the 'start_position' column correspond to values that we would expect to see. In this case, we expect the values to be higher than 0 at all times and that the value is always lower than the value in the 'end_position' column.

In [16]:
# Checking whether all the values in the 'start_position' column correspond to values that we would expect to see. In this 
# case, we expect the values to be higher than 0 at all times and that the value is always lower than the value in the 
# 'end_position' column.
genes_unexpected_start_position <- subset(annotation_genes, ensembl_gene_id %in% genes & !start_position > 0 & !start_position < end_position)

# Checking whether the number of genes in the 'genes_unexpected_start_position' is equal to 0.
if (nrow(genes_unexpected_start_position) == 0) {
    cat("None of the genes present in the 'genes' character array have an unexpected value for the 'start_position' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have an unexpected value for the 'start_position' column:\n")
    print(genes_unexpected_start_position)
    cat("\n\nThe number of genes that have an unexpected value for the 'start_position' column.:\n")
    cat(nrow(genes_unexpected_start_position))
}

None of the genes present in the 'genes' character array have an unexpected value for the 'start_position' column.


As we can see from the output above, none of the genes present in the 'genes' character array have an unexpected value for the 'start_position' column of the 'annotation_genes' data table. Therefore, no additional genes need to be deleted.

##### The end position

For the end position, we can first of all check whether there are any genes for which there is no information about the end position present within the 'annotation_genes'. This can be achieved by finding the subset of rows (genes) within the 'annotation_genes' data table where the gene is of course present in the 'genes' character array and has a missing value for that column. The function 'subset() can help to find these subsets.

In [17]:
# Retrieving the genes for which the column 'end_position' representing the end position has a missing value.
missing_genes_end_position <- subset(annotation_genes, ensembl_gene_id %in% genes & is.na(end_position))

# Checking whether the number of genes in the 'missing_genes_end_position' is equal to 0.
if (nrow(missing_genes_end_position) == 0) {
    cat("None of the genes present in the 'genes' character array have a missing value for the 'end_position' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have a missing value for the 'end_position' column:\n")
    print(missing_genes_end_position)
    cat("\n\nThe number of genes that have a missing value for the 'end_position' column.:\n")
    cat(nrow(missing_genes_end_position))
}

None of the genes present in the 'genes' character array have a missing value for the 'end_position' column.


As we can see from the output above, none of the genes present in the 'genes' character array have a missing value for the 'end_position' column of the 'annotation_genes' data table. Next, we need to verify whether all the values in the 'end_position' column correspond to values that we would expect to see. In this case, we expect the values to be higher than 0 at all times and that the value is always higher than the value in the 'start_position' column.

In [18]:
# Checking whether all the values in the 'end_position' column correspond to values that we would expect to see. In this 
# case, we expect the values to be higher than 0 at all times and that the value is always higher than the value in the 
# 'start_position' column.
genes_unexpected_end_position <- subset(annotation_genes, ensembl_gene_id %in% genes & !end_position > 0 & !start_position < end_position)

# Checking whether the number of genes in the 'genes_unexpected_end_position' is equal to 0.
if (nrow(genes_unexpected_end_position) == 0) {
    cat("None of the genes present in the 'genes' character array have an unexpected value for the 'end_position' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have an unexpected value for the 'end_position' column:\n")
    print(genes_unexpected_end_position)
    cat("\n\nThe number of genes that have an unexpected value for the 'end_position' column.:\n")
    cat(nrow(genes_unexpected_end_position))
}

None of the genes present in the 'genes' character array have an unexpected value for the 'end_position' column.


As we can see from the output above, none of the genes present in the 'genes' character array have an unexpected value for the 'end_position' column of the 'annotation_genes' data table. Therefore, no additional genes need to be deleted.

##### The strand

For the strand, we can first of all check whether there are any genes for which there is no information about the strand present within the 'annotation_genes'. This can be achieved by finding the subset of rows (genes) within the 'annotation_genes' data table where the gene is of course present in the 'genes' character array and has a missing value for that column. The function 'subset() can help to find these subsets.

In [19]:
# Retrieving the genes for which the column 'strand' representing the strand has a missing value.
missing_genes_strand <- subset(annotation_genes, ensembl_gene_id %in% genes & is.na(strand))

# Checking whether the number of genes in the 'missing_genes_strand' is equal to 0.
if (nrow(missing_genes_strand) == 0) {
    cat("None of the genes present in the 'genes' character array have a missing value for the 'strand' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have a missing value for the 'strand' column:\n")
    print(missing_genes_strand)
    cat("\n\nThe number of genes that have a missing value for the 'strand' column.:\n")
    cat(nrow(missing_genes_strand))
}

None of the genes present in the 'genes' character array have a missing value for the 'strand' column.


As we can see from the output above, none of the genes present in the 'genes' character array have a missing value for the 'strand' column of the 'annotation_genes' data table. Next, we need to verify whether all the values in the 'strand' column correspond to values that we would expect to see. In this case, we expect the values to be equal to either 1 or -1.

In [20]:
# Checking whether all the values in the 'strand' column correspond to values that we would expect to see. In this case, we
# expect the values to be equal to either 1 or -1.
genes_unexpected_strand <- subset(annotation_genes, ensembl_gene_id %in% genes & !(strand != -1 | strand != 1))

# Checking whether the number of genes in the 'genes_unexpected_strand' is equal to 0.
if (nrow(genes_unexpected_strand) == 0) {
    cat("None of the genes present in the 'genes' character array have an unexpected value for the 'strand' column.\n")
} else {
    cat("The following genes that are present in the 'genes' character array have an unexpected value for the 'strand' column:\n")
    print(genes_unexpected_strand)
    cat("\n\nThe number of genes that have an unexpected value for the 'strand' column.:\n")
    cat(nrow(genes_unexpected_strand))
}

None of the genes present in the 'genes' character array have an unexpected value for the 'strand' column.


As we can see from the output above, none of the genes present in the 'genes' character array have an unexpected value for the 'strand' column of the 'annotation_genes' data table. Therefore, no additional genes need to be deleted.

<br></br>
<br></br>
We can retrieve for how many genes there is some location information missing or unexpected values occur within the desired columns by calling the function 'length()'.

In [21]:
cat("The number of genes for which there is some location information missing or unexpected values occur within the desired columns causing us to delete the genes:\n")
length(genes_to_delete)

The number of genes for which there is some location information missing or unexpected values occur within the desired columns causing us to delete the genes:


As we can see from the output above, there are 806 genes for which there is some location information missing or unexpected values occur within the desired columns. These genes are deleted from the gene expression datasets and the array 'genes' below.

##### Discarding the 806 genes from all the gene expression datasets

__The file 'gene_expression_data_cleaned_sorted.csv':__

To remove the 806 genes present in the array 'genes_to_delete' from the file 'gene_expression_data_cleaned_sorted.csv', we can alter the 'gene_expression_data' data table to delete the rows that represent these genes. This can be achieved by indexing the 'gene_expression_data' data table based on whether the genes appear within the 'genes_to_delete' array.

In [22]:
# Removing the genes present in the 'genes_to_delete' array from the 'gene_expression_data' data table by indexing the 
# 'gene_expression_data' data table based on whether the genes appear within the 'genes_to_delete' array.
gene_expression_data_genes_removed <- gene_expression_data[!(sub("\\..*", "", gene_expression_data$"Gene ID") %in% genes_to_delete),]

print("The gene expression data after the genes have been removed:")
gene_expression_data_genes_removed

[1] "The gene expression data after the genes have been removed:"


Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,⋯,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000419.13,DPM1,97.4399,63.4521,98.1366,55.5088,80.6678,85.4610,80.0162,49.7091,⋯,74.3346,143.1418,118.7632,169.7759,68.5659,64.4713,89.2847,90.8538,84.3173,66.3511
ENSG00000000457.14,SCYL3,6.5428,5.4929,6.3809,5.0426,4.8642,4.0998,4.8043,4.8995,⋯,5.0076,7.0083,9.4073,6.7280,6.4974,5.4887,4.2106,7.6404,6.6323,7.0174
ENSG00000000460.17,C1orf112,5.9849,3.1369,5.7963,2.7663,6.9529,5.4879,2.8904,4.2395,⋯,3.2461,7.2234,6.9749,5.8939,3.9360,4.6042,3.8754,2.3827,5.9630,5.1545
ENSG00000000938.13,FGR,7.0651,16.4290,21.9912,65.6843,25.5317,11.9973,26.2095,17.0881,⋯,11.1232,36.0053,9.5427,3.8866,15.2813,11.1335,15.9864,23.5793,10.5260,13.9823
ENSG00000000971.16,CFH,16.9301,16.4273,19.3073,50.7959,24.9163,45.3018,12.3576,14.5879,⋯,17.2011,8.1233,4.4526,11.4120,22.3823,13.0230,6.9736,10.0796,9.3104,3.4071
ENSG00000001036.14,FUCA2,63.7275,66.9871,73.1009,65.8725,101.1771,96.1271,57.4271,50.5623,⋯,41.8926,73.5414,42.2461,55.6078,91.3001,79.9233,84.9414,29.7248,28.4352,65.8406
ENSG00000001084.13,GCLC,14.5107,22.8885,9.4840,20.9938,17.4425,9.7493,12.6304,14.7003,⋯,14.4338,10.8659,25.2573,11.7874,22.3965,16.0202,9.9544,18.8568,16.0662,11.1008
ENSG00000001167.14,NFYA,24.2783,18.1955,24.5692,22.3543,28.6862,30.9364,19.3338,22.7468,⋯,15.1850,21.6339,32.9123,18.0061,21.8243,25.2603,19.4110,47.1435,27.2980,22.2000
ENSG00000001460.18,STPG1,0.9795,1.4115,1.3706,4.6538,5.4792,3.6715,1.4185,2.1311,⋯,7.6503,0.8688,1.0287,7.0489,6.1533,8.1108,3.5411,5.5205,3.2969,1.3547
ENSG00000001461.17,NIPAL3,22.7208,23.8983,16.2227,30.2300,32.1510,14.5839,8.9311,19.2154,⋯,46.9597,16.2843,16.2530,33.7971,21.6787,33.9425,18.9278,18.4699,14.0138,29.7655


Next, we can also replace the part of the gene IDs present in the 'gene_expression_data_genes_removed' data table that contain a dot followed by any number of values with an empty string such that only the part of the gene IDs remain that actually encode the gene. This would allow us to later retrieve information about the genes present in the 'genes_location_data' data table, which will be defined later, directly by comparing the IDs of the gene rather than having to remove some part of one of the strings first.

In [23]:
# Replacing the part of the gene IDs present in the 'gene_expression_data_genes_removed' data table that contain a dot 
# followed by any number of values with an empty string such that only the part of the gene IDs remain that actually encode 
# the gene.
gene_expression_data_genes_removed$'Gene ID' <- gsub("\\.\\d+$", "", gene_expression_data_genes_removed$'Gene ID')

print("The gene expression data after the gene IDs have been replaced:")
gene_expression_data_genes_removed

[1] "The gene expression data after the gene IDs have been replaced:"


Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,⋯,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000419,DPM1,97.4399,63.4521,98.1366,55.5088,80.6678,85.4610,80.0162,49.7091,⋯,74.3346,143.1418,118.7632,169.7759,68.5659,64.4713,89.2847,90.8538,84.3173,66.3511
ENSG00000000457,SCYL3,6.5428,5.4929,6.3809,5.0426,4.8642,4.0998,4.8043,4.8995,⋯,5.0076,7.0083,9.4073,6.7280,6.4974,5.4887,4.2106,7.6404,6.6323,7.0174
ENSG00000000460,C1orf112,5.9849,3.1369,5.7963,2.7663,6.9529,5.4879,2.8904,4.2395,⋯,3.2461,7.2234,6.9749,5.8939,3.9360,4.6042,3.8754,2.3827,5.9630,5.1545
ENSG00000000938,FGR,7.0651,16.4290,21.9912,65.6843,25.5317,11.9973,26.2095,17.0881,⋯,11.1232,36.0053,9.5427,3.8866,15.2813,11.1335,15.9864,23.5793,10.5260,13.9823
ENSG00000000971,CFH,16.9301,16.4273,19.3073,50.7959,24.9163,45.3018,12.3576,14.5879,⋯,17.2011,8.1233,4.4526,11.4120,22.3823,13.0230,6.9736,10.0796,9.3104,3.4071
ENSG00000001036,FUCA2,63.7275,66.9871,73.1009,65.8725,101.1771,96.1271,57.4271,50.5623,⋯,41.8926,73.5414,42.2461,55.6078,91.3001,79.9233,84.9414,29.7248,28.4352,65.8406
ENSG00000001084,GCLC,14.5107,22.8885,9.4840,20.9938,17.4425,9.7493,12.6304,14.7003,⋯,14.4338,10.8659,25.2573,11.7874,22.3965,16.0202,9.9544,18.8568,16.0662,11.1008
ENSG00000001167,NFYA,24.2783,18.1955,24.5692,22.3543,28.6862,30.9364,19.3338,22.7468,⋯,15.1850,21.6339,32.9123,18.0061,21.8243,25.2603,19.4110,47.1435,27.2980,22.2000
ENSG00000001460,STPG1,0.9795,1.4115,1.3706,4.6538,5.4792,3.6715,1.4185,2.1311,⋯,7.6503,0.8688,1.0287,7.0489,6.1533,8.1108,3.5411,5.5205,3.2969,1.3547
ENSG00000001461,NIPAL3,22.7208,23.8983,16.2227,30.2300,32.1510,14.5839,8.9311,19.2154,⋯,46.9597,16.2843,16.2530,33.7971,21.6787,33.9425,18.9278,18.4699,14.0138,29.7655


To be sure, we can check whether we have not created any duplicate genes by removing part of the gene IDs. This can be achieved by calling the function 'unique()' which returns the number of unique records and check whether this is equal to the number of records present in the 'gene_expression_data_genes_removed' data table. 

In [24]:
# Checking whether all the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_genes_removed' data table
# are unique.
unique_genes <- unique(gene_expression_data_genes_removed$'Gene ID')

if (length(unique_genes) == nrow(gene_expression_data_genes_removed)) {
  cat("All genes are unique.")
} else {
  cat("There are duplicated genes.")
}

All genes are unique.

In addition, we can also check whether any of the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_genes_removed' data table can not be mapped to a gene ID in the 'annotation_genes' data table.

In [25]:
# Checking whether any of the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_genes_removed' data 
# table can not be mapped to a gene ID in the 'annotation_genes' data table.
gene_ids_not_mapped <- !gene_expression_data_genes_removed$'Gene ID' %in% annotation_genes$ensembl_gene_id

# Retrieving the gene IDs which can not be mapped to a gene ID which are present in the 'ensembl_gene_id' column of the 
# 'annotation_genes' data table.
gene_ids <- gene_expression_data_genes_removed$'Gene ID'[gene_ids_not_mapped]

# Displaying the gene IDs that cannot be mapped to a gene ID in the annotation data.
if (length(gene_ids) == 0) {
    cat("All the gene IDs can be mapped to a gene ID in the annotation data.")
} else {
    cat("The following gene IDs cannot be mapped to a gene ID in the annotation data:\n")
    cat(gene_ids)
    cat("\n\nThe number of gene IDs that cannot be mapped to a gene ID in the annotation data:\n")
    cat(length(gene_ids))
}

All the gene IDs can be mapped to a gene ID in the annotation data.

As we can see from the output above, all of the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_genes_removed' data table can be mapped to a gene ID in the 'annotation_genes' data table.

Next, we can store the resulting data table 'gene_expression_data_genes_removed' featuring this information. This can be achieved by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [26]:
# The path where the 'gene_expression_data_genes_removed' data table should be stored.
path <- file.path(data_directory_combined_cleaned_files, "gene_expression_data_cleaned_sorted_genes_removed.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(gene_expression_data_genes_removed, path)
    cat(paste("The file has been created at the path: ", path))
}

The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data/gene_expression_data_cleaned_sorted_genes_removed.csv

<br></br>
<br></br>
__The file 'gene_expression_data_cleaned_log2_transformed_sorted.csv':__

To remove the 806 genes present in the array 'genes_to_delete' from the file 'gene_expression_data_cleaned_log2_transformed_sorted.csv', we first need to load the file from the directory 'data_directory_combined_cleaned_files'. This can be achieved by calling the function 'fread()' with as argument the path of the file that should be loaded into this notebook as a data table. To explicitly call the function 'fread()' from the 'data.table' package, the '::' operator is used.

In [27]:
# The path of the gene expression data file 'gene_expression_data_cleaned_log2_transformed_sorted.csv' to be loaded into this 
# notebook.
path <- file.path(data_directory_combined_cleaned_files, "gene_expression_data_cleaned_log2_transformed_sorted.csv")

# If the 'path' defined above does not point to a file, the execution of this code block is terminated and an error message
# is displayed.
if (!file.exists(path)) {
  stop("File not found: ", path)
}

# Loading the gene expression log2 transformed data as a data table into this notebook by calling the function 'fread()'.
gene_expression_data_log2_transformed <- data.table::fread(path)

cat("The gene_expression log2 transformed data:")
gene_expression_data_log2_transformed

The gene_expression log2 transformed data:

Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,⋯,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000003.15,TSPAN6,6.2813885,6.3909686,5.7000041,4.9624287,6.1498812,5.3862626,6.3715048,5.5975461,⋯,4.6612707,6.7296330,5.9858146,6.31654040,6.8374223,6.2455527,6.0675140,6.8900697,6.1755046,6.5538136
ENSG00000000005.6,TNMD,0.4890291,0.5047729,0.2130050,0.1844707,0.3665880,1.0936954,0.1806571,0.6421932,⋯,0.4945692,0.4590119,0.4274989,0.49671799,1.8159852,0.3278023,0.2705887,2.7367347,0.2951352,0.3969267
ENSG00000000419.13,DPM1,6.6211713,6.0101555,6.6313459,5.8204036,6.3516955,6.4339776,6.3401385,5.6641728,⋯,6.2352407,7.1713450,6.9040409,7.41596059,6.1203084,6.0327907,6.4964096,6.5212675,6.4147664,6.0736296
ENSG00000000457.14,SCYL3,2.9151002,2.6988630,2.8837967,2.5951694,2.5519343,2.3504407,2.5371221,2.5605927,⋯,2.5867888,3.0014960,3.3795239,2.95009509,2.9063904,2.6979295,2.3814495,3.1110981,2.9321179,3.0031345
ENSG00000000460.17,C1orf112,2.8042395,2.0485501,2.7647495,1.9131479,2.9914810,2.6977516,1.9599185,2.3894291,⋯,2.0861383,3.0397350,2.9954664,2.78532037,2.3033424,2.4865084,2.2855206,1.7581752,2.7997090,2.6216417
ENSG00000000938.13,FGR,3.0116924,4.1234179,4.5230099,6.0592752,4.7296452,3.7001401,4.7660385,4.1769690,⋯,3.5996987,5.2096600,3.3981725,2.28883102,4.0251440,3.6009239,4.0863082,4.6193719,3.5268200,3.9051872
ENSG00000000971.16,CFH,4.1643116,4.1232772,4.3439265,5.6947660,4.6957879,5.5329964,3.7395889,3.9623547,⋯,4.1859537,3.1895558,2.4469443,3.63366370,4.5473449,3.8097231,2.9952312,3.4698339,3.3660284,2.1398296
ENSG00000001036.14,FUCA2,6.0163069,6.0871891,6.2114192,6.0633411,6.6749281,6.6018020,5.8685658,5.6882447,⋯,5.4226569,6.2199700,5.4344981,5.82292895,6.5282603,6.3384832,6.4252814,4.9413317,4.8794705,6.0626528
ENSG00000001084.13,GCLC,3.9551919,4.5782444,3.3901174,4.4590250,4.2049623,3.4261708,3.7687560,3.9727202,⋯,3.9480214,3.5687496,4.7146467,3.67665105,4.5482208,4.0891761,3.4534386,4.3115612,4.0930700,3.5970305
ENSG00000001167.14,NFYA,4.6598275,4.2626962,4.6763351,4.5456163,4.8917205,4.9971298,4.3458079,4.5696612,⋯,4.0165855,4.5004133,5.0837367,4.24839062,4.5124987,4.7148115,4.3512750,5.5892691,4.8226282,4.5360529


Now that the file has been loaded, we can alter the 'gene_expression_data_log2_transformed' data table to delete the rows that represent these genes. This can be achieved by indexing the 'gene_expression_data_log2_transformed' data table based on whether the genes appear within the 'genes_to_delete' array.

In [28]:
# Removing the genes present in the 'genes_to_delete' array from the 'gene_expression_data' data table by indexing the 
# 'gene_expression_data_log2_transformed' data table based on whether the genes appear within the 'genes_to_delete' array.
gene_expression_data_log2_transformed_genes_removed <- gene_expression_data_log2_transformed[!(sub("\\..*", "", gene_expression_data_log2_transformed$"Gene ID") %in% genes_to_delete),]

print("The gene expression log2 transformed data after the genes have been removed:")
gene_expression_data_log2_transformed_genes_removed

[1] "The gene expression log2 transformed data after the genes have been removed:"


Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,⋯,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000419.13,DPM1,6.6211713,6.0101555,6.6313459,5.8204036,6.3516955,6.4339776,6.3401385,5.6641728,⋯,6.2352407,7.1713450,6.9040409,7.41596059,6.1203084,6.0327907,6.4964096,6.5212675,6.414766,6.0736296
ENSG00000000457.14,SCYL3,2.9151002,2.6988630,2.8837967,2.5951694,2.5519343,2.3504407,2.5371221,2.5605927,⋯,2.5867888,3.0014960,3.3795239,2.95009509,2.9063904,2.6979295,2.3814495,3.1110981,2.932118,3.0031345
ENSG00000000460.17,C1orf112,2.8042395,2.0485501,2.7647495,1.9131479,2.9914810,2.6977516,1.9599185,2.3894291,⋯,2.0861383,3.0397350,2.9954664,2.78532037,2.3033424,2.4865084,2.2855206,1.7581752,2.799709,2.6216417
ENSG00000000938.13,FGR,3.0116924,4.1234179,4.5230099,6.0592752,4.7296452,3.7001401,4.7660385,4.1769690,⋯,3.5996987,5.2096600,3.3981725,2.28883102,4.0251440,3.6009239,4.0863082,4.6193719,3.526820,3.9051872
ENSG00000000971.16,CFH,4.1643116,4.1232772,4.3439265,5.6947660,4.6957879,5.5329964,3.7395889,3.9623547,⋯,4.1859537,3.1895558,2.4469443,3.63366370,4.5473449,3.8097231,2.9952312,3.4698339,3.366028,2.1398296
ENSG00000001036.14,FUCA2,6.0163069,6.0871891,6.2114192,6.0633411,6.6749281,6.6018020,5.8685658,5.6882447,⋯,5.4226569,6.2199700,5.4344981,5.82292895,6.5282603,6.3384832,6.4252814,4.9413317,4.879471,6.0626528
ENSG00000001084.13,GCLC,3.9551919,4.5782444,3.3901174,4.4590250,4.2049623,3.4261708,3.7687560,3.9727202,⋯,3.9480214,3.5687496,4.7146467,3.67665105,4.5482208,4.0891761,3.4534386,4.3115612,4.093070,3.5970305
ENSG00000001167.14,NFYA,4.6598275,4.2626962,4.6763351,4.5456163,4.8917205,4.9971298,4.3458079,4.5696612,⋯,4.0165855,4.5004133,5.0837367,4.24839062,4.5124987,4.7148115,4.3512750,5.5892691,4.822628,4.5360529
ENSG00000001460.18,STPG1,0.9851361,1.2699308,1.2452523,2.4992208,2.6958157,2.2238859,1.2741125,1.6466696,⋯,3.1127502,0.9021122,1.0205555,3.00879163,2.8386089,3.1875777,2.1830418,2.7049826,2.103296,1.2355433
ENSG00000001461.17,NIPAL3,4.5680808,4.6379753,4.1062394,4.9648607,5.0509805,3.9619844,3.3119535,4.3373828,⋯,5.5837507,4.1113903,4.1087753,5.12089517,4.5032660,5.1269109,4.3167105,4.2831736,3.908217,4.9432415


Next, we can also replace the part of the gene IDs present in the 'gene_expression_data_log2_transformed_genes_removed' data table that contain a dot followed by any number of values with an empty string such that only the part of the gene IDs remain that actually encode the gene. This would allow us to later retrieve information about the genes present in the 'genes_location_data' data table, which will be defined later, directly by comparing the IDs of the gene rather than having to remove some part of one of the strings first.

In [29]:
# Replacing the part of the gene IDs present in the 'gene_expression_log2_transformed_data_genes_removed' data table that 
# contain a dot followed by any number of values with an empty string such that only the part of the gene IDs remain that 
# actually encode the gene.
gene_expression_data_log2_transformed_genes_removed$'Gene ID' <- gsub("\\.\\d+$", "", gene_expression_data_log2_transformed_genes_removed$'Gene ID')

print("The gene expression data after the gene IDs have been replaced:")
gene_expression_data_log2_transformed_genes_removed

[1] "The gene expression data after the gene IDs have been replaced:"


Gene ID,Gene Name,TCGA-06-0125-01A-01,TCGA-06-0125-02A-11,TCGA-06-0152-02A-01,TCGA-06-0171-02A-11,TCGA-06-0190-01A-01,TCGA-06-0190-02A-01,TCGA-06-0210-01A-01,TCGA-06-0210-02A-01,⋯,TCGA-32-1980-01A-01,TCGA-32-5222-01A-01,TCGA-41-5651-01A-01,TCGA-76-4925-01A-01,TCGA-76-4926-01B-01,TCGA-76-4927-01A-01,TCGA-76-4928-01B-01,TCGA-76-4929-01A-01,TCGA-76-4931-01A-01,TCGA-76-4932-01A-01
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000419,DPM1,6.6211713,6.0101555,6.6313459,5.8204036,6.3516955,6.4339776,6.3401385,5.6641728,⋯,6.2352407,7.1713450,6.9040409,7.41596059,6.1203084,6.0327907,6.4964096,6.5212675,6.414766,6.0736296
ENSG00000000457,SCYL3,2.9151002,2.6988630,2.8837967,2.5951694,2.5519343,2.3504407,2.5371221,2.5605927,⋯,2.5867888,3.0014960,3.3795239,2.95009509,2.9063904,2.6979295,2.3814495,3.1110981,2.932118,3.0031345
ENSG00000000460,C1orf112,2.8042395,2.0485501,2.7647495,1.9131479,2.9914810,2.6977516,1.9599185,2.3894291,⋯,2.0861383,3.0397350,2.9954664,2.78532037,2.3033424,2.4865084,2.2855206,1.7581752,2.799709,2.6216417
ENSG00000000938,FGR,3.0116924,4.1234179,4.5230099,6.0592752,4.7296452,3.7001401,4.7660385,4.1769690,⋯,3.5996987,5.2096600,3.3981725,2.28883102,4.0251440,3.6009239,4.0863082,4.6193719,3.526820,3.9051872
ENSG00000000971,CFH,4.1643116,4.1232772,4.3439265,5.6947660,4.6957879,5.5329964,3.7395889,3.9623547,⋯,4.1859537,3.1895558,2.4469443,3.63366370,4.5473449,3.8097231,2.9952312,3.4698339,3.366028,2.1398296
ENSG00000001036,FUCA2,6.0163069,6.0871891,6.2114192,6.0633411,6.6749281,6.6018020,5.8685658,5.6882447,⋯,5.4226569,6.2199700,5.4344981,5.82292895,6.5282603,6.3384832,6.4252814,4.9413317,4.879471,6.0626528
ENSG00000001084,GCLC,3.9551919,4.5782444,3.3901174,4.4590250,4.2049623,3.4261708,3.7687560,3.9727202,⋯,3.9480214,3.5687496,4.7146467,3.67665105,4.5482208,4.0891761,3.4534386,4.3115612,4.093070,3.5970305
ENSG00000001167,NFYA,4.6598275,4.2626962,4.6763351,4.5456163,4.8917205,4.9971298,4.3458079,4.5696612,⋯,4.0165855,4.5004133,5.0837367,4.24839062,4.5124987,4.7148115,4.3512750,5.5892691,4.822628,4.5360529
ENSG00000001460,STPG1,0.9851361,1.2699308,1.2452523,2.4992208,2.6958157,2.2238859,1.2741125,1.6466696,⋯,3.1127502,0.9021122,1.0205555,3.00879163,2.8386089,3.1875777,2.1830418,2.7049826,2.103296,1.2355433
ENSG00000001461,NIPAL3,4.5680808,4.6379753,4.1062394,4.9648607,5.0509805,3.9619844,3.3119535,4.3373828,⋯,5.5837507,4.1113903,4.1087753,5.12089517,4.5032660,5.1269109,4.3167105,4.2831736,3.908217,4.9432415


To be sure, we can check whether we have not created any duplicate genes by removing part of the gene IDs. This can be achieved by calling the function 'unique()' which returns the number of unique records and check whether this is equal to the number of records present in the 'gene_expression_data_log2_transformed_genes_removed' data table. 

In [30]:
# Checking whether all the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_log2_transformed_genes_removed' 
# data table are unique.
unique_genes <- unique(gene_expression_data_log2_transformed_genes_removed$'Gene ID')

if (length(unique_genes) == nrow(gene_expression_data_log2_transformed_genes_removed)) {
  cat("All genes are unique.")
} else {
  cat("There are duplicated genes.")
}

All genes are unique.

In addition, we can also check whether any of the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_log2_transformed_genes_removed' data table can not be mapped to a gene ID in the 'annotation_genes' data table.

In [31]:
# Checking whether any of the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_log2_transformed_genes_removed' 
# data table can not be mapped to a gene ID in the 'annotation_genes' data table.
gene_ids_not_mapped <- !gene_expression_data_log2_transformed_genes_removed$'Gene ID' %in% annotation_genes$ensembl_gene_id

# Retrieving the gene IDs which can not be mapped to a gene ID which are present in the 'ensembl_gene_id' column of the 
# 'annotation_genes' data table.
gene_ids <- gene_expression_data_log2_transformed_genes_removed$'Gene ID'[gene_ids_not_mapped]

# Displaying the gene IDs that cannot be mapped to a gene ID in the annotation data.
if (length(gene_ids) == 0) {
    cat("All the gene IDs can be mapped to a gene ID in the annotation data.")
} else {
    cat("The following gene IDs cannot be mapped to a gene ID in the annotation data:\n")
    cat(gene_ids)
    cat("\n\nThe number of gene IDs that cannot be mapped to a gene ID in the annotation data:\n")
    cat(length(gene_ids))
}

All the gene IDs can be mapped to a gene ID in the annotation data.

As we can see from the output above, all of the gene IDs present in the 'Gene ID' column of the 'gene_expression_data_log2_transformed_genes_removed' data table can be mapped to a gene ID in the 'annotation_genes' data table.

Next, we can store the resulting data table 'gene_expression_data_log2_transformed_genes_removed' featuring this information. This can be achieved by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [32]:
# The path where the 'gene_expression_data_log2_transformed_genes_removed' data table should be stored.
path <- file.path(data_directory_combined_cleaned_files, "gene_expression_data_cleaned_log2_transformed_sorted_genes_removed.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(gene_expression_data_log2_transformed_genes_removed, path)
    cat(paste("The file has been created at the path: ", path))
}

The file has been created at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/combined_cleaned_data/gene_expression_data_cleaned_log2_transformed_sorted_genes_removed.csv

##### Discarding the 806 genes from the array of genes 'genes'

To remove the 806 genes present in the array 'genes_to_delete' from the array 'genes', we can call the function 'setdiff()' which returns all the elements in the "genes" array that are not present in the 'genes_to_delete' array. 

In [33]:
# Removing the genes present in the 'genes_to_delete' array from the 'genes' array.
genes <- setdiff(genes, genes_to_delete)

We can check whether indeed all the 806 genes present in the 'genes_to_delete' array have been removed from the array of genes called 'genes' by running the following code.

In [34]:
# Checking whether the 'genes' appear in the 'genes_to_delete' array.
missing_genes <- genes %in% genes_to_delete

# Retrieving the genes which are present in the 'genes_to_delete' array.
genes_missing = genes[missing_genes]

# Displaying the missing genes.
if (length(genes_missing) == 0) {
    cat("All the genes present in the 'genes_to_delete' array have been removed.")
} else {
    cat("The following genes present in the 'genes_to_delete' array have not been removed:\n")
    cat(genes_missing)
    cat("\n\nThe number of genes present in the 'genes_to_delete' array that have not been removed:\n")
    cat(length(genes_missing))
}

All the genes present in the 'genes_to_delete' array have been removed.

As we can see from the output above, indeed all the 806 genes have been removed from the array of genes called 'genes' which means that now for all of the genes the necessary information is stored within the annotation data 'annotation_genes' and we can continue with the next step. We can now use this annotation data to retrieve the locations of the genes. Before we do this, we should first create a data table to store the location information for each of the genes present in the array 'genes'. This can be achieved by calling the function 'data.table()' present in the data.table library. Within this function call, we can set the column 'gene' to contain the genes present in the array 'genes'.

In [35]:
# Creating a data table to store the location information for each of the genes present in the array 'genes' and adding all
# these genes to the column 'gene'.
genes_location_data <- data.table::data.table(gene = genes, chromosome = character(), start_position = numeric(), end_position = numeric(), strand = character())

cat("The 'genes_location_data' data table:")
genes_location_data

"Item 2 has 0 rows but longest item has 19631; filled with NA"
"Item 3 has 0 rows but longest item has 19631; filled with NA"
"Item 4 has 0 rows but longest item has 19631; filled with NA"
"Item 5 has 0 rows but longest item has 19631; filled with NA"


The 'genes_location_data' data table:

gene,chromosome,start_position,end_position,strand
<chr>,<chr>,<dbl>,<dbl>,<chr>
ENSG00000000419,,,,
ENSG00000000457,,,,
ENSG00000000460,,,,
ENSG00000000938,,,,
ENSG00000000971,,,,
ENSG00000001036,,,,
ENSG00000001084,,,,
ENSG00000001167,,,,
ENSG00000001460,,,,
ENSG00000001461,,,,


To verify whether all the genes from the 'genes' character array have been added to the 'genes_location_data' data table, we can check whether the number of rows corresponds to the number of elements in the 'genes' character array.

In [36]:
# Checking whether all the genes from the 'genes' character array have been added to the 'genes_location_data' data table.
if (nrow(genes_location_data) == length(genes)) {
    cat("All of the genes from the 'genes' character array have been added.")
} else {
    cat("Not all of the genes from the 'genes' character array have been added.")
}

All of the genes from the 'genes' character array have been added.

Now, we can add the location data for each of the genes by retrieving the data from the 'annotation_genes' data table and insert it into the 'genes_location_data' data table.

Since we do not want to retrieve two records for a single gene, we need to check whether all the genes in the 'annotation_genes' data table are unique. This can be achieved by calling the function 'unique()' with as argument the column of the genes.

In [37]:
# Checking whether all the genes in the 'annotation_genes' data table are unique.
unique_genes_annotation <- unique(annotation_genes$ensembl_gene_id)

if (length(unique_genes_annotation) == nrow(annotation_genes)) {
  cat("All genes in the 'annotation_genes' data table are unique.")
} else {
  cat("There are non-unique genes in the 'annotation_genes' data table.")
}

All genes in the 'annotation_genes' data table are unique.

Now that we have verified that all genes in the 'annotation_genes' data table are unique, we can safely retrieve the records for a single gene. This can be achieved by first calling the function 'match()' which retrieves the matching rows based on the genes present in the 'gene' column from the 'genes_location_data' data table and the 'ensembl_gene_id' column from the 'annotation_genes' data table. From these matching rows, the chromosome, the starting position, the ending position, and the strand can be retrieved to fill in the missing values in the 'genes_location_data' data table. 

To be consistent with how the location data is represented for both the CpG sites and the genes, we transform the values 1 and -1 for the 'strand' column to the values '+' and '-' respectively and add the string 'chr' before the chromosome numbers.

In [38]:
# Matching the genes in the 'genes_location_data' data table to the 'annotation_genes' data table.
matching_rows <- match(genes_location_data$gene, annotation_genes$ensembl_gene_id)

# Filling in the missing values in the 'genes_location_data' data table with the chromosome, start position, end position, 
# and strand from the 'annotation_genes' data table.
genes_location_data$chromosome <- paste0("chr", annotation_genes$chromosome_name[matching_rows])
genes_location_data$start_position <- as.numeric(annotation_genes$start_position[matching_rows])
genes_location_data$end_position <- as.numeric(annotation_genes$end_position[matching_rows])
genes_location_data$strand <- ifelse(annotation_genes$strand[matching_rows] == 1, '+', '-')

To increase the readability of the data table and increase the structure we can sort the table based on the 'chromosome' column followed by the 'start_position' column. To define the order in which the chromosomes should be sorted, the function 'factor' can be used first. To actually order the data table, we can call the function 'order()'.

In [39]:
# Defining the order in which the chromosomes should be sorted by calling the function 'factor()'.
genes_location_data$chromosome <- factor(genes_location_data$chromosome, levels = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY"))

# Sorting the table based on the chromosome and the position to increase the structure by calling the function 'order()'.
genes_location_data_sorted = genes_location_data[order(genes_location_data$chromosome, start_position)]

cat("The resulting 'genes_location_data_sorted' data table:")
genes_location_data_sorted

The resulting 'genes_location_data_sorted' data table:

gene,chromosome,start_position,end_position,strand
<chr>,<fct>,<dbl>,<dbl>,<chr>
ENSG00000227232,chr1,14404,29570,-
ENSG00000278267,chr1,17369,17436,-
ENSG00000268903,chr1,135141,135895,-
ENSG00000269981,chr1,137682,137965,-
ENSG00000279457,chr1,185217,195411,-
ENSG00000225972,chr1,629062,629433,+
ENSG00000225630,chr1,629640,630683,+
ENSG00000237973,chr1,631074,632616,+
ENSG00000229344,chr1,632757,633438,+
ENSG00000240409,chr1,633535,633741,+


## Storing the Resulting Genes Location Data

Now the location data of the genes has been retrieved, we can store the resulting data table 'genes_location_data_sorted' featuring this information. This can be achieved by calling the function 'fwrite()' from the 'data.table' library which takes as arguments the data table to be stored and the location where the data table should be stored.

In [40]:
# The path where the 'genes_location_data_sorted' data table should be stored.
path <- file.path(data_directory_location_files, "genes_location_data.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(genes_location_data_sorted, path)
    cat(paste("The file has been created at the path: ", path))
}

There is already a file present at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/location_data/genes_location_data.csv