In [1]:
from matplotlib import pyplot as pp
import numpy as np

In [2]:
import main

In [3]:
m = main.Module()

### Dataset

Below you can find the list of the datasets that were collected by Steffi. Let's examine them a bit.

In [4]:
for p in m.cfg['data']['dataset'] :
    print( p )

{'pattern': 'pattern_1', 'geo_tag': 'GSE111061'}
{'pattern': 'pattern_1', 'geo_tag': 'GSE80688'}
{'pattern': 'pattern_1', 'geo_tag': 'GSE148346'}
{'pattern': 'pattern_1', 'geo_tag': 'GSE157538'}
{'pattern': 'pattern_1', 'geo_tag': 'GSE65127'}
{'pattern': 'pattern_2a', 'geo_tag': 'GSE236758'}
{'pattern': 'pattern_2a', 'geo_tag': 'GSE6281'}
{'pattern': 'pattern_2a', 'geo_tag': 'GSE168735', 'should_load': False}
{'pattern': 'pattern_3', 'geo_tag': 'GSE53795'}
{'pattern': 'pattern_3', 'geo_tag': 'GSE11792'}
{'pattern': 'pattern_3', 'geo_tag': 'GSE10433'}
{'pattern': 'pattern_3', 'geo_tag': 'GSE6475'}
{'pattern': 'pattern_3', 'geo_tag': 'GSE122592'}
{'pattern': 'pattern_3', 'geo_tag': 'GSE115099'}
{'pattern': 'pattern_3', 'geo_tag': 'GSE108110'}
{'pattern': 'pattern_4a', 'geo_tag': 'GSE212954'}
{'pattern': 'pattern_4a', 'geo_tag': 'GSE151464'}
{'pattern': 'pattern_4a', 'geo_tag': 'GSE202855'}
{'pattern': 'pattern_4a', 'geo_tag': 'GSE188952'}
{'pattern': 'pattern_4a', 'geo_tag': 'GSE173900'}

### Data type

As I have discovered some of the data is stored as RNA and some as SRA. RNA data is easier to access as the package **GEOParse** can easily read them. For the SRA data the information we are looking for is stored within the supplementary files and I have not been able to load them. Let's see which datasets contain which type of data.

In [6]:
m.examine_type()

1 GSE111061 pattern_1 SRA
2 GSE80688 pattern_1 SRA
3 GSE148346 pattern_1 RNA
4 GSE157538 pattern_1 SRA
5 GSE65127 pattern_1 RNA
6 GSE236758 pattern_2a SRA
7 GSE6281 pattern_2a RNA
8 GSE168735 pattern_2a RNA
9 GSE53795 pattern_3 RNA
10 GSE11792 pattern_3 RNA
11 GSE10433 pattern_3 RNA
12 GSE6475 pattern_3 RNA
13 GSE122592 pattern_3 SRA
14 GSE115099 pattern_3 SRA
15 GSE108110 pattern_3 RNA
16 GSE212954 pattern_4a SRA
17 GSE151464 pattern_4a SRA
18 GSE202855 pattern_4a SRA
19 GSE188952 pattern_4a SRA
20 GSE173900 pattern_4a SRA
21 GSE158395 pattern_4a SRA
22 GSE92566 pattern_4a RNA
23 GSE234987 pattern_4a SRA
24 GSE153011 pattern_4a RNA
25 GSE166863 pattern_4a SRA
26 GSE166861 pattern_4a SRA
27 GSE9285 pattern_4a RNA
28 GSE158923 pattern_4b SRA
29 GSE169146 pattern_4b SRA
30 GSE32887 pattern_4b RNA


To deal with the SRA data, we either have to figure out how each dataset is loaded or figure out a way to convert SRA to RNA.

For now, we will ignore the SRA data to properly understand how to read this data. This leaves us with the following datasets :

In [14]:
m.examine_type(ignore_sra=True)

1 GSE148346 pattern_1 RNA
Data shape [ num samples, num genes ] : (129, 30147)
2 GSE65127 pattern_1 RNA
Data shape [ num samples, num genes ] : (40, 54675)
3 GSE6281 pattern_2a RNA
Data shape [ num samples, num genes ] : (34, 54675)
4 GSE168735 pattern_2a RNA
Data shape [ num samples, num genes ] : (103, 21201)
5 GSE53795 pattern_3 RNA
Data shape [ num samples, num genes ] : (24, 54675)
6 GSE11792 pattern_3 RNA
Data shape [ num samples, num genes ] : (16, 22277)
7 GSE10433 pattern_3 RNA
Data shape [ num samples, num genes ] : (12, 22277)
8 GSE6475 pattern_3 RNA
Data shape [ num samples, num genes ] : (18, 22277)
9 GSE108110 pattern_3 RNA
Data shape [ num samples, num genes ] : (54, 54675)
10 GSE92566 pattern_4a RNA
Data shape [ num samples, num genes ] : (7, 54675)
11 GSE153011 pattern_4a RNA
Data shape [ num samples, num genes ] : (30, 45234)
12 GSE9285 pattern_4a RNA
Data shape [ num samples, num genes ] : (75, 28495)
13 GSE32887 pattern_4b RNA
Data shape [ num samples, num genes ] :

## Genes

Now we can look at the number of genes in each dataset and how the naming of the genes is coded. I need a way to convert the gene names from one system to another. If we look at **4 - GSE168735**, **11 - GSE153011** and **12 - GSE9285** the gene names are in a different format. We will also ignore these datasets for the time being for me to figure out how to convert the gene names the standarized format. We also have the gene names in the brain dataset that we have to the ones found in this dataset.

In [19]:
m.examine_genes(ignore_sra=True)

1 GSE148346 pattern_1 RNA
Number of genes : 30147
['1007_s_at' '1053_at' '117_at' ... 'AFFX-r2-Ec-bioD-5_at'
 'AFFX-r2-P1-cre-3_at' 'AFFX-r2-P1-cre-5_at']
 ****************** 
2 GSE65127 pattern_1 RNA
Number of genes : 54675
['1007_s_at' '1053_at' '117_at' ... 'AFFX-r2-Ec-bioD-5_at'
 'AFFX-r2-P1-cre-3_at' 'AFFX-r2-P1-cre-5_at']
 ****************** 
3 GSE6281 pattern_2a RNA
Number of genes : 54675
['1007_s_at' '1053_at' '117_at' ... 'AFFX-r2-Ec-bioD-5_at'
 'AFFX-r2-P1-cre-3_at' 'AFFX-r2-P1-cre-5_at']
 ****************** 
4 GSE168735 pattern_2a RNA
Number of genes : 21201
['10000_at_AB_1:51' '10000_at_AB_52:60' '10001_at_AB_1:4;6:19' ...
 '999_at_AB_12:22' '999_at_AB_1:11' '9_at']
 ****************** 
5 GSE53795 pattern_3 RNA
Number of genes : 54675
['1007_s_at' '1053_at' '117_at' ... 'AFFX-r2-Ec-bioD-5_at'
 'AFFX-r2-P1-cre-3_at' 'AFFX-r2-P1-cre-5_at']
 ****************** 
6 GSE11792 pattern_3 RNA
Number of genes : 22277
['1007_s_at' '1053_at' '117_at' ... 'AFFX-r2-Ec-bioD-5_at'
 'AFFX-r

Let's consider the genes from **GSE65127** which includes 54675 genes to our target gene pool. We want to know what fraction of the genes from different datasets are included in this gene pool :

In [27]:
m = main.Module()
m.load_dataset()

100%|████████████████████████████████████| 30/30 [00:28<00:00,  1.06it/s]


In [30]:
m.examine_gene_pool()

1 pattern_1 GSE148346  - Number of genes in the pool :  30147  - Fraction of the genes in the pool :  1.0
2 pattern_1 GSE65127  - Number of genes in the pool :  54675  - Fraction of the genes in the pool :  1.0
3 pattern_2a GSE6281  - Number of genes in the pool :  54675  - Fraction of the genes in the pool :  1.0
4 pattern_2a GSE168735  - Number of genes in the pool :  189  - Fraction of the genes in the pool :  0.008914673836139805
5 pattern_3 GSE53795  - Number of genes in the pool :  54675  - Fraction of the genes in the pool :  1.0
6 pattern_3 GSE11792  - Number of genes in the pool :  22277  - Fraction of the genes in the pool :  1.0
7 pattern_3 GSE10433  - Number of genes in the pool :  22277  - Fraction of the genes in the pool :  1.0
8 pattern_3 GSE6475  - Number of genes in the pool :  22277  - Fraction of the genes in the pool :  1.0
9 pattern_3 GSE108110  - Number of genes in the pool :  54675  - Fraction of the genes in the pool :  1.0
10 pattern_4a GSE92566  - Number of g

This confirms that the datasets **4 - GSE168735**, **11 - GSE153011** and **12 - GSE9285** have a different format compared to the rest of the datasets. This is verified by seeing that barely any of the gene names for the dataset has appeared in the gene pool. We require a conversion if possible. 

Ideally, we might require a database for doing so.