This notebook provides the functionality to retrieve gene expression data from the Gene Expression Omnibus (GEO), apply data normalization techniques, and generate an output table in the .tsv format.

Example will be shown on Ulloa-Montoya GSE35640, GPL570 sample

# Import of python base packages

In [24]:
%load_ext autoreload
%matplotlib inline
%config IPCompleter.use_jedi = False

import pathlib
import subprocess
import logging
import os
import pandas as pd
import csv

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
from portraits.mapping import get_gs_for_probes_from_3col,get_expressions_for_gs


In [26]:
import warnings
warnings.filterwarnings("ignore")

In [27]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


# Get the example data from GEO

If you want to use another sample please change the GSE and PLATFORM variables to your desired samples GSE and PLATFORM values in GEO correspondingly.

In [28]:
GSE = 'GSE35640'
PLATFORM = 'GPL570'
EXPRESSION_MATRIX = 'Test/expression.tsv'

Initialize and create a temporary directory for the CEL files

In [29]:
current_dir = pathlib.Path().parent.absolute()
dir_to_process = str(current_dir / 'TMPDIR')

In [30]:
if not os.path.exists(dir_to_process):
     os.mkdir(dir_to_process)

In [31]:
with open(os.devnull, "w") as f:
    subprocess.run([
        'wget',
        f'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE35nnn/{GSE}/suppl/{GSE}_RAW.tar'
    ],stdout=f, stderr=subprocess.STDOUT );

In [32]:
subprocess.run([
    'tar',
    '-xf',
    f'{GSE}_RAW.tar',
    '-C', dir_to_process
])
subprocess.run([
    'rm',
    f'{GSE}_RAW.tar'
])

CompletedProcess(args=['rm', 'GSE35640_RAW.tar'], returncode=0)

In [33]:
os.listdir(dir_to_process)

['GSM872328_MAGE008_sample_1.CEL.gz',
 'GSM872329_MAGE008_sample_2.CEL.gz',
 'GSM872330_MAGE008_sample_3.CEL.gz',
 'GSM872331_MAGE008_sample_4.CEL.gz',
 'GSM872332_MAGE008_sample_5.CEL.gz',
 'GSM872333_MAGE008_sample_6.CEL.gz',
 'GSM872334_MAGE008_sample_7.CEL.gz',
 'GSM872335_MAGE008_sample_8.CEL.gz',
 'GSM872336_MAGE008_sample_9.CEL.gz',
 'GSM872337_MAGE008_sample_10.CEL.gz',
 'GSM872338_MAGE008_sample_11.CEL.gz',
 'GSM872339_MAGE008_sample_12.CEL.gz',
 'GSM872340_MAGE008_sample_13.CEL.gz',
 'GSM872341_MAGE008_sample_14.CEL.gz',
 'GSM872342_MAGE008_sample_15.CEL.gz',
 'GSM872343_MAGE008_sample_16.CEL.gz',
 'GSM872344_MAGE008_sample_17.CEL.gz',
 'GSM872345_MAGE008_sample_18.CEL.gz',
 'GSM872346_MAGE008_sample_19.CEL.gz',
 'GSM872347_MAGE008_sample_20.CEL.gz',
 'GSM872348_MAGE008_sample_21.CEL.gz',
 'GSM872349_MAGE008_sample_22.CEL.gz',
 'GSM872350_MAGE008_sample_23.CEL.gz',
 'GSM872351_MAGE008_sample_24.CEL.gz',
 'GSM872352_MAGE008_sample_25.CEL.gz',
 'GSM872353_MAGE008_sample_26.CEL.

## Extracting expression values from CEL file

For affy arrays without special probes, use RMA
For GPL570/GPL96, use gcrma

In [36]:
annotated_expression = pd.read_csv('Test/expressions.tsv')

In [None]:
%%R

# Load required R packages
library(affy)
library(annotate)
library(gcrma)

In [None]:
%%R -i dir_to_process -o normalized_expression

# Bulk read cell files
raw_expression <- ReadAffy(celfile.path = dir_to_process)

# Normalize expression using RMA
rma_normalized <- gcrma(raw_expression)

# Retrieve expressions from dataset
normalized_expression <- as.data.frame(exprs(rma_normalized))

In [None]:
normalized_expression.head()

In [None]:
# Trim names to make the table more readable.

normalized_expression.columns = normalized_expression.columns.to_series().apply(lambda x: x.split('_')[0]).values
normalized_expression.head()

In [None]:
# Delete unnecessary files

subprocess.run(['rm',  '-r','TMPDIR/'])

## Converting probe ids to HUGO gene symbols

To download the SOFT file manually, follow these steps (we use the Ulloya-Montoya sample as an example):

* Go to https://www.ncbi.nlm.nih.gov/geo/
* Type the GLP platform number – GLP570 – in the search bar
* Go to https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
* Find the Annotation SOFT table at the bottom of the page
* Click the button to download the SOFT annotation table for your sample (GPL570.annot.gz in our example)
* Upload the file to the environment.

To download SOFT annotation table from jupyter notebook, change the PLATFORM value to appropriate GPL platform.

In [37]:
with open(os.devnull, "w") as f:
    subprocess.run([
        'wget',
        f'ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPLnnn/{PLATFORM}/annot/{PLATFORM}.annot.gz'
    ], stdout=f, stderr=subprocess.STDOUT)

Once you’ve downloaded the SOFT file, extract the data from it. Unzip the SOFT file to get GPL570.annot.gz

In [38]:
subprocess.run(['gunzip', f'{PLATFORM}.annot.gz'])

CompletedProcess(args=['gunzip', 'GPL570.annot.gz'], returncode=0)

Remove the header from the SOFT file to avoid any problems in further processes

In [39]:
subprocess.run("sed '1,/^ID/d' GPL570.annot > GPL570.beheaded.annot", shell = True)

CompletedProcess(args="sed '1,/^ID/d' GPL570.annot > GPL570.beheaded.annot", returncode=0)

Turn the SOFT file into a 3-column file.<br>

1st column: probe id<br>
2nd column: gene symbol column (as is with '///')<br>
3rd column: entrez id (not needed for the study)

Subsetting SOFT file to have 3 columns 

In [40]:
gene_SOFT_annotations = pd.read_csv(f'{PLATFORM}.beheaded.annot', sep = '\t', header = None)
gene_SOFT_annotations = gene_SOFT_annotations.iloc[:, [0, 2, 3]]

# Rename columns
gene_SOFT_annotations = gene_SOFT_annotations.rename(columns = {2: 1, 3 : 2})
gene_SOFT_annotations.to_csv(f'{PLATFORM}.3col', sep = '\t',  index=False, quoting=csv.QUOTE_NONNUMERIC)

In [41]:
gene_SOFT_annotations.head()

Unnamed: 0,0,1,2
0,1007_s_at,MIR4640///DDR1,100616237///780
1,1053_at,RFC2,5982
2,117_at,HSPA6,3310
3,121_at,PAX8,7849
4,1255_g_at,GUCA1A,2978


Delete all unnecessary files

In [42]:
subprocess.run(['rm', f'{PLATFORM}.annot'])
subprocess.run(['rm', f'{PLATFORM}.beheaded.annot'])

CompletedProcess(args=['rm', 'GPL570.beheaded.annot'], returncode=0)

In [None]:
probes_gs_dict = get_gs_for_probes_from_3col(f'{PLATFORM}.3col', normalized_expression.index.tolist())

In [None]:
pd.Series(probes_gs_dict).head(10)

In [None]:
series = pd.Series(probes_gs_dict)
annotated_expression = get_expressions_for_gs(series, normalized_expression, 'max').T.sort_index()

annotated_expression.to_csv(EXPRESSION_MATRIX, sep='\t', index=True)