### Data Preprocessing
The data downloaded from GEO is either tab delimited or unstructured series matrix. This notebook will convert all data to csv format and extract metadata from the series matrix file.
- In this notebook, we do everything manually necessary to ensure that all other processes run seemlessly with automation.

### Import Libraries

In [116]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import gzip

### Load Series Matrix Data

In [117]:
# path to your downloaded file
metadata_path = '../downloads/GSE53697_series_matrix.txt.gz'

# prepare metadata
with gzip.open(metadata_path, 'rt') as file:
    metadata_df = pd.read_csv(file, sep='\t', skiprows = 30)
    metadata_df.index = metadata_df.iloc[:,0]
    metadata_df = metadata_df.iloc[:,1:].T
    
metadata_df.head()

!Sample_title,!Sample_geo_accession,!Sample_status,!Sample_submission_date,!Sample_last_update_date,!Sample_type,!Sample_channel_count,!Sample_source_name_ch1,!Sample_organism_ch1,!Sample_characteristics_ch1,!Sample_characteristics_ch1.1,...,!Sample_instrument_model,!Sample_library_selection,!Sample_library_source,!Sample_library_strategy,!Sample_relation,!Sample_relation.1,!Sample_supplementary_file_1,!series_matrix_table_begin,ID_REF,!series_matrix_table_end
RNAseq_Ctrl_1,GSM1885080,Public on Feb 17 2016,Sep 16 2015,May 15 2019,SRA,1,control_brain,Homo sapiens,disease status: control,tissue: brain,...,Illumina HiSeq 2500,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,NONE,,GSM1885080,
RNAseq_Ctrl_2,GSM1885081,Public on Feb 17 2016,Sep 16 2015,May 15 2019,SRA,1,control_brain,Homo sapiens,disease status: control,tissue: brain,...,Illumina HiSeq 2500,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,NONE,,GSM1885081,
RNAseq_Ctrl_3,GSM1885082,Public on Feb 17 2016,Sep 16 2015,May 15 2019,SRA,1,control_brain,Homo sapiens,disease status: control,tissue: brain,...,Illumina HiSeq 2500,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,NONE,,GSM1885082,
RNAseq_Ctrl_4,GSM1885083,Public on Feb 17 2016,Sep 16 2015,May 15 2019,SRA,1,control_brain,Homo sapiens,disease status: control,tissue: brain,...,Illumina HiSeq 2500,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,NONE,,GSM1885083,
RNAseq_Ctrl_5,GSM1885084,Public on Feb 17 2016,Sep 16 2015,May 15 2019,SRA,1,control_brain,Homo sapiens,disease status: control,tissue: brain,...,Illumina HiSeq 2500,cDNA,transcriptomic,RNA-Seq,BioSample: https://www.ncbi.nlm.nih.gov/biosam...,SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX...,NONE,,GSM1885084,


**Observation**   
* Series matrix files contain different number of rows and some rows must be skipped in order for the above code to work correctly. To do so the file must be opened manually to see which rows to skip.
* The resulting dataframe must be explored manually to extract the relevant information.
* In this case, we will extract information from columns 0, 8,16 and 17

In [118]:
metadata_df = metadata_df.iloc[:,[0, 8,16]]
metadata_df.head()

!Sample_title,!Sample_geo_accession,!Sample_characteristics_ch1,!Sample_description
RNAseq_Ctrl_1,GSM1885080,disease status: control,C1
RNAseq_Ctrl_2,GSM1885081,disease status: control,C2
RNAseq_Ctrl_3,GSM1885082,disease status: control,C3
RNAseq_Ctrl_4,GSM1885083,disease status: control,C4
RNAseq_Ctrl_5,GSM1885084,disease status: control,C5


In [119]:
# reset index
metadata_df = metadata_df.rename_axis(None, axis=1)
metadata_df.reset_index(drop = True, inplace = True)
metadata_df.head()

Unnamed: 0,!Sample_geo_accession,!Sample_characteristics_ch1,!Sample_description
0,GSM1885080,disease status: control,C1
1,GSM1885081,disease status: control,C2
2,GSM1885082,disease status: control,C3
3,GSM1885083,disease status: control,C4
4,GSM1885084,disease status: control,C5


In [120]:
# rename columns
metadata_df.columns = ["sampleAccession", "Type", "sampleName"]
metadata_df.head()

Unnamed: 0,sampleAccession,Type,sampleName
0,GSM1885080,disease status: control,C1
1,GSM1885081,disease status: control,C2
2,GSM1885082,disease status: control,C3
3,GSM1885083,disease status: control,C4
4,GSM1885084,disease status: control,C5


In [121]:
# fix Type column
metadata_df["Type"] = metadata_df["Type"].map(lambda x: "healthy control" if "control" in x else "Alzheimer")
metadata_df.head()

Unnamed: 0,sampleAccession,Type,sampleName
0,GSM1885080,healthy control,C1
1,GSM1885081,healthy control,C2
2,GSM1885082,healthy control,C3
3,GSM1885083,healthy control,C4
4,GSM1885084,healthy control,C5


In [122]:
metadata_df["Type"].unique()

array(['healthy control', 'Alzheimer'], dtype=object)

In [123]:
# sort table by accession
metadata_df = metadata_df.sort_values(by = "sampleAccession")
metadata_df.head()

Unnamed: 0,sampleAccession,Type,sampleName
0,GSM1885080,healthy control,C1
1,GSM1885081,healthy control,C2
2,GSM1885082,healthy control,C3
3,GSM1885083,healthy control,C4
4,GSM1885084,healthy control,C5


### Label Read Files Names same as Sample Accessions

In [124]:
# get reads 
import os
list_files = os.listdir("../downloads/Reads") 

In [125]:
list_files[:10]

['SRR2422934_1.fastq',
 'SRR2422933_2.fastq',
 'SRR2422930_1.fastq',
 'SRR2422928_1.fastq',
 'SRR2422934_2.fastq',
 'SRR2422918_1.fastq',
 'SRR2422918_2.fastq',
 'SRR2422932_2.fastq',
 'SRR2422929_1.fastq',
 'SRR2422931_1.fastq']

**Observation**  
* We use featureCounts to produce the count data in our pipeline
* featureCounts uses the file names to label samples columns in the count data table
* We want to make sure that we can easily map or rename the sampple columns with info from metadata
* Therefore we will rename raw reads with metadata's sampleAccession names

In [126]:
# get full file paths
file_paths = ["../downloads/Reads/"+file for file in list_files]
file_paths[:10]

['../downloads/Reads/SRR2422934_1.fastq',
 '../downloads/Reads/SRR2422933_2.fastq',
 '../downloads/Reads/SRR2422930_1.fastq',
 '../downloads/Reads/SRR2422928_1.fastq',
 '../downloads/Reads/SRR2422934_2.fastq',
 '../downloads/Reads/SRR2422918_1.fastq',
 '../downloads/Reads/SRR2422918_2.fastq',
 '../downloads/Reads/SRR2422932_2.fastq',
 '../downloads/Reads/SRR2422929_1.fastq',
 '../downloads/Reads/SRR2422931_1.fastq']

In [129]:
# extract sampple ids
name_list1 = sorted(set([file_path.split("/")[-1].split("_")[0] for file_path in file_paths]))
name_list1[:10]

['SRR2422918',
 'SRR2422919',
 'SRR2422920',
 'SRR2422921',
 'SRR2422922',
 'SRR2422923',
 'SRR2422924',
 'SRR2422925',
 'SRR2422926',
 'SRR2422927']

In [130]:
# now map names
name_list2 = sorted(metadata_df["sampleAccession"])
# create map
mapping = dict(zip(name_list1, name_list2))
mapping

{'SRR2422918': 'GSM1885080',
 'SRR2422919': 'GSM1885081',
 'SRR2422920': 'GSM1885082',
 'SRR2422921': 'GSM1885083',
 'SRR2422922': 'GSM1885084',
 'SRR2422923': 'GSM1885085',
 'SRR2422924': 'GSM1885086',
 'SRR2422925': 'GSM1885087',
 'SRR2422926': 'GSM1885088',
 'SRR2422927': 'GSM1885089',
 'SRR2422928': 'GSM1885090',
 'SRR2422929': 'GSM1885091',
 'SRR2422930': 'GSM1885092',
 'SRR2422931': 'GSM1885093',
 'SRR2422932': 'GSM1885094',
 'SRR2422933': 'GSM1885095',
 'SRR2422934': 'GSM1885096'}

In [131]:
# now rename files
# Function to replace keys in file paths with corresponding values from the mapping
def rename_file(file_path, mapping):
    for old_name, new_name in mapping.items():
        if old_name in file_path:
            new_path = file_path.replace(old_name, new_name)
            os.rename(file_path, new_path)
            print(f"Renamed {file_path} to {new_path}")
            return

# Iterate over the file paths and rename them using the mapping
for file_path in file_paths:
    rename_file(file_path, mapping)


Renamed ../downloads/Reads/SRR2422934_1.fastq to ../downloads/Reads/GSM1885096_1.fastq
Renamed ../downloads/Reads/SRR2422933_2.fastq to ../downloads/Reads/GSM1885095_2.fastq
Renamed ../downloads/Reads/SRR2422930_1.fastq to ../downloads/Reads/GSM1885092_1.fastq
Renamed ../downloads/Reads/SRR2422928_1.fastq to ../downloads/Reads/GSM1885090_1.fastq
Renamed ../downloads/Reads/SRR2422934_2.fastq to ../downloads/Reads/GSM1885096_2.fastq
Renamed ../downloads/Reads/SRR2422918_1.fastq to ../downloads/Reads/GSM1885080_1.fastq
Renamed ../downloads/Reads/SRR2422918_2.fastq to ../downloads/Reads/GSM1885080_2.fastq
Renamed ../downloads/Reads/SRR2422932_2.fastq to ../downloads/Reads/GSM1885094_2.fastq
Renamed ../downloads/Reads/SRR2422929_1.fastq to ../downloads/Reads/GSM1885091_1.fastq
Renamed ../downloads/Reads/SRR2422931_1.fastq to ../downloads/Reads/GSM1885093_1.fastq
Renamed ../downloads/Reads/SRR2422919_2.fastq to ../downloads/Reads/GSM1885081_2.fastq
Renamed ../downloads/Reads/SRR2422924_2.fas

In [100]:
# save meta data
metadata_df.to_csv("../data/metadata.csv", index = False)