## `filter-data.ipynb`

The purpose of this Jupyter notebook is to produce 3 .csv files where:
- non-sequenced sample information are to be discarded
- patients without any sequenced samples are to be discarded as well

The .csv file will be outputted after running this entire script/notebook. You can do so by clicking `Kernel` $\rightarrow$ `Restart & Run All` in the menu above.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, time

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('seaborn-darkgrid')

#### Creation of  `filter-data` folder

In [2]:
try:
    os.mkdir('../filter-data')
    print('Folder \'filter-data\' created!')
except FileExistsError as e:
    print('Folder already exists!')

Folder already exists!


#### Loading data

- `data1`: sample characteristics
- `data2`: patient information
- `norm_data`: transcriptome gene expression

In [3]:
data1 = pd.read_csv('../data/M2PHDS_19-20_OMICS_CLIN_DATA_MAARS_all_Fri_Apr_04_14h_CEST_2014.csv', sep='\t')
data2 = pd.read_csv('../data/M2PHDS_19-20_OMICS_CLIN_DATA_MAARS_AD_full_20190131_12-34-49.csv', sep='\t')
norm_data = pd.read_csv('../data/M2PHDS_19-20_OMICS_TRANSC_MAARS_normTranscriptome_618samples_16042014.txt', sep='\t')

Verifying if all sequences have characteristics in the `data1` DataFrame.

In [4]:
all(s in data1['sample_id'].to_list() for s in list(norm_data))

True

In [5]:
sequenced_samples = data1[[sample_id in list(norm_data) for sample_id in data1['sample_id'].to_list()]]
print('Proportion of samples that have been sequenced: {}%'.format(np.around(len(sequenced_samples)*100/len(data1), 2)))

Proportion of samples that have been sequenced: 46.92%


#### Generating new files

In [6]:
sequenced_patients = data2[[s in sequenced_samples['MAARS_identifier'].to_list() for s in data2['patient#Identification#MAARS identifier (MAARS_identifier)'].to_list()]]

In [7]:
sequenced_samples.to_csv('../filter-data/all.csv', index=False)
sequenced_patients.to_csv('../filter-data/ad_full.csv', index=False)
norm_data.T.to_csv('../filter-data/transcriptome.csv', index=[s[:-3] for s in norm_data.index])

Discarding lesional samples from patients with psoriasis 

In [8]:
not_pso_les = sequenced_samples[[not (a and b) for (a, b) in zip(sequenced_samples['clinical_group'] == 'PSO', sequenced_samples['lesional'] == 'LES')]]['sample_id']
norm_data2 = norm_data.iloc[:, [i for i, s in enumerate(list(norm_data)) if s in not_pso_les.to_list()]]
norm_data2.T.to_csv('../filter-data/transcriptome2.csv', index=[s[:-3] for s in norm_data.index])

In [9]:
sequenced_samples[[not (a and b) for (a, b) in zip(sequenced_samples['clinical_group'] == 'PSO', sequenced_samples['lesional'] == 'LES')]].to_csv('../filter-data/all2.csv', index=False)