# Data analysis of the Emerson repertoires

This code is an example on how to use 'emerson_data_analysis.py' to read and process the original Emerson datasets

The notebook contains:
1. Filtering of the datasets
2. Obtention of the top clones

## Imports

- _numpy_ for math operations

- _pandas_ for DataFrame operations

- _emerson_data_analysis_ for original operation on the datasets

In [1]:
import numpy as np
import pandas as pd
import emerson_data_analysis as eda

## Directories

The directories depend on the user. We need the following directories:
- read_path : where the original datasets are
- save_path : where the corrected datasets are

In [2]:
read_path = '/home/pablo/Documentos/internship/emerson/RAW/'
save_path = '/home/pablo/Documentos/internship/emerson/files_red_bis/'

## Loading data

We need the following file:

- file_patients : .txt file containing a list of all the patients' dataset names. This file needs to be stored in at least one of the paths containing the datasets (ideally in both)

In [3]:
file_patients = 'reduced_patients.txt'
frame_patients = pd.read_csv(save_path + file_patients, names=['file_patients'])

In [4]:
frame_patients

Unnamed: 0,file_patients
0,HIP00110.tsv.gz
1,HIP00169.tsv.gz
2,HIP00594.tsv.gz
3,HIP00602.tsv.gz
4,HIP00614.tsv.gz
...,...
627,HIP17887.tsv.gz
628,HIP19048.tsv.gz
629,HIP19089.tsv.gz
630,HIP19716.tsv.gz


## 1. Filtering of the datasets

The function 'reduce_data' allows to filter the original datasets

We provide an example for n = 1 patient. The code can be easily extended to more patients

### A. Original dataset

In [5]:
original_patient = pd.read_csv(read_path + frame_patients.file_patients[0], sep='\t', low_memory=False)
original_patient

Unnamed: 0,nucleotide,aminoAcid,count (templates/reads),frequencyCount (%),cdr3Length,vMaxResolved,vFamilyName,vGeneName,vGeneAllele,vFamilyTies,...,jOrphon,vFunction,dFunction,jFunction,fractionNucleated,vAlignLength,vAlignSubstitutionCount,vAlignSubstitutionIndexes,vAlignSubstitutionGeneThreePrimeIndexes,vSeqWithMutations
0,ACTCTGACTGTGAGCAACATGAGCCCTGAAGACAGCAGCATATATC...,CSVEESYEQYF,10,7.661348e-05,33,TCRBV29-01*01,TCRBV29,TCRBV29-01,1.0,,...,,,,,,,,,,
1,GAATGTGAGCACCTTGGAGCTGGGGGACTCGGCCCTTTATCTTTGC...,,3088,1.919402e-02,38,TCRBV05-01*01,TCRBV05,TCRBV05-01,1.0,,...,,,,,,,,,,
2,GCTACCAGCTCCCAGACATCTGTGTACTTCTGTGCCACCACGGGTA...,CATTGTSGGPSQSTQYF,1772,1.094612e-02,51,TCRBV10-03*01,TCRBV10,TCRBV10-03,1.0,,...,,,,,,,,,,
3,ATCCAGCGCACAGAGCAGGGGGACTCGGCCATGTATCTCTGTGCCA...,CASSLRVGGYGYTF,1763,1.084118e-02,42,TCRBV07-09,TCRBV07,TCRBV07-09,,,...,,,,,,,,,,
4,TGCAGCAAGAAGACTCAGCTGCGTATCTCTGCACCAGCAGCCAAGG...,,1241,7.660116e-03,52,TCRBV01-01*01,TCRBV01,TCRBV01-01,1.0,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130935,AAACCTGAGCTCTCTGGAGCTGGGGGACTCAGCTTTGTATTTCTGT...,,1,4.926912e-07,38,TCRBV09-01,TCRBV09,TCRBV09-01,,,...,,,,,,,,,,
130936,AAACCTGAGCTCTCTGGAGCTGGGGGACTCAGCTTTGTACTTCTGT...,,1,4.926912e-07,38,TCRBV09-01,TCRBV09,TCRBV09-01,,,...,,,,,,,,,,
130937,AAACCTGAGCTCTCTGGAGCTGGGGACTCAGCTTTGTATTTCTGTG...,CASSVASNTEAFF,1,4.926912e-07,39,TCRBV09-01,TCRBV09,TCRBV09-01,,,...,,,,,,,,,,
130938,AAACCCGAGCTCTCTGGAGCTGGGGGACTCAGCTTTGTATTTCTGT...,,1,4.926912e-07,38,TCRBV09-01,TCRBV09,TCRBV09-01,,,...,,,,,,,,,,


### B. Filtered dataset

In [6]:
n = 1  # this is the number of patients we want to study. in this case we choose just one

for j in range(n):
    eda.reduce_data(frame_patients.file_patients[j], read_path, save_path)

In [7]:
reduced_patient = pd.read_csv(save_path + frame_patients.file_patients[0], sep='\t', low_memory=False)
reduced_patient

Unnamed: 0,aa,count,frequency
0,ADYSNQPQHF,1,2.778731e-06
1,AEDGIRVGGYGYTF,40,4.662092e-05
2,AEGGIRDSTDTQYF,2,1.234991e-06
3,AEGGIRVGGYGYTF,63,5.186964e-05
4,AEGGLRVGGYGYTF,15,9.262435e-06
...,...,...,...
96987,YASTYRTQDSPLHF,1,1.234991e-06
96988,YASVGLAIAGIRETQYF,1,6.174957e-07
96989,YSARVTGSNEQYF,1,6.174957e-07
96990,YSEATDVRIEAFF,1,4.322470e-06


## 2. Top clones data

The function 'top_clone_creator' allows to gather the first L top clones

We use L = 10 top clones. For all the repertoire, use L = len(frame_patients)

In [8]:
L = 10
top_clones = eda.top_clone_creator(save_path, frame_patients, L)

In [9]:
del top_clones['special']
top_clones

Unnamed: 0,aa,count,frequency
81997,CATTGTSGGPSQSTQYF,1805,0.013764
25960,CASSKLTGDSGANVLTF,3344,0.026423
105106,CASSSTLMNTEAFF,5168,0.02329
120776,CASSRWVQGNTEAFF,1527,0.006771
57321,CASSRRANEQFF,17917,0.011298
127065,CASSSRENTEAFF,1031,0.006054
87794,CASSQDPLRRGSGNTIYF,25260,0.073578
52100,CASSYSPDRLSGGYTF,20966,0.029973
92938,CASSLGTSGRLGTDTQYF,1838,0.008629
15275,CASSEGMYTEAFF,1697,0.019845
