Step 1 - Import python libraries

In [1]:
import pandas as pd
import glob 
import os 

Step 2 - Load and read your data file (Note that you will need to run step 2 and step 3 for all the sample files in order to convert them to pyTCR standardized format)

In [2]:
# Specify the path to your file in Google Drive or locally
filePath = "../data/samples/989003BW_TCRB.tsv"

targetFilename = os.path.basename(filePath)

df_samples = pd.read_table(filePath, low_memory=False, engine="c")

df_samples.head()

Unnamed: 0,sample_name,frequency,templates,amino_acid,rearrangement,v_resolved,d_resolved,j_resolved,hospitalized
0,989003BW_TCRB,0.006787,3907,CASSLDRETVYGYTF,AACGCCTTGGAGCTGGACGACTCGGCCCTGTATCTCTGTGCCAGCA...,TCRBV05-04*01,TCRBD02-01*02,TCRBJ01-02*01,True
1,989003BW_TCRB,0.006403,3686,CASSLTSGSLNEQFF,ACATCGGCCCAAAAGAACCCGACAGCTTTCTATCTCTGTGCCAGTA...,TCRBV19-01*01,TCRBD02-01,TCRBJ02-01*01,True
2,989003BW_TCRB,0.002458,1415,CASSQGYEQYF,AATCTTCACATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATT...,TCRBV03-01/03-02*01,unknown,TCRBJ02-07*01,True
3,989003BW_TCRB,0.00206,1186,CSASDLGGRLDTQYF,ACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTA...,TCRBV20-01*04,TCRBD02-01*02,TCRBJ02-03*01,True
4,989003BW_TCRB,0.001801,1037,CASSLVAGGFEQYF,GTGACATCGGCCCAAAAGAACCCGACAGCTTTCTATCTCTGTGCCA...,TCRBV19-01*01,TCRBD02-01*02,TCRBJ02-07*01,True


Step 3 - Convert data to the pyTCR standardized format:
| column | name | description                                    |
|--:|:---------|:------------------------------------------------|
| 1   | `sample`  | The name of the sample                       |
| 2   | `freq`    | The share of clonotypes in the sample        |
| 3   | `#count`  | The number of reads                          |
| 4   | `cdr3aa`  | CDR3 amino acid clonotype                    |
| 5   | `cdr3nt`  | CDR3 nucleotide                              |
| 6   | `v`       | V gene                                       |
| 7   | `d`       | D gene                                       |
| 8   | `j`       | J gene                                       |
| ... | optional fields | any other fields intended for your use |

- Modify the `required_columns` below to match the column names from your data that are equivalent to pyTCR's columns in the same order as described above
- The following code will create a new `.csv` file with with the correct pyTCR column names and place it in the current directory

In [3]:
# Enter the column names from your data that represent the required pyTCR columns
required_columns = [
'sample_name','frequency', 'templates',
'amino_acid', 'rearrangement', 'v_resolved' , 'd_resolved', 'j_resolved'
]

optional_columns = ['hospitalized']

df_new = df_samples.filter(required_columns + optional_columns)

# Rename the columns to pyTCR standard names
df_new.columns = [
'sample','freq', '#count', 'cdr3aa',
'cdr3nt', 'v', 'd', 'j'] + optional_columns

df_new.to_csv(f'./{targetFilename}.csv', na_rep='.', index=False)

Step 4 - Combine all sample files

Add a new column `sample` to each `.csv` file in the current directory with the filename as the value
- This is useful for converting data in other formats that do not contain a column with a sample name

In [4]:
globbed_files = glob.glob("*.csv")

data = []

for csv in globbed_files:
    dataframe = pd.read_csv(csv)
    dataframe['sample'] = os.path.basename(csv.split('.')[0])
    data.append(dataframe)

combined_data = pd.concat(data)
combined_data.to_csv("combined_data.csv", index=False)

df=pd.read_csv("combined_data.csv", index_col=[0])

df

Unnamed: 0_level_0,freq,#count,cdr3aa,cdr3nt,v,d,j,hospitalized
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
989003BW_TCRB,0.006787,3907,CASSLDRETVYGYTF,AACGCCTTGGAGCTGGACGACTCGGCCCTGTATCTCTGTGCCAGCA...,TCRBV05-04*01,TCRBD02-01*02,TCRBJ01-02*01,True
989003BW_TCRB,0.006403,3686,CASSLTSGSLNEQFF,ACATCGGCCCAAAAGAACCCGACAGCTTTCTATCTCTGTGCCAGTA...,TCRBV19-01*01,TCRBD02-01,TCRBJ02-01*01,True
989003BW_TCRB,0.002458,1415,CASSQGYEQYF,AATCTTCACATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATT...,TCRBV03-01/03-02*01,unknown,TCRBJ02-07*01,True
989003BW_TCRB,0.002060,1186,CSASDLGGRLDTQYF,ACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTA...,TCRBV20-01*04,TCRBD02-01*02,TCRBJ02-03*01,True
989003BW_TCRB,0.001801,1037,CASSLVAGGFEQYF,GTGACATCGGCCCAAAAGAACCCGACAGCTTTCTATCTCTGTGCCA...,TCRBV19-01*01,TCRBD02-01*02,TCRBJ02-07*01,True
...,...,...,...,...,...,...,...,...
989003BW_TCRB,0.000002,1,CASSPVGSYQPQHF,ATCCAGCCCTCAGAACCCAGGGACTCAGCTGTGTACTTCTGTGCCA...,TCRBV12,unknown,TCRBJ01-05*01,True
989003BW_TCRB,0.000002,1,CASSETQGARGKLFF,CAGCCCTCAGAACCCAGGGACTCAGCTGTGTACTTCTGTGCCAGCA...,TCRBV12,TCRBD01-01*01,TCRBJ01-04*01,True
989003BW_TCRB,0.000002,1,CSVGDRVVGYTF,CTGACTGTGAGCAACATGAGCCCTGAAGACAGCAGCATATATCTCT...,TCRBV29-01,TCRBD01-01*01,TCRBJ01-02*01,True
989003BW_TCRB,0.000002,1,CASTTEGRVYYGCTF,AACGCCTTGTTGCTGGGGGACTCGGCCCTGTATCTCTGTGCCAGCA...,TCRBV05-05*01,unknown,TCRBJ01-02*01,True


Convert .csv file to .tsv file

In [5]:
targetFileExtension = 'tsv'

df = pd.read_csv("combined_data.csv", low_memory=False, engine="c")

file = "combined_data.csv".split('.')[0]

newFile = f'{file}.{targetFileExtension}'

# Save new file to current directory
df.to_csv(newFile, sep='\t', na_rep='.', index=False)