#### Create Public/Private label in dataset
This notebook creates the Private/Public label in the dataset.
The label is created by counting the number of times each sequence appears in the dataset.
If the sequence appears more than once, it is labeled as Public, otherwise it is labeled as Private.
The threshold for public can be changed by changing the value of the variable "threshold" in the function "label_public_private".
The output file is saved in the same directory as the input file.



In [1]:
cd ..

/home/ubuntu/CVC


#### Set Environment

In [7]:
%load_ext autoreload
%autoreload 2

In [10]:
import pandas as pd
import collections
from lab_notebooks.utils import TRANSFORMER, DEVICE, DATA_DIR

#### Load and Prepare Data

In [3]:
# edit according to the desired file
data_file_name = "data_file_name.csv"

In [None]:
# load data
data_dir = DATA_DIR + data_file_name
tcrb_data = pd.read_csv(data_dir,engine="pyarrow", index_col=0)
tcrb_data

In [89]:
# edit column names
tcrb_data.rename(columns={'amino_acid': 'aaSeqCDR3'}, inplace=True)
tcrb_data.rename(columns={'sample': 'SampleName'}, inplace=True)

# drop duplicate rows
tcrb_data_no_duplicates = tcrb_data.drop_duplicates()

In [93]:
# count number of times aaSeqCDR3 appear - each sequence appears once in each sample so the duplicate occurances
# will be in different samples
num_of_occurances_df = tcrb_data_no_duplicates['aaSeqCDR3'].value_counts().to_frame()
len(num_of_occurances_df.index)
num_of_occurances_df.head()

Unnamed: 0_level_0,count
aaSeqCDR3,Unnamed: 1_level_1
CASSLGETQYF,100
CASSLGDTQYF,100
CASSPSTDTQYF,97
CASSLGYEQYF,97
CASSLGGYEQYF,96


In [None]:
num_of_occurances_df.index.name = 'Sequences'
num_of_occurances_df.reset_index(inplace=True)
num_of_occurances_df = num_of_occurances_df.rename(columns={'count': 'Appearances'})
num_of_occurances_df

#### Label as Public/Private

In [103]:
# if occurrence value is larger than 1, the sequence is public(1), otherwise it's private(0)
# change here for different threshold 10,50,100 if desired
def label_public_private (row):
    if row['Appearances'] > 1 : return 1
    return 0

In [None]:
num_of_occurances_df['Private_Public_label'] = \
    num_of_occurances_df.apply(lambda row: label_public_private(row), axis=1)
num_of_occurances_df

In [105]:
collections.Counter(num_of_occurances_df['Private_Public_label'])

Counter({1: 1518211, 0: 10494911})

##### Merge Dataframes

In [108]:
tcrb_data_no_duplicates = tcrb_data_no_duplicates.rename(columns={'aaSeqCDR3': 'Sequences'})
# merge the two dataframes num_of_occurances_df and tcrb_data_no_duplicates on Sequences
merged_df = pd.merge(num_of_occurances_df, tcrb_data_no_duplicates, on='Sequences')
# remove SampleName column from merged_df and drop duplicates
merged_df_no_duplicates = merged_df.drop(['SampleName'], axis=1)
merged_df_no_duplicates = merged_df_no_duplicates.drop_duplicates()
len(merged_df_no_duplicates.index)

12026806

#### Export to csv

In [125]:
# output to csv
output_path = "data/12M_j_gene_pub_priv_label.csv"
merged_df_no_duplicates.to_csv(output_path)