<font size="8">HLA-A:02*01 selection, shuffled dataset</font>

Create combined df, with all structures, select on 9-mers and HLA-A:02*01

In [7]:
import pandas as pd

def load_and_preprocess_dataframe(df):
    df = df.loc[(df.peptide.str.len() == 9) & (df.allele == "HLA-A*02:01")]
    df["binder"] = df.measurement_value.apply(lambda x: int(x < 500))
    return df

train_val = pd.read_csv(r'C:\Users\gijst\vscode\3DVac\Cluster\BA_pMHCI_human_quantitative_only_eq_shuffled_train_validation.csv')
test = pd.read_csv(r'C:\Users\gijst\vscode\3DVac\Cluster\BA_pMHCI_human_quantitative_only_eq_shuffled_test.csv')

hla_all = pd.concat([train_val, test], ignore_index=True)
print(f"Total amount of HLA: {len(hla_all)}")
hla_a2 = load_and_preprocess_dataframe(hla_all)
hla_a2.to_csv("y:/data/hla_a_02_01.csv", index=False)
print(f"Total amount of HLA-A:02*01: {len(hla_a2)}")

Total amount of HLA: 100206


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["binder"] = df.measurement_value.apply(lambda x: int(x < 500))


Total amount of HLA-A:02*01: 8356


Remove the ID's for which the sturcture couldn't be created 

I've checked seperatly that the ID couldn't be created, due to the fact that the valance in some atoms is greater that permittes, this causes 133 structures to not be taken into account. 

In [5]:
import h5py

subset = hla_a2["ID"].tolist()
h5_path = r"Y:\data/proteins.hdf5"
missing_ids = []
with h5py.File(h5_path) as h5_f:
    modelled_ids = [modeled_id.decode("utf8") for modeled_id in list(h5_f["ids"][:])] # ids of modelled cases
    if subset is not None:
        ids = [i for i in subset if i in modelled_ids]
        for i in subset:
            if i not in modelled_ids:
                missing_ids.append(i)
print(f"Amount of sturctures for which no sturcture could be resolved: {len(missing_ids)}")

hla_a2_filtered = hla_a2[~hla_a2['ID'].isin(missing_ids)]
print(f"Amount of HLA-A:02*01, with structure {len(hla_a2_filtered)}")

Amount of sturctures for which no sturcture could be resolved: 131
Amount of HLA-A:02*01, with structure 8225


Now that a dataframe is created with all the structures that I've been using, I need to create dataframes, for each of the experiments. 

Split used for shuffled dataset

In [14]:
from sklearn.model_selection import train_test_split

def load_and_preprocess_dataframe2(df):
    df = df.loc[(df.peptide.str.len() == 9) & (df.allele == "HLA-A*02:01")]
    ids = df.ID.tolist()
    df["binder"] = df.measurement_value.apply(lambda x: int(x < 500))
    return df, ids

def split_data(df, ids):
    training_ids, validation_ids = train_test_split(ids, test_size=0.2, stratify=df.binder, random_state=1)
    return training_ids, validation_ids

train_val_df, train_val_ids = load_and_preprocess_dataframe2(train_val)

shuffled = {"train":split_data(train_val_df, train_val_ids)[0],
            "val":split_data(train_val_df, train_val_ids)[1],
            "test":load_and_preprocess_dataframe2(test)[1]}

# turn dictionary into pandas dataframe, with 2 column; datatype, ID

# Convert the dictionary into a list of dictionaries
data_list = [{'datatype': key, 'ID': val} for key, values in shuffled.items() for val in values]

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data_list)

len of train_ids: 6015
len of validation_ids: 1504
len of train_ids: 6015
len of validation_ids: 1504


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["binder"] = df.measurement_value.apply(lambda x: int(x < 500))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["binder"] = df.measurement_value.apply(lambda x: int(x < 500))


Now that the original IDs are known I can create the dataframe with the correct IDs 

In [17]:
# Create a boolean mask
mask = ~df['ID'].isin(missing_ids)

# Apply the mask to the DataFrame
df_filtered = df[mask]
print(len(df_filtered))
df_filtered.to_csv('Y:/data/Shuffled_dataset_IDs_datatype.csv', index=False)

8225


Add a column, which tells if each ID is a binder or not

In [21]:
merged_df = pd.merge(df_filtered, hla_a2, on='ID', how='inner')
result_df = merged_df.groupby(['datatype'])['binder'].agg(binders='sum', non_binders=lambda x: len(x) - sum(x))
print(len(merged_df))
# Reset index to make 'datatype' a column
result_df = result_df.reset_index()
result_df.to_csv('Y:/data/Shuffled_dataset_binder_non_binder.csv', index=False)
# Print the result DataFrame
print(result_df)

8225
3
  datatype  binders  non_binders
0     test      404          418
1    train     2948         2978
2      val      735          742
