## MAIN

Here, I will walk you through the first 5 steps of the code.

Step 1.  01_get_contact_matrix/run.py <br>
Step 2.  02_apply_threshhold/run.py <br>
Step 3.  03_finalize_dataset/run.py <br>
Step 4.  04_map/run.py <br>
Step 5.  05_combine/run.py <br>

These scripts will take you from the initial simulations (found at ```Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup```) to the initial 2D matrix that is used in the ML pipeline (```05_combine/output/ML_input.csv```).

Each of the script folders, their usage, and their output are explained separately below. You may also refer to the ```README.md``` file.

In [1]:
import os

##################
# CHANGE THESE VARIABLES
##################

#(FROM THE DROPBOX FOLDER - HARD CODED TO MASAUER2 DIRECTORY)
system="8a" # Used to get the directory of the trajectory (FROM THE DROPBOX FOLDER - HARD CODED TO MASAUER2 DIRECTORY)
cutoff=12 # Distance cutoff
workingDir = os.getcwd() # Current directory
timestep = 20 # ps
tFinal = 1500000 # ps

## Step 1 - Compute Contact Distance Matrix

For each trajectory, compute the contact distance matrix between all C-$\alpha$ atoms. Reshape the matrix from (nFrames x nCalpha x nCAlpha) to have dimensions (nFrames x nCAlpha*nCAlpha).

OUTPUT -> For each trajectory, we get a ```.npy``` file of the format ```WNT{wnt_protein_name}_distances_iter{chunk_num}.npy```. For example, the trajectory ```Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_108.dcd``` produces ```output/WNT1_distances_iter108.npy```.

In [7]:
arguments = f"-sys {system} -dir {workingDir} -dt {timestep}"
%run 01_get_contact_matrix/run.py $arguments

['/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_001.dcd', '/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_002.dcd', '/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_003.dcd', '/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_004.dcd', '/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_005.dcd', '/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt1/copy01/dcd_files/Wnt1WlsPc_copy_01_run_006.dcd', '/Users/masauer2/Library/CloudStorage/Box-Box/Summerinternship_2024/WNT-WLS-project/Data_Backup/wnt

KeyboardInterrupt: 

## Step 2 - Apply Distance Threshhold

Iterate over the second half of each trajectory, record the indeces corresponding to atom pairs that are within the cutoff at least ONCE.

INPUT -> the ```.npy``` files found at ```01_get_contact_matrix/output``` <br>
OUTPUT ->  a ```.txt``` file of the form ```output/WNT{wnt_protein_name}_idx_thresh{threshhold}.txt```. This file contains the column indeces of contact pairs (from ```01_get_contact_matrix/output```) that are within the threshhold.

In [19]:
arguments = f"-sys {system} -cutoff {cutoff} -dir {workingDir} -dt {timestep} -tf {tFinal}"
#%run 02_apply_threshhold/run.py $arguments
%run 02_apply_threshhold/run.py $arguments

['WNT5a_distances_iter0.npy', 'WNT5a_distances_iter1.npy', 'WNT5a_distances_iter2.npy', 'WNT5a_distances_iter3.npy', 'WNT5a_distances_iter4.npy', 'WNT5a_distances_iter5.npy', 'WNT5a_distances_iter6.npy', 'WNT5a_distances_iter7.npy', 'WNT5a_distances_iter8.npy', 'WNT5a_distances_iter9.npy', 'WNT5a_distances_iter10.npy', 'WNT5a_distances_iter11.npy', 'WNT5a_distances_iter12.npy', 'WNT5a_distances_iter13.npy', 'WNT5a_distances_iter14.npy', 'WNT5a_distances_iter15.npy', 'WNT5a_distances_iter16.npy', 'WNT5a_distances_iter17.npy', 'WNT5a_distances_iter18.npy', 'WNT5a_distances_iter19.npy', 'WNT5a_distances_iter20.npy', 'WNT5a_distances_iter21.npy', 'WNT5a_distances_iter22.npy', 'WNT5a_distances_iter23.npy', 'WNT5a_distances_iter24.npy', 'WNT5a_distances_iter25.npy', 'WNT5a_distances_iter26.npy', 'WNT5a_distances_iter27.npy', 'WNT5a_distances_iter28.npy', 'WNT5a_distances_iter29.npy', 'WNT5a_distances_iter30.npy', 'WNT5a_distances_iter31.npy', 'WNT5a_distances_iter32.npy', 'WNT5a_distances_it

## Step 3 - Finalize Dataset

Given the indeces of the C-$\alpha$ atom pairs that are within the distance threshhold, parse the original dataset generated from step 1 - keep only the atom pairs within the distance threshhold.

INPUT -> the `.npy` files found at `01_get_contact_matrix/output` and the  `.txt` files found at `02_apply_threshhold/output` <br>
OUTPUT:<br>
1.   `.txt` file of the form `WNT{wnt_protein_name}_threshhold{threshhold}_labels.txt` (a list of the column names of the pased matrix)
2.  `.npy` file of the form `WNT{wnt_protein_name}_threshhold{threshhold}_matrix.npy` (this is the matrix that only contains the contact pairs within the threshhold)


In [20]:
arguments = f"-sys {system} -cutoff {cutoff} -dir {workingDir} -dt {timestep} -tf {tFinal}"
%run 03_finalize_dataset/run.py $arguments

['WNT5a_distances_iter0.npy', 'WNT5a_distances_iter1.npy', 'WNT5a_distances_iter2.npy', 'WNT5a_distances_iter3.npy', 'WNT5a_distances_iter4.npy', 'WNT5a_distances_iter5.npy', 'WNT5a_distances_iter6.npy', 'WNT5a_distances_iter7.npy', 'WNT5a_distances_iter8.npy', 'WNT5a_distances_iter9.npy', 'WNT5a_distances_iter10.npy', 'WNT5a_distances_iter11.npy', 'WNT5a_distances_iter12.npy', 'WNT5a_distances_iter13.npy', 'WNT5a_distances_iter14.npy', 'WNT5a_distances_iter15.npy', 'WNT5a_distances_iter16.npy', 'WNT5a_distances_iter17.npy', 'WNT5a_distances_iter18.npy', 'WNT5a_distances_iter19.npy', 'WNT5a_distances_iter20.npy', 'WNT5a_distances_iter21.npy', 'WNT5a_distances_iter22.npy', 'WNT5a_distances_iter23.npy', 'WNT5a_distances_iter24.npy', 'WNT5a_distances_iter25.npy', 'WNT5a_distances_iter26.npy', 'WNT5a_distances_iter27.npy', 'WNT5a_distances_iter28.npy', 'WNT5a_distances_iter29.npy', 'WNT5a_distances_iter30.npy', 'WNT5a_distances_iter31.npy', 'WNT5a_distances_iter32.npy', 'WNT5a_distances_it

## Step 4 - Map

Input:<br>
1. The maps at ```00_map/output/{wnt_from}_to_{wnt_to}.csv``` (where we are mapping the indeces from ```wnt_from``` to ```wnt_to```). This structure is a 3 column csv. First column contains the indeces of ```wnt_from``` and the second column contains the indeces of ```wnt_to``` that is the best map.

Output:<br>
1. New labels at ```output/WNT{SYSTEM}_threshhold{distance_threshhold}_labels.txt```. These essentially replace the labels located in ```03_finalize_dataset```.

In [32]:
arguments = f"-sys {system} -cutoff {cutoff} -dir {workingDir}"
%run 04_map/run.py $arguments

There are 2041 contact pairs in Wnt8a.
After conversion, there are 2041 contact pairs in Wnt8a.
136 unique contact pairs in Wnt8a and 237 unique pairs on WntLess.


## Step 5 - Combine

Inputs:
- the `txt` labels generated in `scripts/04_map/output`.
- the `npy` matrix containing the contact pairs within the `12Ã…` threshold generated in `scripts/03_finalize_dataset/output`.  

Output: ```output/ML_input.csv```, the final input containing contact distances for each frame # (rows) and each contact pair within the threshhold (column).

In [None]:
arguments = f"-cutoff {cutoff} -dir {workingDir}"
%run 05_combine/run.py $arguments


## Step 6 - Autocorrelation Function

You can run ```run_acf.py``` to generate a set of autocorrelation functions for each of the features (stored as ```output/wnt{wnt_protein_name}_acf_t12_numpy.npy```). <br>
Then, run ```plot_acf.py``` to average over all the individual autocorrelation function and plot the averaged ACF.

## Step 7 - Preprocess
This is a large jupyter notebook that reads in ```05_combine/output/ML_input.csv``` and performs the following steps. <br>
1. Read in the input data
2.  Do the train/test splits based on the ACF results
3. Get the spearman correlation matrix for the training set
4. Perform the clustering (first on indeces, second on spearman dstance matrix)
5. Pull features for each subcluster and get the new train/test sets

## Step 8 - Build the model
Read the train data in and do grid search.

## Step 9 - Evaluate the model
Learning curve based on optimized hyperparameters.