Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Chromatin Profiling Based on Issue #43 #51

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 70 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,17 +334,85 @@ python -m evals/instruction_tuned_genomics

### Chromatin Profile

You'll need to see the [DeepSea](https://www.nature.com/articles/nmeth.3547) and [repo](https://github.com/FunctionLab/sei-framework) for info how to download and preprocess.
You'll need to see the [DeepSea](https://www.nature.com/articles/nmeth.3547) and [repo](https://github.com/FunctionLab/sei-framework) for info how to download and preprocess.

For a more detailed example look below

1. Git clone or Download the following [buildling-deepsea repo](https://github.com/jakublipinski/build-deepsea-training-dataset).

2. Git clone or Download the [sei framework library](https://github.com/FunctionLab/sei-framework.git).

3. Step into the sei framework and perform the setup.

```
sh ./download_data.sh.
```

This should download data into the resources folder.

4. Next step into the build-deepsea-training-dataset. Follow the instructions in the repo to build the dataset in debugging mode, which will include the instructions below. The hg19 path will need to be specifically given a path to the given fa file. You should include the hg19 file you downloaded in step 3 in resources, specifically the FA hg19 file from the sei framework. You can modify other parameters as well for further customization.


```
git clone git@github.com:jakublipinski/build-deepsea-training-dataset.git
cd build-deepsea-training-dataset/data
xargs -L 1 curl -C - -O -L < deepsea_data.urls
find ./ -name \*.gz -exec gunzip {} \;
cd ..
```

```
mkdir out

python build.py \
--metadata_file data/deepsea_metadata.tsv \
--pos data/allTFs.pos.bed \
--beds_folder data/ \
--hg19 [path to FA file] \
--train_size 2200000 \
--valid_size 4000 \
--train_filename out/train.mat \
--valid_filename out/valid.mat \
--test_filename out/test.mat \
--train_data_filename out/train_data.npy \
--train_labels_filename out/train_labels.npy \
--valid_data_filename out/valid_data.npy \
--valid_labels_filename out/valid_labels.npy \
--test_data_filename out/test_data.npy \
--test_labels_filename out/test_labels.npy \
--save_debug_info True
```

5. Run the following Python Code to create your coordinate target files, youll need to pass in the path to your directory for, please make a note of their location for step 8.

```
python ./create_coords.py
```


6. Now step into the Sei Framework and follow the steps in chromatin profile prediction and specifically run the following command below. The input-file will be the bed or fasta input file you download in step 3 which should be in the resources directory within the sei framework. For the genome, this example is geared towards hg19. You can do hg38 as well but you will need to make changes to earlier steps. Output directory is your choice.

```
sh 1_sequence_prediction.sh <input-file> <genome> <output-dir> --cuda
```



6. This should create a folder called chromatin-profiles-hdf5 in your output directory along with several other files.

7. Take your coordinate files from step 5 and copy them to the chromatin-profiles-hdf5 folder.

8. Now go back to hyena dna and, assuming you have already setup hyena dna, perform the following.

example chromatin profile run:
```
python -m train wandb=null experiment=hg38/chromatin_profile dataset.ref_genome_path=/path/to/fasta/hg38.ml.fa dataset.data_path=/path/to/chromatin_profile dataset.ref_genome_version=hg38
```

- `dataset.ref_genome_path` # path to a human ref genome file (the input sequences)
- `dataset.ref_genome_version` # the version of the ref genome (hg38 or hg19, we use hg38)
- `dataset.data_path` # path to the labels of the dataset

For paths to chromatin files, those will be the paths to the chromatin-profiles-hdf5. For the hg19 fa files, those can be in resources or found in your output directory with the chromatin profile. For genome version and experiment, this is for the hg19 experiment.


### Species Classification
Expand Down
16 changes: 16 additions & 0 deletions create_coords.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import Pandas as pd

def create_coord_target_files(file, name):
target_cols=pd.read_csv('data/deepsea_metadata.tsv', sep='\t')['File accession'].tolist() # metadata from build-deepsea-training-dataset repo
colnames=target_cols+['Chr_No','Start','End']
df = pd.read_csv(file, usecols=colnames, header=0)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.rename(columns={k:f'y_{k}' for k in target_cols}, inplace=True)
df.to_csv(f'{name}_coords_targets.csv')


path_to_deepsea_data_repo = sys.argv[1]
create_coord_target_files('debug_valid.tsv', 'val')
create_coord_target_files('debug_test.tsv', 'test')
create_coord_target_files('debug_train.tsv', 'train')