<a href="https://colab.research.google.com/github/PeterDeWeirdt/defense_predictor/blob/main/defense_predictor_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DefensePredictor: A Machine Learning Model to Discover Novel Prokaryotic Immune Systems

Example usage of the [defense_predictor](https://github.com/PeterDeWeirdt/defense_predictor) python pacakge. DefensePredictor is a machine-learning model that leverages embeddings from a protein language model, [ESM2](https://github.com/facebookresearch/esm), to classify proteins as anti-phage defensive.

We reccomend using a GPU runtime for this notebook to accelerate predictions.

## Create a virutal environment and install defense_predictor

In [1]:
!pip install virtualenv

Collecting virtualenv
  Downloading virtualenv-20.28.1-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading virtualenv-20.28.1-py3-none-any.whl (4.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/4.3 MB[0m [31m17.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m4.3/4.3 MB[0m [31m67.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distlib-0.3.9-py2.py3-none-any.whl (468 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/469.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 k

In [2]:
!virtualenv defense_predictor

created virtual environment CPython3.10.12.final.0-64 in 1274ms
  creator CPython3Posix(dest=/content/defense_predictor, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==24.3.1, setuptools==75.6.0, wheel==0.45.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [3]:
!source defense_predictor/bin/activate;pip install defense_predictor

Collecting defense_predictor
  Downloading defense_predictor-0.1.1-py3-none-any.whl.metadata (3.9 kB)
Collecting fair-esm<3.0.0,>=2.0.0 (from defense_predictor)
  Downloading fair_esm-2.0.0-py3-none-any.whl.metadata (37 kB)
Collecting joblib<2.0.0,>=1.4.2 (from defense_predictor)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting lightgbm<5.0.0,>=4.5.0 (from defense_predictor)
  Downloading lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Collecting numpy<3.0.0,>=2.1.2 (from defense_predictor)
  Downloading numpy-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting pandas<3.0.0,>=2.2.3 (from defense_predictor)
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting requests<3.0.0,>=2.32.3 (from defense_predictor)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting scikit-learn<2.0.0,>=1.5.2 (from defense_predictor)
  Downloadi

In [4]:
!source defense_predictor/bin/activate;defense_predictor_download

Downloading model weights
Downloading ESM2 weights


## Run DefensePredictor

Modify the `ftp_link_base` with your assembly of interest to get predictions

In [5]:
ftp_link_base = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2'

In [6]:
assembly_num = ftp_link_base.split('/')[-1]
ft_link = f'{ftp_link_base}/{assembly_num}_feature_table.txt.gz'
fna_link = f'{ftp_link_base}/{assembly_num}_cds_from_genomic.fna.gz'
faa_link = f'{ftp_link_base}/{assembly_num}_protein.faa.gz'

We can download the necessary files directly from the ftp directory

In [7]:
input_file_list = list()
for link in [ft_link, fna_link, faa_link]:
  !wget {link}
  f_name = link.split('/')[-1]
  !gunzip {f_name}
  input_file_list.append(f_name[:-3])

--2025-01-13 20:47:26--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_feature_table.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 130.14.250.12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231979 (227K) [application/x-gzip]
Saving to: ‘GCF_000005845.2_ASM584v2_feature_table.txt.gz’


2025-01-13 20:47:26 (646 KB/s) - ‘GCF_000005845.2_ASM584v2_feature_table.txt.gz’ saved [231979/231979]

--2025-01-13 20:47:27--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 130.14.250.12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:443... connected.
HTTP request sent, awaiting response... 20

We can now run `defense_predictor` on our files and save outputs to the `files` folder. The predicted probability and log-odds of defense for each protein are in the columns `defense_probability` and `defense_log_odds`, respectively. We reccomend using a log-odds cutoff of `7.2` to confidently call proteins predicted defensive.

In [8]:
!source defense_predictor/bin/activate;defense_predictor \
--ncbi_feature_table {input_file_list[0]} \
--ncbi_cds_from_genomic {input_file_list[1]} \
--ncbi_protein_fasta {input_file_list[2]} \
--output {assembly_num}_defense_predictor_output.csv

Reading data
Getting ESM2 encodings
100% 347/347 [03:36<00:00,  1.61it/s]
Calculating remaining features
Making predictions
