<a href="https://colab.research.google.com/github/PeterDeWeirdt/defense_predictor/blob/main/defense_predictor_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DefensePredictor: A Machine Learning Model to Discover Novel Prokaryotic Immune Systems

Example usage of the [defense_predictor](https://github.com/PeterDeWeirdt/defense_predictor) python pacakge. DefensePredictor is a machine-learning model that leverages embeddings from a protein language model, [ESM2](https://github.com/facebookresearch/esm), to classify proteins as anti-phage defensive.

We reccomend using a GPU runtime for this notebook to accelerate predictions.

## Create a virutal environment and install defense_predictor

In [1]:
!pip install virtualenv

Collecting virtualenv
  Downloading virtualenv-20.28.1-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading virtualenv-20.28.1-py3-none-any.whl (4.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distlib-0.3.9-py2.py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: distlib, virtualenv
Successfully installed distlib-0.3.9 virtualenv-20.28.1


In [2]:
!virtualenv defense_predictor

created virtual environment CPython3.10.12.final.0-64 in 1665ms
  creator CPython3Posix(dest=/content/defense_predictor, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==24.3.1, setuptools==75.6.0, wheel==0.45.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [3]:
!source defense_predictor/bin/activate;pip install git+https://github.com/PeterDeWeirdt/defense_predictor.git

Collecting git+https://github.com/PeterDeWeirdt/defense_predictor.git
  Cloning https://github.com/PeterDeWeirdt/defense_predictor.git to /tmp/pip-req-build-gn05e2si
  Running command git clone --filter=blob:none --quiet https://github.com/PeterDeWeirdt/defense_predictor.git /tmp/pip-req-build-gn05e2si
  Resolved https://github.com/PeterDeWeirdt/defense_predictor.git to commit 1653ad9dc95f4c26b71240d070555496fb093066
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fair-esm<3.0.0,>=2.0.0 (from defense-predictor==0.1.1)
  Downloading fair_esm-2.0.0-py3-none-any.whl.metadata (37 kB)
Collecting joblib<2.0.0,>=1.4.2 (from defense-predictor==0.1.1)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting lightgbm<5.0.0,>=4.5.0 (from defense-predictor==0.1.1)
  Downloading lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Collec

## Run DefensePredictor

Modify the `ftp_link_base` with your assembly of interest to get predictions

In [4]:
ftp_link_base = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2'

In [5]:
assembly_num = ftp_link_base.split('/')[-1]
ft_link = f'{ftp_link_base}/{assembly_num}_feature_table.txt.gz'
fna_link = f'{ftp_link_base}/{assembly_num}_cds_from_genomic.fna.gz'
faa_link = f'{ftp_link_base}/{assembly_num}_protein.faa.gz'

We can download the necessary files directly from the ftp directory

In [6]:
input_file_list = list()
for link in [ft_link, fna_link, faa_link]:
  !wget {link}
  f_name = link.split('/')[-1]
  !gunzip {f_name}
  input_file_list.append(f_name[:-3])

--2025-01-09 00:38:13--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_feature_table.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 130.14.250.12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231979 (227K) [application/x-gzip]
Saving to: ‘GCF_000005845.2_ASM584v2_feature_table.txt.gz’


2025-01-09 00:38:14 (871 KB/s) - ‘GCF_000005845.2_ASM584v2_feature_table.txt.gz’ saved [231979/231979]

--2025-01-09 00:38:14--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 130.14.250.12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:443... connected.
HTTP request sent, awaiting response... 20

We can now run `defense_predictor` on our files and save outputs to the `files` folder. The predicted probability and log-odds of defense for each protein are in the columns `defense_probability` and `defense_log_odds`, respectively. We reccomend using a log-odds cutoff of `7.2` to confidently call proteins predicted defensive.

In [7]:
!source defense_predictor/bin/activate;defense_predictor \
--ncbi_feature_table {input_file_list[0]} \
--ncbi_cds_from_genomic {input_file_list[1]} \
--ncbi_protein_fasta {input_file_list[2]} \
--output {assembly_num}_defense_predictor_output.csv

Reading data
Getting ESM2 encodings
100% 347/347 [03:26<00:00,  1.68it/s]
Calculating remaining features
Making predictions
