Skip to content

Jupyter-notebook for scraping protein structures from the Uniprot database.

License

Notifications You must be signed in to change notification settings

DanYamamotoEvans/SPDBUE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping PDB files from Uniprot Entries (SPBDUE)

This is a Jupyter-notebook for scraping protein structure files from the Uniprot database entries.

With the advent of AlphaFold2 and other precise structure predictions, Uniprot database has started to list structure predictions for within entries. As for now, AlphaFold2 provides a download page proteome-wide predictions for dozens of species. When focusing on variants represented in isoforms and/or orthologs, it is teadious to click every entry of the Uniprot and download the predections. Therefore I prepared a python script to scrape all the structure predictions of the given search result.

Please feel free to contact me through this repository for bugs and issues.
:Dan

News

  • 2022-01-12 Version 1.0.0 made public!
Archive
  • 2022-01-12 Version 1.0.0 made public!

Installation

Please make sure you have appropriate Python and pip before starting.

Python version >=3.5
pip    version >= 1.1.0

Dependencies :

requests version >=2.27.1
beautifulsoup4 version >=4.10.0
openpyxl version >=3.0.9
tqdm version >=2.2.4
selenium version >=3.11.0

To install these packages, first clone this repository by

git clone https://github.com/DanYamamotoEvans/SPDBUE.git

Next, go to the location of the SPDBUE folder in the terminal, and install the dependencies by

pip install .

Other core programs to install:

pip install jupyterlab

Now you're ready to scrape the PDB files!

Usage

Overview

Inputs:

  • Excel file from Uniprot search result
  • Output directory name
  • Fasta file from Uniprot search result (optional)

Outputs:

  • A directory with all the PDB files
  • A summary file with Uniprot ID to the PDB files. If fasta file is provided you will also have the protein sequence.

Step 1. Enter your query in Uniprot

Go to UniProt database, and search the proteins of your interest by entering a query in the search bar. Hit search to get the results. You can select the proteins of interest from the search result. Screen Shot 2022-01-12 at 5 08 54

Step 2. Download the search result

Above the table of search results, you will see a download button. Click this, and select 'Excel'. The download will begin shortly. Make sure you rename the file once donload is cmomplete.
Screen Shot 2022-01-12 at 5 09 56


Option

  • You can also download the fasta file and put it as input so you can access the sequence information easily.

Tips

  • Make sure you use the Advanced option for searching Uniprot entries with 3D structures.
  • The scraping needs to wait for the java script to load, making it slow (~10 sec per protein). Make sure you have <1000 entries after searching.

Screen Shot 2022-01-12 at 5 09 33

Step 3. Run the Jupyter-notebook

  1. Open your terminal, and go to the SPDBUE directory.
  2. Open jupyter-notebook by entering
jupyter-notebook

this will open your browzer.

  1. Click and open SPDBUE_main.ipynb.
  2. Change path, xlsx_f, and fasta_f to your directory/files. Make sure you place the xlsx and fasta in the indicated directory.
  3. Execute cells by going to the header -> Cell -> Run all.
  4. The script will create a new directory under that with the name of the xlsx file (without the extension). Within it, there will be a summary file and a folder with all the pdb files.

The default has a test file for downloading 10 entries. I reccomend you run the script with this before you test your own list. You can undo the comment out by deleting the '#' before fasta_f to see result when inputing a fasta file.

Citation

If you make use of an AlphaFold prediction, please cite the following papers: Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021). Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).

About

Jupyter-notebook for scraping protein structures from the Uniprot database.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published