Scraping PDB files from Uniprot Entries (SPBDUE)

This is a Jupyter-notebook for scraping protein structure files from the Uniprot database entries.

With the advent of AlphaFold2 and other precise structure predictions, Uniprot database has started to list structure predictions for within entries. As for now, AlphaFold2 provides a download page proteome-wide predictions for dozens of species. When focusing on variants represented in isoforms and/or orthologs, it is teadious to click every entry of the Uniprot and download the predections. Therefore I prepared a python script to scrape all the structure predictions of the given search result.

Please feel free to contact me through this repository for bugs and issues.
:Dan

News

2022-01-12 Version 1.0.0 made public!

Installation

Please make sure you have appropriate Python and pip before starting.

Python version >=3.5
pip    version >= 1.1.0

Dependencies :

requests version >=2.27.1
beautifulsoup4 version >=4.10.0
openpyxl version >=3.0.9
tqdm version >=2.2.4
selenium version >=3.11.0

To install these packages, first clone this repository by

git clone https://github.com/DanYamamotoEvans/SPDBUE.git

Next, go to the location of the SPDBUE folder in the terminal, and install the dependencies by

pip install .

Other core programs to install:

Jupyter-notebook

pip install jupyterlab

Now you're ready to scrape the PDB files!

Usage

Overview

Inputs:

Excel file from Uniprot search result
Output directory name
Fasta file from Uniprot search result (optional)

Outputs:

A directory with all the PDB files
A summary file with Uniprot ID to the PDB files. If fasta file is provided you will also have the protein sequence.

Step 1. Enter your query in Uniprot

Go to UniProt database, and search the proteins of your interest by entering a query in the search bar. Hit search to get the results. You can select the proteins of interest from the search result.

Step 2. Download the search result

Above the table of search results, you will see a download button. Click this, and select 'Excel'. The download will begin shortly. Make sure you rename the file once donload is cmomplete.

Option

You can also download the fasta file and put it as input so you can access the sequence information easily.

Tips

Make sure you use the Advanced option for searching Uniprot entries with 3D structures.
The scraping needs to wait for the java script to load, making it slow (~10 sec per protein). Make sure you have <1000 entries after searching.

Step 3. Run the Jupyter-notebook

Open your terminal, and go to the SPDBUE directory.
Open jupyter-notebook by entering

jupyter-notebook

this will open your browzer.

Click and open SPDBUE_main.ipynb.
Change path, xlsx_f, and fasta_f to your directory/files. Make sure you place the xlsx and fasta in the indicated directory.
Execute cells by going to the header -> Cell -> Run all.
The script will create a new directory under that with the name of the xlsx file (without the extension). Within it, there will be a summary file and a folder with all the pdb files.

The default has a test file for downloading 10 entries. I reccomend you run the script with this before you test your own list. You can undo the comment out by deleting the '#' before fasta_f to see result when inputing a fasta file.

Citation

If you make use of an AlphaFold prediction, please cite the following papers: Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021). Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
test		test
LICENSE		LICENSE
README.md		README.md
SPDBUE_main.ipynb		SPDBUE_main.ipynb
_config.yml		_config.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping PDB files from Uniprot Entries (SPBDUE)

News

Installation

Usage

Overview

Step 1. Enter your query in Uniprot

Step 2. Download the search result

Step 3. Run the Jupyter-notebook

Citation

About

Releases

Packages

Languages

License

DanYamamotoEvans/SPDBUE

Folders and files

Latest commit

History

Repository files navigation

Scraping PDB files from Uniprot Entries (SPBDUE)

News

Installation

Usage

Overview

Step 1. Enter your query in Uniprot

Step 2. Download the search result

Step 3. Run the Jupyter-notebook

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages