# Example 1: Getting protein information and running blast searches

In [1]:
%reload_ext autoreload
%autoreload 2

from pyEED.core.proteinsequence import ProteinSequence
from pyEED.ncbi.utils import get_nucleotide_sequences

## Query NCBI

The pyEED library is centered around the `ProteinSequence` object, which integrates available information on protein sequence, corresponding nucleotide sequence, as well as regions and sites within the sequences. The `ProteinSequence` can be initialized directly with a protein sequence accession number.

In [4]:
aldolase = ProteinSequence.from_ncbi("NP_001287541.1")
print(aldolase)

AttributeError: 'SeqFeature' object has no attribute 'description'

## BLAST search

In [3]:
blast_results = aldolase.blast(n_hits=15)

Running blast search for aldolase 1, isoform M from Drosophila melanogaster...


Fetching protein sequences:   0%|          | 0/15 [00:00<?, ?it/s]

ID: AEB39622.1
Name: AEB39622
Description: FI14722p [Drosophila melanogaster]
Number of features: 7
/topology=linear
/data_file_division=INV
/date=12-APR-2011
/accessions=['AEB39622']
/sequence_version=1
/db_source=accession BT126261.1
/keywords=['']
/source=Drosophila melanogaster (fruit fly)
/organism=Drosophila melanogaster
/taxonomy=['Eukaryota', 'Metazoa', 'Ecdysozoa', 'Arthropoda', 'Hexapoda', 'Insecta', 'Pterygota', 'Neoptera', 'Endopterygota', 'Diptera', 'Brachycera', 'Muscomorpha', 'Ephydroidea', 'Drosophilidae', 'Drosophila', 'Sophophora']
/references=[Reference(title='Direct Submission', ...)]
/comment=Sequence submitted by:
Berkeley Drosophila Genome Project
Lawrence Berkeley National Laboratory
Berkeley, CA 94720
This clone was sequenced as part of a high-throughput process to
sequence clones from the Drosophila Gene Collection. The sequence
has been subjected to integrity checks for sequence accuracy,
presence of a polyA tail and contiguity within 200 kb in the
genome. Th

Fetching protein sequences:   7%|▋         | 1/15 [00:02<00:30,  2.21s/it]

ID: NP_001287541.1
Name: NP_001287541
Description: aldolase 1, isoform M [Drosophila melanogaster]
Database cross-references: BioProject:PRJNA164, BioSample:SAMN02803731
Number of features: 7
/topology=linear
/data_file_division=INV
/date=15-NOV-2022
/accessions=['NP_001287541']
/sequence_version=1
/db_source=REFSEQ: accession NM_001300612.1
/keywords=['RefSeq']
/source=Drosophila melanogaster (fruit fly)
/organism=Drosophila melanogaster
/taxonomy=['Eukaryota', 'Metazoa', 'Ecdysozoa', 'Arthropoda', 'Hexapoda', 'Insecta', 'Pterygota', 'Neoptera', 'Endopterygota', 'Diptera', 'Brachycera', 'Muscomorpha', 'Ephydroidea', 'Drosophilidae', 'Drosophila', 'Sophophora']
/references=[Reference(title='Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data', ...), Reference(title='Gene Model Annotations for Drosophila melanogaster: The Rule-Benders', ...), Reference(title='The Release 6 reference sequence of the Drosophila melanogaster genome', ...), Reference(title='Se

Fetching protein sequences:  13%|█▎        | 2/15 [00:04<00:26,  2.01s/it]

ID: XP_032576247.1
Name: XP_032576247
Description: fructose-bisphosphate aldolase isoform X1 [Drosophila sechellia]
Database cross-references: BioProject:PRJNA609127
Number of features: 7
/topology=linear
/data_file_division=INV
/date=03-MAR-2020
/accessions=['XP_032576247']
/sequence_version=1
/db_source=REFSEQ: accession XM_032720356.1
/keywords=['RefSeq']
/source=Drosophila sechellia
/organism=Drosophila sechellia
/taxonomy=['Eukaryota', 'Metazoa', 'Ecdysozoa', 'Arthropoda', 'Hexapoda', 'Insecta', 'Pterygota', 'Neoptera', 'Endopterygota', 'Diptera', 'Brachycera', 'Muscomorpha', 'Ephydroidea', 'Drosophilidae', 'Drosophila', 'Sophophora']
/comment=MODEL REFSEQ:  This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NC_045952.1) annotated using gene prediction method: Gnomon,
supported by EST evidence.
Also see:
    Documentation of NCBI's Annotation Process
COMPLETENESS: full length.
/structured_comment=defaultdict(<class 'dict'>

Fetching protein sequences:  20%|██        | 3/15 [00:05<00:23,  1.93s/it]

ID: XP_016036385.1
Name: XP_016036385
Description: fructose-bisphosphate aldolase isoform X1 [Drosophila simulans]
Database cross-references: BioProject:PRJNA695671
Number of features: 7
/topology=linear
/data_file_division=INV
/date=03-NOV-2021
/accessions=['XP_016036385']
/sequence_version=1
/db_source=REFSEQ: accession XM_016174648.3
/keywords=['RefSeq']
/source=Drosophila simulans
/organism=Drosophila simulans
/taxonomy=['Eukaryota', 'Metazoa', 'Ecdysozoa', 'Arthropoda', 'Hexapoda', 'Insecta', 'Pterygota', 'Neoptera', 'Endopterygota', 'Diptera', 'Brachycera', 'Muscomorpha', 'Ephydroidea', 'Drosophilidae', 'Drosophila', 'Sophophora']
/comment=MODEL REFSEQ:  This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NC_052523.2) annotated using gene prediction method: Gnomon,
supported by EST evidence.
Also see:
    Documentation of NCBI's Annotation Process
COMPLETENESS: full length.
/structured_comment=defaultdict(<class 'dict'>, {

Fetching protein sequences:  27%|██▋       | 4/15 [00:08<00:23,  2.14s/it]

ID: 1FBA_A
Name: 1FBA_A
Description: Chain A, FRUCTOSE 1,6-BISPHOSPHATE ALDOLASE
Number of features: 28
/topology=linear
/data_file_division=INV
/date=01-DEC-2020
/accessions=['1FBA_A']
/db_source=pdb: molecule 1FBA, chain A, release Jul 13, 2011; deposition: Jun 8, 1992; class: LYASE(ALDEHYDE); source: Mmdb_id: 73594, Pdb_id 1: 1FBA; Exp. method: X-ray Diffraction.
/keywords=['']
/source=Drosophila melanogaster (fruit fly)
/organism=Drosophila melanogaster
/taxonomy=['Eukaryota', 'Metazoa', 'Ecdysozoa', 'Arthropoda', 'Hexapoda', 'Insecta', 'Pterygota', 'Neoptera', 'Endopterygota', 'Diptera', 'Brachycera', 'Muscomorpha', 'Ephydroidea', 'Drosophilidae', 'Drosophila', 'Sophophora']
/references=[Reference(title='Fructose-1,6-bisphosphate aldolase from Drosophila melanogaster: primary structure analysis, secondary structure prediction, and comparison with vertebrate aldolases', ...), Reference(title='The crystal structure of fructose-1,6-bisphosphate aldolase from Drosophila melanogaster a




UnboundLocalError: local variable 'protein_name' referenced before assignment

In [None]:
get_nucleotide_sequences(blast_results)

## Storing `ProteinSequence`s in a database



In [None]:
from sdRDM import DataModel
from sdrdm_database import DBConnector, create_tables

### Setting up a local MySQL database

First, a local MySQL database needs to be setup. Therefore, we run a docker container with a MySQL database. 

>[!NOTE]
>
>If docker is not isntalled on your system, please follow the instructions on the [docker website](https://docs.docker.com/get-docker/).


In case this notebook is run on a macOS system with a M1 chip, the following command needs to be run in the terminal first:

>```bash
>export DOCKER_DEFAULT_PLATFORM=linux/amd64 
>```

Next, navigate to the directory where this notebook is located and run the following command to start the docker container:

>```bash
>docker compose up -d
>```

### Connect to the database

In [None]:
# Establish a connection to the database
db = DBConnector(
    username="root",
    password="root",
    host="localhost",
    db_name="db",
    port=3306,
    dbtype="mysql",
)

In [None]:
# Create the tables in the database
create_tables(db_connector=db, model=ProteinSequence)

### Populate the database

In [None]:
# Transform the data into a list of dictionaries
blast_result_data = [result.dict() for result in blast_result_sequences]

# Insert the data into the database
db.connection.insert("Reactant", blast_result_data)