# Get rich sequence information

## Acquire sequence information based on accession id(s)

**Single accession ID**

Single sequences can be retrieved using the `get_id` function. The function takes an accession id as input and returns the sequence as a `ProteinRecord` object.  
The `ProteinRecord` object contains the sequence as a string and additional information such as information on the `Organism`, `Region` or `Site` annotations of the sequence.


In [4]:
from pyeed.core import ProteinRecord
from pyeed.core import Organism

my_organism = Organism(name="E. coli", taxonomy_id=9094)

E300 = ProteinRecord(id="E300_wefef", sequence="SDFSGWRSB", organism=my_organism, name="my sequence")

E300.add_to_sites(
    positions=[1,2,3],
    name="active site"
)

E300.

print(E300)



[4mProteinRecord[0m
├── [94mid[0m = E300_wefef
├── [94mname[0m = my sequence
├── [94morganism[0m
│   └── [4mOrganism[0m
│       ├── [94mid[0m = f1e41f5c-3e17-477f-a453-837f4fd09ca7
│       ├── [94mtaxonomy_id[0m = 9094
│       └── [94mname[0m = E. coli
├── [94msequence[0m = SDFSGWRSB
└── [94msites[0m
    └── 0
        └── [4mSite[0m
            ├── [94mid[0m = 277fed27-a7b4-451a-a822-cbfe3e88c1a2
            ├── [94mname[0m = active site
            └── [94mpositions[0m = [1, 2, 3, ...]



In [1]:
from pyeed.core import ProteinRecord


matHM = ProteinRecord.get_id("MBP1912539.1")

**Multiple accession IDs**

To load multiple sequences at once, the `get_ids` function can be used. The function takes a list of accession IDs as input and returns a list of `ProteinRecord` objects.

In [3]:
import json

# Load the saved ids from json
with open("ids.json", "r") as f:
    ids = json.load(f)

# Get the protein info for each id
proteins = ProteinRecord.get_ids(ids)

Output()

## Serach for similar sequences with BLAST

The `ncbi_blast` method can be used to perform a BLAST search on the NCBI server. The method can be applied to a `ProteinRecord` object and returns a list of `ProteinRecord` objects that represent the hits of the BLAST search.
By specifying the `n_hits`, `e_value`, `db`, `matrix`, and `identity`, the search can be customized to number of hits, E-value, query database, substitution matrix, and identity to accept the hit, respectively.

<div class="admonition warning">
    <p class="admonition-title">NCBI BLAST service might be slow</p>
    <p>Due to the way NCBI handles requests to its BLAST API the service is quite slow. During peak working hours a single search might take more than 15 min.</p>
</div>

In [4]:
blast_results = matHM.ncbi_blast(
    n_hits=100,
    e_value=0.05,
    db="swissprot",
    matrix="BLOSUM62",
    identity=0.5,
)

Output()

## Inspect objects

Each `pyeed` object has a rich `print` method, displaying all the information available for the object. This can be useful to inspect the object and its attributes.

In [6]:
print(blast_results[3])

[4mProteinRecord[0m
├── [94mid[0m = WP_068323110.1
├── [94mname[0m = methionine adenosyltransferase
├── [94morganism[0m
│   └── [4mOrganism[0m
│       ├── [94mid[0m = 95652fe4-fb9a-40e7-b98a-1dfad46d8e56
│       ├── [94mtaxonomy_id[0m = 1609559
│       ├── [94mname[0m = Pyrococcus kukulkanii
│       ├── [94mdomain[0m = Archaea
│       ├── [94mphylum[0m = Euryarchaeota
│       ├── [94mtax_class[0m = Thermococci
│       ├── [94morder[0m = Thermococcales
│       ├── [94mfamily[0m = Thermococcaceae
│       └── [94mgenus[0m = Pyrococcus
├── [94msequence[0m = MARNIVVEEIVRTPVEMQKVELVERKGIGHPDSIADGIAEAVSRALCREYIKRYGVILHHNTDQVEVVGGRAYPKFGGGEVVKPIYILLSGRAVELVDQELFPVHEVAIRAAKEYLKKNIRHLDVENHVVIDSRIGQGSVDLVSVFNKAKENPIPLANDTSFGVGFAPLTETERLVLETERLLNSEKFKKEYPAVGEDIKVMGLRKGDEIDLTIAAAIVDSEVANPKEYMEVKDKIKETVEELAKDITSRKVNIYVNTADDPKKDIYYITVTGTSAEAGDDGSVGRGNRVNGLITPNRHMSMEAAAGKNPVSHVGKIYNILAMFIANDIAKALPVEEVYVRILSQIGKPIDQPLVASIQVIPKQGHTVKEFEKDAYAIADEWLANITKIQKMILEDKITVF
├── [94