# Find HBA in nt
This notebook shows how to use Pandas/Dataframes to find the HBA1 gene in nucleotide sequences.

## Installation
First, install the the API library into your virtual environment:

In [None]:
%pip install --quiet ncbi-cloudblast-api

For this demo, you also need to install `pandas` and `matplotlib`:

In [None]:
%pip install --quiet pandas==0.24.2 matplotlib

We also need to enable matplotlib for notebook

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20,10)

## Before you start
To use this libray, you must provide the address for a CloudBlast API service endpoint:

In [None]:
API_ADDRESS = '35.221.15.226:5000'

## Perform a Blast Search

In [None]:
from ncbi_cloudblast_api.api_client import APIClient

if not API_ADDRESS:
    raise ValueError("Please set value for API_ADDRESS in the previous step.")

client=APIClient(API_ADDRESS)

`NM_000558.5` is the mRNA for human hemoglobin subunit alpha 1 (*HBA1*)

In [None]:
query="NM_000558.5"

print (f'Running Blast search for: {query} ...')

# "search" method will wait for the Blast search to complete
# and then returns the result.
res = client.search(accession=query)
print ("Done.")

## Show only the first (strongest) match for each organism.

In [None]:
# A slice of search result for some selected fields
df = res.as_dataframe()[["qaccver", "saccver", "pident", "length", "evalue", "bitscore", "staxid", "qstart"]]

In [None]:
# First 20 HSP's (in default sort order)
df.head()

Number of unique organisms:

In [None]:
df.drop_duplicates(subset="staxid", keep="first")

In [None]:
df['staxid'].value_counts()

Number of matches per taxonomic node.

In [None]:
plt = df['staxid'].value_counts().plot.bar(rot=0)
plt.set_ylabel("number");
plt.set_xlabel("tax_id");
plt.set_title("Number of matches per taxonomic node");
plt.set_xticks([]);

Number of matches per subject sequence

In [None]:
plt = df['saccver'].value_counts().plot.bar(x='subject acc', y='number', rot=0)
plt.set_ylabel("number");
plt.set_xlabel("subject acc");
plt.set_title("Number of matches per subject sequence");
plt.set_xticks([]);