# Build an On-Disk Database
Using AnnSQL, there are two types of databases you can build. The first is a simple in-memory database for smaller datasets. The second is an on-disk database which we demonstrate how to build in this notebook. Building an on-disk AnnSQL database will allow you to query, filter, and run basic statistics on a laptop for larger than memory datasets.



###  Install the AnnSQL package

```bash
pip install annsql
```

### Import Libraries 

In [1]:
from AnnSQL import AnnSQL
from AnnSQL.MakeDb import MakeDb
import scanpy as sc
import os

### Load the dataset
Here, we load the sample pbmc3k raw dataset provided by Scanpy. **Note**: For very large datasets, it is necessary to open a dataset using the AnnData backed mode. Backed mode is fully supported. If opening in backed mode, the database will build in chunks. Depending on the size of your dataset and your compute source, this process may take time.

In [2]:
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'


### Build the AnnSQL database

In [3]:
adata = sc.read_h5ad("data/pbmc3k_raw.h5ad", backed="r")

#Delete command. This is for testing purposes only. 
if os.path.exists("db/pbmc3k.asql"):
	os.remove("db/pbmc3k.asql")
if os.path.exists("db/pbmc3k.asql.wal"):
	os.remove("db/pbmc3k.asql.wal")

#high system memory (>24Gb)
MakeDb(adata=adata, db_name="pbmc3k", db_path="db/", chunk_size=5000)

# #medium system memory (12-24Gb)
# MakeDb(adata=adata, db_name="pbmc3k", db_path="db/", chunk_size=2500)

# #low system memory (<=12Gb)
# MakeDb(adata=adata, db_name="pbmc3k", db_path="db/", chunk_size=1000, make_buffer_file=True)

Time to make var_names unique:  23.616740942001343
Time to create X table structure:  0.24507379531860352
Starting backed mode X table data insert. Total rows: 2700
Processed chunk 0-2699 in 4.261802673339844 seconds

Too close for missiles, switching to guns
Creating X table from buffer file.
This may take a while...
Time to create X table from buffer: 67.55312919616699
Finished inserting X data.


<AnnSQL.MakeDb.MakeDb at 0x7e32c02339b0>

### Open the Database
Below we instantiate the AnnSQL class with the db parameter pointing to our newly created database. By default the database files contain the `.asql` extension.

In [4]:
asql = AnnSQL(db="db/pbmc3k.asql")

### Query the Database

In [5]:
asql.query("SELECT * FROM X LIMIT 5")

Unnamed: 0,cell_id,MIR1302_10,FAM138A,OR4F5,RP11_34P13_7,RP11_34P13_8,AL627309_1,RP11_34P13_14,RP11_34P13_9,AP006222_2,...,KIR3DL2_1,AL590523_1,CT476828_1,PNRC2_1,SRSF10_1,AC145205_1,BAGE5,CU459201_1,AC002321_2,AC002321_1
0,AAACATACAACCAC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AAACATTGAGCTAC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,AAACATTGATCAGC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,AAACCGTGCTTCCG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AAACCGTGTATGCG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculate total counts per gene

In [6]:
#total counts per gene 
asql.query("SELECT SUM(COLUMNS(*)) FROM (SELECT * EXCLUDE (cell_id) FROM X)")

Unnamed: 0,MIR1302_10,FAM138A,OR4F5,RP11_34P13_7,RP11_34P13_8,AL627309_1,RP11_34P13_14,RP11_34P13_9,AP006222_2,RP4_669L17_10,...,KIR3DL2_1,AL590523_1,CT476828_1,PNRC2_1,SRSF10_1,AC145205_1,BAGE5,CU459201_1,AC002321_2,AC002321_1
0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,116.0,70.0,0.0,0.0,0.0,0.0,0.0
