# Database Homework

**Authors:** [Tony Kabilan Okeke](mailto:tko35@drexel.edu),
  [Cooper Molloy](mailto:cdm348@drexel.edu)  
**Date:** October 22, 2022

In [1]:
# Notebook set up
%load_ext autoreload
%autoreload 2

# Import libraries
from urllib.request import urlretrieve
from pathlib import Path
import pandas as pd
import sqlite3
import random
import string

# Import BMES module
import sys, os
sys.path.append(os.environ['BMESAHMETDIR'])
import bmes

In [2]:
# Function Definitions

def downloadurl(url, filename=''):
    """
    Download a file from a url and save it as filename.
    bmes.downloadurl generates errors on my machine, so I wrote my own.
    """

    if not filename:
        filename = url.split('/')[-1]
    path = (Path(bmes.datadir()) / filename).resolve()
    if not path.exists():
        urlretrieve(url, path)
    return path.__str__()


def select(conn, query, show=False):
    """
    Run a SELECT query and return the results as a pandas DataFrame
    """
    
    cur = conn.cursor()
    cur.execute(query);
    rows = cur.fetchall()
    if len(rows) == 0:
        print("No rows returned for query")
        return None
    else:
        df = pd.DataFrame(rows)
        df.columns = [x[0] for x in cur.description]
        if show: display(df)
        return df


def randomname():
    """
    Generate a random name
    """

    characters = string.ascii_letters + string.ascii_lowercase
    return ''.join(random.choices(characters, k=16))

## [20pt]  Yeast Apoptosis Genes

Write a GO query to find the names of yeast genes that are associated with the
"execution phase of apoptosis". Here, we define "yeast" as any organism under 
the genus '*Saccharomyces*'.

* Fetch the results of your GO query from the web and display them as output 
  from your python/Matlab code.

In [3]:
# Download the database
URL = "http://sacan.biomed.drexel.edu/ftp/binf/godb.sqlite"
godbfile = downloadurl(URL)

# Query the database
db = sqlite3.connect(godbfile)
df = select(db, """
    SELECT DISTINCT(GENE.symbol)
    FROM
        term AS T1,
        graph_path AS GP,
        term AS T2,
        association AS A,
        gene_product AS GENE,
        species AS S
    WHERE
        T1.name LIKE "%execution phase of apoptosis%" AND
        GP.term1_id = T1.id AND
        T2.id = GP.term2_id AND
        A.term_id = T2.id AND
        GENE.id = A.gene_product_id AND
        S.genus = "Saccharomyces" AND
        GENE.species_id = S.id
""")
db.close()

print(f"""
The following genes were fount to be associated with the "execution
phase of apoptosis" in yeast (Genus: Saccharomyces):
""")
print(', '.join(df['symbol']))


The following genes were fount to be associated with the "execution
phase of apoptosis" in yeast (Genus: Saccharomyces):

NUC1, YBL055C


## [60pt]  mirdb

### Download file & Set up db connection

This section is sufficient for downloading the data file and setting up the 
database connection. You may make changes/improvements or keep it as is.

In the remainder of this problem, you need to use the `mirtxtfile` and `db` 
variables created here.

In [4]:
# Download the database
URL = 'http://sacan.biomed.drexel.edu/lib/exe/fetch.php?rev=&media=course:bcomp2:db:homework_mirdb_dog75.txt'
mirtxtfile = downloadurl(URL, 'mirdb_dog75.txt')

# Connect to the database
mirdbfile = bmes.datadir() + '/mirdb_dog.sqlite'
db = sqlite3.connect(mirdbfile)

### [30 pt]  Create a Database from the mirdb Data

* Any downloaded files should be stored elsewhere on your computer (i.e., in a
  "Temporary" directory). 
* Store the database elsewhere (in "Temporary" directory) on your computer; not 
  within the same folder as your assignment. 

**Note:** If your database creation code does not work, you may use a database
created by the instructor.

In [5]:
# Uncomment the followind lines to use the instructor's database. If you 
# are using the instructor's database, we will assume that your database 
# creation code is does not work.

# mirdbfile='http://sacan.biomed.drexel.edu/lib/exe/fetch.php?rev=&media=course:bcomp2:db:homework_mirdb_dog.sqlite';
# mirdbfile=downloadurl(mirdbfile);
# db = sqlite3.connect(mirdbfile)

In [9]:
# Delete the table if it already exists
db.execute("DROP TABLE IF EXISTS mir2target");

# Create fresh mir2target table
db.execute("""
    CREATE TABLE IF NOT EXISTS mir2target (
        mirna TEXT(16) COLLATE NOCASE,
        target TEXT (12) COLLATE NOCASE,
        score FLOAT
)
""");

# Populate the table with data from the text file
with open(mirtxtfile, 'r') as file:
    for line in file:
        line = line.strip().split('\t')
        db.execute("INSERT INTO mir2target VALUES (?, ?, ?)", line);

db.commit();

# Query the database to check that the data was loaded correctly
select(db, show=True, query="""
    SELECT * FROM mir2target LIMIT 10
""");

Unnamed: 0,mirna,target,score
0,cfa-miR-133b,XM_014115527,77.9277
1,cfa-miR-133b,NM_001003204,89.6964
2,cfa-miR-133b,XM_005615722,89.6925
3,cfa-miR-133b,XM_014112508,77.1277
4,cfa-miR-133b,XM_014120898,79.1733
5,cfa-miR-133b,XM_546818,95.5241
6,cfa-miR-133b,XM_014109410,75.0686
7,cfa-miR-133b,XM_014120614,98.5764
8,cfa-miR-133b,XM_005640208,94.9956
9,cfa-miR-133b,XM_846751,77.7379


### [10 pt]  Find miRNAs for a target

In [10]:
# How many miRNAs are predicted to target XM_532324  ?

res = select(db, """
    SELECT DISTINCT(mirna) FROM mir2target
    WHERE target = 'XM_532324'
""")
print(f"{res.shape[0]} miRNAs are predicted to target XM_532324")

13 miRNAs are predicted to target XM_532324


In [11]:
# Show at most 10 miRNAs that are predicted to target XM_532324.

print("The following miRNAs are predicted to target XM_532324:")
print('\n'.join(res['mirna'][:10]))

The following miRNAs are predicted to target XM_532324:
cfa-miR-30c
cfa-miR-1185
cfa-miR-342
cfa-miR-8881
cfa-miR-30d
cfa-miR-653
cfa-miR-8824
cfa-miR-30a
cfa-miR-19b
cfa-miR-30b


### [10 pt]  Find targets for a miRNA

In [12]:
# How many predicted targets of cfa-let-7a have a prediction score 
# of at least 80?

res = select(db, """
    SELECT DISTINCT(target) FROM mir2target
    WHERE mirna = 'cfa-let-7a' AND score >= 80
""")
print(f"{res.shape[0]} targets of cfa-let-7a have a prediction score ",
      "of at least 80.")

303 targets of cfa-let-7a have a prediction score  of at least 80.


In [13]:
# Show at most 10 predicted targets of cfa-let-7a that have a prediction 
# score of at least 80.

print("The following targets of cfa-let-7a have a prediction score ",
      "of at least 80:")
print('\n'.join(res['target'][:10]))

The following targets of cfa-let-7a have a prediction score  of at least 80:
XM_014119515
XM_847837
XM_014111346
XM_541808
XM_005621935
XM_014118125
XM_847579
XM_005630512
XM_005618982
XM_014114613


### [10 pt]  Summarize miRNAs and target counts

In [14]:
# List the miRNAs and the number of their targets.
# (Each row of the result should contain a distinct miRNA). 
# (Use count() and GROUP BY). Show only top 10 rows of the result.

select(db, show=True, query="""
    SELECT 
        mirna, 
        COUNT(*) AS target_count 
    FROM mir2target
    GROUP BY mirna
    ORDER BY target_count DESC
    LIMIT 10
""");

Unnamed: 0,mirna,target_count
0,cfa-miR-30c,1545
1,cfa-miR-126,976
2,cfa-miR-137,851
3,cfa-miR-96,682
4,cfa-miR-568,648
5,cfa-let-7e,487
6,cfa-let-7a,487
7,cfa-miR-194,459
8,cfa-miR-361,402
9,cfa-miR-133b,390


In [15]:
# Close the database connection
db.close()

## [20 pt]  Performance Comparison - Excel vs. Database

In this section, you are asked to compare the performance of adding & retrieving 
data in a database table vs. in an Excel file.

In [16]:
# Define paths to files
xlsfile = bmes.tempdir() + '/hwdb_performance.xlsx';
dbfile = bmes.tempdir() + '/hwdb_performance.sqlite';

# Delete the files if they are there, so the performance analysis can 
# start fresh. 
if bmes.isfile(xlsfile): os.remove(xlsfile);
if bmes.isfile(dbfile):  os.remove(dbfile);

### `xls_insertname(xlsfile, name)`

Create an external file `xls_insertname.py` for the function 
`xls_insertname(xlsfile, name)` that :

- Creates the Excel file xlsfile if it does not already exist and writes the 
  header row containing "id" and "name" (without the quotes) as the column names
- Adds the name to the Excel file as a new row, along with its id. You need 
  to automatically determine a unique integer id for this new name being 
  added (similar to SQL's auto-increment feature).
- Returns the total number of names in the table (not including the header row, 
  but including the new row that you just added).

In [17]:
# Testing the xls_insertname() to make sure it works:
from xls_insertname import xls_insertname
print(xls_insertname(xlsfile, randomname()))
print(xls_insertname(xlsfile, randomname()))

1
2


### `db_insertname(dbfile, name)`

Create an external file `db_insertname.py` for the function 
`db_insertname(dbfile, name)` that :

- Creates the database table "names" if the database or the table does not 
  already exist. The table needs to have the columns "id" and "name" (without 
  the quotes).
- Adds the name to the names table as a new row. You should not identify the id 
  of this new row yourserlf, but need to have the database automatically 
  identify the id.
- You may assume that name will be at most 16 characters.
- Returns the total number of names in the table (not including the header row, 
  but including the new row that you just added).

In [18]:
# Testing the db_insertname() to make sure it works:
from db_insertname import db_insertname
print(db_insertname(dbfile, randomname()))
print(db_insertname(dbfile, randomname()))

1
2


### Run & Time Multiple Inserts

In [19]:
# Use bmes.tic() & bmes.toc() to identify how long it takes to make 1000 
# insertions of random 16-character names using xls_insertname(). 
# Report the total elapsed time.

bmes.tic();
for _ in range(1000):
    xls_insertname(xlsfile, randomname());
bmes.toc();

Elapsed: 38.80 sec.


In [20]:
# Use tic() & toc() to identify how long it takes to make 1000 
# insertions of random 16-character names using db_insertname(). 
# Report the total elapsed time.

bmes.tic();
for _ in range(1000):
    db_insertname(dbfile, randomname());
bmes.toc();

Elapsed: 7.42 sec.
