# Making PrimaryMicroRNA and MatureMicroRNA Tables

The first step in making our database was to initialize two of the tables. The python package <a href="https://github.com/mkleehammer/pyodbc/wiki">pyodbc</a> was used to connect to MS SQL Server database from python and <a href="https://www.mysql.com/products/connector/">mysql.connector</a> was used to connect to MySQL database. All of the code for connecting and editing these databases is contained within my module data_processing.

In [1]:
import data_processing as dp

def create_miR_tables(db_name, sql_version="MySQL", firewall=False):
    """
        Creates the PrimaryMicroRNA and MatureMicroRNA tables which will be filled in later
        
        Will check with user and re-create tables which already exist if user desires.
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    db_con.make_table("PrimaryMicroRNA", {"PriMiRName": ["varchar(50)", "NOT NULL"], 
                                      "PriID": ["varchar(50)", "NOT NULL"],
                                      "Chr": ["varchar(2)"], "GenomeStart": ["Int"],
                                      "GenomeEnd": ["Int"], "ChrStrand": ["nchar(1)"],
                                      "StemLoopSeq": ["varchar(200)"], "LongSeq": ["varchar(250)"], 
                                      "MiRFamily": ["varchar(50)"], "RNAfold": ["VARCHAR(200)"],
                                      "HighConfidence": ["NCHAR(1)"]},
                  other_conditions=["PRIMARY KEY (PriID)"])
    
    db_con.make_table("MatureMicroRNA", {"MatMiRName": ["VARCHAR(50)", "NOT NULL"], "MatID": ["VARCHAR(50)", "NOT NULL"], 
                                     "PriID": ["VARCHAR(50)", "NOT NULL"], "MatStart": ["INT"], "MatEnd": ["INT"],
                                     "MatSeq": ["VARCHAR(50)"]}, 
                  other_conditions=["PRIMARY KEY (MatID)", "FOREIGN KEY (PriID) REFERENCES PrimaryMicroRNA (PriID)"])
    db_con.close_cursor()
    db_con.close_connection()

In [2]:
create_miR_tables("miR-test", firewall=True)

The table PrimaryMicroRNA already exists. Would you like to drop and recreate it? (Y/N)Y
Deleting table PrimaryMicroRNA
The table(s) [u'MatureMicroRNA'] have foreign key contstrants on table PrimaryMicroRNA. 
It is necessary to drop these tables to drop PrimaryMicroRNA. 
Would you like to continue and drop these tables? (Y/N)Y
Table PrimaryMicroRNA sucessfully deleted
Creating table PrimaryMicroRNA
CREATE TABLE PrimaryMicroRNA (ChrStrand nchar(1),
    PriID varchar(50) NOT NULL,
    Chr varchar(2),
    StemLoopSeq varchar(200),
    HighConfidence NCHAR(1),
    GenomeEnd Int,
    PriMiRName varchar(50) NOT NULL,
    GenomeStart Int,
    RNAfold VARCHAR(200),
    LongSeq varchar(250),
    MiRFamily varchar(50),
    PRIMARY KEY (PriID)) ENGINE=InnoDB;
Sucessfully created table PrimaryMicroRNA
Creating table MatureMicroRNA
CREATE TABLE MatureMicroRNA (MatEnd INT,
    MatSeq VARCHAR(50),
    MatStart INT,
    PriID VARCHAR(50) NOT NULL,
    MatID VARCHAR(50) NOT NULL,
    MatMiRName VARCHAR

# Importing Data from miRBase

The tables were initially filled with known data about the miRNAs. 

The first file to be imported is the basic list of known primary and mature microRNAs. This data was downloaded from <a href="http://www.mirbase.org/ftp.shtml">miRBase version 21</a>. 

## hsa.gff3 Import

The hsa.gff3 file, which contains all of the human microRNAs and their genomic locations, was used to fill in the name, ID, chromosome, strand, genomic start and genomic end of the primary and mature miRNAs. It was also used to link the mature miRNAs to their corresponding primary miRNA. 

The hsa.gff3 file has format (for more gff3 file info see <a href="http://www.sequenceontology.org/gff3.shtml">this site</a>): 

* <b>Column 1</b>: chromosome
* <b>Column 2</b>: .
* <b>Column 3</b>: miRNA_primary_transcript/miRNA
* <b>Column 4</b>: genomic start (hg38)
* <b>Column 5</b>: genomic end (hg38)
* <b>Column 6</b>: .
* <b>Column 7</b>: strand of the chromosome
* <b>Column 8</b>: .
* <b>Column 9</b>: attributes
    * <b>ID</b>: accession number
    * <b>Alias</b>: secondary accession number
    * <b>Name</b>: miRNA name
    * <b>Derives_from</b>: in the case of mature miRNAs, the corresponding primary miRNA accession number

In [1]:
import re
import data_processing as dp

def import_microRNAs(hsaMiRLoc, db_name, sql_version="MySQL", firewall=False):
    """
        Adds the miRNAs and their genomic locations to a database
    """
    
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
        
    # Clears both tables before beginning import
    db_con.clear_table("MatureMicroRNA")
    db_con.clear_table("PrimaryMicroRNA")

    # Regular expression to find the pri-miR ID and name from the info string
    priRE = re.compile('^ID=(MI[0-9]{7});.*?;Name=(.*?)$')
    # Regular expression to find the mat-miR ID, mat-miR name and corresponding pri-miR from the info string
    matRE = re.compile('^ID=(.*?);.*?;Name=(.*?);Derives_from=(MI[0-9]{7})$')
    # Regular expression to find chromosome name
    chrRE = re.compile('^chr(.*?)$')
    
    with open(hsaMiRLoc, "r") as miRFile:
        pri_dict = {"PriMiRName": [], "PriID": [], "Chr": [], "GenomeStart": [], "GenomeEnd": [], 
                    "ChrStrand": []}
        mat_dict = {"MatID": [], "MatMiRName": [], "PriID": [], "MatStart": [], "MatEnd": []}
        for line in miRFile:
            # Pass header lines
            if line[0] == '#':
                continue
            
            elements = line.split('\t')
            # If the miRNA is anotated as a primary miRNA
            if elements[2] == 'miRNA_primary_transcript':
                m = priRE.match(elements[8]) # The 9th element in the line includes several attibutes
                priID = m.group(1)
                priName = m.group(2)
                m2 = chrRE.match(elements[0]) # Finds the chromosome number
                chromosome = m2.group(1)
                pri_dict["PriMiRName"] += [priName]
                pri_dict["PriID"] += [priID]
                pri_dict["Chr"] += [chromosome]
                pri_dict["GenomeStart"] += [int(elements[3])]
                pri_dict["GenomeEnd"] += [int(elements[4])]
                pri_dict["ChrStrand"] += [elements[6]]

            # If its a mature miRNA
            else:
                m = matRE.match(elements[8]) # The 9th element in the line includes several attibutes
                matID = m.group(1)
                matName = m.group(2)
                priID = m.group(3)
                mat_dict["MatID"] += [matID]
                mat_dict["MatMiRName"] += [matName]
                mat_dict["PriID"] += [priID]
                mat_dict["MatStart"] += [int(elements[3])]
                mat_dict["MatEnd"] += [int(elements[4])]
                
        db_con.make_many_rows(pri_dict, "PrimaryMicroRNA")
        db_con.make_many_rows(mat_dict, "MatureMicroRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [2]:
hsaMiRLoc = "From miRBase/hsa.gff3.txt"
import_microRNAs(hsaMiRLoc, "miR-test", firewall=True)

## miRNA Confidence Import

Since many of the miRNA annotations, come from small RNA deep sequencing experiments and may not represent true miRNAs, miRBase includes a list of high confidence miRNAs. <a href="https://doi.org/10.1093/nar/gkt1181">This paper</a> and <a href="http://www.mirbase.org/blog/2014/07/high-confidence-mirna-set-available-for-mirbase-21/">this blog post</a> explain the details of how these miRNAs obtain a high confidence annotation. In general, deep sequencing experiments must show these miRNAs have:
* \>10 deep sequencing reads which map to each arm of the hairpin or >5 reads per arm and >100 total reads
* Predicted mature miRNAs from each arm have 0 to 4-nt 3' overhangs, which result from Dicer and Drosha cleavage
* \>50% of the reads for each arm have same 5' end
* Hairpin folding free energy < -0.2 kcal/mol/nt
* \>60% of predicted mature miRNAs base pair in the hairpin

NOTE: Annotation of high confidence miRNAs is automated and has no relation to the number of studies which support the existance of the miRNA (i.e. hsa-mir-21 is not annotated as high confidence)

Those primary miRNAs which are considered high confidence get a 'T' in the HighConfidence column. All other primary miRNAs will have 'NULL' in that column.

In [1]:
import data_processing as dp

def import_high_conf(highConfFile, db_name, sql_version="MySQL", firewall=False):
    """
    Imports the high confidence miRNAs from miRBase v21
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    
    with open(highConfFile, "r") as f:
        conf_dict = {"HighConfidence": []}
        pri_dict = {"PriID": []}
        for line in f:
            # Only look at human high confidence miRNAs
            if line[:5] == ">hsa-":
                elements = line.split(" ")
                priID = elements[1]
                conf_dict["HighConfidence"] += ["T"]
                pri_dict["PriID"] += [priID]
    db_con.update_many_rows(conf_dict, pri_dict, "PrimaryMicroRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [2]:
highConfFile = "From miRBase/high_conf_hairpin.fa"
import_high_conf(highConfFile, "miR-test", firewall=True)

## miRNA Family Import

Next, the information about which miRNAs belong to which family is added. The miFam.dat file was also downloaded from <a href="http://www.mirbase.org/ftp.shtml">miRBase version 21</a> and converted into a text file using excel due to difference in how line breaks are handled in Linux vs Windows machines. This file includes a list of miRNAs for all species which belong to a given miRNA family. Each family of miRNAs is headed by *AC*, the family ID, *ID*, and the family name. The miRNAs which belong to that family are then list in the format *MI*, miRNA accession number and the miRNA name. After each family, a divider of *//* is included. The below code was used to fill in this information for the human primary miRNAs which belong to a known family of miRNAs.

In [4]:
import data_processing as dp

def import_miR_family(famFile, db_name, sql_version="MySQL", firewall=False):
    """
        Imports the miRNA family data and fills in the PrimaryMicroRNA MiRFamily column
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    
    with open(famFile, "r") as f:
        fam_dict ={"MiRFamily": []}
        pri_dict = {"PriID": []}
        for line in f:
            elements = line.split('\t')
            # The family ID heads the group of miRNAs which belong to that family
            if elements[0] == "ID":
                family = elements[1]
            # Only looks at human miRNAs
            elif elements[0] == "MI" and "hsa" in elements[2]:
                priID = elements[1]
                fam_dict["MiRFamily"] += [family]
                pri_dict["PriID"] += [priID]
                
    db_con.update_many_rows(fam_dict, pri_dict, "PrimaryMicroRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [5]:
famFile = "From miRBase/miFam.txt"
import_miR_family(famFile, "miR-test", firewall=True)