# Introduction to Databases and Structured Query Language (SQL)

As Data Scientists, you will frequently want to store data in an organized, structured manner that allows you to do complex queries.  Because you are good Data Scientists, [**you do not use Excel!!!**](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80)

In this course, we will only discuss **Relational Databases**, because those are the most common in bioinformatics.  (There are other kinds!!).  So when I say "database" I mean "relational database".

Databases are used to store information in a manner that, when used properly, is:
  a) highly structured
  b) constrained (i.e. detects errors)
  c) transactional (i.e. can undo a command if it discovers a problem)
  d) indexed (for speed of search)
  e) searchable
  
The core concept of a database is a **Table**.  Tables contain one particular "kind" of information (e.g. a Table could represent a Student, a University, a Book, or a Clinical Record.  

Tables contain **Rows** and **Columns** where, generally, every column represents a "feature" of that information (e.g. a Student table might have **["name", "gender", "studentID", "age"]** as its columns/features).  Every row represents an "individual", and their values for each feature (e.g. a Row in a Student table might have **["Mark Wilkinson", "M", "163483", "35"]** as its values.

A Database may have many Tables that represent various kinds of related information.  For example, a library database might have a Books table, a Publishers table, and a Locations table.  A Book has a Publisher, and a Location, so the tables need to be connected to one another.  This is achieved using **keys**.  Generally, every row (individual) in a table has a unique identifier (generally a number), and this is called its **key**.  Because it is unique, it is possible to refer unambiguously to that individual record.

I think the easiest way to learn about databases and SQL is to start building one!  We will use the MySQL Docker Container that we created in the previous lesson.  We are going to create a Germplasm database (seed stocks).  It will contain information about the seed (its amount, its harvest date, its location), the germplasm (its species, the allele it carries), and about the genetics related to that allele (the gene_id, the gene name, the protein name, and a link to the GenBank record)

(if that container isn't running, please **docker start course-mysql** now!)

**Note:  This Jupyter Notebook is running the Python kernel.  This allows us to use some nice tools in Python (the sql extension and SqlMagic) that provide access to the mysql database server from inside of the Notebook.  You don't need to know any Python to do this.  Note also that you can do exactly the same commands in your Terminal window.**

To connect to the MySQL Docker Container from your terminal window, type:

     mysql -h 127.0.0.1 -P 3306 --protocol=tcp -u root -p
 
(then enter your password 'root' to access the database)
 
<pre>


</pre>
# SQL

Structured Query Language is a way to interact with a database server.  It is used to create, delete, edit, fill, and query tables and their contents.  

First, we will learn the SQL commands that allow us to explore the database server, and create new databases and tables..  Later, we will use SQL to put information into those tables.  Finally, we will use SQL to query those tables.


## Python SQL Extension

The commands below are used to connect to the MySQL server in our Docker Container.  You need to execute them ONCE.  In every subsequent Juputer code window, you will have access to the database.

all SQL commands are preceded by 

     %sql 
     
(**only in the Python extension!  Not in your terminal window!**)

all SQL commands end with a ";"

In [2]:
%load_ext sql
%config SqlMagic.autocommit=False
%sql mysql+pymysql://root:root@127.0.0.1:3306/mysql
#%sql mysql+pymysql://anonymous@ensembldb.ensembl.org/homo_sapiens_core_92_38
            

'Connected: root@mysql'

## show databases

**show databases** is the command to see what databases exist in the server.  The ones you see now are the default databases that MySQL uses to organize itself.  _**DO NOT TOUCH THESE DATABASES**_

In [5]:
%sql show databases;



 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
4 rows affected.


Database
information_schema
mysql
performance_schema
sys


## create database

The command to create a database is **create database** (surprise!  ;-) )

We will create a database called "germplasm"




In [9]:
%sql create database germplasm;
%sql show databases


 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
5 rows affected.


Database
information_schema
germplasm
mysql
performance_schema
sys


## use database_name

the **use** command tells the server which database you want to interact with.  Here we will use the database we just created

In [11]:
%sql use germplasm



 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.


[]

## show tables

The show tables command shows what tables the database contains (right now, none!)

In [12]:
% sql show tables

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.


Tables_in_germplasm


# Planning your data structure

This is the hard part.  What does our data "look like" in a well-structured, relational format?

Starting simply:

<center>stock table</center>

  amount  |  date  |  location  
 --- | --- | --- 
  5  | 10/5/2013 |  Room 2234  
  9.8  | 12/1/2015 |  Room 998  


-----------------------------


<center>germplasm table</center>

  taxonid  |  allele
 --- | --- 
  4150  | def-1
  3701  | ap3
  
--------------------------------

<center>gene table</center>

  gene  |  gene_name  |  embl
 --- | ---  | --- 
  DEF  | Deficiens  | https://www.ebi.ac.uk/ena/data/view/AB516402
  AP3  | Apetala3   |   https://www.ebi.ac.uk/ena/data/view/AF056541
  
  
  


## add indexes

It is usually a good idea to have an index column on every table, so let's add that first:


<center>stock table</center>

id  |  amount  |  date  |  location  
---  | --- | --- | --- 
1   |  5  | 10/5/2013 |  Room 2234  
2   |   9.8  | 12/1/2015 |  Room 998  


-----------------------------


<center>germplasm table</center>

id  |  taxonid  |  allele
--- |  --- | --- 
1  |  4150  | def-1
2  |   3701  | ap3
  
--------------------------------

<center>gene table</center>

id  |  gene  |  gene_name  |  embl
--- | --- | ---  | --- 
1  |  DEF  | Deficiens  | https://www.ebi.ac.uk/ena/data/view/AB516402
2  |  AP3  | Apetala3   |   https://www.ebi.ac.uk/ena/data/view/AF056541
  


##  find linkages

* Every germplasm has a stock record.  This is a 1:1 relationship.
* Every germplasm represents a specific gene.  This is a 1:1 relationship

So every germplasm must point to the index of a stock, and also to the index of a gene

Adding that into our tables we have:



<center>stock table</center>

id  |  amount  |  date  |  location  
---  | --- | --- | --- 
1   |  5  | 10/5/2013 |  Room 2234  
2   |   9.8  | 12/1/2015 |  Room 998  


-----------------------------


<center>germplasm table</center>

id  |  taxonid  |  allele  |  stock_id  |  genetics_id
--- |  --- | ---  | --- | ---
1  |  4150  | def-1  | 2   |  1
2  |   3701  | ap3   | 1   |  2
  
--------------------------------

<center>gene table</center>

id  |  gene  |  gene_name  |  embl
--- | --- | ---  | --- 
1  |  DEF  | Deficiens  | https://www.ebi.ac.uk/ena/data/view/AB516402
2  |  AP3  | Apetala3   |   https://www.ebi.ac.uk/ena/data/view/AF056541
  


## data types in MySQL

I will not discuss [all MySQL Datatypes](https://dev.mysql.com/doc/refman/5.7/en/data-types.html), but we will look at only the ones we need.  We need:

* Integers (type INTEGER)
* Floating point (type FLOAT)
* Date  (type DATE [in yyyy-mm-dd format](https://dev.mysql.com/doc/refman/5.7/en/datetime.html) )
* Characters (small, variable-length --> type [VARCHAR(x)](https://dev.mysql.com/doc/refman/5.7/en/char.html) )

<pre>


</pre>
## create table 

tables are created using the **create table** command (surprise!)

The [syntax of create table](https://dev.mysql.com/doc/refman/5.7/en/create-table.html) can be quite complicated, but we are only going to do the most simple examples.

    create table table_name (column_name column_definition)
    
column definitions include the data-type, and other options like if it is allowed to be null(blank), or if it should be treated as an "index" column.

Examples are easier to understand than words... so here are our table definitions:

    

In [20]:
#%sql drop table stock
%sql create table stock(id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, amount FLOAT NOT NULL, date DATE NOT NULL, location VARCHAR(20) NOT NULL);
%sql describe stock


 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
4 rows affected.


Field,Type,Null,Key,Default,Extra
id,int(11),NO,PRI,,auto_increment
amount,float,NO,,,
date,date,NO,,,
location,varchar(20),NO,,,


In [23]:
#%sql drop table germplasm
%sql create table germplasm(id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, taxonid INTEGER NOT NULL, allele VARCHAR(10) NOT NULL, stock_id INTEGER NOT NULL, gene_id INTEGER NOT NULL);
%sql describe germplasm


 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
5 rows affected.


Field,Type,Null,Key,Default,Extra
id,int(11),NO,PRI,,auto_increment
taxonid,int(11),NO,,,
allele,varchar(10),NO,,,
stock_id,int(11),NO,,,
gene_id,int(11),NO,,,


In [24]:
#%sql drop table gene
%sql create table gene(id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, gene VARCHAR(10) NOT NULL, gene_name VARCHAR(30) NOT NULL, embl VARCHAR(70) NOT NULL);
%sql describe gene

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
0 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
4 rows affected.


Field,Type,Null,Key,Default,Extra
id,int(11),NO,PRI,,auto_increment
gene,varchar(10),NO,,,
gene_name,varchar(30),NO,,,
embl,varchar(70),NO,,,


## loading data

There are many ways to import data into MySQL.  If you have data in another (identical) MySQL database, you can "dump" the data, and then import it directly.  If you have tab or comma-delimited (tsv, csv) you can **sometimes** import it directly from these formats.  You can also enter data using SQL itself.  This is usually the safest way, when you have to keep multiple tables synchronized (as we do, since the germplasm table is "linked to" the other two tables)

## insert into

The command to load data is:

    insert into table_name (field1, field2, field3) values (value1, value2, value3)
    
Now... what data do we need to add, in what order?

The germplasm table needs the ID number from both the gene table and the stock table, so we cannot enter the germplasm information first.  We must therefore enter the gene and stock data first.


In [25]:
# NOTE - we DO NOT put data into the "id" column!  This column is auto_increment, so it "magically" creates its own value
%sql insert into gene (gene, gene_name, embl) values ('DEF', "Deficiens", 'https://www.ebi.ac.uk/ena/data/view/AB516402');
%sql insert into gene (gene, gene_name, embl) values ('AP3', "Apetala3", 'https://www.ebi.ac.uk/ena/data/view/AF056541');


 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.


UsageError: Line magic function `%select` not found.


In [28]:
%sql select last_insert_id();  # just to show you that this function exists!

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.


last_insert_id()
2


In [29]:
%sql insert into stock(amount, date, location) values (5, '2013-05-10', 'Room 2234');
%sql insert into stock(amount, date, location) values (9.8, '2015-1-12', 'Room 998');


 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.
 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
1 rows affected.


[]

#### Almost ready!  

We now need to know the index numbers from the stock and gene databases that correspond to the data for the germplasm table.  For this, we need to learn another function:  **select**

## Select statements

**Select** is the command used to query the database.  We will look in more detail later, but now all you need to know is that the most basic structure is:

     select * from table_name
     


In [33]:
%sql select * from stock;  # notice that the id number was automatically generated


 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
2 rows affected.


id,amount,date,location
1,5.0,2013-05-10,Room 2234
2,9.8,2015-01-12,Room 998


In [32]:
%sql select * from gene;

 * mysql+pymysql://root:***@127.0.0.1:3306/mysql
2 rows affected.


id,gene,gene_name,embl
1,DEF,Deficiens,https://www.ebi.ac.uk/ena/data/view/AB516402
2,AP3,Apetala3,https://www.ebi.ac.uk/ena/data/view/AF056541


<pre>


</pre>

Just a reminder, our germplasm data is:

id  | 	taxonid  |  allele  | 	stock_id  |	gene_id
--- | --- | --- | --- | --- |
1 	 | 4150 	|def-1 	
2 	| 3701 |	ap3 	