In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# Lecture 14 - Interfacing with SQLite

# Table of Contents
* [Lecture 14 - Interfacing with SQLite](#Lecture-14---Interfacing-with-SQLite)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
		* [Connecting SQLite to the Database](#Connecting-SQLite-to-the-Database)
			* &nbsp;
				* [(1) iterate through each record in the data frame](#%281%29-iterate-through-each-record-in-the-data-frame)
				* [(2) construct a tuple that will contain all the data from each row](#%282%29-construct-a-tuple-that-will-contain-all-the-data-from-each-row)
				* [(3) construct a SQL string containing the SQL insert statement](#%283%29-construct-a-SQL-string-containing-the-SQL-insert-statement)
				* [(4) execute the SQL string  statement together with the data](#%284%29-execute-the-SQL-string--statement-together-with-the-data)
		* [Connecting DataFrames with SQLite](#Connecting-DataFrames-with-SQLite)


---

### Learning Outcomes

At the end of this lecture, you should be able to:

* connect to SQLite using Python scripts  
* create a database and select it using Python 
* create tables in a selected database
* construct insert statements with data from a dataframe
* execute inserts into tables
* construct and execute select statements using Python scripts

In cases when our data is heavily relational, it might prove more advantageous to store and work with the data using a relational database format.

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite does not operate like MySQL, Oracle or MS Server since it does not require a separate process to act as the RDBMS.

SQLite is simply a set of libraries that are callable. SQLite is the most widely deployed database engine in the world.

In [None]:
import pandas as pd
import sqlite3
import datetime as dt
import numpy as np

### Connecting SQLite to the Database

We use the function sqlite3.connect to connect to the database. We can use the argument ":memory:" to create a temporary DB in the RAM or pass the name of a file to open or create it.

In [None]:
# Create a database in RAM
connection = sqlite3.connect(':memory:')

In [None]:
# Creates or opens a file called mySQLiteDB.sl3 with a SQLite3 DB
connection = sqlite3.connect('../datasets/mySQLiteDB.sl3')

A cursor object gives us a 'handle' to the specified database and allows us to execute commands and traverse the records from the result set.  

In [None]:
cursor = connection.cursor()


With the cursor, we can now execute commands to create SQL tables.

We will use the population example from the previous lectures to demonstrate how a table based on this example can be created and its data can be inserted. 

In [None]:
data = pd.DataFrame({'population':[3778000, 19138000, 20000, 447000, 4433000, 22680000, 10900, 549598],
                     'year':[2000, 2000, 2000, 2000, 2014, 2014, 2014, 2014],
                     'nation':['New Zealand', 'Australia', 'Cook Islands', 'Solomon Islands', 
                                'New Zealand', 'Australia', 'Cook Islands', 'Solomon Islands']})
data

We can now create a DB table to store this data.



In [None]:
national_populations = """
    CREATE TABLE national_populations (
      entry INTEGER PRIMARY KEY ,
      nation VARCHAR(20) NOT NULL,
      population INTEGER(10) NOT NULL,
      year date NOT NULL
    ) 
    """

national_populations

In [None]:
cursor.execute("DROP TABLE IF EXISTS national_populations")
cursor.execute(national_populations)

And if we performed any operation on the database other than sending queries, we need to commit those changes via the .commit() method before we close the connection

In [None]:
connection.commit()

We can now begin inserting data from a data frame into the table.

In [None]:
data

Of course, we could perform the row insertions manually one-by-one by writing out the SQL statement as a string with all the values imbedded.

In [None]:
sql_statement = """
            INSERT INTO national_populations 
            (nation, population, year) 
            VALUES ('New Zealand', 3778000, '2000-01-01')
            """

We then execute the SQL statement below by passing it to the *execute()* method as an argument, followed by a call to commit.

In [None]:
cursor.execute(sql_statement)
connection.commit()

If you are using Firefox, you can install a plugin which will enable you to view graphically your database tables:

https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager-webext/

https://add0n.com/sqlite-manager.html?version=0.2.2&type=install

**Exercise:** Write code to insert the second row of the above data frame into the database

**Exercise:** Turn to the person next to you and discuss the potential issues with the above approach to inserting data into a database if you are faced with millions of records.

So clearly this approach to inserting data does not scale to bigger and real-world problems.

What is needed is a more automated approach.

What we are after is a construct that will allow us to signify which parts of the INSERT statement string alre placeholders for values with which it can be substituted.

Below is an example of how we can create placeholders that will be repolaced with values from a data frame. The placeholders are represented by the *?*.

In [None]:
add_national_entry = """
            INSERT INTO national_populations 
            (nation, population, year) 
            VALUES (?, ?, ?)
                     """

We next need to create a tuple having the 3 values in order that they will be substituted as *?*. 

In [None]:
substitution_values = ('Cook Islands', 20000,  dt.date(2010, 1, 1))
substitution_values

We then cal the execute statement on the cursor with the above arguments:

In [None]:
cursor.execute(add_national_entry, substitution_values)
connection.commit()

**Exercise**: Write code that inserts into the table above values 'Germany', '84000000', 2014 following the example above.

We now have the tools and the mechanism to automate this entire process. For this we will need to use iteration.

Below are the steps we must follow:

##### (1) iterate through each record in the data frame

In [None]:
data.iterrows()

In [None]:
for index, row in data.iterrows():
    print("iteration: ", index)
    print(row['nation'], row['population'], row['year'])

##### (2) construct a tuple that will contain all the data from each row

In [None]:
for index, row in data.iterrows():
    national_entry_data = (row['nation'], row['population'], dt.date(row['year'], 1, 1))
    print(national_entry_data)

##### (3) construct a SQL string containing the SQL insert statement

In [None]:
add_national_entry = """
                            INSERT INTO national_populations 
                            (nation, population, year) 
                            VALUES (?, ?, ?)
                     """

for index, row in data.iterrows():
    print(add_national_entry)

##### (4) execute the SQL string  statement together with the data

In [None]:
add_national_entry = """
                            INSERT INTO national_populations 
                            (nation, population, year) 
                            VALUES (?, ?, ?)
                        """

for index, row in data.iterrows():
    substitution_values = (row['nation'], row['population'], dt.date(row['year'], 1, 1))
    cursor.execute(add_national_entry, substitution_values)
    
#!!!!!! VERY IMPORTANT  !!!!!
#nothing will happen without the line below
connection.commit()

We also could have a done a multiple insert:

In [None]:
cursor.executemany(
      """INSERT INTO national_populations (nation, population, year) 
      VALUES (?, ?, ?)""",
      [
      ('China', 1382323332,  dt.date(2016, 1, 1)),
      ('USA', 324118787,  dt.date(2016, 1, 1)),
      ('Russia', 143439832,  dt.date(2016, 1, 1))
      ] )

connection.commit()


We can now query that table into which we have just inserted data.

In [None]:
query = ("SELECT * FROM national_populations ")
cursor.execute(query)

In [None]:
for (entry, nation, population, year) in cursor:
    print(nation, population, year)

**Exercise**: Write code that queries the table above by selecting countries which have populations above 3 million. Return nation name and population only.

We can get some metadata info about our national_populations table using the PRAGMA command. The PRAGMA table_info(tableName) command returns one row for each column in the national_populations table. Columns in the result set include the column order number, column name, data type, whether or not the column can be NULL, and the default value for the column. 

In [None]:
info = cursor.execute('PRAGMA table_info(national_populations)')
for c in info:
        print(c[0], c[1], c[2], c[3], c[4], c[5])

We can create an index on SQLite databases:

In [None]:
sql = ("CREATE INDEX ind_year ON national_populations (year);")
cursor.execute(sql)

Once finished using a database, release the memory by closing both the cursor and the connection.

In [None]:
cursor.close()
connection.close()

**Exercise**: Create a database and a table schema to store the data in from the adult_mortality_rate_by_cause.csv, adult_mortality_rates.csv, child_mortality_rates.csv, total_health_expenditure_percent_per_capita_of_gdp_by_country_per_year.csv datasets cleaned from the previous tutorials. 

Write a scripts to to insert the data from a data frame into the database.


This is <em>really</em> neat!

In [None]:
%%latex
Some important equations:
$$E = mc^2$$
$$e^{i \pi} = -1$$

### Connecting DataFrames with SQLite

We have the capability to read a SQL query into a DataFrame through *read_sql_query* which returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used.

In [None]:
connection = sqlite3.connect("../datasets/mySQLiteDB.sl3")
   

In [None]:
df_sql = pd.read_sql_query('SELECT * '
                           'FROM national_populations '
                           'WHERE population > 1000000 '
                           'LIMIT 3', connection)
df_sql

In [None]:
data

In [None]:
df_sql = pd.read_sql_query('SELECT nation, AVG(population) as `total_population` '
                           'FROM national_populations '
                           'GROUP BY nation '
                           'ORDER BY -total_population '
                           'LIMIT 6', connection)

df_sql

We can write a dataframe directly to a sqlite database table without specifying column types: 

In [None]:
df_sql.to_sql('temp_results', connection, if_exists='append', index=False)

In [None]:
pd.read_sql_query('SELECT * FROM temp_results', connection)

**Exercise:** Import the country_info.csv and convert it into a SQLite database using the above approach, then generate a SQL query that for each currency counts the number of countries that use it, and list the top 10 in a dataframe. 

In [None]:
Anaconda3-2018.12 (Python 3.7)
nbextensions

activate base
conda install -c conda-forge jupyter_contrib_nbextensions

jupyter contrib nbextension install --user
jupyter nbextension enable spellchecker/main
jupyter nbextension enable codefolding/main
conda install -c conda-forge jupyter_nbextensions_configurator
jupyter nbextensions_configurator enable --user

In [None]:
%%javascript
require(['base/js/utils'],
function(utils) {
   utils.load_extensions('calico-spell-check', 'calico-document-tools', 'calico-cell-tools');
});

https://pypi.org/project/ipython-sql/

In [None]:
!pip install ipython-sql

In [None]:
%load_ext sql

In [None]:
%%sql sqlite://
CREATE TABLE writer (first_name, last_name, year_of_death);
INSERT INTO writer VALUES ('William', 'Shakespeare', 1616);
INSERT INTO writer VALUES ('Bertold', 'Brecht', 1956);

In [None]:
%sql select * from writer

https://github.com/ipython/ipython/wiki/Extensions-Index


https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/

Jupyter
https://www.dataquest.io/blog/advanced-jupyter-notebooks-tutorial/
