### Hello, databases

Databases are _very_ common software tools. They are fundamental infrastructure underlying the digital world. 

In this module, you will learn to use databases and understand how they work. This notebook will help you get setup with a particular database [SQLite](https://www.sqlite.org/index.html). However, it's good to remember that SQLite is just one kind of database. There are lots of others too. We'll talk about that at the end of this notebook.

### Installation and setup

If you are using Anaconda, then it is very easy to import sqllite. The package is included in standard conda environments.

In [3]:
import sqlite3 as sql 

If you were able to import sqlite3 in the prior line, congrats you are all set up! If you are not using conda, you might have to do some more work to install and configure sqlite3. 

In [14]:
## This cell shows you how to create a database

dbname = "shapedatabase"     # name the database

conn  = sql.connect(dbname)    # Connect to the database
cur = conn.cursor()            # Your connection to the database is maintained via cursor
                                    # this line just gets a cursor
 
'''
We interact with databases via SQL statments. 
A SQL statement is a short command, either requesting information 
from a database or updating a database. The SQL statement below
creates a table called "shapes" in the database "shapedatabase"
'''

SQLStatement = '''
CREATE TABLE shapes ( 
id INTEGER PRIMARY KEY AUTOINCREMENT, 
shape VARCHAR, 
color VARCHAR 
)'''     

cur.execute(SQLStatement) 

<sqlite3.Cursor at 0x7f959f765dc0>

**Question** 

Try running `cur.execute(SQLStatement)` a second time. What happens? Why do you think this occurs?

[Type your answer here]

In [42]:
### This code prints out the structure of the database 

SQLStatement = '''pragma table_info('shapes')'''     # this command requests the schema for the table shapes 

sth = cur.execute(SQLStatement).fetchall()

print("cid", "name", "type", "primary key")
for s in sth:
    print(s[0], s[1], s[2], s[-1])

cid name type primary key
0 id INTEGER 1
1 shape VARCHAR 0
2 color VARCHAR 0


**Questions** 

How many rows are printed out above? Why do you think that is the case?

[Type your answer here]

What do you think the cid column might represent?

[Type your answer here]

Why is there a 1 in the primary key column for the first row?

[Type your answer here]


### Hello, INSERT statement

In [43]:
insert_data = ''' 
INSERT INTO atable (some_key, some_value) VALUES (?,?)''' 
 
cur.execute(insert_data, ('foo', 'bar')) 
some_data = ( 
	('fob','baz'), 
	('zoo','bee') 
	) 
cur.executemany(insert_data, some_data) 
 
cur.execute(‘insert into atable (some_key) VALUES (?)’, (‘noval’,)) 
 

SyntaxError: invalid character in identifier (<ipython-input-43-5970ec30201f>, line 11)

In [None]:
query = ''' 
SELECT id, some_key, some_value FROM atable''' 
 
sth = cur.execute(query) 
results = sth.fetchall() 
for i, key, val in results: 
	print(i, key, val) 

In this notebook, you will get setup with SQLite. But before going through the steps to get things running, it is worth pointing out that there are lots and lots of different kinds of databases. We just picked SQLite for this assignment set because it is easy to get working. In your career in information science, you will have many database choices. 

Postgres and MySQL are standard, established and popular open-source, relational databases. Postgres in particular is a great choice for a reliable and performant database to support many applications (beware, Postgres can be slightly annoying to set up). Beyond reliable favorites like Postgres and MySQL, there are lots and lots of other kinds of databases, which fill different software niches and use cases. For instance, standard databases store records on disk, but VoltDB stores records in memory. Some databases like Oracle (expensive, “enterprise” software) might support complex permission structures (i.e. who can access what record) which are needed very large organizations. CockroachDB replicates information in many places, to ensure that information is accessible. Google BigQuery supports a SQL-like API for records stored on Google cloud. In this lecture, we will also talk about software systems that store information like databases, but might not support a database-like API. For instance, there is a system called Hadoop which just stores information in files rather than tables, based on a system called MapReduce from Google. Things like MongoDB use a key-value store, rather than a table-like API. 

In general, when people say “database” they mean something that supports a SQL-like API and guarantees four properties anytime you interact with the API https://en.wikipedia.org/wiki/ACID. There are many, many articles online explaining the ACID properties. 

Question: VoltDB stores records in memory but Postgres stores records on disk. Why do you think someone designed an in-memory database? What might be the advantage of this sort of software? What kinds of applications might be suited to VoltDB? (Hint: accessing disk is slow)

The point of all that is not to overwhelm you! Instead, imagine that you are walking into the tool aisle at Home Depot. There are hundreds and hundreds of different tools on the shelves. When you install SQLite, you are essentially picking one particular kind of wrench. That's just a good thing to keep in mind. 