### Creating and modifying databases
#### Creating a table
To just create an empty database we can write

In [3]:
import pandas as pd
import sqlite3

db = sqlite3.connect("my_database.db")

Recall that this was the same command we used to open up a database or SQLite file. The command opens up the file if such a file exists and creates the file otherwise. So, in this case, it will create the file 'my_database.db' in your working directory.

To actually create the tables we will have to write a query. We must mention the name of the table, and the names of each column followed by their data type. Here is an example

In [4]:
query = "CREATE TABLE customer (id INTEGER PRIMARY KEY, first_name TEXT, last_name TEXT, age INTEGER)"

By including the keyword PRIMARY KEY after the column id we are specifying that this column is the basis of the index for looking things up in this table. This means that selecting rows by id will be very fast.

Now to run this query we will show you a different way that does not rely on pandas read_sql_query function. Instead, it works by creating a Cursor object from the Connect object db as follows:

In [5]:
cursor = db.cursor()

And now this cursor object has a method called execute through which we can run our query.

In [6]:
cursor.execute(query)


<sqlite3.Cursor at 0x263795bb650>

Now we can check that the table was created correctly by using a query we have already seen before

In [7]:
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

<sqlite3.Cursor at 0x263795bb650>

This should return the names of all the tables in our database. Except that now the results of the query are not automatically returned to us as when we used pandas. We must run an additional command to get them. We can use the fetchall() method to get all the results of a query.

In [8]:
results = cursor.fetchall()
print(results)

[('customer',)]


#### Adding rows
Let’s now try to enter some data into our newly created database. We can insert rows into our database using the INSERT query. We must specify one value for each column in the correct order. Here is an example

In [9]:
query = "INSERT INTO customer VALUES  (701, 'Mackenzie', 'Fox', 35)"
cursor.execute(query)

<sqlite3.Cursor at 0x263795bb650>

Querying the table now will not actually return this new row. The reason is that SQLite does not actually modify a database until we commit our transaction. 

In [10]:
db.commit()

Let’s now check that the changes were made correctly. We have already seen how to select all the rows from a given table

In [12]:
cursor.execute("SELECT * FROM customer;")
results = cursor.fetchall()
print(results)

[(701, 'Mackenzie', 'Fox', 35)]


#### Adding columns
In addition to rows, we can also add columns to our database at any point in time. Let’s suppose we want to add a column giving the location of each customer. We can do this as follows

In [13]:
cursor.execute("ALTER TABLE customer ADD COLUMN city TEXT DEFAULT 'Geneva';")
db.commit()

The last keyword DEFAULT allows us to mention a default value that will be applied to all of the existing rows. If we now rerun our "SELECT * FROM customer;" query and fetch the results we will see that the new column was added correctly

In [14]:
cursor.execute("SELECT * FROM customer;")
results = cursor.fetchall()
print(results)

[(701, 'Mackenzie', 'Fox', 35, 'Geneva')]


#### Adding data from a pandas DataFrame

Now, most of the time we will not be creating a database directly but taking the data from an existing format and moving into a database. Below we take a look at creating a database from an existing pandas DataFrame.

Let’s start by defining a DataFrame

In [15]:
import pandas as pd

df2 = pd.DataFrame(
    [[702, "Emily", "Joy", 49, "Geneva"]],
    columns=["id", "first_name", "last_name", "age", "city"],
)

Aside: We take this opportunity to point out a little trick when defining a DataFrame: we have usually defined DataFrames so far from Python dictionaries, but we can also do it by just passing a list of lists as seen above. This method has the advantage that we can be sure that the order of the columns will be the one we specified, whereas from a dictionary we have no control over the final order.

Now to transfer this data into a Database we can use the pandas function to_sql(). We must mention the name of the table that we want the data to be stored in, and the name of the database. We can add the argument if_exists to tell pandas what to do if the table already exists. In this case, let’s try adding the data from this DataFrame to the already existing table customer in the database db.

In [16]:
df2.to_sql(name="customer", con=db, if_exists="append", index=False)

The table customer already exists in this case and we have asked pandas to just append the data to this table. The argument index=False specifies to not include the index of the DataFrame as an actual column in the database, which would be the default setting.

Let’s check our results by querying our database one more time



In [17]:
cursor.execute("SELECT * FROM customer;")
results = cursor.fetchall()
print(results)

[(701, 'Mackenzie', 'Fox', 35, 'Geneva'), (702, 'Emily', 'Joy', 49, 'Geneva')]


### Updating rows
We can update any row in our database with a special SQL query that uses the UPDATE keyword. We must specify the name of the table we want to update to update and then pass the new values for each column that we want updated. We can specify which row we want the updates to apply to by using the WHERE statement that you have seen before.

Let’s look at an example. Let’s say we want to change the city of the customer Emily Joy in our table from Geneva to Zurich. We can identify this row using the id column. Our query then looks like this

In [18]:
update = """
UPDATE customer
SET city='Zurich'
WHERE id=702;
"""

In [19]:
cursor.execute(update)
db.commit()

In [20]:
# Check results
cursor.execute("SELECT * FROM customer;")
results = cursor.fetchall()
print(results)

[(701, 'Mackenzie', 'Fox', 35, 'Geneva'), (702, 'Emily', 'Joy', 49, 'Zurich')]


#### Deleting rows
Rows can be deleted using a query with the keyword DELETE. For example we can delete the second row of our database by executing the following query

In [21]:
delete = """
DELETE FROM customer
WHERE id=702;
"""
cursor.execute(delete)
db.commit()

checking the database again now shows that there is only the first row remaining

In [22]:
cursor.execute("SELECT * FROM customer;")
results = cursor.fetchall()
print(results)

[(701, 'Mackenzie', 'Fox', 35, 'Geneva')]


#### Creating a database from a CSV file
Now, this csv file is very small just for the purposes of this demonstration. However, when working with large csv files where we want to extract specific rows that satisfy some condition we must usually load the entire csv file as a DataFrame and then go through it line by line and check our condition. All of this is happening in the working memory (RAM). And if the file is large then this can cause serious problems. In such instances, it is better to load the file as a database which is stored on the disk but is still easy and fast to interact with.

We can load csv files to a database using the to_sql() function from pandas once again. We must specify a connection object giving the database that we are connected to, so let’s start by setting this up

In [23]:
db = sqlite3.connect("songs.db")
cursor = db.cursor()

A common practice is to load the csv file in chunks which can be more efficient then trying to import the whole thing at once. The read_csv() function from pandas has a parameter called chunksize that allows us to do this. If we set chunksize=k then the csv file will be broken into chunks of k rows which we can then load one at a time.

Let’s give the following a try

In [24]:
for chunk in pd.read_csv("c2_songs.csv", chunksize=4):
    chunk.to_sql(name="data", con=db, if_exists="append", index=False)
    print(chunk.iloc[0, 2])

Stairway to Heaven
Black Dog
All My Love
Rebel Rebel
Golden Years


We have set up a for loop that reads 4 rows of the csv file at a time and loads them into a table called data of the database db. We have asked to print out the entry in the first row and third column of each chunk so you can see how the chunks were divided.

These are the songs in the first row of each chunk. Now we can quickly check that this worked correctly by importing the data back as a DataFrame

In [25]:
pd.read_sql_query("SELECT * FROM data;", db)

Unnamed: 0,Musician,Genre,Name,Decade,Minutes
0,Led Zeppelin,hard rock,Stairway to Heaven,70,08:02
1,Led Zeppelin,hard rock,Kashmir,70,08:37
2,Led Zeppelin,hard rock,Immigrant Song,70,02:26
3,Led Zeppelin,hard rock,Whole Lotta Love,60,05:33
4,Led Zeppelin,hard rock,Black Dog,70,04:55
5,Led Zeppelin,hard rock,Good Times Bad Times,60,02:43
6,Led Zeppelin,hard rock,Moby Dick,60,04:25
7,Led Zeppelin,hard rock,Ramble On,60,04:35
8,Led Zeppelin,hard rock,All My Love,70,05:53
9,Led Zeppelin,hard rock,The Song Remains the Same,70,05:24
