# Database Loading

To load the data into your system you can use multiple methods.

You could manually convert the data to CSV files and load data from CSV, or you could use a programmatic approach with Python.


## Python + Excel + SQLite

Another idea is to use python to load the data in for you.

First step is to pull data from Excel into Python.  For this, we will used the [PpenPyXL](https://openpyxl.readthedocs.io/en/stable/) library.

[This link](https://automatetheboringstuff.com/chapter12/) is a good read, as your time permits!



In [1]:
import openpyxl as xl
import pandas as pd
wb = xl.load_workbook(filename='../../../datasets/Module4Data.xlsx')

print (xl.__version__)  # This needs to be openPyXL 2.4.0 or later
# Show the Tabs
print (wb.get_sheet_names())


2.4.8
['Artist', 'Albums', 'Songs']


Now that you have the data from tabs (worksheets) in data structures, you can manipulate the data and then put it into the database. 
Note: This can take a couple tries to get right.  Don't be afraid to remove all your data or your entire database and start over.

To make things a little easier we are going to move over to a panda dataframe - http://openpyxl.readthedocs.io/en/2.4/pandas.html

In [2]:
import openpyxl as xl
import pandas as pd
wb = xl.load_workbook(filename='../../../datasets/Module4Data.xlsx')
# If you try to load the workbook readonly, you will get an error.  If you find this curious,
# I suggest you print (type(ws)) below... with readonly and not readonly
# wb = xl.load_workbook(filename='../datasets/Module4Data.xlsx', read_only=True)

ws = wb['Artist'] # ws is now an IterableWorksheet

# http://openpyxl.readthedocs.io/en/2.4/pandas.html
df = pd.DataFrame(ws.values)
print(df)


                        0                  1     2
0                   Queen               Rock  1970
1      The Rolling Stones               Rock  1962
2                  Prince               Rock  1958
3             The Beatles               Rock  1960
4                 Nirvana             Grunge  1987
5               Pearl Jam             Grunge  1990
6             Soundgarden             Grunge  1984
7   Red Hot Chili Peppers          Funk Rock  1983
8        Jane’s Addiction   Alternative Rock  1985
9                No Doubt           Ska Punk  1986
10                   Bush   Alternative Rock  1992
11    Stone Temple Pilots   Alternative Rock  1989
12       System of a Down  Alternative Metal  1994
13                Dr. Dre            Hip Hop  1984
14              Radiohead   Alternative Rock  1985
15              Green Day          Punk Rock  1986
16           Foo Fighters   Alternative Rock  1994


In [4]:
# Alternative, use Pandas directly
import pandas  # requires the xlrd optional library for Pandas
df = pandas.read_excel(open('../../../datasets/Module4Data.xlsx','rb'), sheetname='Artist')
print(df)

                    Queen               Rock  1970
0      The Rolling Stones               Rock  1962
1                  Prince               Rock  1958
2             The Beatles               Rock  1960
3                 Nirvana             Grunge  1987
4               Pearl Jam             Grunge  1990
5             Soundgarden             Grunge  1984
6   Red Hot Chili Peppers          Funk Rock  1983
7        Jane’s Addiction   Alternative Rock  1985
8                No Doubt           Ska Punk  1986
9                    Bush   Alternative Rock  1992
10    Stone Temple Pilots   Alternative Rock  1989
11       System of a Down  Alternative Metal  1994
12                Dr. Dre            Hip Hop  1984
13              Radiohead   Alternative Rock  1985
14              Green Day          Punk Rock  1986
15           Foo Fighters   Alternative Rock  1994


#### Consider the task of loading the entire Excel file into the DB.  

  1. Should you have a dataframe for each tab?
  1. Load all at once ?
  1. If the file is massive, you will need to use the memory efficient readonly mode.  This increases the complexity of the code, however also increases the scalability. Please consider this an advanced exercise for later.


## Data Loading Considerations 

It important to review your database design.  One of the things you have to ensure does not exist are cyclic foreign key relationships. 

Basically, you cannot load a table that has a foreign key reference for data that has not been first loaded. In the case of this lab, you cannot load __Songs__ that are properly associated to an __Album__ without first loading the album data.


### The cell below loads the Artist table.  More work would have to be done to keep the linkages for Artist ID and songs, using additional Python Data Structures. 



In [5]:
import sqlite3
import numpy

databaseFilename = '../databases/songs.db'
connection = sqlite3.connect(databaseFilename)
cursor = connection.cursor()

sqlite3.register_adapter(numpy.int64, int)
sqlite3.register_adapter(numpy.float64, float)

for row in df.itertuples(index=True, name ='None'):
    print(row)
    cursor.execute('INSERT INTO Artist VALUES(?,?,?,?)',row)

# Save (commit) the changes
connection.commit()


(0, 'The Rolling Stones', 'Rock', 1962)
(1, 'Prince', 'Rock', 1958)
(2, 'The Beatles', 'Rock', 1960)
(3, 'Nirvana', 'Grunge', 1987)
(4, 'Pearl Jam', 'Grunge', 1990)
(5, 'Soundgarden', 'Grunge', 1984)
(6, 'Red Hot Chili Peppers', 'Funk Rock', 1983)
(7, 'Jane’s Addiction', 'Alternative Rock', 1985)
(8, 'No Doubt', 'Ska Punk', 1986)
(9, 'Bush', 'Alternative Rock', 1992)
(10, 'Stone Temple Pilots', 'Alternative Rock', 1989)
(11, 'System of a Down', 'Alternative Metal', 1994)
(12, 'Dr. Dre', 'Hip Hop', 1984)
(13, 'Radiohead', 'Alternative Rock', 1985)
(14, 'Green Day', 'Punk Rock', 1986)
(15, 'Foo Fighters', 'Alternative Rock', 1994)


# PLEASE SAVE YOUR NOTEBOOK