In [1]:
import sqlite3 as sq3
import pandas.io.sql as pds
import pandas as pd

In [2]:
db_path = './data/classic_rock.db'
con = sq3.Connection(db_path)
con

<sqlite3.Connection at 0x1bacaceaa70>

To start, get all data from the sqlite database by running a `SELECT * FROM rock_songs` query. We can inspect the data by using the standard pandas dataframe functions since the `read_sql()` function reads the data directly into a dataframe.

In [6]:
query = """SELECT * FROM rock_songs;"""
data = pds.read_sql(query, con)
display(data.info())
display(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1650 entries, 0 to 1649
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Song          1650 non-null   object 
 1   Artist        1650 non-null   object 
 2   Release_Year  1650 non-null   float64
 3   PlayCount     1650 non-null   int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 51.7+ KB


None

Unnamed: 0,Song,Artist,Release_Year,PlayCount
0,Caught Up in You,.38 Special,1982.0,82
1,Hold On Loosely,.38 Special,1981.0,85
2,Rockin' Into the Night,.38 Special,1980.0,18
3,Art For Arts Sake,10cc,1975.0,1
4,Kryptonite,3 Doors Down,2000.0,13


We can also run any genberic sqlite-compatible query. Below we grab all artists and aggregate how many songs they have in the database by using `COUNT(*)`. This function counts all occurences and saves it in a new column named `num_songs`. We also aggregate the average of all songs' playcounts by using the `AVG(PlayCount)` function. We then group by Artist and Release_Year (otherwise each pair would occur multiple times in the data) and order by number of songs.

In [7]:
query = """
SELECT Artist, Release_Year, COUNT(*) AS num_songs, AVG(PlayCount) AS avg_plays
FROM rock_songs
GROUP BY Artist, Release_Year
ORDER BY num_songs desc;
"""
data = pds.read_sql(query, con)
display(data.info())
display(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 908 entries, 0 to 907
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Artist        908 non-null    object 
 1   Release_Year  908 non-null    float64
 2   num_songs     908 non-null    int64  
 3   avg_plays     908 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 28.5+ KB


None

Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,The Beatles,1967.0,23,6.565217
1,Led Zeppelin,1969.0,18,21.0
2,The Beatles,1965.0,15,3.8
3,The Beatles,1968.0,13,13.0
4,The Beatles,1969.0,13,15.0


The `read_sql()` function also accepts parameters. This way we can force the data to take a specific form, for example by ensuring that all data is in a valid float format, or in a desired date format. In the above data we can see that the release year is seen as a float64 value.

Since `Release_Year` is encoded as a float, it does not work to use the usual `parse_dates` parameter. So we need to handle this manually. Since only the year is provided, we don't need the full datetime object. This case can be handle by simpy using an integer.

In [18]:
query = """
SELECT Artist, Release_Year, COUNT(*) AS num_songs, AVG(PlayCount) AS avg_plays
FROM rock_songs
GROUP BY Artist, Release_Year
ORDER BY num_songs desc;
"""
data = pds.read_sql(query, con, coerce_float=True)
data["Release_Year"] = data["Release_Year"].astype(int)
display(data.info())
display(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 908 entries, 0 to 907
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Artist        908 non-null    object 
 1   Release_Year  908 non-null    int64  
 2   num_songs     908 non-null    int64  
 3   avg_plays     908 non-null    float64
dtypes: float64(1), int64(2), object(1)
memory usage: 28.5+ KB


None

Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,The Beatles,1967,23,6.565217
1,Led Zeppelin,1969,18,21.0
2,The Beatles,1965,15,3.8
3,The Beatles,1968,13,13.0
4,The Beatles,1969,13,15.0
