# Machine Learning Foundation

## Section 1, Part a: Reading Data


### Learning Objective(s)

*   Create a SQL database connection to a sample SQL database, and read records from that database
*   Explore common input parameters

### Packages

*   [Pandas](https://pandas.pydata.org/pandas-docs/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)
*   [Pandas.read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)
*   [SQLite3](https://docs.python.org/3.6/library/sqlite3.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)


## Simple data reads

Structured Query Language (SQL) is an [ANSI specification](https://docs.oracle.com/database/121/SQLRF/ap_standard_sql001.htm?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01#SQLRF55514), implemented by various databases. SQL is a powerful format for interacting with large databases efficiently, and SQL allows for a consistent experience across a large market of databases. We'll be using sqlite, a lightweight and somewhat restricted version of sql for this example. sqlite uses a slightly modified version of SQL, which may be different than what you're used to.


In [1]:
# Import required libraries for database operations and data manipulation
import sqlite3 as sq3          # SQLite3 for database operations
import pandas.io.sql as pds    # Pandas SQL tools for reading SQL queries into DataFrames
import pandas as pd            # Pandas for data manipulation and analysis

### Database connections

Our first step will be to create a connection to our SQL database. A few common SQL databases used with Python include:

*   Microsoft SQL Server
*   Postgres
*   MySQL
*   AWS Redshift
*   AWS Aurora
*   Oracle DB
*   Terradata
*   Db2 Family
*   Many, many others

Each of these databases will require a slightly different setup, and may require credentials (username & password), tokens, or other access requirements. We'll be using `sqlite3` to connect to our database, but other connection packages include:

*   [`SQLAlchemy`](https://www.sqlalchemy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01) (most common)
*   [`psycopg2`](http://initd.org/psycopg/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)
*   [`MySQLdb`](http://mysql-python.sourceforge.net/MySQLdb.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)


In [2]:
# Download the database using Python's requests library
import os
import requests

# Create data directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# URL of the database file
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0232EN-SkillsNetwork/asset/classic_rock.db'

# Download the file if it doesn't exist
if not os.path.exists('data/classic_rock.db'):
    print("Downloading database file...")
    response = requests.get(url)
    with open('data/classic_rock.db', 'wb') as f:
        f.write(response.content)
    print("Download completed!")
else:
    print("Database file already exists.")

Downloading database file...
Download completed!
Download completed!


## Downloading the Sample Database

In this section, we'll download a sample SQLite database containing information about classic rock songs. We'll use Python's `requests` library instead of shell commands to ensure cross-platform compatibility. The database will be stored in a `data` directory in our current working folder.

**Expected Output:**
- A new `data` directory will be created (if it doesn't exist)
- The database file will be downloaded (if it doesn't exist)
- You'll see a message indicating the download status

In [3]:
# Initialize path to SQLite database using os.path for cross-platform compatibility
import os

# Define the path to the database file
path = os.path.join('data', 'classic_rock.db')

# Create a connection to the SQLite database
# This will create a new database if it doesn't exist, or connect to an existing one
con = sq3.connect(path)

print(f"Successfully connected to database at: {path}")

# We now have a live connection to our SQL database

Successfully connected to database at: data\classic_rock.db


## Creating the Database Connection

Now we'll establish a connection to our SQLite database. SQLite is a lightweight, file-based database that's perfect for learning and small applications. We'll use Python's built-in `sqlite3` library to create the connection.

**Expected Output:**
- A connection object will be created
- A confirmation message will show the database path
- No errors should occur if the database file exists and is valid

In [4]:
con

<sqlite3.Connection at 0x1d0a359a980>

### Reading data

Now that we've got a connection to our database, we can perform queries, and load their results in as Pandas DataFrames


In [5]:
# Write a simple SQL query to select all columns and rows from the rock_songs table
query = '''
SELECT * 
FROM rock_songs;
'''

# Execute the query and load results into a pandas DataFrame
# This uses pandas.io.sql.read_sql which combines query execution and DataFrame creation
observations = pds.read_sql(query, con)

# Display the first 5 rows of the DataFrame using head()
# This helps us quickly inspect the structure and content of our data
observations.head()

Unnamed: 0,Song,Artist,Release_Year,PlayCount
0,Caught Up in You,.38 Special,1982.0,82
1,Hold On Loosely,.38 Special,1981.0,85
2,Rockin' Into the Night,.38 Special,1980.0,18
3,Art For Arts Sake,10cc,1975.0,1
4,Kryptonite,3 Doors Down,2000.0,13


## Basic Data Query and Display

Let's start with a simple query to see what our data looks like. We'll select all columns and rows from the `rock_songs` table and display the first few entries. This is a common first step in data exploration to understand the structure and content of our dataset.

**Expected Output:**
- A DataFrame containing all columns from the rock_songs table
- The first 5 rows of data will be displayed
- You should see information about songs including artist, release year, and play count

In [6]:
# Write a more complex SQL query to analyze song data by artist and year
query = '''
SELECT 
    Artist,                     -- Group by artist
    Release_Year,              -- and release year
    COUNT(*) AS num_songs,     -- Count number of songs
    AVG(PlayCount) AS avg_plays  -- Calculate average play count
FROM rock_songs
GROUP BY Artist, Release_Year   -- Group the results
ORDER BY num_songs desc;        -- Sort by number of songs in descending order
'''

# Execute the query and store results in a DataFrame
observations = pds.read_sql(query, con)

# Display the first 5 rows of the results
# This shows us the artists with the most songs in a given year
observations.head()

Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,The Beatles,1967.0,23,6.565217
1,Led Zeppelin,1969.0,18,21.0
2,The Beatles,1965.0,15,3.8
3,The Beatles,1968.0,13,13.0
4,The Beatles,1969.0,13,15.0


## Advanced SQL Query with Aggregation

Now we'll perform a more complex analysis of our data using SQL aggregation functions. This query will:
1. Group songs by artist and release year
2. Count the number of songs per group
3. Calculate the average play count
4. Sort the results by number of songs

**Expected Output:**
- A DataFrame with aggregated statistics
- Columns: Artist, Release_Year, num_songs (count), avg_plays
- Results sorted by number of songs in descending order
- First 5 rows showing artists with the most songs in a given year

## Common parameters

There are a number of common paramters that can be used to read in SQL data with formatting:

*   coerce_float: Attempt to force numbers into floats
*   parse_dates: List of columns to parse as dates
*   chunksize: Number of rows to include in each chunk

Let's have a look at using some of these parameters


In [7]:
# Define the query to group and analyze song data
query='''
SELECT Artist, Release_Year, COUNT(*) AS num_songs, AVG(PlayCount) AS avg_plays  
    FROM rock_songs
    GROUP BY Artist, Release_Year
    ORDER BY num_songs desc;
'''

# Execute the query with additional parameters to demonstrate pandas.read_sql features
observations_generator = pds.read_sql(
    query,
    con,
    coerce_float=True,     # Convert numeric columns to float type where possible
    parse_dates=['Release_Year'],  # Convert Release_Year to datetime
    chunksize=5           # Process results in chunks of 5 rows
)

# Iterate through the first 5 chunks of results
# This demonstrates how to process large datasets in manageable pieces
for index, observations in enumerate(observations_generator):
    if index < 5:  # Only show first 5 chunks
        print(f'Chunk {index + 1}:')  # Print chunk number
        display(observations)         # Display chunk contents

Chunk 1:


Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,The Beatles,1970-01-01 00:32:47,23,6.565217
1,Led Zeppelin,1970-01-01 00:32:49,18,21.0
2,The Beatles,1970-01-01 00:32:45,15,3.8
3,The Beatles,1970-01-01 00:32:48,13,13.0
4,The Beatles,1970-01-01 00:32:49,13,15.0


Chunk 2:


Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,Led Zeppelin,1970-01-01 00:32:50,12,13.166667
1,Led Zeppelin,1970-01-01 00:32:55,12,14.166667
2,Pink Floyd,1970-01-01 00:32:59,11,41.454545
3,Pink Floyd,1970-01-01 00:32:53,10,29.1
4,The Doors,1970-01-01 00:32:47,10,28.9


Chunk 3:


Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,Fleetwood Mac,1970-01-01 00:32:57,9,35.666667
1,Jimi Hendrix,1970-01-01 00:32:47,9,24.888889
2,The Beatles,1970-01-01 00:32:43,9,2.444444
3,The Beatles,1970-01-01 00:32:44,9,3.111111
4,Elton John,1970-01-01 00:32:53,8,18.5


Chunk 4:


Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,Led Zeppelin,1970-01-01 00:32:51,8,47.75
1,Led Zeppelin,1970-01-01 00:32:53,8,34.125
2,Boston,1970-01-01 00:32:56,7,69.285714
3,Rolling Stones,1970-01-01 00:32:49,7,36.142857
4,Van Halen,1970-01-01 00:32:58,7,51.142857


Chunk 5:


Unnamed: 0,Artist,Release_Year,num_songs,avg_plays
0,Bruce Springsteen,1970-01-01 00:32:55,6,7.666667
1,Bruce Springsteen,1970-01-01 00:33:04,6,11.5
2,Creedence Clearwater Revival,1970-01-01 00:32:49,6,23.833333
3,Creedence Clearwater Revival,1970-01-01 00:32:50,6,18.833333
4,Def Leppard,1970-01-01 00:33:07,6,32.0


## Working with Large Datasets using Chunks

This section demonstrates how to handle large datasets efficiently using pandas' chunking capability. Instead of loading all data at once, we'll process it in smaller chunks. This is particularly useful when working with large datasets that might not fit in memory.

Key features demonstrated:
- Using `chunksize` parameter to process data in smaller pieces
- Converting data types automatically with `coerce_float`
- Parsing dates with `parse_dates`

**Expected Output:**
- Five chunks of data, each containing 5 rows
- Each chunk will show the same columns as before but processed in smaller batches
- Release_Year will be converted to datetime format
- Numeric columns will be stored as float types where appropriate

### Machine Learning Foundation (C) 2020 IBM Corporation
