## Answering Business Questions using SQL

This is a small project where I'll be using SQL to answer some business questions. 

The main purpose here is to further develop my SQL understanding.

---

### Schema

We'll be using the **chinook** database. 

Chinook is a record store, and thus it's database contains information on `customer`, `invoice`, `artist`, etc.

The database's schema can be seen below:

<img src="chinook-schema.svg" alt="Drawing" style="width: 500px;" align = "left"/>

### Creating Helper Functions

For this, we'll be using the the `sqlite3` and `pandas` libraries. Let's import these.

In [1]:
import pandas as pd

import sqlite3

We'll next create a function utilising `Pandas` which will make running queries simpler, and allow for better visualisation of the returned results. We'll also implement the `with` statement so that our our database doesn't stay open longer than required.

With our function below, what's happening is:

1. SQL query is passed to function
2. Our database `chinook` is connected to
3. The query is run on the database using `pandas` and returned as a dataframe 
4. Our database is closed

In [2]:
def run_query(query):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(query, conn)

We'll now create another function which will allow us to run queries that don't return results (e.g. `CREATE VIEW`). We don't need to use `pandas` for this.

In [3]:
def run_command(command):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(command)

Finally, we'll create a function that'll return a list of all views and tables within our database.

In [4]:
def show_tables():
    tables = run_query("""SELECT name, type FROM sqlite_master
                          WHERE type IN ("table","view");""")
    print(tables)

In [5]:
show_tables()

              name   type
0            album  table
1           artist  table
2         customer  table
3         employee  table
4            genre  table
5          invoice  table
6     invoice_line  table
7       media_type  table
8         playlist  table
9   playlist_track  table
10           track  table
11     top_5_names   view


### Selecting Albums to Purchase

The Chinook has signed a deal with a new record label.

We want to select the first three albums that will be added to the store, from a list of four. 

All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:

<img src="artists.png" alt="Drawing" style="width: 200px" align="left"/>

The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.

Let's write a query to find out which genres sell the most tracks in the USA.

In [6]:
c = """WITH genre_total AS
                (
                Select 
                    g.name,
                    COUNT(t.track_id) absolute_total
                FROM genre g 
                INNER JOIN track t ON t.genre_id = g.genre_id
                INNER JOIN invoice_line il ON il.track_id = t.track_id
                INNER JOIN invoice i ON i.invoice_id = il.invoice_id
                INNER JOIN customer c ON c.customer_id = i.customer_id
                WHERE c.country = "USA"
                GROUP BY 1
                )
             SELECT 
                 name genre, 
                 absolute_total total,
                 ROUND(CAST(absolute_total AS float) / 
                 (SELECT SUM(absolute_total) FROM genre_total) * 100, 0) percentage
             FROM genre_total
             ORDER BY 2 DESC
             LIMIT 10;
          """

run_query(c)

Unnamed: 0,genre,total,percentage
0,Rock,561,53.0
1,Alternative & Punk,130,12.0
2,Metal,124,12.0
3,R&B/Soul,53,5.0
4,Blues,36,3.0
5,Alternative,35,3.0
6,Latin,22,2.0
7,Pop,22,2.0
8,Hip Hop/Rap,20,2.0
9,Jazz,14,1.0


Here we can see that top 10 most popular genres in the USA. We could have also added an additional clause to filter by only the genres that we're interested in:

In [7]:
c = """WITH genre_total AS
                (
                Select 
                    g.name,
                    COUNT(t.track_id) absolute_total
                FROM genre g 
                INNER JOIN track t ON t.genre_id = g.genre_id
                INNER JOIN invoice_line il ON il.track_id = t.track_id
                INNER JOIN invoice i ON i.invoice_id = il.invoice_id
                INNER JOIN customer c ON c.customer_id = i.customer_id
                WHERE c.country = "USA"
                GROUP BY 1
                )
             SELECT 
                 name genre, 
                 absolute_total total,
                 ROUND(CAST(absolute_total AS float) / 
                 (SELECT SUM(absolute_total) FROM genre_total) * 100, 0) percentage
             FROM genre_total
             WHERE genre LIKE "%Hip%" OR genre LIKE "%Pop%" OR genre LIKE "%Punk%" OR genre LIKE "%Blues%"
             ORDER BY 2 DESC
             LIMIT 10;
          """

run_query(c)

Unnamed: 0,genre,total,percentage
0,Alternative & Punk,130,12.0
1,Blues,36,3.0
2,Pop,22,2.0
3,Hip Hop/Rap,20,2.0


From this, the artists we may want to go with are:

1. Red Tone
2. Slim Jim Bites
3. Meteor and the Girls

### Analysing Employee Sales Performance

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase.

Let's anlayse the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

We'll write a query that finds the total dollar amount of sales assigned to each sales support agent within the company.

In [8]:
c = """WITH subquery AS 
            (
            SELECT
                (e.first_name || " " || e.last_name) support_rep,
                e.hire_date,
                COUNT(*) total_customers,
                SUM(i.total) sales
            FROM customer c
            INNER JOIN invoice i ON i.customer_id = c.customer_id
            INNER JOIN employee e ON e.employee_id = c.support_rep_id
            WHERE e.title = "Sales Support Agent"
            GROUP BY 1 ORDER BY 4 DESC
            )
            SELECT
                *,
                ROUND(sales / (SELECT SUM(sales) FROM subquery) * 100, 0) percentage
            FROM subquery
          """

run_query(c)

Unnamed: 0,support_rep,hire_date,total_customers,sales,percentage
0,Jane Peacock,2017-04-01 00:00:00,212,1731.51,37.0
1,Margaret Park,2017-05-03 00:00:00,214,1584.0,34.0
2,Steve Johnson,2017-10-17 00:00:00,188,1393.92,30.0


We can make a few observations here. We only have a 3 support reps. Of these, Jane Peacock has made the most sales, with Steven Johnson making the least. Margaret Park is in the middle.

However, when we look at the hire date of each rep, we can see that Jane and Margaret started within 1 month of each other, whereas Steven started around 6 months later. And despite this, Jane and Margaret have had roughly the same amount of customers, with Steven falling short. 

Based on this, we might argue that Jane has performed the best in terms of sales. With Margaret performing the worst.

### Analysing Sales by Country

We will now write a  query that collates data on purchases from different countries (from the `customer` table)

Where a country has only one customer, we'll collect these into an "Other" group.

We'll sort the resutls by the total sales from highest to lowest, with the "Other" group at the very bottom.

For each country, we'll include:

* total number of customers
* total value of sales
* average value of sales per customer
* average order value

In [9]:
c = """WITH full_data AS 
                (
                SELECT 
                    c.country,
                    c.customer_id,
                    i.invoice_id,
                    il.unit_price
                FROM customer c
                INNER JOIN invoice i ON i.customer_id = c.customer_id
                INNER JOIN invoice_line il ON il.invoice_id = i.invoice_id
                ),
            -- This calcualates a lot of our statistics, and groups countries with 1 customer under "Other"
            country_other AS
                (
                SELECT
                    country,
                    COUNT(DISTINCT(customer_id)) total_customers,
                    SUM(unit_price) total_sales_value,
                    ROUND(SUM(unit_price) / COUNT(DISTINCT(customer_id)), 2) avg_per_customer,
                    ROUND(SUM(unit_price) / COUNT(DISTINCT(invoice_id)), 2) avg_per_order,
                    CASE 
                        WHEN COUNT(DISTINCT(customer_id)) = 1 THEN "Other"
                        ELSE country
                    END AS country_final
                FROM full_data
                GROUP BY 1
                ),
            -- This adds a sort column, which we'll use to move "Other" country values to the bottom
            country_sort AS
                (
                SELECT 
                    country_final country,
                    SUM(total_customers) total_customers,
                    total_sales_value,
                    avg_per_customer,
                    avg_per_order, 
                    CASE 
                        WHEN country_final = "Other" THEN 0
                        ELSE 1
                    END AS sort
                FROM country_other
                GROUP BY 1
                )
        SELECT 
            country "Country",
            total_customers "Customers",
            total_sales_value "Sales Value",
            avg_per_customer "Avg per Customer",
            avg_per_order "Avg per Order"
        FROM country_sort
        ORDER BY sort DESC, total_sales_value DESC
       """

run_query(c)

Unnamed: 0,Country,Customers,Sales Value,Avg per Customer,Avg per Order
0,USA,13,1040.49,80.04,7.94
1,Canada,8,535.59,66.95,7.05
2,Brazil,5,427.68,85.54,7.01
3,France,5,389.07,77.81,7.78
4,Germany,4,334.62,83.66,8.16
5,Czech Republic,2,273.24,136.62,9.11
6,United Kingdom,3,245.52,81.84,8.77
7,Portugal,2,185.13,92.57,6.38
8,India,2,183.15,91.58,8.72
9,Other,15,39.6,39.6,7.92


Our query was a little longer and more complex than it needed to be (something I'll come back and improve at a later point). However, we got what we wanted.

From the results, the top 3 biggest markets are the **USA**, **Canada**, and **Brazil** in regards to total sales value.

However, we can also note that the average spent by customer for **Czech Republic** is high compared to other countries. There's potential here, and thus marketing efforts directed here may prove fruitful going forward. However, we should be aware that there are only 2 customers. Being such a small sample size, we should be aware that these could both ultimately prove be outliers if we increased the customer count going forward.

### Customers and Country

Here we will use a couple of subqueries, along with the `WITH` clause to find the customer from each country that has spent the most money at Chinook.

In [10]:
c = """WITH 
            customer_total AS
                (
                SELECT 
                    c.customer_id,
                    i.total
                FROM customer AS c
                INNER JOIN invoice i ON i.customer_id = c.customer_id
                ),  
            customer_total_sum AS 
                (
                SELECT
                    customer_id,
                    SUM(total) total
                FROM customer_total
                GROUP BY 1
                )
        SELECT 
            c.country Country,
            c.first_name || " " || c.last_name 'Name',
            cts.total Total
        FROM customer_total_sum AS cts
        INNER JOIN customer c ON c.customer_id = cts.customer_id
        GROUP BY 1 ORDER BY 3 DESC LIMIT 5
    """

run_query(c)

Unnamed: 0,Country,Name,Total
0,Czech Republic,František Wichterlová,144.54
1,Ireland,Hugh O'Reilly,114.84
2,India,Manoj Pareek,111.87
3,Brazil,Luís Gonçalves,108.9
4,Portugal,João Fernandes,102.96


We can see here the top 5 customers in terms of money spent. This sort of information may prove useful going forward if we wanted to identify why certain customers are spending more than others.

### Albums vs Individual Tracks

Management are currently considering a new purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.

Let's now find out what percentage of purchases are individual tracks vs whole albums.

In order to answer the question, we're going to have to **identify whether each invoice has all the tracks from an album**. We can do this by **getting the list of tracks from an invoice and comparing it to the list of tracks from an album**. 

We can find the album to compare the purchase to by looking up the album that one of the purchased tracks belongs to. It doesn't matter which track we pick, since if it's an album purchase, that album will be the same for all tracks.

Let's now write a query that **categorizes each invoice as either an album purchase or not**, and calculates the following summary statistics:

* **Number of invoices**
* **Percentage of invoices**

We'll then make a recommendation on whether Chinook should continue to buy full albums from record companies.

In [45]:
c = """WITH invoice_first_track AS (
              SELECT 
                    il.invoice_id invoice_id, 
                    MIN(il.track_id) first_track_id 
              FROM 
                invoice_line il 
              GROUP BY 1 ),
            repeat AS (
                SELECT 
                  t.track_id 
                FROM 
                  track t 
                WHERE 
                  t.album_id = (SELECT t2.album_id FROM track t2 WHERE t2.track_id = ifs.first_track_id) 
                  
                EXCEPT
                
                SELECT 
                  il2.track_id 
                FROM 
                  invoice_line il2 
                WHERE 
                  il2.invoice_id = ifs.invoice_id
                  ) IS NULL 
                  
                  AND (
                    SELECT 
                      il2.track_id 
                    FROM 
                      invoice_line il2 
                    WHERE 
                      il2.invoice_id = ifs.invoice_id 
                EXCEPT 
                SELECT 
                  t.track_id 
                FROM 
                  track t 
                WHERE 
                  t.album_id = (
                SELECT 
                  t2.album_id 
                FROM 
                  track t2 
                WHERE 
                  t2.track_id = ifs.first_track_id
                  ))
      
            SELECT * FROM invoice
       """

run_query(c)

DatabaseError: Execution failed on sql 'WITH invoice_first_track AS (
              SELECT 
                    il.invoice_id invoice_id, 
                    MIN(il.track_id) first_track_id 
              FROM 
                invoice_line il 
              GROUP BY 1 ),
            repeat AS (
                SELECT 
                  t.track_id 
                FROM 
                  track t 
                WHERE 
                  t.album_id = (
                        SELECT 
                          t2.album_id 
                        FROM 
                          track t2 
                        WHERE 
                          t2.track_id = ifs.first_track_id
                                ) 
                EXCEPT 
                SELECT 
                  il2.track_id 
                FROM 
                  invoice_line il2 
                WHERE 
                  il2.invoice_id = ifs.invoice_id
                  ) IS NULL 
                  AND (
                    SELECT 
                      il2.track_id 
                    FROM 
                      invoice_line il2 
                    WHERE 
                      il2.invoice_id = ifs.invoice_id 
                EXCEPT 
                SELECT 
                  t.track_id 
                FROM 
                  track t 
                WHERE 
                  t.album_id = (
                SELECT 
                  t2.album_id 
                FROM 
                  track t2 
                WHERE 
                  t2.track_id = ifs.first_track_id
                  )))
      
            SELECT * FROM invoice
       ': near "IS": syntax error

In [8]:
import pandas as pd
import sqlite3

In [16]:
def run_query(query):
    with sqlite3.connect('my_database.db') as con:
        return pd.read_sql(query, con)

In [17]:
df = run_query("SELECT country FROM my_table LIMIT 3;")
df

Unnamed: 0,country
0,Brazil
1,Germany
2,Canada
