# 📈  Answering business questions about record store using SQL 
In this project we will look at the example database in SQLite called Chinook which contains data about a record store. We will be trying to answer some questions that the business might have when making decisions about marketing or pricing strategies. This project is part of the Dataquest's data engineering course guided projects. The Chinook database is available here: https://www.sqlitetutorial.net/sqlite-sample-database/

We will give guidance in the following areas:
- How to decide on new artists based on their music genre?
- Which employees perform their best and are there any predictors of their success?
- What should be the startegy in different markets based on countries?
- Is buying whole albums advantegous for the company? How can the pricing startegy be adapted based on this information?

In [8]:
%%capture
%load_ext sql
%sql sqlite:///chinook.db

In [9]:
import pandas as pd
import sqlite3

First, before diving into answering business questions we will create 3 helper functions that will help us execute the queries faster and visualise better those that produce an output in the form of a table.

In [11]:
db = "chinook.db"

# Creating helper functions

# This function is used for displaying queries as pandas dataframe which enables nicer table formatting
def run_query(q, database):
    with sqlite3.connect(database) as conn:
        return pd.read_sql(q, conn)
    
# This function is used for executing all queries that don't output a table like CREATE, DROP, ALTER etc.
def run_query_no_table(q, database):
    with sqlite3.connect(database) as conn:
        conn.isolation_level = None
        conn.execute(q)

# This function returns all tables in the database for progress checking
def return_all(database):
    query = """SELECT name, type FROM sqlite_master WHERE type in ('table', 'view')"""
    return run_query(query, db)

return_all(db)

Unnamed: 0,name,type
0,albums,table
1,sqlite_sequence,table
2,artists,table
3,customers,table
4,employees,table
5,genres,table
6,invoices,table
7,invoice_items,table
8,media_types,table
9,playlists,table


## Question 1:
The business has been given a proposal of 4 new US artists to be sold by the record store, however the record store wants to select only 3 of those to start with. The record store isn't currently selling any of their music therefore we don't have real-life data about the sales of their work. 

The only thing we know, besides the name of the artist, is the genre of their music. Therefore, in order to make an informed decision we will look at the distribution of sales of tracks bought by US customers grouped by genres. Then we will rank them and pick 3 artists that are creating music in the best selling genres.

Find genres that sell the most tracks in US and visualise in a pandas table.

In [63]:
# This query gives us a list of all tracks bought by US customers
view_query = """
CREATE VIEW US_tracks AS
SELECT trackId, Quantity
FROM invoices AS i JOIN invoice_items ii ON i.invoiceId = ii.invoiceId
WHERE i.CustomerId IN
    (SELECT CustomerId
    FROM customers
    WHERE Country = 'USA')
"""
run_query_no_table(view_query, db)

# This query returns a list of tracks and their respective genres
view_query = """
CREATE VIEW tracks_by_genres AS
SELECT g.name as genre, trackId
FROM genres g JOIN tracks t ON g.GenreId = t.GenreId
"""
run_query_no_table(view_query, db)

# This query puts together the two queries above by aggregating the number of purchased tracks per each genre
# The result is then sorted based on the share of sales per genre
query = """
SELECT Genre, SUM(ut.Quantity) AS No_Of_Tracks, (1.0*COUNT(*))/(1.0*(SELECT  COUNT(*) FROM US_tracks))*100 AS Share
FROM US_tracks ut LEFT JOIN tracks_by_genres tg ON ut.trackId = tg.trackId
GROUP BY genre
ORDER BY Share DESC
"""

run_query(query, db)

Unnamed: 0,genre,No_Of_Tracks,Share
0,Rock,157,31.781377
1,Latin,91,18.421053
2,Metal,64,12.955466
3,Alternative & Punk,50,10.121457
4,Jazz,22,4.453441
5,Blues,15,3.036437
6,TV Shows,14,2.834008
7,R&B/Soul,12,2.42915
8,Comedy,8,1.619433
9,Classical,8,1.619433


As we can see from the result above, none of the new artists are producing music in the top 3 genres by sales of the record store. However if we had to choose purely based on the ranking we would have to pick Red Tone, Slim Jim Bites and Meteor and the Girls as these bands have a better chance of succeding in front of the customers of this particular record store.

An alternative to this would be to look at the history of these artists at different shops and whether their music has hit any good positions in the charts. However this won't work in case we are talking about new artists.

Also we should expand our horizons outside of the US customers as music is an universal language and people might not decide what they're going to buy simply based on the country from which the artist is from.

## Question 2:
The business wants to analyze performance of its sales rep (total sales per sales rep) and try to look for any indicators that predict their success.

The things that could influcence sales rep sales could be: 
- the distribution of his customers like their country or average spend
- the manager of the employee
- the country, age and company tenure of the employee
- the amount of customers that each rep has

First, we will do a short exploration of the two main tables: customers and employees. Which will enable us to better understand which columns will be needed to understand not only the total sales but also the influencing factors.

Let's start with the customers and look at the country distribution and some basic statistical values to see whether customers origin country influences sales value or not.

In [166]:
query = """
SELECT Country, COUNT(DISTINCT c.CustomerId) AS Customers, COUNT(InvoiceId) AS Invoices, 
COUNT(InvoiceId)/COUNT(DISTINCT c.CustomerId) AS Invoices_per_customer, SUM(Total) AS Sales,
SUM(Total)/COUNT(InvoiceId) AS Sales_per_invoice
FROM customers c JOIN invoices i ON c.customerId = i.customerId
GROUP BY Country
ORDER BY Sales_per_invoice DESC
"""
run_query(query, db)

Unnamed: 0,Country,Customers,Invoices,Invoices_per_customer,Sales,Sales_per_invoice
0,Chile,1,7,7,46.62,6.66
1,Ireland,1,7,7,45.62,6.517143
2,Hungary,1,7,7,45.62,6.517143
3,Czech Republic,2,14,7,90.24,6.445714
4,Austria,1,7,7,42.62,6.088571
5,Finland,1,7,7,41.62,5.945714
6,Netherlands,1,7,7,40.62,5.802857
7,India,2,13,6,75.26,5.789231
8,USA,13,91,7,523.06,5.747912
9,Norway,1,7,7,39.62,5.66


Based on the result above, we can see that most of the customers are coming from EN speaking countries which at first glance would lead us to believe that it will have an impact on the performance of our employees. However, when looking at some calculated averages we can see that these customers are fairly consistent when it comes to the number of orders but there are some subtle differences when it comes to the average order value, therefore we can attempt to cluster our countries into HIGH, MED and LOW value countries and see whether one employee doesn't have significantly more customers from one of these buckets.

In [169]:
# We will start by making a view out of our previous query so that we can get some statistics and create the clusters
# The clustering will be very simple MAX-MIN/number of clusters which I'm setting to 3 based on the data preview above

view_query = """
CREATE VIEW customers_by_country AS
SELECT Country, COUNT(DISTINCT c.CustomerId) AS Customers, COUNT(InvoiceId) AS Invoices, 
COUNT(InvoiceId)/COUNT(DISTINCT c.CustomerId) AS Invoices_per_customer, SUM(Total) AS Sales,
SUM(Total)/COUNT(InvoiceId) AS Sales_per_invoice
FROM customers c JOIN invoices i ON c.customerId = i.customerId
GROUP BY Country
ORDER BY Sales_per_invoice DESC
"""
# run_query_no_table(view_query, db)

query = """
SELECT MAX(Sales_per_invoice), MIN(Sales_per_invoice), 
(MAX(Sales_per_invoice) - MIN(Sales_per_invoice))/3 AS Interval_size
FROM customers_by_country
"""
run_query(query, db)

Unnamed: 0,MAX(Sales_per_invoice),MIN(Sales_per_invoice),Interval_size
0,6.66,5.374286,0.428571


In [176]:
# So our intervals will be HIGH for AOV > 6.2, LOW for AOV < 5.8 and MED for the rest

query = """
SELECT e.EmployeeId, Country_value,
COUNT(DISTINCT c.customerId) AS No_Of_Customers
FROM employees e JOIN customers c ON c.SupportRepId = e.EmployeeId JOIN invoices i ON i.CustomerId = c.CustomerId 
JOIN 
(SELECT Country, Sales,
    CASE 
        WHEN Sales_per_invoice > 6.2 THEN "HIGH"
        WHEN Sales_per_invoice < 5.8 THEN "LOW"
        ELSE "MED"
    END AS Country_value
    FROM customers_by_country
) cc ON cc.country = c.country
WHERE Title = "Sales Support Agent"
GROUP BY e.EmployeeId, Country_value
"""
run_query(query, db)

Unnamed: 0,EmployeeId,Country_value,No_Of_Customers
0,3,HIGH,2
1,3,LOW,18
2,3,MED,1
3,4,HIGH,1
4,4,LOW,19
5,5,HIGH,2
6,5,LOW,14
7,5,MED,2


Based on this result we would assume that the best performing employee will be employee number 5 followed by number 3 and number 4 would finish last. However the differences here are so small that with this clustering it probably won't be a good predictor as we will see later when we look at their actual ranking.

One more thing that we can check is what is the distribution of sales coming from particular customers, then we can group them based on the sales they generate into High, Med and Low value. Maybe we have some customers that tend to do higher orders we just cannot see it when grouped by countries.

In [86]:
query = """
SELECT c.CustomerId, COUNT(i.InvoiceId) AS Invoices, SUM(Total) AS Sales,
SUM(Total)/COUNT(InvoiceId) AS Sales_per_invoice
FROM customers c JOIN invoices i ON c.customerId = i.customerId
GROUP BY c.CustomerId
"""
run_query(query, db)

Unnamed: 0,CustomerId,Invoices,Sales,Sales_per_invoice
0,1,7,39.62,5.66
1,2,7,37.62,5.374286
2,3,7,39.62,5.66
3,4,7,39.62,5.66
4,5,7,40.62,5.802857
5,6,7,49.62,7.088571
6,7,7,42.62,6.088571
7,8,7,37.62,5.374286
8,9,7,37.62,5.374286
9,10,7,37.62,5.374286


As before this further examination looks very similar to the previous case. All orders are of fairly even sales value so clustering would probably result in an underwhelming result. We have demonstrated above how that could be done so we will skip it for this one case.

In this scenario we don't have a lot of data therefore we can analyze the whole dataset, however in case of larger tables it would be more advantegous to look at the AVG order value, MIN and MAX order value which would straight away show that the spread is very little and has probably very small impact on performance of our sales reps.

Now let's look at the employees table:

In [87]:
query = """
SELECT * 
FROM employees
"""
run_query(query, db)

Unnamed: 0,EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
0,1,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com
1,2,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com
2,3,Peacock,Jane,Sales Support Agent,2.0,1973-08-29 00:00:00,2002-04-01 00:00:00,1111 6 Ave SW,Calgary,AB,Canada,T2P 5M5,+1 (403) 262-3443,+1 (403) 262-6712,jane@chinookcorp.com
3,4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com
4,5,Johnson,Steve,Sales Support Agent,2.0,1965-03-03 00:00:00,2003-10-17 00:00:00,7727B 41 Ave,Calgary,AB,Canada,T3B 1Y7,1 (780) 836-9987,1 (780) 836-9543,steve@chinookcorp.com
5,6,Mitchell,Michael,IT Manager,1.0,1973-07-01 00:00:00,2003-10-17 00:00:00,5827 Bowness Road NW,Calgary,AB,Canada,T3B 0C5,+1 (403) 246-9887,+1 (403) 246-9899,michael@chinookcorp.com
6,7,King,Robert,IT Staff,6.0,1970-05-29 00:00:00,2004-01-02 00:00:00,590 Columbia Boulevard West,Lethbridge,AB,Canada,T1K 5N8,+1 (403) 456-9986,+1 (403) 456-8485,robert@chinookcorp.com
7,8,Callahan,Laura,IT Staff,6.0,1968-01-09 00:00:00,2004-03-04 00:00:00,923 7 ST NW,Lethbridge,AB,Canada,T1H 1Y8,+1 (403) 467-3351,+1 (403) 467-8772,laura@chinookcorp.com


Again, since we don't have that many employees we can study the data line by line. By doing this we can see that all sales agents have the same manager, so that doesn't have influence over their performance. All of them are from Canada which means that country also doesn't pose an advantage for our sales rep. So the two remaining factors that we should observe are their age and how long they have been with the company.

In [102]:
# One year up/down in terms of age and tenure will not bring significant difference in this case therefore the 
# approximation of age below is sufficient at this point
query = """
SELECT e.EmployeeId, (DATE() - BirthDate) AS Age, (DATE() - HireDate) AS Tenure, 
COUNT(DISTINCT c.customerId) AS No_Of_Customers, COUNT(i.invoiceId) AS No_Of_Invoices, SUM(Total) AS Sales
FROM employees e JOIN customers c ON c.SupportRepId = e.EmployeeId JOIN invoices i ON i.CustomerId = c.CustomerId
WHERE Title = "Sales Support Agent"
GROUP BY e.EmployeeId
"""
run_query(query, db)

Unnamed: 0,EmployeeId,Age,Tenure,No_Of_Customers,No_Of_Invoices,Sales
0,3,50,21,21,146,833.04
1,4,76,20,20,140,775.4
2,5,58,20,18,126,720.16


We don't have a lot of employees therefore it is hard to infer any relation between the different data points that we have but based on everything we have seen so far we can conclude this:
1. Having customers from a certain country doesn't seem to be advantegous for a sales rep since all countries had a very similar average order value and average number of orders per customer
2. All employees are from the same country and have the same manager so this doesn't influence employees performance neither
3. Age also doesn't seem to play an impact although we see that the youngest employee had the highest sale, this trend doesn't continue for the second best performing employee who is significantly older
4. Finally the biggest predictor of higher sales is simply the number of customers which as we have learned from analyzing customers invoices means more orders (an average of 7 per each customer)

We would need more datapoints in order to confirm all 4 points above.

## Question 3:
In this part we are going to revisit the customer and country distribution discussion we had earlier, only this time we are going to group countries with small customer count (customer count = 1) in a category called Other. We are going to be looking for answers to the following questions.
- Total number of customers
- Total value of sales
- Average value of sales per customer
- Average order value

Let's first create a view with this modified country field (we will only keep those fields needed for later calculations):

In [118]:
view_query = """
CREATE VIEW new_customers AS
SELECT COUNT(DISTINCT c.CustomerId) AS Customers, COUNT(InvoiceId) AS Invoices, SUM(Total) AS Sales,
CASE 
    WHEN COUNT(DISTINCT c.CustomerId) = 1 THEN "Other"
    ELSE Country
END AS New_Country
FROM Customers c JOIN invoices i ON c.customerId = i.customerId
GROUP BY Country
"""

run_query_no_table(view_query, db)

Now we have a view with aggregated values that are needed for the demanded calculations as well as our new "other" value in the country column. Let's now try to answer the questions that we asked above:

In [123]:
query = """
SELECT New_Country, SUM(Customers) as Total_number_of_customers, SUM(Sales) as Total_value_of_sales, 
SUM(Sales)/SUM(customers) AS Average_sales_per_customer, SUM(Sales)/SUM(Invoices) AS Average_order_value
FROM new_customers
GROUP BY New_Country
ORDER BY Average_sales_per_customer DESC
"""

run_query(query,db)

Unnamed: 0,New_Country,Total_number_of_customers,Total_value_of_sales,Average_sales_per_customer,Average_order_value
0,Czech Republic,2,90.24,45.12,6.445714
1,Other,15,604.3,40.286667,5.755238
2,USA,13,523.06,40.235385,5.747912
3,Germany,4,156.48,39.12,5.588571
4,France,5,195.1,39.02,5.574286
5,Portugal,2,77.24,38.62,5.517143
6,Brazil,5,190.1,38.02,5.431429
7,Canada,8,303.96,37.995,5.427857
8,India,2,75.26,37.63,5.789231
9,United Kingdom,3,112.86,37.62,5.374286


From the data above we can give the following recommendation:
- AOV and Average sales per customer in Czech Republic are the highest but we have only 2 customers in this country, therefore if we were to try to get more customers in CZ that will follow a similar behavior as our current customers we might be able to accelerate our growth.
- Another interesting startegy would be to try push more orders in India as they have the second largest AOV, so building better relationships with our existing clients there could be a strategy moving forwards
- Still it looks that the sales are mostly correlated to the number of customers we have in any given country and therefore our marketing team should focus on gaining new customers in country where we have small presence like the above mentioned Czech Republic, Portugal and India and for countries where we have a lot of customers already, we could try playing with different pricing to boost AOV or do some promotion to increase the number of order per customer they make.

## Question 4:
Management would like to change pricing startegy for buying albums vs individual tracks, therefore we need to find out what is the share of invoices through which albums are bought and through which only individual tracks are bought.

In [165]:
# customers can buy either one full album or individual tracks (they can buy the whole album by buying all tracks
# individually as well)
# the price for buying an album vs all of its tracks individually is the same

# Get a number of tracks per albumId
# Get a number of tracks per albumId on each invoice, because we know that a customer cannot buy 1 album and then add
# other tracks to it in one purchase we can check what kind of a purchase it is by comparing the sum of tracks on 
# an invoice with the sum of all tracks of all albums in a given invoice, if they are equal then it must've been an
# album purchase.

view_query = """
CREATE VIEW album_tracks AS
SELECT AlbumId, COUNT(TrackId) AS No_of_tracks
FROM tracks
GROUP BY AlbumId
"""

# run_query_no_table(view_query,db)

view_query = """
CREATE VIEW categorized_invocies AS
SELECT InvoiceId, COUNT(i.TrackId), SUM(No_of_tracks),
CASE 
    WHEN COUNT(i.TrackId) = SUM(No_of_tracks) THEN "Album"
    ELSE "Individual tracks"
END AS Purchase_type
FROM invoice_items i JOIN tracks t ON i.trackId = t.trackId JOIN album_tracks a ON a.albumId = t.albumId
GROUP BY InvoiceId
"""

# run_query_no_table(view_query,db)

query = """
SELECT Purchase_type, COUNT(InvoiceId) AS Invoices, 1.0*COUNT(InvoiceId)/(SELECT COUNT(*) FROM categorized_invocies)*100 AS Share
FROM categorized_invocies
GROUP BY Purchase_type
"""
run_query(query, db)

Unnamed: 0,Purchase_type,Invoices,Share
0,Album,6,1.456311
1,Individual tracks,406,98.543689


Based on the result above it looks like only 6 invoices contained purely albums which indicates that it actually might be a wise idea to stop purchasing full albums and focus maybe more on songs that score well in the top charts.

Or if the company wanted to boost their AOV it might be a good pricing strategy to make the albums cost less than buying all songs individually then some customers might decide they will pay that extra dollar in exchange for having more songs if they wanted to buy more songs from one album anyways.