## Answering Business Questions using SQL
---

### Inroduction

In this guided project, we'll  use the Chinook database that is provided as a SQLite database file called `chinook.db`. The Chinook database contains information about a fictional digital music shop - kind of like a mini-iTunes store.

First of all we connect to database.

In [1]:
%%capture
%load_ext sql
%sql sqlite:///chinook.db

Now let's check database tables.

In [2]:
%%sql
DROP VIEW IF EXISTS track_artist;

SELECT name, type
  FROM sqlite_master
 WHERE type IN ("table","view");

 * sqlite:///chinook.db
Done.
Done.


name,type
albums,table
sqlite_sequence,table
artists,table
customers,table
employees,table
genres,table
invoices,table
invoice_items,table
media_types,table
playlists,table


There is database schema below.

![Image](https://s3.amazonaws.com/dq-content/191/chinook-schema.svg)

In [3]:
%%html
<!-- left aligment for the table below -->
<style>
  table {margin-left: 0 !important;}
</style>

### Popular genres in the USA

Let's pretend tha the Chinook store has signed a deal with a record label. This label is able to add three albums from a list four:

| Artist Name          | Genre   |
|----------------------|---------|
| Regal                | Hip-Hop |
| Red Tone             | Punk    |
| Meteor and the Girls | Pop     |
| Slim Jim Bites       | Blues   |

Our task is to pick most profitable albums. For tha we'll find what genre is the most popular one by checking number of tracks sold.
Since the label specializes in artists from **USA** we'll check only sales in this country.

In [5]:
%%sql
WITH invoice_tracks AS
                 (SELECT il.InvoiceId, il.TrackId,
                         il.quantity AS track_qty
                    FROM invoice_items AS il
                         INNER JOIN invoices AS i
                         ON il.InvoiceId = i.InvoiceId
                  
                   WHERE i.BillingCountry = "USA"
                 ),
    
     track_genres AS
                 (SELECT t.TrackId,
                         g.name AS genre
                    FROM tracks AS t
                         INNER JOIN genres AS g
                         ON t.GenreId = g.GenreId
                 )

SELECT tg.genre,
       SUM(it.track_qty) AS tracks_sold,
       ROUND(CAST(SUM(it.track_qty) AS Float)*100/(
                                               SELECT SUM(it.track_qty)
                                                 FROM invoice_tracks AS it
                                                   ),
                      2) || "%" AS tracks_sold_pct
  FROM invoice_tracks AS it
       LEFT JOIN track_genres AS tg
       ON tg.TrackId = it.TrackId
        
 GROUP BY 1
 ORDER BY 2 DESC;

 * sqlite:///chinook.db
Done.


genre,tracks_sold,tracks_sold_pct
Rock,157,31.78%
Latin,91,18.42%
Metal,64,12.96%
Alternative & Punk,50,10.12%
Jazz,22,4.45%
Blues,15,3.04%
TV Shows,14,2.83%
R&B/Soul,12,2.43%
Comedy,8,1.62%
Classical,8,1.62%


We've got a chart which shows us the amount of tracks sold by genre. Now we can use it to pick three albums from the label list simply comparing it with the chart.Here our recomendations:
* **Red Tone** - punk(and also alternative) tracks give **12.37%** of sales  
* **Slim Jim Bites** - blues tracks give **3.43%** of sales
* **Meteor and the Girls** - pop tracks give **2.09%** of sales

These three albums should bring more profit to the Chinook store. The remain album - **Regal** is hip-hop genre. This genre gives only **1.9%** of sales. 

### Best sales support agent

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. We've been asked to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

Let's write the corresponding query.

In [6]:
%%sql
WITH customer_sales AS
                 (SELECT c.SupportRepId, c.CustomerId,
                         i.total
                    FROM customers AS c
                         LEFT JOIN invoices AS i
                         ON c.CustomerId = i.CustomerId
                 )

SELECT e.FirstName || " " || e.LastName AS employee_name,
       DATE() - e.birthdate AS age,
       DATE(e.HireDate) AS hire_date,
       ROUND(SUM(cs.total), 0) AS sales,
       COUNT(DISTINCT cs.CustomerId) AS num_of_customers,
       ROUND(AVG(cs.total), 2) AS avg_single_sale,
       ROUND(SUM(cs.total)/COUNT(DISTINCT cs.CustomerId), 2) AS avg_sale_per_customer
  FROM employees AS e
       INNER JOIN customer_sales AS cs
       ON e.EmployeeId = cs.SupportRepId
        
 GROUP BY 1
 ORDER BY 4 DESC;

 * sqlite:///chinook.db
Done.


employee_name,age,hire_date,sales,num_of_customers,avg_single_sale,avg_sale_per_customer
Jane Peacock,48,2002-04-01,833.0,21,5.71,39.67
Margaret Park,74,2003-05-03,775.0,20,5.54,38.77
Steve Johnson,56,2003-10-17,720.0,18,5.72,40.01


According the table above **Jane Peacock** is the best sales support agent. She has highest stats:
* number of customes - **21**
* average single sale - **8.17**
* average sales per customer - **82.45**

She's the most expirienced agent because she was hired first. And also she's the youngest one. Probably it explains her high results.

### Country sales analysys

Our next task is to analyze the sales data for customers from each different country using value from the **customers table**. In particular, you have been directed to calculate data, for each country, on the:

* total number of customers
* total value of sales
* average value of sales per customer
* average order value

Also we should group all countries with only one customer as **"Other"** and put this group at the end of table.
Let's write a query.

In [7]:
%%sql
WITH other_countries AS
                 (SELECT AVG(i.total) AS avg_order_value,
                         SUM(i.total) AS total,
                         COUNT(DISTINCT c.CustomerId) AS customers,
                         CASE
                             WHEN COUNT(DISTINCT c.CustomerId) = 1 THEN "Other"
                             ELSE c.country
                         END AS country
                    FROM customers AS c
                         LEFT JOIN invoices AS i
                         ON c.CustomerId = i.CustomerId
                   GROUP BY c.country
                 ),
    
     world AS
            (SELECT AVG(i.total) AS avg_order_value,
                    SUM(i.total) AS total,
                    COUNT(DISTINCT c.CustomerId) AS customers,
                    "World" AS country
               FROM customers AS c
                    LEFT JOIN invoices AS i
                    ON c.CustomerId = i.CustomerId
            ),
            
     final_countries AS
            (SELECT *
               FROM other_countries
            
             UNION
            
             SELECT *
               FROM world
            ) 
             

SELECT country,
       SUM(customers) AS customers,
       ROUND(SUM(total), 2) AS total_sales_value,
       ROUND(SUM(total)*100/(
                             SELECT SUM(total)
                               FROM other_countries
                            ),
                 2) || "%" AS total_sales_pct,
       ROUND(SUM(total)/SUM(customers), 2) AS avg_sales_per_customer,
       ROUND(AVG(avg_order_value), 2) AS avg_order_value
  FROM final_countries
 GROUP BY 1
 ORDER BY CASE
              WHEN country = "Other" THEN 1
              WHEN country = "World" THEN 2
              ELSE 0
          END, 3 DESC;

 * sqlite:///chinook.db
Done.


country,customers,total_sales_value,total_sales_pct,avg_sales_per_customer,avg_order_value
USA,13,523.06,22.46%,40.24,5.75
Canada,8,303.96,13.05%,38.0,5.43
France,5,195.1,8.38%,39.02,5.57
Brazil,5,190.1,8.16%,38.02,5.43
Germany,4,156.48,6.72%,39.12,5.59
United Kingdom,3,112.86,4.85%,37.62,5.37
Czech Republic,2,90.24,3.88%,45.12,6.45
Portugal,2,77.24,3.32%,38.62,5.52
India,2,75.26,3.23%,37.63,5.79
Other,9,370.58,15.91%,41.18,5.88


There are some quite interesting findings.

1. Chinook gained from sales in the USA about **1040** dollars - **22%** of all profit. It's comparable with profit gained from **15** different countries that bring **1095** dollars - **23%**
2. Czech Republic customers are real melomans. There are the highest average order value - **9.11** and average sales per customer - **136.62**
3. In addition we've got values for whole **world**. Now we can easily compare sales results from any country with global results.

### Individual tracks vs whole albums

The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:

* purchase a whole album
* purchase a collection of one or more individual tracks

Management are currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.

We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue

In [8]:
%%sql
WITH tracks_invoice AS
                  (SELECT il.InvoiceId,
                          t.AlbumId,
                          il.TrackId                          
                     FROM invoice_items AS il
                          LEFT JOIN tracks AS t
                          ON il.TrackId = t.TrackId
                  ),
    
     tracks_album AS
                  (SELECT a.AlbumId,
                          t.TrackId
                     FROM albums AS a
                          LEFT JOIN tracks AS t
                          ON a.AlbumId = t.AlbumId
                  )
            
SELECT CASE
           WHEN (
                 SELECT TrackId
                   FROM tracks_invoice AS ti_in
                  WHERE ti_in.InvoiceId = ti.InvoiceId
                 EXCEPT
                 SELECT TrackId
                   FROM tracks_album AS ta
                WHERE ta.AlbumId = ti.AlbumId
                 ) IS NULL
            AND (
                 SELECT TrackId
                   FROM tracks_album AS ta
                 WHERE ta.AlbumId = ti.AlbumId
                 EXCEPT
                 SELECT TrackId
                   FROM tracks_invoice AS ti_in
                 WHERE ti_in.InvoiceId = ti.InvoiceId
                ) IS NULL THEN "Full album"
            ELSE "Tracks"
        END AS invoice_type,
       COUNT(DISTINCT ti.InvoiceId) AS invoice_qty,
       ROUND(COUNT(DISTINCT ti.InvoiceId)*100/
                                              (SELECT CAST(COUNT(*) AS Float)
                                                 FROM invoices
                                              ), 1) || "%" AS invoice_pct
  FROM tracks_invoice AS ti
 GROUP BY 1;

 * sqlite:///chinook.db
Done.


invoice_type,invoice_qty,invoice_pct
Full album,2,0.5%
Tracks,410,99.5%


Only **18.6%** of sales are albumls. Other **81.4%** are generated by selling individual tracks.
According to that we can suggest that new purchasing stragegy will succeed and save Chinook's money.

### Most popular artist

To answer this question we'll use following criteria:
>  Artist should be used in the most playlists

Let's first create a view.

In [9]:
%%sql
CREATE VIEW track_artist AS
                                  SELECT ar.name, t.TrackId, pt.PlaylistId
                                     FROM tracks AS t
                                          LEFT JOIN albums AS al
                                          ON t.AlbumId = al.AlbumId
                   
                                          LEFT JOIN artists AS ar
                                          ON al.ArtistId = ar.ArtistId
                    
                                          LEFT JOIN playlist_track AS pt
                                          ON t.TrackId = pt.TrackId
                                  

 * sqlite:///chinook.db
Done.


[]

Now use our created view and criteria above.

In [10]:
%%sql
SELECT name,
       COUNT(DISTINCT PlaylistId) AS in_playlist,
       ROUND(COUNT(DISTINCT PlaylistId)*100/(
                                        SELECT CAST(COUNT(DISTINCT PlaylistId) AS Float)
                                          FROM track_artist
                                       ), 1) || "%" AS in_playlist_pct
  FROM track_artist
 GROUP BY 1
 ORDER BY 2 DESC
 LIMIT 10;

 * sqlite:///chinook.db
Done.


Name,in_playlist,in_playlist_pct
Eugene Ormandy,7,50.0%
The King's Singers,6,42.9%
English Concert & Trevor Pinnock,6,42.9%
Berliner Philharmoniker & Herbert Von Karajan,6,42.9%
Academy of St. Martin in the Fields & Sir Neville Marriner,6,42.9%
Yo-Yo Ma,5,35.7%
Wilhelm Kempff,5,35.7%
Ton Koopman,5,35.7%
"Sir Georg Solti, Sumi Jo & Wiener Philharmoniker",5,35.7%
Sir Georg Solti & Wiener Philharmoniker,5,35.7%


It seems **Eugene Ormandy**'s tracks appear in **7** playlists. But what if we would take different criterea:

> Artist's tracks should be used in the most playlists.

In [11]:
%%sql
SELECT name,
       COUNT(PlaylistId) AS in_playlist,
       ROUND(COUNT(PlaylistId)*100/(
                                    SELECT CAST(COUNT(PlaylistId) AS Float)
                                      FROM track_artist
                                     ), 1) || "%" AS in_playlist_pct
  FROM track_artist
 GROUP BY 1
 ORDER BY 2 DESC
 LIMIT 10;

 * sqlite:///chinook.db
Done.


Name,in_playlist,in_playlist_pct
Iron Maiden,516,5.9%
U2,333,3.8%
Metallica,296,3.4%
Led Zeppelin,252,2.9%
Deep Purple,226,2.6%
Lost,184,2.1%
Pearl Jam,177,2.0%
Faith No More,145,1.7%
Eric Clapton,145,1.7%
Lenny Kravitz,143,1.6%


According the table above **"Iron Maden"** is the most popular artist. Their tracks appear **516** times in the playlists.

###  Purchased vs not purchased tracks

There are a lot of tracks in the Chinook store. Obviously some of then has never been sold. Let's find out how many.

In [12]:
%%sql
SELECT CASE
           WHEN il.InvoiceId IS NULL THEN "Not purchased"
           ELSE "Purchased"
       END AS purchased_or_not,
       COUNT(DISTINCT t.TrackId) AS track_qty,
       ROUND(COUNT(DISTINCT t.TrackId)*100/(
                                    SELECT CAST(COUNT(*) AS Float)
                                      FROM tracks
                                   ), 1) || "%" AS track_pct
  FROM tracks AS t
       LEFT JOIN invoice_items AS il
       ON t.TrackId = il.TrackId
 GROUP BY 1;

 * sqlite:///chinook.db
Done.


purchased_or_not,track_qty,track_pct
Not purchased,1519,43.4%
Purchased,1984,56.6%


Now we can see that **48.4%** of tracks have never been sold. Let's explore genres of purchased tracks. We'll find how many tracks were sold from the stock in each genre.

In [13]:
%%sql
WITH tracks_genre_invoice AS
                      (SELECT g.name AS genre,
                              t.TrackId,
                              il.InvoiceId
                         FROM tracks AS t
                              LEFT JOIN genres AS g
                              ON t.GenreId = g.GenreId
                       
                              LEFT JOIN invoice_items AS il
                              ON t.TrackId = il.TrackId
                      ),
    
    sold_tracks_count AS
                      (SELECT tgi.genre,
                              COUNT(DISTINCT tgi.TrackId) AS tracks_in_stock,
    
                              COUNT(DISTINCT CASE
                                                 WHEN tgi.InvoiceId IS NOT NULL THEN tgi.TrackId 
                                              END) AS tracks_sold,
        
                              ROUND(COUNT(DISTINCT CASE
                                                        WHEN tgi.InvoiceId IS NOT NULL THEN tgi.TrackId 
                                                     END
                                                        )*100/COUNT(DISTINCT tgi.TrackId), 1) AS track_sold_pct
                         FROM tracks_genre_invoice AS tgi

                        GROUP BY 1
                      ),
    
    united_sales AS
                   (SELECT *
                      FROM sold_tracks_count

                    UNION

                    SELECT 'TOP_10' AS genre,
                           SUM(tracks_in_stock) AS tracks_in_stock,
                           SUM(tracks_sold) AS tracks_sold,
                           AVG(track_sold_pct) AS track_sold_pct
                      FROM (
                            SELECT *
                              FROM sold_tracks_count
                             ORDER BY 4 DESC
                             LIMIT 10
                           )
    
                    UNION

                    SELECT 'BOTTOM_10' AS genre,
                           SUM(tracks_in_stock) AS tracks_in_stock,
                           SUM(tracks_sold) AS tracks_sold,
                           AVG(track_sold_pct) AS track_sold_pct
                      FROM (
                            SELECT *
                              FROM sold_tracks_count
                             ORDER BY 4
                             LIMIT 10
                           )
                   )

SELECT *
  FROM united_sales
 ORDER BY CASE
              WHEN genre = "TOP_10" THEN 1
              WHEN genre = "BOTTOM_10" THEN 2
              ELSE 0
          END, 4 DESC;

 * sqlite:///chinook.db
Done.


genre,tracks_in_stock,tracks_sold,track_sold_pct
Bossa Nova,15,14,93.0
Sci Fi & Fantasy,26,20,76.0
Blues,81,53,65.0
Alternative & Punk,332,203,61.0
Metal,374,231,61.0
R&B/Soul,61,37,60.0
Latin,579,340,58.0
Rock,1297,745,57.0
Pop,48,26,54.0
Jazz,130,68,52.0


It's clearly that **10** genres are really unpopular. Only **11%** of tracks in the stock or less have ever been sold. I suggest to remove them from the chiinok store at all. It would be **312** tracks. Also it seems that some of them are not actual tracks at all.

Also I suggest to cut number of **Latin** tracks in half. It would be about **290** tracks.

Intsead of removed trakcks Chinnok store should **increase** number of track from **top 10** according to the table above.

### Protected vs non-protected media

The last question we are going to answer is

* Do protected vs non-protected media types have an effect on popularity?

We'll measure popularity by amount of sales.

In [14]:
%%sql
SELECT mt.name AS media_type,
       CASE
           WHEN il.InvoiceId IS NULL THEN "Not purchased"
           ELSE "Purchased"
       END AS purchased_or_not,
       COUNT(DISTINCT t.TrackId) AS track_qty,
       ROUND(COUNT(DISTINCT t.TrackId)*100/(
                                    SELECT CAST(COUNT(*) AS Float)
                                      FROM tracks AS t
                                           LEFT JOIN media_types AS mt_in
                                           ON t.MediaTypeId = mt_in.MediaTypeId
                                     WHERE mt_in.name = mt.name
                                   ), 1) || "%" AS track_pct
  FROM tracks AS t
       LEFT JOIN invoice_items AS il
       ON t.TrackId = il.TrackId
        
       LEFT JOIN media_types AS mt
       ON t.MediaTypeId = mt.MediaTypeId
 GROUP BY 1, 2
HAVING mt.name LIKE "%Protected%";

 * sqlite:///chinook.db
Done.


media_type,purchased_or_not,track_qty,track_pct
Protected AAC audio file,Not purchased,108,45.6%
Protected AAC audio file,Purchased,129,54.4%
Protected MPEG-4 video file,Not purchased,111,51.9%
Protected MPEG-4 video file,Purchased,103,48.1%


There are only two types of protected files:

* Protected AAC audio file
* Protected MPEG-4 video file

Also let's remember that only **51.6%** of tracks were purchased from Chinook store.

**Protected MPEG-4 video** files are poorly sold. Only **1.4%** of tracks were sold. Actually these files are not tracks at all but video. Probably it should be removed from the store.

**Protected AAC audio** files have better results. **36.3%** of tracks were sold but it is still worse that average sales.

So protected files do not seem popular then unprotected ones. Even opposite.

### Conclusions

1. We've found most popular genres in the USA and made some recomendstions for the Chinook store:
    * **Red Tone** - punk(and also alternative) tracks give **12.37%** of sales  
    * **Slim Jim Bites** - blues tracks give **3.43%** of sales
    * **Meteor and the Girls** - pop tracks give **2.09%** of sales
    
2. We've found best sales agent. It's **Jane Peacock** with:
    * number of customes - **21**
    * average single sale - **8.17**
    * average sales per customer - **82.45**
    
3. We've analysed country sales and found following:
    * Chinook gained from sales in the USA about **1040** dollars - **22%** of all profit. It's comparable with profit gained from **15** different countries that bring **1095** dollars - **23%**
    * Czech Republic customers are real melomans. There are the highest average order value - **9.11** and average sales per customer - **136.62**
    * In addition we've got values for whole **world**. Now we can easily compare sales results from any country with global results.
    
4. We've explored album and individual tracks sales. Only **18.6%** of sales are albumls. Other **81.4%** are generated by selling individual tracks.

5. We've found most popular artists using two different criterias:
    * **Eugene Ormandy**. His tracks appear in **7** playlists
    * **"Iron Maden"**. Their tracks appear **516** times in the playlists
    
6. We've found that only **51.6%** of tracks have been sold. To improve this result we've suggested to cut number of **Latin** tracks in half and remove tracks of 10 genres from the store at all(less then **11%** were sold):
    - Soundtrack
    - TV Shows
    - Drama
    - Bossa Nova
    - Comedy
    - Opera
    - Rock And Roll
    - Sci Fi & Fantasy
    - Science Fiction
    - World
    
7. We've explored protected files from the Chinook store. They didn't bring more profit then unprotected ones. **Protected MPEG-4 video** files should be removed from the store at all. Only **1.4%** of them were sold.