# Finding our best-performing salespeople and products

## Introduction

**Business Context.** You work for AdventureWorks, a company that sells outdoor sporting equipment. The company has many different locations and has been recording the sales of different locations on various products. You, their new data scientist, have been tasked with the question: **"What are our best products and salespeople and how can use this information to improve our overall performance?"**

You have been given access to the relevant data files with documentation from the IT department. Your job is to extract meaningful insights from these data files to help increase sales. First, you will look at the best products and try to see how different products behave in different categories. Second, you will analyze the best salespeople to see if the commission percentage motivates them to sell more.

**Business Problem.** Your task is to **construct a database from the provided CSV files and then write queries in SQL to carry out the requested analysis**.

**Analytical Context.** You are given the data (stored in the ```data/csvs``` folder) as a set of separate CSV files, each one representing a table. You will build a new PostgreSQL database from these files using AWS RDS.

The company has been pretty vague about how they expect you to extract insights, but you have come up with the following plan of attack:

1. Create the database and ensure you can run basic queries against it
2. Look at how product ratings and total sales are related
3. See how products sell in different subcategories (bikes, helmets, socks, etc.)
4. Calculate which salespeople have performed the best in the past year
5. Seeing if total sales are correlated with their commission percentage

Of course, this is only your initial plan. As you explore the database, your strategy will change.

## Setting up AWS

In this case, we'll assume that the company has given you an entry-level laptop, which is not capable of running a PostgreSQL server locally. Therefore, you should set up a cloud database, connect to it from `psql`, and run the analysis via the `psql` or directly from the notebook.

### Question (20 min):

Repeat the steps in Case 12.3 to create a new RDS instance with a PostgreSQL database.

## Overview of the data

The data for the case is contained in the ```./data/csvs``` directory; specifically, it is the ```AdventureWorks``` sample data provided by Microsoft. We will be focusing on the Sales and Production categories. Complete documentation for the original data (of which you have only a subset) can be found [here](https://dataedo.com/download/AdventureWorks.pdf). 

**Product Tables:**
* **Product**: one row per product that the company sells
* **ProductReview**: one row per rating and review left by customers
* **ProductModelProductDescriptionCulture**: a link between products and their longer descriptions also indicating a "culture" - which language and region the product is for
* **ProductDescription**: a longer description of each product, for a specific region
* **ProductCategory**: the broad categories that products fit into
* **ProductSubCategory**: the narrower subcategories that products fit into

**Sales Tables:**
* **SalesPerson**: one row per salesperson, including information on their commission and performance
* **SalesOrderHeader**: one row per sale summarizing the sale
* **SalesOrderDetail**: many rows per sale, detailing each product that forms part of the sale
* **SalesTerritory**: the different territories where products are sold, including performance

**Region Tables:**
* **CountryRegionCurrency**: the currency used by each region
* **CurrencyRate**: the average and closing exchange rates for each currency compared to the USD

## Setting up `ipython-sql` and `pgspecial`

Jupyter notebook is usually used to run Python code, but with an add-on it can run SQL directly against a database too. Install the extensions `ipython-sql` and `pgspecial` through `pip` (you may have to restart the notebook after doing this) and create the database `adventureworks`:

In [1]:
!pip3 install ipython-sql pgspecial

Collecting ipython-sql
  Downloading https://files.pythonhosted.org/packages/ab/3d/0d38357c620df31cebb056ca1804027112e5c008f4c2c0e16d879996ad9f/ipython_sql-0.4.0-py3-none-any.whl
Collecting pgspecial
  Downloading https://files.pythonhosted.org/packages/34/70/f20df1e335592ace0e4f54989307d1630163445599b6abd43a2e0149d483/pgspecial-1.11.10-py3-none-any.whl
Collecting sqlparse
[?25l  Downloading https://files.pythonhosted.org/packages/85/ee/6e821932f413a5c4b76be9c5936e313e4fc626b33f16e027866e1d60f588/sqlparse-0.3.1-py2.py3-none-any.whl (40kB)
[K     |████████████████████████████████| 40kB 7.8MB/s  eta 0:00:01
Collecting prettytable<1
  Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Building wheels for collected packages: prettytable
  Building wheel for prettytable (setup.py) ... [?25ldone
[?25h  Created wheel for prettytable: filename=prettytable-0.7.2-cp37-none-any.whl size=12666 sha256=

Now load the sql add-on and connect to the database as follows. You'll need to change the username (`postgres`), password (`mysecretpassword`), host (`localhost`), and database name (`postgres`) to what you used when setting up your RDS instance:

In [1]:
%load_ext sql
%sql postgresql://postgres:mysecretpassword@localhost/postgres

You should now be able to run SQL directly from any Jupyter notebook cell by starting the cell with a line that states `%%sql`. For example (once you have a database with some tables, which we'll only create later):

```sql
%%sql

SELECT * FROM product LIMIT 10;
```

**Note:** Unlike `pandas` which automatically truncates output for large DataFrames, the SQL plug-in gives you exactly what you ask for. If you do a `SELECT * FROM` a table with a million rows and no `LIMIT` clause, it'll output all million rows and probably freeze your notebook. It's good practice to always use a `LIMIT` clause even when it's not needed to avoid any mishaps.

## Creating the database and adding the tables

Now, let's create a database called `adventuretime`. (If you do this through the notebook, you'll have to add the line `end;` before your `create database` command as the add-on runs everything in transactions).

You'll need to add a table for each of the CSV files. Spend some time looking at the different CSV files and getting used to how they reference each other and what headers they create. Then, you'll need to write an appropriate `CREATE TABLE` command with appropriate types. You can figure out the types by inspecting the CSV files and/or referencing the documentation.

### Exercise 1: (30 min)

Write all of the commands that you need to

* Create the database
* Create the tables
* Import the data from the CSVs

**Hint:** As an example, to add data for the `salesperson` table, you would use the following commands:

1. Create table (can be run from Jupyter Notebook or the `psql` command line interface):
```sql
CREATE TABLE salesperson (
    businessentityid INTEGER,
    territoryid INTEGER,
    salesquota INTEGER,
    bonus INTEGER,
    commissionpct FLOAT,
    salesytd FLOAT,
    saleslastyear FLOAT,
    rowguid TEXT,
    modifieddate DATE
    );
```

2. Copy data (has to be run from the `psql` shell):

```sql
\copy salesperson FROM 'data/csvs/salesperson.csv' with (format CSV, header true, delimiter ',');
```

**Answer.** One possible solution is shown below:

In [2]:
%%sql

-- CREATE THE DATABASE
end;
create database adventureworks;

 * postgresql://postgres:***@localhost/postgres
Done.
Done.


[]

Connect to the new database that we just created:

In [3]:
%sql postgresql://postgres:mysecretpassword@localhost/adventureworks

In [4]:
%%sql
-- ADDING THE TABLES

CREATE TABLE product (
    productid INTEGER,
    NAME TEXT,
    productnumber TEXT,
    makeflag BOOLEAN,
    finishedgoodsflag BOOLEAN,
    color TEXT,
    safetystocklevel INTEGER,
    reorderpoint INTEGER,
    standardcost FLOAT,
    listprice FLOAT,
    size TEXT,
    sizeunitmeasurecode TEXT,
    weightunitmeasurecode TEXT,
    weight FLOAT,
    daystomanufacture INTEGER,
    productline TEXT,
    class TEXT,
    style TEXT,
    productsubcategoryid INTEGER,
    productmodelid INTEGER,
    sellstartdate DATE,
    sellenddate DATE,
    discontinueddate DATE,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE productmodelproductdescriptionculture (
    productmodelid INTEGER,
    productdescriptionid INTEGER,
    cultureid TEXT,
    modifieddate DATE
    );

CREATE TABLE productdescription (
    productdescriptionid INTEGER,
    description TEXT,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE productreview (
    productreviewid INTEGER,
    productid INTEGER,
    reviewername TEXT,
    reviewdate DATE,
    emailaddress TEXT,
    rating INTEGER,
    comments TEXT,
    modifeddate DATE
    );

CREATE TABLE productcategory (
    productcategoryid INTEGER,
    name TEXT,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE productsubcategory (
    productsubcategoryid INTEGER,
    productcategoryid INTEGER,
    name TEXT,
    rowguid TEXT,
    modifieddate DATE
    );
    
CREATE TABLE salesperson (
    businessentityid INTEGER,
    territoryid INTEGER,
    salesquota INTEGER,
    bonus INTEGER,
    commissionpct FLOAT,
    salesytd FLOAT,
    saleslastyear FLOAT,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE salesorderdetail (
    salesorderid INTEGER,
    salesorderdetailid INTEGER,
    carriertrackingnumber TEXT,
    orderqty INTEGER,
    productid INTEGER,
    specialofferid INTEGER,
    unitprice FLOAT,
    unitpricediscount FLOAT,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE salesorderheader (
    salesorderid INTEGER,
    revisionnumber INTEGER,
    orderdate DATE,
    duedate DATE,
    shipdate DATE,
    STATUS TEXT,
    onlineorderflag BOOLEAN,
    purchaseordernumber TEXT,
    accountnumber TEXT,
    customerid INTEGER,
    salespersonid INTEGER,
    territoryid INTEGER,
    billtoaddressid INTEGER,
    shiptoaddressid INTEGER,
    shipmethodid INTEGER,
    creditcardid INTEGER,
    creditcardapprovalcode TEXT,
    currencyrateid INTEGER,
    subtotal FLOAT,
    taxamt FLOAT,
    freight FLOAT,
    totaldue FLOAT,
    comment TEXT,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE salesterritory (
    territoryid INTEGER,
    name TEXT,
    countryregioncode TEXT,
    "group" TEXT,
    salesytd FLOAT,
    saleslastyear FLOAT,
    costytd FLOAT,
    costlastyear FLOAT,
    rowguid TEXT,
    modifieddate DATE
    );
    
CREATE TABLE countryregioncurrency (
    countryregioncode TEXT,
    currencycode TEXT,
    modifieddate DATE
    );

CREATE TABLE currencyrate (
    currencyrateid INTEGER,
    currencyratedate DATE,
    fromcurrencycode TEXT,
    tocurrencycode TEXT,
    averagerate FLOAT,
    endofdayrate FLOAT,
    modifieddate DATE
    );

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

```sql

-- copying the data (to be run from a psql shell in Terminal / Command Prompt that is open in the main project directory with a `data` subfolder).

\copy product from 'data/csvs/product.csv' with (format CSV, header true, delimiter ',');
\copy productreview from 'data/csvs/productreview.csv' with (format CSV, header true, delimiter ',');
\copy productmodelproductdescriptionculture from 'data/csvs/productmodelproductdescriptionculture.csv' with (format CSV, header true, delimiter ',');
\copy productdescription from 'data/csvs/productdescription.csv' with (format CSV, header true, delimiter ',');
\copy productcategory from 'data/csvs/productcategory.csv' with (format CSV, header true, delimiter ',');
\copy productsubcategory from 'data/csvs/productsubcategory.csv' with (format CSV, header true, delimiter ',');
\copy salesperson from 'data/csvs/salesperson.csv' with (format CSV, header true, delimiter ',');
\copy salesorderheader from 'data/csvs/salesorderheader.csv' with (format CSV, header true, delimiter ',');
\copy salesorderdetail from 'data/csvs/salesorderdetail.csv' with (format CSV, header true, delimiter ',');
\copy salesterritory from 'data/csvs/salesterritory.csv' with (format CSV, header true, delimiter ',');
\copy countryregioncurrency from 'data/csvs/countryregioncurrency.csv' with (format CSV, header true, delimiter ',');
\copy currencyrate from 'data/csvs/currencyrate.csv' (format csv, header true, delimiter ',');
```


In [5]:
%%sql

-- CHECK TO MAKE SURE THE DATA IS LOADED AS EXPECTED
select * from product limit 10;

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
10 rows affected.


productid,name,productnumber,makeflag,finishedgoodsflag,color,safetystocklevel,reorderpoint,standardcost,listprice,size,sizeunitmeasurecode,weightunitmeasurecode,weight,daystomanufacture,productline,class,style,productsubcategoryid,productmodelid,sellstartdate,sellenddate,discontinueddate,rowguid,modifieddate
1,Adjustable Race,AR-5381,False,False,,1000,750,0.0,0.0,,,,,0,,,,,,2008-04-30,,,694215b7-08f7-4c0d-acb1-d734ba44c0c8,2014-02-08
2,Bearing Ball,BA-8327,False,False,,1000,750,0.0,0.0,,,,,0,,,,,,2008-04-30,,,58ae3c20-4f3a-4749-a7d4-d568806cc537,2014-02-08
3,BB Ball Bearing,BE-2349,True,False,,800,600,0.0,0.0,,,,,1,,,,,,2008-04-30,,,9c21aed2-5bfa-4f18-bcb8-f11638dc2e4e,2014-02-08
4,Headset Ball Bearings,BE-2908,False,False,,800,600,0.0,0.0,,,,,0,,,,,,2008-04-30,,,ecfed6cb-51ff-49b5-b06c-7d8ac834db8b,2014-02-08
316,Blade,BL-2036,True,False,,800,600,0.0,0.0,,,,,1,,,,,,2008-04-30,,,e73e9750-603b-4131-89f5-3dd15ed5ff80,2014-02-08
317,LL Crankarm,CA-5965,False,False,Black,500,375,0.0,0.0,,,,,0,,L,,,,2008-04-30,,,3c9d10b7-a6b2-4774-9963-c19dcee72fea,2014-02-08
318,ML Crankarm,CA-6738,False,False,Black,500,375,0.0,0.0,,,,,0,,M,,,,2008-04-30,,,eabb9a92-fa07-4eab-8955-f0517b4a4ca7,2014-02-08
319,HL Crankarm,CA-7457,False,False,Black,500,375,0.0,0.0,,,,,0,,,,,,2008-04-30,,,7d3fd384-4f29-484b-86fa-4206e276fe58,2014-02-08
320,Chainring Bolts,CB-2903,False,False,Silver,1000,750,0.0,0.0,,,,,0,,,,,,2008-04-30,,,7be38e48-b7d6-4486-888e-f53c26735101,2014-02-08
321,Chainring Nut,CN-6137,False,False,Silver,1000,750,0.0,0.0,,,,,0,,,,,,2008-04-30,,,3314b1d7-ef69-4431-b6dd-dc75268bd5df,2014-02-08


## Finding our most popular products

As discussed, the company would like to know which of their products is the most popular among customers. You figure that the average rating given in reviews is correlated with the number of sales of a particular product (that products with higher reviews have more sales).

### Exercise 2: (15 min)

Using the ```product``` and ```productreview``` tables, ```JOIN``` them and rank the products according to their average review rating. What are the names and IDs of the top 5 products?

**Answer.** One possible solution is shown below.

In [6]:
%%sql
SELECT product.productid, name, round(avg(rating), 2) as avgrating, count(*) as num_ratings
FROM product inner join productreview
ON productreview.productid = product.productid
GROUP BY product.productid, name
ORDER BY avgrating DESC;

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
3 rows affected.


productid,name,avgrating,num_ratings
798,"Road-550-W Yellow, 40",5.0,1
709,"Mountain Bike Socks, M",5.0,1
937,HL Mountain Pedal,3.0,2


### Exercise 3: (30 min)

Much to your disappointment, there are only three products with ratings and only four reviews in total! This is nowhere near enough to perform an analysis of the correlation between reviews and total sales.

Nevertheless, your manager wants the **English description** of these products for an upcoming sale. Use the documentation provided above if you need help navigating the structure to extract this!

**Hint:** You'll notice that the value for `cultureid` in the `productmodelproductdescriptionculture` table often has extra trailing spaces which makes it difficult to reliably get descriptions of a specific language. You should first modify this table before writing the `SELECT` statement to get the descriptions that your manager wants. To do this, you can use an `UPDATE` statement with Postgres's [`TRIM`](https://w3resource.com/PostgreSQL/trim-function.php) function.

**Answer.** One possible solution is given below:

In [7]:
%%sql

UPDATE productmodelproductdescriptionculture set cultureid = TRIM(cultureid);

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
762 rows affected.


[]

In [9]:
%%sql
SELECT "name",
       description
FROM productdescription pd
INNER JOIN productmodelproductdescriptionculture pm ON pm.productdescriptionid=pd.productdescriptionid
INNER JOIN product ON product.productmodelid = pm.productmodelid
WHERE productid IN (798,709,937)
  AND cultureid = 'en'

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
3 rows affected.


name,description
"Road-550-W Yellow, 40","Same technology as all of our Road series bikes, but the frame is sized for a woman. Perfect all-around bike for road or racing."
HL Mountain Pedal,Stainless steel; designed to shed mud easily.
"Mountain Bike Socks, M",Combination of natural and synthetic fibers stays dry and provides just the right cushioning.


### Exercise 4: (30 min)

Since we cannot infer the most popular products from the reviews, we will go with an alternative strategy.

Get the model ID, name, description, and total number of sales for each product and display the top-10 selling products. You can infer how often products have been sold by looking at the `salesorderdetail` table (each row might indicate more than one sale, so take note of `OrderQty`).

**Answer.** One possible solution is shown below:

In [10]:
%%sql 

WITH english_descriptions AS
  (SELECT productmodelid,
          description
   FROM productmodelproductdescriptionculture pmpdc
   INNER JOIN productdescription pd ON pd.productdescriptionid = pmpdc.productdescriptionid
   AND cultureid = 'en')
SELECT product.productmodelid,
       description,
       product.name,
       sum(orderqty) AS total_orders
FROM product
INNER JOIN salesorderdetail ON product.productid = salesorderdetail.productid
INNER JOIN english_descriptions ON product.productmodelid = english_descriptions.productmodelid
GROUP BY product.productmodelid,
         name,
         description
ORDER BY total_orders DESC
LIMIT 10

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
10 rows affected.


productmodelid,description,name,total_orders
2,Traditional style with a flip-up brim; one-size fits all.,AWC Logo Cap,8311
111,AWC logo water bottle - holds 30 oz; leak-proof.,Water Bottle - 30 oz.,6815
33,"Universal fit, well-vented, lightweight , snap-on visor.","Sport-100 Helmet, Blue",6743
11,Unisex long-sleeve AWC logo microfiber cycling jersey,"Long-Sleeve Logo Jersey, L",6592
33,"Universal fit, well-vented, lightweight , snap-on visor.","Sport-100 Helmet, Black",6532
33,"Universal fit, well-vented, lightweight , snap-on visor.","Sport-100 Helmet, Red",6266
1,"Light-weight, wind-resistant, packs to fit into a pocket.","Classic Vest, S",4247
114,"Includes 8 different size patches, glue and sandpaper.",Patch Kit/8 Patches,3865
32,"Short sleeve classic breathable jersey with superior moisture control, front zipper, and 3 back pockets.","Short-Sleeve Classic Jersey, XL",3864
11,Unisex long-sleeve AWC logo microfiber cycling jersey,"Long-Sleeve Logo Jersey, M",3636


### Exercise 5: (30 min)

Let's look at the correlation between quantity sold and price for each item in each subcategory. Some subcategories don't have enough sales to make the correlation meaningful, so only look at the top 10 subcategories by total quantity of sales.

Once you've looked at the data, make a hypothesis about what causes any positive or negative correlations between price and quantity, and explain this in 2-3 sentences.

**Hint:** You'll need to calculate the total quantities from `salesorderdetail` again and group the products by subcategory. It'll probably be easier if you use at least two [CTEs](https://www.postgresql.org/docs/9.1/queries-with.html). You can calculate the correlation in PostgreSQL by using the built-in [```corr()```](https://www.postgresql.org/docs/9.4/functions-aggregate.html) function.

**Answer.** One possible solution is shown below:

In [11]:
%%sql

WITH product_qtys
AS (
    SELECT productid,
        SUM(orderqty) AS quantity
    FROM salesorderdetail
    GROUP BY productid
    ),
product_price_qty
AS (
    SELECT pc.name AS category,
        ps.name AS subcategory,
        p.listprice,
        sum(product_qtys.quantity) AS quantity
    FROM product p
    INNER JOIN product_qtys
        ON p.productid = product_qtys.productid
    INNER JOIN productsubcategory ps
        ON p.productsubcategoryid = ps.productsubcategoryid
    INNER JOIN productcategory pc
        ON ps.productcategoryid = pc.productcategoryid
    GROUP BY pc.name,
        ps.name,
        p.listprice
    )
SELECT subcategory,
    corr(ppq.listprice, ppq.quantity) AS corr,
    sum(quantity) AS total_qty
FROM product_price_qty ppq
GROUP BY subcategory
ORDER BY total_qty DESC limit 10;

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
10 rows affected.


subcategory,corr,total_qty
Road Bikes,-0.3797962198574496,47196
Mountain Bikes,0.3226668926506684,28321
Jerseys,-1.0,22711
Helmets,,19541
Tires and Tubes,-0.8514230857944957,18006
Touring Bikes,0.3944448355142475,14751
Gloves,-1.0,13012
Road Frames,-0.9380370882692972,11753
Mountain Frames,0.6165208815431658,11621
Bottles and Cages,-0.9701687537690749,10552


We can see a negative correlation for more commoditized items such as road bikes, clothing, and parts. It's likely that shoppers for these items are price-sensitive. These items are substitutable and buyers are likely to not care too much about specific brands or super high quality.

In contrast, more differentiated items such as mountain bikes and touring bikes exhibit a positive correlation between price and quantity. It's likely that buyers of these items are professionals or enthusiasts who care more about quality and specific brands than price.

## Finding our top salespeople

As mentioned earlier, we want to find our best salespeople and see whether or not we can incentivize them in an appropriate manner. Namely, we want to determine if the commission percentage we give them motivates them to make more and bigger sales.

### Exercise 5: (10 min)

Find the top five performing salespeople by using the `salesytd` (Sales, year-to-date) column. (We only need to know the `businessentityid` for each salesperson as this uniquely identifies each.) Why might you be skeptical of these numbers right now?

**Answer.** One possible solution is shown below:

In [12]:
%%sql

SELECT BusinessEntityID, SalesYTD FROM SalesPerson ORDER BY SalesYTD DESC LIMIT 5;

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
5 rows affected.


businessentityid,salesytd
276,4251368.5497
289,4116871.2277
275,3763178.1787
277,3189418.3662
290,3121616.3202


The numbers are hard-coded into this table, instead of dynamically calculated from each sales record. Currently, we don't know how this number is updated or much about it at all, so it's good to remain skeptical.

### Exercise 6: (15 min)

Using ```salesorderheader```, find the top 5 salespeople who made the most sales **in the most recent year** (2014). (There is a column called `subtotal` - use that.) Sales that do not have an associated salesperson should be excluded from your calculations and final output. All orders that were made within the 2014 calendar year should be included.

**Hint:** You can use the syntax `'1970-01-01'::date` to generate an arbitrary date in PostgreSQL and compare this to specific dates in the tables.

**Answer.** One possible solution is shown below:

In [13]:
%%sql 

SELECT salespersonid, round(SUM(subtotal)) AS totalsales
FROM salesorderheader soh
WHERE soh.orderdate >= '2014-01-01'::date
AND soh.SalesPersonID is not NULL
GROUP BY SalesPersonID
ORDER BY TotalSales DESC
LIMIT 5;

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
5 rows affected.


salespersonid,totalsales
289,1382997.0
276,1271089.0
275,1057247.0
282,1044811.0
277,1040093.0


We see right away that there are discrepancies between the two sales totals. For the remainder of this case, use this dynamically-calculated total as the authoritative answer.

### Exercise 7: (30 min)

Looking at the documentation, you will see that `subtotal` in the ```salesorderheader``` table is calculated from other tables in the database. To validate this figure (instead of trusting it blindly), let's calculate `subtotal` manually. Using the ```salesorderdetail``` and ```salesorderheader``` tables, calculate the sales for each salesperson for **this past year** (2014) and display results for the top 5 salespeople.

**Hint:** You will have to ```JOIN``` ```salesorderdetail``` on ```salesorderheader``` to get the salesperson, calculate line totals for each sale using appropriate discounts, then sum all the line totals to get the total sale. You will want to use ```WITH``` clauses again to keep things sane.

**Answer.** One possible solution is shown below:

In [14]:
%%sql
WITH orders
AS (
    SELECT salesorderid,
        sum(unitprice * (1 - unitpricediscount) * orderqty) AS ordertotal
    FROM salesorderdetail
    GROUP BY salesorderid
    ),
salespersontotalsales
AS (
    SELECT salespersonid,
        sum(ordertotal) AS totalsales
    FROM orders o
    INNER JOIN salesorderheader soh
        ON o.salesorderid = soh.salesorderid
    WHERE soh.orderdate >= make_date(2014, 1, 1)
        AND soh.salespersonid != 0
    GROUP BY salespersonid
    )
SELECT *
FROM salespersontotalsales
ORDER BY totalsales DESC LIMIT 5;


 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
5 rows affected.


salespersonid,totalsales
289,1382996.5839100005
276,1271088.521461
275,1057247.3785719995
282,1044810.827687
277,1040093.4069010002


### Exercise 8: (30 min)

Using ```corr()```, see if there is a positive relationship between total sales and commission percentage.

**Answer.** One possible solution is shown below:

In [15]:
%%sql 

WITH orders
AS (
    SELECT salesorderid,
        sum(unitprice * (1 - unitpricediscount) * orderqty) AS ordertotal
    FROM salesorderdetail
    GROUP BY salesorderid
    ),
salespersontotalsales
AS (
    SELECT salespersonid,
        sum(ordertotal) AS totalsales
    FROM orders o
    INNER JOIN salesorderheader soh
        ON o.salesorderid = soh.salesorderid
    GROUP BY salespersonid
    )
SELECT corr(spts.totalsales, sp.commissionpct) AS correlation
FROM salespersontotalsales spts
JOIN salesperson sp
    ON sp.businessentityid = spts.salespersonid;


 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
1 rows affected.


correlation
0.4377704110963032


### Exercise 9: (20 min)

Remember how we mentioned that products were sold in many regions? This is why you had to work with the `culture` value before to get the English language descriptions. To make matters worse, you are told the sales are recorded in **local** currency, so your previous analysis is flawed, and you must convert all amounts to USD if you wish to compare the different salespeople fairly!

Use the `countryregioncurrency` table in combination with the `salesperson` and `salesterritory` ones to figure out the relevant currency symbol for each of the top salespeople.

**Answer.** One possible solution is shown below:

In [16]:
%%sql

WITH salespersonwithcurrency
AS (
    SELECT a.businessentityid,
        crc.currencycode
    FROM (
        SELECT sp.businessentityid,
            st.countryregioncode
        FROM salesperson sp
        INNER JOIN salesterritory st
            ON sp.territoryid = st.territoryid
        ) a
    INNER JOIN countryregioncurrency crc
        ON crc.countryregioncode = a.countryregioncode
    )
SELECT *
FROM salespersonwithcurrency LIMIT 5;

 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
5 rows affected.


businessentityid,currencycode
275,USD
276,USD
277,USD
278,CAD
279,USD


### Exercise 10: (45 min)

Now that we have the currency codes associated with each salesperson, redo Exercise 7 to take the currency exchange into account. If there are salespeople in the top 5 that weren't there before, explain why.

**Hint:** The rates in the```currencyrate``` table always go from `FromCurrencyCode=USD` to `ToCurrencyCode=<Desired Currency Code>`, and they are listed every day. When calculating line totals, use the `AverageRate` for that day. You should be able to reuse a lot of Exercise 7.

**Answer.** One possible solution is shown below:

In [17]:
%%sql 



WITH orders
AS (
    SELECT salesorderid,
        sum(unitprice * (1 - unitpricediscount) * orderqty) AS ordertotal
    FROM salesorderdetail
    GROUP BY salesorderid
    ),
salespersonwithcurrency
AS (
    SELECT a.businessentityid,
        crc.currencycode
    FROM (
        SELECT sp.businessentityid,
            st.countryregioncode
        FROM salesperson sp
        INNER JOIN salesterritory st
            ON sp.territoryid = st.territoryid
        ) a
    INNER JOIN countryregioncurrency crc
        ON crc.countryregioncode = a.countryregioncode
    ),
orderswithcurrency
AS (
    SELECT a.salespersonid,
        a.ordertotal,
        a.orderdate,
        spwc.currencycode
    FROM (
        SELECT *
        FROM orders o
        INNER JOIN salesorderheader soh
            ON o.salesorderid = soh.salesorderid
        WHERE soh.orderdate >= '2014-01-01'::date
            AND soh.salespersonid != 0
        ) a
    INNER JOIN salespersonwithcurrency spwc
        ON spwc.businessentityid = a.salespersonid
    ),
orderswithcurrencyrate
AS (
    SELECT owc.salespersonid,
        owc.ordertotal,
        owc.ordertotal / cr.averagerate AS ordertotaladjusted,
        owc.orderdate,
        owc.currencycode,
        cr.averagerate
    FROM orderswithcurrency owc
    INNER JOIN currencyrate cr
        ON cr.tocurrencycode = owc.currencycode
    WHERE cr.currencyratedate = owc.orderdate
    ),
salespersontotalsalesadjusted
AS (
    SELECT salespersonid,
        sum(ordertotaladjusted) AS totalsalesadjusted
    FROM orderswithcurrencyrate
    GROUP BY salespersonid
    )
SELECT *
FROM salespersontotalsalesadjusted
ORDER BY totalsalesadjusted DESC LIMIT 5;



 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
5 rows affected.


salespersonid,totalsalesadjusted
289,2146418.4023948
276,1271088.521461001
275,1057247.3785720002
277,1040093.406901
290,844392.7295821795


Our top salesperson (`289`) did not appear in our previous list. Their total sales looked substantially lower before, because they are recorded in GBP (British pound), a currency that is stronger than the US dollar.

### Exercise 11: (15 min)

How does the correlation from Exercise 8 change once you've adjusted for the currency?

**Answer.** One possible solution is shown below:

In [18]:
%%sql

WITH orders
AS (
    SELECT salesorderid,
        sum(unitprice * (1 - unitpricediscount) * orderqty) AS ordertotal
    FROM salesorderdetail
    GROUP BY salesorderid
    ),
salespersonwithcurrency
AS (
    SELECT a.businessentityid,
        crc.currencycode
    FROM (
        SELECT sp.businessentityid,
            st.countryregioncode
        FROM salesperson sp
        INNER JOIN salesterritory st
            ON sp.territoryid = st.territoryid
        ) a
    INNER JOIN countryregioncurrency crc
        ON crc.countryregioncode = a.countryregioncode
    ),
orderswithcurrency
AS (
    SELECT a.salespersonid,
        a.ordertotal,
        a.orderdate,
        spwc.currencycode
    FROM (
        SELECT *
        FROM orders o
        INNER JOIN salesorderheader soh
            ON o.salesorderid = soh.salesorderid
        WHERE soh.orderdate >= '2014-01-01'::date
            AND soh.salespersonid != 0
        ) a
    INNER JOIN salespersonwithcurrency spwc
        ON spwc.businessentityid = a.salespersonid
    ),
orderswithcurrencyrate
AS (
    SELECT owc.salespersonid,
        owc.ordertotal,
        owc.ordertotal / cr.averagerate AS ordertotaladjusted,
        owc.orderdate,
        owc.currencycode,
        cr.averagerate
    FROM orderswithcurrency owc
    INNER JOIN currencyrate cr
        ON cr.tocurrencycode = owc.currencycode
    WHERE cr.currencyratedate = owc.orderdate
    ),
salespersontotalsalesadjusted
AS (
    SELECT salespersonid,
        sum(ordertotaladjusted) AS totalsalesadjusted
    FROM orderswithcurrencyrate
    GROUP BY salespersonid
    )
    
SELECT corr(sptsa.totalsalesadjusted, sp.commissionpct) AS correlation
FROM salespersontotalsalesadjusted sptsa
JOIN salesperson sp
    ON sp.businessentityid = sptsa.salespersonid;



 * postgresql://postgres:***@localhost/adventureworks
   postgresql://postgres:***@localhost/postgres
1 rows affected.


correlation
0.3734734962578998


We see that correlation has gone down which indicates that offering a higher commission is less important than before, but still has a positive relationship nonetheless.