<h1>Finding our best-performing salespeople and products</h1>

<h2>Introduction</h2>

<p><strong>Business Context.</strong> You work for AdventureWorks, a company that sells outdoor sporting equipment. The company has many different locations and has been recording the sales of different locations on various products. You, their new data scientist, have been tasked with the question: <strong>"What are our best products and salespeople and how can use this information to improve our overall performance?"</strong></p>
<p>You have been given access to the relevant data files with documentation from the IT department. Your job is to extract meaningful insights from these data files to help increase sales. First, you will look at the best products and try to see how different products behave in different categories. Second, you will analyze the best salespeople to see if the commission percentage motivates them to sell more.</p>

<p><strong>Business Problem.</strong> Your task is to <strong>construct a database from the provided CSV files and then write queries in SQL to carry out the requested analysis</strong>.</p>

<p><strong>Analytical Context.</strong> You are given the data (stored in the <code>data/csvs</code> folder) as a set of separate CSV files, each one representing a table. You will build a new PostgreSQL database from these files using AWS RDS.</p>
<p>The company has been pretty vague about how they expect you to extract insights, but you have come up with the following plan of attack:</p>
<ol>
<li>Create the database and ensure you can run basic queries against it</li>
<li>Look at how product ratings and total sales are related</li>
<li>See how products sell in different subcategories (bikes, helmets, socks, etc.)</li>
<li>Calculate which salespeople have performed the best in the past year</li>
<li>Seeing if total sales are correlated with their commission percentage</li>
</ol>
<p>Of course, this is only your initial plan. As you explore the database, your strategy will change.</p>

<h2>Setting up AWS</h2>

<p>In this case, we'll assume that the company has given you an entry-level laptop, which is not capable of running a PostgreSQL server locally. Therefore, you should set up a cloud database, connect to it from <code>psql</code>, and run the analysis via the <code>psql</code> or directly from the notebook.</p>

<h3>Question :</h3>
<p>Repeat the steps in Case 12.3 to create a new RDS instance with a PostgreSQL database.</p>

<h2>Overview of the data</h2>

<p>The data for the case is contained in the <code>./data/csvs</code> directory; specifically, it is the <code>AdventureWorks</code> sample data provided by Microsoft. We will be focusing on the Sales and Production categories. Complete documentation for the original data (of which you have only a subset) can be found <a href="https://dataedo.com/download/AdventureWorks.pdf">here</a>. </p>
<p><strong>Product Tables:</strong>
* <strong>Product</strong>: one row per product that the company sells
* <strong>ProductReview</strong>: one row per rating and review left by customers
* <strong>ProductModelProductDescriptionCulture</strong>: a link between products and their longer descriptions also indicating a "culture" - which language and region the product is for
* <strong>ProductDescription</strong>: a longer description of each product, for a specific region
* <strong>ProductCategory</strong>: the broad categories that products fit into
* <strong>ProductSubCategory</strong>: the narrower subcategories that products fit into</p>
<p><strong>Sales Tables:</strong>
* <strong>SalesPerson</strong>: one row per salesperson, including information on their commission and performance
* <strong>SalesOrderHeader</strong>: one row per sale summarizing the sale
* <strong>SalesOrderDetail</strong>: many rows per sale, detailing each product that forms part of the sale
* <strong>SalesTerritory</strong>: the different territories where products are sold, including performance</p>
<p><strong>Region Tables:</strong>
* <strong>CountryRegionCurrency</strong>: the currency used by each region
* <strong>CurrencyRate</strong>: the average and closing exchange rates for each currency compared to the USD</p>

<h2>Using <code>ipython-sql</code> and <code>pgspecial</code></h2>
<p>Jupyter notebook is usually used to run Python code, but with an add-on it can run SQL directly against a database too. The extensions <code>ipython-sql</code> and <code>pgspecial</code> will let you do this.</p>

<p>Load the sql add-on and connect to the database as follows. You'll need to change the username (<code>postgres</code>), password (<code>mysecretpassword</code>), host (<code>localhost</code>), and database name (<code>postgres</code>) to what you used when setting up your RDS instance:</p>

In [1]:
%load_ext sql
%sql postgresql://postgres:hZiY0heJJLxckBlztfvH@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime

'Connected: postgres@adventuretime'

<p>You should now be able to run SQL directly from any Jupyter notebook cell by starting the cell with a line that states <code>%%sql</code>. For example (once you have a database with some tables, which we'll only create later):</p>
<div class="codehilite"><pre><span></span><code><span class="o">%%</span><span class="k">sql</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">product</span> <span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div>


<p><strong>Note:</strong> Unlike <code>pandas</code> which automatically truncates output for large DataFrames, the SQL plug-in gives you exactly what you ask for. If you do a <code>SELECT * FROM</code> a table with a million rows and no <code>LIMIT</code> clause, it'll output all million rows and probably freeze your notebook. It's good practice to always use a <code>LIMIT</code> clause even when it's not needed to avoid any mishaps.</p>

<h2>Creating the database and adding the tables</h2>
<p>Now, let's create a database called <code>adventuretime</code>. (If you do this through the notebook, you'll have to add the line <code>end;</code> before your <code>create database</code> command as the add-on runs everything in transactions).</p>
<p>You'll need to add a table for each of the CSV files. Spend some time looking at the different CSV files and getting used to how they reference each other and what headers they create. Then, you'll need to write an appropriate <code>CREATE TABLE</code> command with appropriate types. You can figure out the types by inspecting the CSV files and/or referencing the documentation.</p>

<h3>Exercise 1:</h3>
<p>Write all of the commands that you need to</p>
<ul>
<li>Create the database</li>
<li>Create the tables</li>
<li>Import the data from the CSVs</li>
</ul>
<p><strong>Hint:</strong> As an example, to add data for the <code>salesperson</code> table, you would use the following commands:</p>
<ol>
<li>Create table (can be run from Jupyter Notebook or the <code>psql</code> command line interface):</li>

<div class="codehilite"><pre><span></span><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">salesperson</span> <span class="p">(</span>
    <span class="n">businessentityid</span> <span class="nb">INTEGER</span><span class="p">,</span>
    <span class="n">territoryid</span> <span class="nb">INTEGER</span><span class="p">,</span>
    <span class="n">salesquota</span> <span class="nb">INTEGER</span><span class="p">,</span>
    <span class="n">bonus</span> <span class="nb">INTEGER</span><span class="p">,</span>
    <span class="n">commissionpct</span> <span class="nb">FLOAT</span><span class="p">,</span>
    <span class="n">salesytd</span> <span class="nb">FLOAT</span><span class="p">,</span>
    <span class="n">saleslastyear</span> <span class="nb">FLOAT</span><span class="p">,</span>
    <span class="n">rowguid</span> <span class="nb">TEXT</span><span class="p">,</span>
    <span class="n">modifieddate</span> <span class="nb">DATE</span>
    <span class="p">);</span>
</code></pre></div>



<li>Copy data (has to be run from the <code>psql</code> shell):</li>
</ol>
<div class="codehilite"><pre><span></span><code><span class="err">\</span><span class="k">copy</span> <span class="n">salesperson</span> <span class="k">FROM</span> <span class="s1">&#39;data/csvs/salesperson.csv&#39;</span> <span class="k">with</span> <span class="p">(</span><span class="n">format</span> <span class="n">CSV</span><span class="p">,</span> <span class="n">header</span> <span class="k">true</span><span class="p">,</span> <span class="k">delimiter</span> <span class="s1">&#39;,&#39;</span><span class="p">);</span>
</code></pre></div>

## 1.

```sql
%%sql
end;

CREATE DATABASE adventuretime;
```

```sql
%%sql
end;

CREATE TABLE IF NOT EXISTS salesperson (
    businessentityid INTEGER,
    territoryid INTEGER,
    salesquota MONEY,
    bonus MONEY,
    commissionpct FLOAT,
    salesytd MONEY,
    saleslastyear MONEY,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE IF NOT EXISTS salesterritory (
    territoryid INTEGER,
    name VARCHAR(50),
    countryregioncode VARCHAR(3),
    group1 VARCHAR(50),  --notice that it was changed for group1
    salesytd MONEY,
    saleslastyear MONEY,
    costytd MONEY,
    costlastyear MONEY,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE IF NOT EXISTS salesorderdetail (
    salesorderid INTEGER,
    salesorderdetailid INTEGER,
    carriertrackingnumber VARCHAR(25),
    orderqty INT2,
    productid INT,
    specialofferid INTEGER,
    unitprice MONEY,
    unitpricediscount MONEY,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE IF NOT EXISTS salesorderheader (
    salesorderid INTEGER,
    revisionnumber INTEGER,
    orderdate DATE,
    duedate DATE,
    shipdate DATE,
    status INTEGER,
    onlineorderflag CHAR(1),
    purchaseordernumber VARCHAR(25),
    accountnumber VARCHAR(15),
    customerid INTEGER,
    salespersonid INTEGER,
    territoryid INTEGER,
    billtoaddressid INTEGER,
    shiptoaddressid INTEGER,
    shipmethodid INTEGER,
    creditcardid INTEGER,
    creditcardapprovalcode VARCHAR(15),
    currencyrateid INTEGER,
    subtotal MONEY,
    taxamt MONEY,
    freight MONEY,
    totaldue MONEY,
    comment VARCHAR(128),
    rowguid TEXT,
    modifieddate DATE
    );
```

```sql 
%%sql
end;

CREATE TABLE IF NOT EXISTS product (
    productid INT,
    name VARCHAR(50),
    productnumber VARCHAR(25),
    makeflag CHAR(2),
    finishedgoodsflag CHAR(2),
    color VARCHAR(15),
    safetystocklevel INT,
    reorderpoint INT,
    standardcost MONEY,
    listprice MONEY,
    size VARCHAR(5),
    sizeunitmeasurecode CHAR(3),
    weightunitmeasurecode CHAR(3),
    weight FLOAT,
    daystomanufacture INT,
    productline CHAR(2),
    class CHAR(2),
    style CHAR(2),
    productsubcategoryid INT,
    productmodelid INT,
    sellstartdate DATE,
    sellenddate DATE,
    discontinueddate DATE,
    rowguid TEXT,
    modifieddate DATE
    );

CREATE TABLE IF NOT EXISTS productcategory (
    productcategoryid INT,
    name VARCHAR(50),
    rowguid TEXT,
    modifieddate DATE
    );
    
CREATE TABLE IF NOT EXISTS productdescription (
    productdescriptionid INT,
    description VARCHAR(400),
    rowguid text,
    modifieddate DATE
    );
    
CREATE TABLE IF NOT EXISTS productmodelproductdescriptionculture (
    productmodelid INT,
    productdescriptionid INT,
    cultureid CHAR(6),
    modifieddate DATE
    );
    
CREATE TABLE IF NOT EXISTS productreview (
    productreviewid INT,
    productid INT,
    reviewername VARCHAR(50),
    reviewdate DATE,
    emailaddress VARCHAR(50),
    rating INT,
    comments VARCHAR(3850),
    modifieddate DATE
    );
    
CREATE TABLE IF NOT EXISTS productsubcategory (
    productsubcategoryid INT,
    productcategoryid INT,
    name VARCHAR(50),
    rowguid TEXT,
    modifieddate DATE
    );
```

```sql 
%%sql 
end;

CREATE TABLE IF NOT EXISTS currencyrate (
    currencyrateid INT,
    currencyratedate DATE,
    fromcurrencycode CHAR(3),
    tocurrencycode CHAR(3),
    averagerate FLOAT,
    endofdayrate FLOAT,
    modifieddate DATE
    );

CREATE TABLE IF NOT EXISTS countryregioncurrency (
    countryregioncode VARCHAR(3),
    currencycode CHAR(3),
    modifieddate DATE
    );
```

## 2.

```sql

\copy salesperson FROM 'data/csvs/salesperson.csv' with (format CSV, header true, delimiter ',');
\copy salesterritory FROM 'data/csvs/salesterritory.csv' with (format CSV, header true, delimiter ',');
\copy salesorderdetail FROM 'data/csvs/salesorderdetail.csv' with (format CSV, header true, delimiter ',');
\copy salesorderheader FROM 'data/csvssalesorderheader.csv' with (format CSV, header true, delimiter ',');
\copy product FROM 'data/csvs/product.csv' with (format CSV, header true, delimiter ',');
\copy productcategory FROM 'data/csvs/productcategory.csv' with (format CSV, header true, delimiter ',');
\copy productdescription FROM 'data/csvs/productdescription.csv' with (format CSV, header true, delimiter ',');
\copy productmodelproductdescriptionculture FROM 'data/csvs/productmodelproductdescriptionculture.csv' with (format CSV, header true, delimiter ',');
\copy productreview FROM 'data/csvs/productreview.csv' with (format CSV, header true, delimiter ',');
\copy productsubcategory FROM 'data/csvs/productsubcategory.csv' with (format CSV, header true, delimiter ',');
\copy currencyrate FROM 'data/csvs/currencyrate.csv' with (format CSV, header true, delimiter ',');
\copy countryregioncurrency FROM 'data/csvs/countryregioncurrency.csv' with (format CSV, header true, delimiter ',');
\copy productreview FROM 'data/csvs/productreview.csv' with (format CSV, header true, delimiter ',');
```

-------

In [2]:
%%sql

-- CHECK TO MAKE SURE THE DATA IS LOADED AS EXPECTED
select * from product limit 10 ;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
10 rows affected.


productid,name,productnumber,makeflag,finishedgoodsflag,color,safetystocklevel,reorderpoint,standardcost,listprice,size,sizeunitmeasurecode,weightunitmeasurecode,weight,daystomanufacture,productline,class,style,productsubcategoryid,productmodelid,sellstartdate,sellenddate,discontinueddate,rowguid,modifieddate
1,Adjustable Race,AR-5381,f,f,,1000,750,$0.00,$0.00,,,,,0,,,,,,2008-04-30,,,694215b7-08f7-4c0d-acb1-d734ba44c0c8,2014-02-08
2,Bearing Ball,BA-8327,f,f,,1000,750,$0.00,$0.00,,,,,0,,,,,,2008-04-30,,,58ae3c20-4f3a-4749-a7d4-d568806cc537,2014-02-08
3,BB Ball Bearing,BE-2349,t,f,,800,600,$0.00,$0.00,,,,,1,,,,,,2008-04-30,,,9c21aed2-5bfa-4f18-bcb8-f11638dc2e4e,2014-02-08
4,Headset Ball Bearings,BE-2908,f,f,,800,600,$0.00,$0.00,,,,,0,,,,,,2008-04-30,,,ecfed6cb-51ff-49b5-b06c-7d8ac834db8b,2014-02-08
316,Blade,BL-2036,t,f,,800,600,$0.00,$0.00,,,,,1,,,,,,2008-04-30,,,e73e9750-603b-4131-89f5-3dd15ed5ff80,2014-02-08
317,LL Crankarm,CA-5965,f,f,Black,500,375,$0.00,$0.00,,,,,0,,L,,,,2008-04-30,,,3c9d10b7-a6b2-4774-9963-c19dcee72fea,2014-02-08
318,ML Crankarm,CA-6738,f,f,Black,500,375,$0.00,$0.00,,,,,0,,M,,,,2008-04-30,,,eabb9a92-fa07-4eab-8955-f0517b4a4ca7,2014-02-08
319,HL Crankarm,CA-7457,f,f,Black,500,375,$0.00,$0.00,,,,,0,,,,,,2008-04-30,,,7d3fd384-4f29-484b-86fa-4206e276fe58,2014-02-08
320,Chainring Bolts,CB-2903,f,f,Silver,1000,750,$0.00,$0.00,,,,,0,,,,,,2008-04-30,,,7be38e48-b7d6-4486-888e-f53c26735101,2014-02-08
321,Chainring Nut,CN-6137,f,f,Silver,1000,750,$0.00,$0.00,,,,,0,,,,,,2008-04-30,,,3314b1d7-ef69-4431-b6dd-dc75268bd5df,2014-02-08


<h2>Finding our most popular products</h2>

<p>As discussed, the company would like to know which of their products is the most popular among customers. You figure that the average rating given in reviews is correlated with the number of sales of a particular product (that products with higher reviews have more sales).</p>

<h3>Exercise 2:</h3>
<p>Using the <code>product</code> and <code>productreview</code> tables, <code>JOIN</code> them and rank the products according to their average review rating. What are the names and IDs of the top 5 products?</p>

In [3]:
%%sql
-- This query can be improved using group by and aggragate function for name and rating

SELECT p.productid, p.name, 
    ROUND(AVG(pr.rating), 2) AS Avg_Rating
FROM product p
JOIN productreview pr ON p.productid = pr.productid
GROUP BY p.productid, p.name
ORDER BY avg_rating DESC;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
3 rows affected.


productid,name,avg_rating
798,"Road-550-W Yellow, 40",5.0
709,"Mountain Bike Socks, M",5.0
937,HL Mountain Pedal,3.0


There is no enough information from review table, its necesaru another approach.

-------

<h3>Exercise 3:</h3>
<p>Much to your disappointment, there are only three products with ratings and only four reviews in total! This is nowhere near enough to perform an analysis of the correlation between reviews and total sales.</p>
<p>Nevertheless, your manager wants the <strong>English description</strong> of these products for an upcoming sale. Use the documentation provided above if you need help navigating the structure to extract this!</p>
<p><strong>Hint:</strong> You'll notice that the value for <code>cultureid</code> in the <code>productmodelproductdescriptionculture</code> table often has extra trailing spaces which makes it difficult to reliably get descriptions of a specific language. You should first modify this table before writing the <code>SELECT</code> statement to get the descriptions that your manager wants. To do this, you can use an <code>UPDATE</code> statement with Postgres's <a href="https://w3resource.com/PostgreSQL/trim-function.php"><code>TRIM</code></a> function.</p>

```sql 
%%sql
end;

UPDATE productmodelproductdescriptionculture SET cultureid = TRIM (BOTH FROM cultureid);
```

In [4]:
%%sql

SELECT pmpdc.productmodelid AS "Model ID", 
    pd.description AS "Description"
FROM productdescription pd
JOIN  productmodelproductdescriptionculture pmpdc 
ON pd.productdescriptionid = pmpdc.productdescriptionid
WHERE pmpdc.cultureid = 'en'
LIMIT 10;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
10 rows affected.


Model ID,Description
95,Chromoly steel.
96,Aluminum alloy cups; large diameter spindle.
97,Aluminum alloy cups and a hollow axle.
23,"Suitable for any type of riding, on or off-road. Fits any budget. Smooth-shifting with a comfortable ride."
22,"This bike delivers a high-level of performance on a budget. It is responsive and maneuverable, and offers peace-of-mind when you decide to go off-road."
21,For true trail addicts. An extremely durable bike that will go anywhere and keep you in control on challenging terrain - without breaking your budget.
20,Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.
19,"Top-of-the-line competition mountain bike. Performance-enhancing options include the innovative HL Frame, super-smooth front suspension, and traction for all terrain."
39,Suitable for any type of off-road trip. Fits any budget.
31,Entry level adult bike; offers a comfortable ride cross-country or down the block. Quick-release hubs and rims.


-------

<h3>Exercise 4:</h3>
<p>Since we cannot infer the most popular products from the reviews, we will go with an alternative strategy.</p>
<p>Get the model ID, name, description, and total number of sales for each product and display the top-10 selling products. You can infer how often products have been sold by looking at the <code>salesorderdetail</code> table (each row might indicate more than one sale, so take note of <code>OrderQty</code>).</p>

In [5]:
%%sql

WITH description_en (model_id, description) 
AS (
    SELECT pmpdc.productmodelid,
    pd.description
    FROM productdescription pd
    JOIN  productmodelproductdescriptionculture pmpdc 
    ON pd.productdescriptionid = pmpdc.productdescriptionid
    WHERE pmpdc.cultureid = 'en'
    ),
get_name (model_id, product_id, description, name)
AS (
    SELECT de.model_id,
        p.productid,
        de.description,
        p.name
    FROM product p
    INNER JOIN description_en de
    ON p.productmodelid = de.model_id
    WHERE p.productid IS NOT null
    ),
order_detail (model_id, name, description, sales)
AS (
    SELECT gn.model_id,
        gn.name,
        gn.description,
        sum(sod.orderqty)
    FROM salesorderdetail sod
    INNER JOIN get_name gn
    ON gn.product_id = sod.productid
    GROUP BY gn.model_id,
        gn.name,
        gn.description
    )
SELECT  od.model_id AS "Model ID",
    od.name AS "Name",
    od.description AS "Description",
    sales AS "Total Sales"
FROM order_detail od 
ORDER BY sales DESC
LIMIT 10;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
10 rows affected.


Model ID,Name,Description,Total Sales
2,AWC Logo Cap,Traditional style with a flip-up brim; one-size fits all.,8311
111,Water Bottle - 30 oz.,AWC logo water bottle - holds 30 oz; leak-proof.,6815
33,"Sport-100 Helmet, Blue","Universal fit, well-vented, lightweight , snap-on visor.",6743
11,"Long-Sleeve Logo Jersey, L",Unisex long-sleeve AWC logo microfiber cycling jersey,6592
33,"Sport-100 Helmet, Black","Universal fit, well-vented, lightweight , snap-on visor.",6532
33,"Sport-100 Helmet, Red","Universal fit, well-vented, lightweight , snap-on visor.",6266
1,"Classic Vest, S","Light-weight, wind-resistant, packs to fit into a pocket.",4247
114,Patch Kit/8 Patches,"Includes 8 different size patches, glue and sandpaper.",3865
32,"Short-Sleeve Classic Jersey, XL","Short sleeve classic breathable jersey with superior moisture control, front zipper, and 3 back pockets.",3864
11,"Long-Sleeve Logo Jersey, M",Unisex long-sleeve AWC logo microfiber cycling jersey,3636


-------

<h3>Exercise 5:</h3>
<p>Let's look at the correlation between quantity sold and price for each item in each subcategory. Some subcategories don't have enough sales to make the correlation meaningful, so only look at the top 10 subcategories by total quantity of sales.</p>
<p>Once you've looked at the data, make a hypothesis about what causes any positive or negative correlations between price and quantity, and explain this in 2-3 sentences.</p>
<p><strong>Hint:</strong> You'll need to calculate the total quantities from <code>salesorderdetail</code> again and group the products by subcategory. It'll probably be easier if you use at least two <a href="https://www.postgresql.org/docs/9.1/queries-with.html">CTEs</a>. You can calculate the correlation in PostgreSQL by using the built-in <a href="https://www.postgresql.org/docs/9.4/functions-aggregate.html"><code>corr()</code></a> function.</p>

In [29]:
%%sql

WITH description_en (model_id, description) 
AS (
    SELECT pmpdc.productmodelid,
    pd.description
    FROM productdescription pd
    JOIN  productmodelproductdescriptionculture pmpdc 
    ON pd.productdescriptionid = pmpdc.productdescriptionid
    WHERE pmpdc.cultureid = 'en'
    ),
get_name (model_id, product_id, subcategoryid, description, name)
AS (
    SELECT de.model_id,
        p.productid,
        p.productsubcategoryid,
        de.description,
        p.name
    FROM product p
    INNER JOIN description_en de
    ON p.productmodelid = de.model_id
    WHERE p.productid IS NOT null
    ),
order_detail (model_id, name, description, units_sales, sales)
AS (
    SELECT gn.model_id,
        gn.name,
        gn.description,
        sum(sod.orderqty),
        sum(sod.orderqty*(sod.unitprice-(1::money-sod.unitpricediscount)))
    FROM salesorderdetail sod
    INNER JOIN get_name gn
    ON gn.product_id = sod.productid
    GROUP BY gn.model_id,
        gn.name,
        gn.description
    )
SELECT  od.model_id AS "Model ID",
    od.name AS "Name",
    od.description AS "Description",
    sum(units_sales) AS "Units Sales",
    sum(sales) AS "Total Sales",
    CORR(units_sales, sales::numeric::float) AS "Units Sales and Total Sales Corr"
FROM order_detail od
WHERE sales IS NOT NULL
AND units_sales IS NOT NULL
GROUP BY od.model_id,
    od.name,
    od.description
ORDER BY SUM(units_sales) DESC
LIMIT 10;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
10 rows affected.


Model ID,Name,Description,Units Sales,Total Sales,Units Sales and Total Sales Corr
2,AWC Logo Cap,Traditional style with a flip-up brim; one-size fits all.,8311,"$43,254.57",
111,Water Bottle - 30 oz.,AWC logo water bottle - holds 30 oz; leak-proof.,6815,"$21,936.15",
33,"Sport-100 Helmet, Blue","Universal fit, well-vented, lightweight , snap-on visor.",6743,"$160,052.80",
11,"Long-Sleeve Logo Jersey, L",Unisex long-sleeve AWC logo microfiber cycling jersey,6592,"$193,691.33",
33,"Sport-100 Helmet, Black","Universal fit, well-vented, lightweight , snap-on visor.",6532,"$155,613.11",
33,"Sport-100 Helmet, Red","Universal fit, well-vented, lightweight , snap-on visor.",6266,"$152,663.32",
1,"Classic Vest, S","Light-weight, wind-resistant, packs to fit into a pocket.",4247,"$155,654.41",
114,Patch Kit/8 Patches,"Includes 8 different size patches, glue and sandpaper.",3865,"$4,365.53",
32,"Short-Sleeve Classic Jersey, XL","Short sleeve classic breathable jersey with superior moisture control, front zipper, and 3 back pockets.",3864,"$127,057.97",
11,"Long-Sleeve Logo Jersey, M",Unisex long-sleeve AWC logo microfiber cycling jersey,3636,"$111,875.95",


In [32]:
%%sql

WITH description_en (model_id, description) 
AS (
    SELECT pmpdc.productmodelid,
    pd.description
    FROM productdescription pd
    JOIN  productmodelproductdescriptionculture pmpdc 
    ON pd.productdescriptionid = pmpdc.productdescriptionid
    WHERE pmpdc.cultureid = 'en'
    ),
get_name (model_id, product_id, subcategoryid, description, name)
AS (
    SELECT de.model_id,
        p.productid,
        p.productsubcategoryid,
        de.description,
        p.name
    FROM product p
    INNER JOIN description_en de
    ON p.productmodelid = de.model_id
    WHERE p.productid IS NOT null
    ),
order_detail (model_id, name, description, units_sales, sales)
AS (
    SELECT gn.model_id,
        gn.name,
        gn.description,
        sum(sod.orderqty),
        sum(sod.orderqty*(sod.unitprice-(1::money-sod.unitpricediscount)))
    FROM salesorderdetail sod
    INNER JOIN get_name gn
    ON gn.product_id = sod.productid
    GROUP BY gn.model_id,
        gn.name,
        gn.description
    )
SELECT  od.model_id AS "Model ID",
    od.name AS "Name",
    od.description AS "Description",
    sum(units_sales) AS "Units Sales",
    sum(sales) AS "Total Sales",
    CORR(units_sales, sales::numeric::float) AS "Units Sales and Total Sales Corr"
FROM order_detail od
WHERE sales IS NOT NULL
AND units_sales IS NOT NULL
GROUP BY od.model_id,
    od.name,
    od.description
ORDER BY SUM(sales) DESC
LIMIT 10;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
10 rows affected.


Model ID,Name,Description,Units Sales,Total Sales,Units Sales and Total Sales Corr
20,"Mountain-200 Black, 38",Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.,2977,"$4,403,175.47",
20,"Mountain-200 Black, 42",Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.,2664,"$4,011,405.22",
20,"Mountain-200 Silver, 38",Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.,2394,"$3,694,089.52",
20,"Mountain-200 Silver, 42",Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.,2234,"$3,439,056.14",
20,"Mountain-200 Silver, 46",Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.,2216,"$3,433,871.75",
20,"Mountain-200 Black, 46",Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100.,2111,"$3,308,986.90",
26,"Road-250 Black, 44","Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads.",1642,"$2,516,660.52",
26,"Road-250 Black, 48","Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads.",1498,"$2,346,749.77",
26,"Road-250 Black, 52","Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads.",1245,"$2,011,203.61",
25,"Road-150 Red, 56","This bike is ridden by race winners. Developed with the Adventure Works Cycles professional race team, it has a extremely light heat-treated aluminum frame, and steering that allows precision control.",664,"$1,847,153.89",


It seems like the is no correlation from units sales and total sales (for example for 8311 units of **AWC Logo COP** has a low total sales value \$43,254.57 than **Mountain-200 Black, 38** which sell 2977 units in \$4,403,175.47

-------

<h2>Finding our top salespeople</h2>
<p>As mentioned earlier, we want to find our best salespeople and see whether or not we can incentivize them in an appropriate manner. Namely, we want to determine if the commission percentage we give them motivates them to make more and bigger sales.</p>

<h3>Exercise 5:</h3>
<p>Find the top five performing salespeople by using the <code>salesytd</code> (Sales, year-to-date) column. (We only need to know the <code>businessentityid</code> for each salesperson as this uniquely identifies each.) Why might you be skeptical of these numbers right now?</p>

In [7]:
%%sql

SELECT businessentityid,
    salesytd
FROM salesperson
ORDER BY salesytd DESC
LIMIT 5;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
5 rows affected.


businessentityid,salesytd
276,"$4,251,368.55"
289,"$4,116,871.23"
275,"$3,763,178.18"
277,"$3,189,418.37"
290,"$3,121,616.32"


We don't know how and when this value were calculated, it's better calculated this value from the db.

-------

<h3>Exercise 6:</h3>
<p>Using <code>salesorderheader</code>, find the top 5 salespeople who made the most sales <strong>in the most recent year</strong> (2014). (There is a column called <code>subtotal</code> - use that.) Sales that do not have an associated salesperson should be excluded from your calculations and final output. All orders that were made within the 2014 calendar year should be included.</p>
<p><strong>Hint:</strong> You can use the syntax <code>'1970-01-01'::date</code> to generate an arbitrary date in PostgreSQL and compare this to specific dates in the tables.</p>

In [8]:
%%sql

SELECT salespersonid,
    sum(subtotal) AS total_sales
FROM salesorderheader
WHERE orderdate > '2014-01-01'::date
AND salespersonid is not null
GROUP BY salespersonid
ORDER by sum(subtotal) DESC
LIMIT 5;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
5 rows affected.


salespersonid,total_sales
289,"$1,382,996.58"
276,"$1,271,088.54"
275,"$1,057,247.43"
282,"$1,044,810.84"
277,"$1,040,093.41"


The values calculate from table salesorherheader are different from table salesperson. it's more confident use the calculated values.

-------

<h3>Exercise 7:</h3>
<p>Looking at the documentation, you will see that <code>subtotal</code> in the <code>salesorderheader</code> table is calculated from other tables in the database. To validate this figure (instead of trusting it blindly), let's calculate <code>subtotal</code> manually. Using the <code>salesorderdetail</code> and <code>salesorderheader</code> tables, calculate the sales for each salesperson for <strong>this past year</strong> (2014) and display results for the top 5 salespeople.</p>
<p><strong>Hint:</strong> You will have to <code>JOIN</code> <code>salesorderdetail</code> on <code>salesorderheader</code> to get the salesperson, calculate line totals for each sale using appropriate discounts, then sum all the line totals to get the total sale. You will want to use <code>WITH</code> clauses again to keep things sane.</p>

In [9]:
%%sql 
WITH order_details (orderid, order_value)
AS (
    SELECT salesorderid,
    orderqty*(unitprice-(1::money-unitpricediscount))
    FROM salesorderdetail
    )
SELECT soh.salespersonid,
    sum(od.order_value) AS Total_Calculated
FROM salesorderheader soh
INNER JOIN order_details od
ON od.orderid = soh.salesorderid
WHERE soh.orderdate > '2014-01-01'::date
AND soh.salespersonid is not null
GROUP BY soh.salespersonid
ORDER by sum(od.order_value) DESC
LIMIT 5;


 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
5 rows affected.


salespersonid,total_calculated
289,"$1,383,339.19"
276,"$1,277,674.47"
275,"$1,057,451.29"
282,"$1,044,755.98"
277,"$1,038,832.80"


The values calculated from salesorderdetails are slightly different that the value subtotal. 

<h3>Exercise 8:</h3>
<p>Using <code>corr()</code>, see if there is a positive relationship between total sales and commission percentage.</p>

In [27]:
%%sql 
WITH order_details (orderid, order_value)
AS (
    SELECT salesorderid,
    orderqty*(unitprice-(1::money-unitpricediscount))
    FROM salesorderdetail
    ),
sales (id, sales)
AS (
    SELECT soh.salespersonid,
        sum(od.order_value) AS Total_Calculated
    FROM salesorderheader soh
    INNER JOIN order_details od
    ON od.orderid = soh.salesorderid
    WHERE soh.orderdate > '2014-01-01'::date
    AND soh.salespersonid is not null
    GROUP BY soh.salespersonid
    ORDER by sum(od.order_value) DESC
    )
SELECT id,
    sum(sp.commissionpct::numeric::float8) AS "% Commission",
    sum(s.sales) AS "Total Sales",
    corr(sp.commissionpct::numeric::float8, s.sales::numeric::float8)
FROM sales s
INNER JOIN salesperson sp
ON sp.businessentityid = s.id
WHERE sp.commissionpct IS NOT NULL
AND s.sales IS NOT NULL
GROUP BY id 
ORDER BY "% Commission" DESC
LIMIT 20;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
17 rows affected.


id,% Commission,Total Sales,corr
289,0.02,"$1,383,339.19",
284,0.019,"$600,330.57",
286,0.018,"$584,444.62",
288,0.018,"$580,931.60",
290,0.016,"$868,541.64",
276,0.015,"$1,277,674.47",
277,0.015,"$1,038,832.80",
282,0.015,"$1,044,755.98",
275,0.012,"$1,057,451.29",
283,0.012,"$490,832.03",


idk why corr return None, i try to put the values with the same datatype :'(.

-------

<h3>Exercise 9:</h3>
<p>Remember how we mentioned that products were sold in many regions? This is why you had to work with the <code>culture</code> value before to get the English language descriptions. To make matters worse, you are told the sales are recorded in <strong>local</strong> currency, so your previous analysis is flawed, and you must convert all amounts to USD if you wish to compare the different salespeople fairly!</p>
<p>Use the <code>countryregioncurrency</code> table in combination with the <code>salesperson</code> and <code>salesterritory</code> ones to figure out the relevant currency symbol for each of the top salespeople.</p>

In [11]:
%%sql
WITH vendor (id,regioncode)
AS (
    SELECT businessentityid,
        countryregioncode
    FROM salesperson s
    INNER JOIN salesterritory st
    ON s.territoryid=st.territoryid
    ),
currency_symbol (id, currencycode)
AS (
    SELECT v.id,
        crc.currencycode
    FROM countryregioncurrency crc
    INNER JOIN vendor v
    ON crc.countryregioncode=v.regioncode
)
SELECT cs.id,
    cs.currencycode
FROM currency_symbol cs;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
16 rows affected.


id,currencycode
275,USD
276,USD
277,USD
278,CAD
279,USD
280,USD
281,USD
282,CAD
283,USD
284,USD


-------

<h3>Exercise 10:</h3>
<p>Now that we have the currency codes associated with each salesperson, redo Exercise 7 to take the currency exchange into account. If there are salespeople in the top 5 that weren't there before, explain why.</p>
<p><strong>Hint:</strong> The rates in the<code>currencyrate</code> table always go from <code>FromCurrencyCode=USD</code> to <code>ToCurrencyCode=&lt;Desired Currency Code&gt;</code>, and they are listed every day. When calculating line totals, use the <code>AverageRate</code> for that day. You should be able to reuse a lot of Exercise 7.</p>

In [12]:
%%sql
WITH vendor (id,regioncode)
AS (
    SELECT businessentityid,
        countryregioncode
    FROM salesperson s
    INNER JOIN salesterritory st
    ON s.territoryid=st.territoryid
    ),
currency_symbol (id, currencycode)
AS (
    SELECT v.id,
        crc.currencycode
    FROM countryregioncurrency crc
    INNER JOIN vendor v
    ON crc.countryregioncode=v.regioncode
    ), 
order_details (orderid, id, date, currency)
AS (
    SELECT salesorderid,
        salespersonid,
        orderdate,
        cs.currencycode
    FROM salesorderheader soh
    INNER JOIN currency_symbol cs
    ON soh.salespersonid = cs.id
    ),
sales (date, id, sales, currency)
AS (
    SELECT od.date,
        od.id,
        sod.orderqty*(sod.unitprice-(1::money-sod.unitpricediscount)),
        od.currency
    FROM salesorderdetail sod
    INNER JOIN order_details od
    ON sod.salesorderid=od.orderid
    ),
currency_change (id, sales)
AS (
    SELECT s.id,
        s.sales*averagerate
    FROM currencyrate cr
    INNER JOIN sales s
    ON cr.currencyratedate=s.date
    )
SELECT id,
    sum(sales) AS Total_sales
FROM currency_change
GROUP BY id
ORDER BY sum(sales) DESC
LIMIT 5;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
5 rows affected.


id,total_sales
276,"$10,338,846,967.19"
290,"$10,014,587,601.88"
277,"$9,598,716,559.95"
275,"$9,140,132,652.82"
289,"$8,708,869,946.65"


-------

<h3>Exercise 11:</h3>
<p>How does the correlation from Exercise 8 change once you've adjusted for the currency?</p>

In [28]:
%%sql
WITH vendor (id, regioncode, commission)
AS (
    SELECT businessentityid,
        countryregioncode,
        commissionpct
    FROM salesperson s
    INNER JOIN salesterritory st
    ON s.territoryid=st.territoryid
    ),
currency_symbol (id, currencycode)
AS (
    SELECT v.id,
        crc.currencycode
    FROM countryregioncurrency crc
    INNER JOIN vendor v
    ON crc.countryregioncode=v.regioncode
    ), 
order_details (orderid, id, date, currency)
AS (
    SELECT salesorderid,
        salespersonid,
        orderdate,
        cs.currencycode
    FROM salesorderheader soh
    INNER JOIN currency_symbol cs
    ON soh.salespersonid = cs.id
    ),
sales (date, id, sales, currency)
AS (
    SELECT od.date,
        od.id,
        sod.orderqty*(sod.unitprice-(1::money-sod.unitpricediscount)),
        od.currency
    FROM salesorderdetail sod
    INNER JOIN order_details od
    ON sod.salesorderid=od.orderid
    ),
currency_change (id, sales)
AS (
    SELECT s.id,
        s.sales*averagerate
    FROM currencyrate cr
    INNER JOIN sales s
    ON cr.currencyratedate=s.date
    )
SELECT cc.id,
    v.commission AS "% Commision",
    sum(sales) AS "Total Sales",
    CORR(sales::numeric::float8, v.commission)
FROM currency_change cc
INNER JOIN vendor v
ON cc.id = v.id
WHERE v.commission IS NOT NULL
AND sales IS NOT NULL
GROUP BY cc.id,
    v.commission
ORDER BY sum(sales) DESC
LIMIT 5;

 * postgresql://postgres:***@database-1.cz1gewn5ss3s.us-east-2.rds.amazonaws.com/adventuretime
5 rows affected.


id,% Commision,Total Sales,corr
276,0.015,"$10,338,846,967.19",
290,0.016,"$10,014,587,601.88",-1.97626643597302e-07
277,0.015,"$9,598,716,559.95",
275,0.012,"$9,140,132,652.82",1.87932065667969e-07
289,0.02,"$8,708,869,946.65",1.52131404894094e-07


-------