<a href="1. FIPS Code and Population Data.ipynb">&lt;- Back to previous notebook</a>

# Step 2: Sourcing sales data.

In this section we'll generate some fake sales data.  Normally you would get this from some enterprise sales system, partners if you're using resellers, etc.  Let's say we want some data that has the amount of the sale, the date/time the purchase was made, the location (county/FIPS), and some unique transaction ID.

<img src="images/sample-sales.png">

### Data quality concern: data in context

Immediately, though, we have some questions:
- Amount: is that in US dollars?  Local currency if sold outside the US?  If we have to convert it, what conversion rate do we use - today's, or the one at the time of purchase?  When rounding, do you round up or truncate?  Accuracy is critical, especially when there's money involved.
- Date/time: is that the date/time of the purchase in local time?  Daylight Saving Time?  What timezone was the purchase in?

### The Data Catalog

In the examples above, it's really important for a Subject Matter Expert (SME) that's familiar with the sales data to clearly define each of these fields and what they represent.  This can be recorded in a data catalog entry for the data source.  For example, the catalog entry for this table might look like this:

| Column | Type | Description |
| :-- | :-- | :---- |
| amount | decimal(8, 2) | Amount in USD, rounded to the nearest cent.  Conversion from non-USD is done at the time of transaction with the conversion rate at midnight UTC of the date of purchase. |
| trans_time | <a href="https://www.postgresql.org/docs/current/datatype-datetime.html">timestamp</a> | Date/time of purchase in UTC |
| id | <a href="https://www.postgresql.org/docs/current/datatype-uuid.html">uuid</a> | A GUID representing a globally unique identifier for the transaction. |
| fips | varchar(5) | The FIPS code (county/state) where the purchase was made |

A data catalog entry might also contain information about data stewards or subject matter experts, the lineage of the table (e.g. joins with other tables), sample data, and more.


## 2.1 Generating the sales data

We'll use <a href="https://faker.readthedocs.io/en/master/">faker</a>, an excellent Python library, to generate a bunch of fake sales information.  To make the example more interesting, though, we'll want to make sure that we weight our fake "purchases" more into the top 100 sales regions (FIPS codes) we have, similar to what would likely happen in real life.  So, 50% of the time, we'll pick at random one of the top 100 FIPS codes we inserted in the last notebook.  The other 50% of the time we'll look up a random record in the fips table.


## 2.2 Read FIPS codes into lists for easy/fast retrieval

### Data quality concern: the FIPS table has state and county data intermingled

We don't want the state-level data for this next section, so we'll filter out anything that ends in '000' except for Washington DC.

Another way we could have handled this would have been to delete that data when we imported it, but it's better this way -- if anyone else in my organization ever wants to reuse this data set, the entire set of data is there for them to use.  We can make a note of the state and county info being in there in the data catalog entry for this data set, if we have one.

In [None]:
# First, let's read the top 100 FIPS codes into a list in memory.  Pandas makes this extremely easy:
from my_connect import my_connect
import pandas

connection = my_connect()

# In this case, we only want counties; the WHERE clause filters out state-level data appropriately
q = """
SELECT fipstxt FROM fips 
WHERE NOT(fipstxt LIKE '%000' AND state <> 'DC')
ORDER BY pop_estimate_2019 DESC LIMIT 100
"""
df = pandas.io.sql.read_sql_query(q, connection)
top_fips = df['fipstxt'].values.tolist()

# Now we'll get all valid FIPS codes.
q = "SELECT fipstxt FROM fips WHERE NOT(fipstxt LIKE '%000' AND state <> 'DC')"
df = pandas.io.sql.read_sql_query(q, connection)
all_fips = df['fipstxt'].values.tolist()

In [None]:
!pip install faker

## 2.3 Create the 'sales' table

In [None]:
connection = my_connect()
cursor = connection.cursor()

q = """
CREATE TABLE IF NOT EXISTS sales (
    id UUID PRIMARY KEY,
    trans_time TIMESTAMP,
    amount DECIMAL(8, 2),
    fips VARCHAR(5)
    )
"""

cursor.execute(q)
connection.commit()

## 2.4 Quick helper function for inserting each sales row

In [None]:
def insert_sale(connection, id, trans_time, amount, fips):
    cursor = connection.cursor()
    q = sql.SQL("INSERT INTO sales (id, trans_time, amount, fips) VALUES ({}, {}, {}, {});")
    cursor.execute(q.format(sql.Literal(str(id)), sql.Literal(trans_time), sql.Literal(amount), sql.Literal(fips)))
    connection.commit()


## 2.5 Generate fake sales data and insert it into the database

In [None]:
import random
from faker import Faker
import uuid
import psycopg2.sql as sql

connection = my_connect()
cursor = connection.cursor()
random.seed()
fake = Faker()

# Zero out the table before starting
cursor.execute("DELETE FROM sales;")
connection.commit()

TOTAL_RECORDS = 50000

for i in range(TOTAL_RECORDS):
    id = uuid.uuid4()
    trans_time = fake.date_time_between(start_date='-1y', end_date='-1d')
    amount = fake.pyfloat(left_digits=4, right_digits=2, positive=True, min_value=10, max_value=1500)

    # 50% chance of picking a FIPS from the top FIPS to help skew our fake sales into heavily populated regions
    if (random.choice(["Top", "Random"]) is "Top"):
        fips = random.choice(top_fips)
    else:
        fips = random.choice(all_fips)
        
    insert_sale(connection, id, trans_time, amount, fips)
    
    # Print a status message every 5000 rows
    if (i % 5000) == 0:
        print("%s records inserted" % i)
        
print("Done")

## 2.6 Aggregation: show total sales for the top 10 states

Now that we have some sales data in there, we can start to get a little value out of it.  Let's say we want to aggregate the data by state and see which states have the highest sales.  Here's an example query:

```
SELECT SUM(sales.amount) AS total, fips.state AS state FROM sales
INNER JOIN fips ON sales.fips = fips.fipstxt
GROUP BY (fips.state)
ORDER BY total DESC LIMIT 10;
```

Here's an example of what the result will look like:
```
        total state
0  3312283.72    TX
1  3173485.84    CA
2  2113674.22    NY
3  2017619.26    FL
4  1627246.43    GA
5  1380399.77    IL
6  1106723.17    OH
7  1036475.05    MA
8  1023290.36    MO
9  1003630.76    MI
```

Here's the actual query:

In [None]:
import pandas
connection = my_connect()
q = """
SELECT SUM(sales.amount) AS total, fips.state AS state FROM sales
INNER JOIN fips ON sales.fips = fips.fipstxt
GROUP BY (fips.state)
ORDER BY total DESC LIMIT 10;
"""
df = pandas.io.sql.read_sql_query(q, connection)
print(df.head(10))

# Next notebook: adding salespeople

Next we will add in salesperson information so we can see who the top salespeople are.

<a href="3. Generate Salespeople.ipynb">Go to the next notebook -&gt;</a>


*Contents © Copyright 2020 HP Development Company, L.P. SPDX-License-Identifier: MIT*