# Final Project

Ok, so today we'll be working with the transactions, product, and hh_demographic tables in the project_data folder.

* First, read in the transactions data.

* Read in the only columns `household_key`, `BASKET_ID`, `DAY`, `PRODUCT_ID`, `QUANTITY`, and `SALES_VALUE`.

* Convert `DAY`, `QUANTITY`, and `PRODUCT_ID` to the smallest appropriate integer types.


In [0]:
import pandas as pd
import numpy as np

In [0]:
transactions = pd.read_csv('/Volumes/dbx_catalog/default/sample_files/project_transactions.csv',
usecols=['household_key', 'BASKET_ID', 'DAY', 'PRODUCT_ID', 'QUANTITY', 'SALES_VALUE'],
dtype={
    'DAY': 'int16',
    'QUANTITY': 'int32',
    'PRODUCT_ID': 'int32'
}
)

In [0]:
transactions

In [0]:
# Use the following snippet to create a Date Column.

transactions = (
    transactions
    .assign(date = (pd.to_datetime("2016", format='%Y') 
                    + pd.to_timedelta(transactions["DAY"].sub(1).astype(str) + " days"))
           )
    .drop(["DAY"], axis=1)
)

In [0]:
transactions.head()

## TIME BASED ANALYSIS

* Plot the sum of sales by month. Are sales growing over time?
* Next, plot the same series after filtering down to dates April 2016 and October 2017.
* Then, plot the sum of monthly sales in 2016 vs the monthly sales 2017.
* Finally, plot total sales by day of week.

In [0]:
(transactions.set_index('date')
.loc[:, 'SALES_VALUE']
.resample('M')
.sum()
.plot())

In [0]:
transactions.set_index('date') \
    .loc['2016-04':'2017-10', 'SALES_VALUE'] \
    .resample('M') \
    .sum() \
    .plot()

In [0]:
(transactions
.set_index('date') \
.loc[:, ['SALES_VALUE']] \
.resample('M') \
.sum() \
.assign(year_prior = lambda x: x['SALES_VALUE'].shift(12)) 
.loc['2017']
.plot()
)

In [0]:
transactions.groupby(transactions['date'].dt.dayofweek) \
    .agg({'SALES_VALUE': 'sum'}) \
    .plot(kind='bar')


# DEMOGRAPHICS

* Read in the `hh_demographic.csv` file, but only the columns `AGE_DESC`, `INCOME_DESC`, `household_key`, and `HH_COMP_DESC`. Convert the appropriate columns to the category dtype.


* Then group the transactions table by household_id, and calculate the sum of SALES VALUE by household.


* Once you've done that, join the demographics DataFrame to the aggregated transactions table. Since we're interested in analyzing the demographic data we have, make sure not to include rows from transactions that don't match.


* Plot the sum of sales by age_desc and income_desc (in separate charts).


* Then, create a pivot table of the mean household sales by `AGE_DESC` and `HH_COMP_DESC`. Which of our demographics have the highest average sales?


In [0]:
dem_cols = ['AGE_DESC', 'INCOME_DESC', 'household_key', 'HH_COMP_DESC']

dem_dtypes = {'AGE_DESC': 'category', 'INCOME_DESC': 'category', 'HH_COMP_DESC': 'category'}

demographics = pd.read_csv('/Volumes/dbx_catalog/default/sample_files/hh_demographic.csv', usecols=dem_cols, dtype=dem_dtypes)
demographics

In [0]:
household_sales = (transactions.groupby('household_key').agg({'SALES_VALUE':'sum'}))
household_sales

In [0]:
household_sales_demo = (household_sales.merge(demographics,
                                              how='inner',
                                              left_on='household_key',
                                              right_on='household_key'))
household_sales_demo                                              

In [0]:
household_sales_demo.info(memory_usage='deep')

In [0]:
household_sales_demo.groupby('AGE_DESC').agg({'SALES_VALUE':'sum'}).plot.bar()

In [0]:
household_sales_demo.groupby('INCOME_DESC').agg({'SALES_VALUE':'sum'}).sort_values('SALES_VALUE', ascending=False).plot.bar()

In [0]:
(household_sales_demo.pivot_table(index='AGE_DESC',
                                 columns='HH_COMP_DESC',
                                 values='SALES_VALUE',
                                 aggfunc='mean',
                                 margins=True)
.style.background_gradient(cmap='RdYlGn', axis=None)                                 
)

In [0]:
del [household_sales_demo, household_sales]

# PRODUCT DEMOGRAPHICS

* Read in the product csv file.

* Only read in product_id and department from product (consider converting columns).

* Join the product DataFrame to transactions and demographics tables, performing an inner join when joining both tables.

* Finally, pivot the fully joined dataframe by AGE_DESC and DEPARTMENT, calculating the sum of sales. Which category does our youngest demographic perform well in?



In [0]:
product_cols = ['PRODUCT_ID', 'DEPARTMENT']
product_dtypes = {'PRODUCT_ID': 'int32', 'DEPARTMENT':'category'}
product = pd.read_csv('/Volumes/dbx_catalog/default/sample_files/product.csv',
                      usecols=product_cols, 
                      dtype=product_dtypes)

In [0]:
product.dtypes

In [0]:
trans_demo_dept = (transactions.merge(demographics, 
                                      how='inner',
                                      left_on='household_key',
                                      right_on='household_key',)
                                .merge(product,
                                       how='inner',
                                       left_on='PRODUCT_ID',
                                       right_on='PRODUCT_ID',))

In [0]:
trans_demo_dept

In [0]:
trans_demo_dept.info(memory_usage='deep')

In [0]:
trans_demo_dept.pivot_table(index='DEPARTMENT', 
                            columns='AGE_DESC',
                            values='SALES_VALUE',
                            aggfunc='sum')
trans_demo_dept.pivot_table(index='DEPARTMENT', columns='AGE_DESC', values='SALES_VALUE', aggfunc='sum').style.background_gradient(cmap='RdYlGn', axis=1)                              

# EXPORT

Finally, export your pivot table to an excel file. Make sure to provide a sheet name.

In [0]:
trans_demo_dept.pivot_table(index='DEPARTMENT', 
                            columns='AGE_DESC',
                            values='SALES_VALUE',
                            aggfunc='sum')
#trans_demo_dept.pivot_table(index='DEPARTMENT', columns='AGE_DESC', values='SALES_VALUE', aggfunc='sum').style.background_gradient(cmap='RdYlGn', axis=1)   
trans_demo_dept.pivot_table(index='DEPARTMENT', columns='AGE_DESC', values='SALES_VALUE', aggfunc='sum').style.background_gradient(cmap='RdYlGn', axis=1).to_excel('demographic_category_sales.xlsx', sheet_name='sales_pivot')

In [0]:
trans_demo_dept.pivot_table(index='DEPARTMENT', 
                            columns='AGE_DESC',
                            values='SALES_VALUE',
                            aggfunc='sum')

In [0]:
trans_demo_dept.pivot_table(index='DEPARTMENT', columns='AGE_DESC', values='SALES_VALUE', aggfunc='sum').style.background_gradient(cmap='RdYlGn', axis=1).to_excel('demographic_category_sales.xlsx', sheet_name='sales_pivot')

In [0]:
trans_demo_dept