# Mid-Course Project

Hi There, and thanks for your help. If you're reading this you've been selected to help on a secret initiative.

You will be helping us analyze a portion of data from a company we want to acquire, which could greatly improve the fortunes of Maven Mega Mart.

We'll be working with `project_transactions.csv` and briefly take a look at `product.csv`.

First, read in the transactions data and explore it.

* Take a look at the raw data, the datatypes, and cast `DAY`, `QUANTITY`, `STORE_ID`, and `WEEK_NO` columns to the smallest appropriate datatype. Check the memory reduction by doing so.
* Is there any missing data?
* How many unique households and products are there in the data? The fields household_key and Product_ID will help here.

In [0]:
import pandas as pd
import numpy as np

In [0]:
transactions = pd.read_csv("/Volumes/dbx_catalog/default/sample_files/project_transactions.csv")

In [0]:
transactions.head()

In [0]:
transactions.info(memory_usage="deep")

In [0]:
transactions.describe().round()

In [0]:
transactions.isna().sum()

In [0]:
transactions['household_key'].nunique()

In [0]:
transactions['PRODUCT_ID'].nunique()

In [0]:
transactions

## Column Creation

Create two columns:

* A column that captures the `total_discount` by row (sum of `RETAIL_DISC`, `COUPON_DISC`)
* The percentage disount (`total_discount` / `SALES_VALUE`). Make sure this is positive (try `.abs()`).
* If the percentage discount is greater than 1, set it equal to 1. If it is less than 0, set it to 0. 
* Drop the individual discount columns (`RETAIL_DISC`, `COUPON_DISC`, `COUPON_MATCH_DISC`).

Feel free to overwrite the existing transaction DataFrame after making the modifications above.

In [0]:
transactions = pd.read_csv("/Volumes/dbx_catalog/default/sample_files/project_transactions.csv")
transactions['total_discount']=transactions['RETAIL_DISC'] + transactions['COUPON_DISC']
transactions['percentage_discount'] = transactions['total_discount'] / transactions['SALES_VALUE']
transactions['percentage_discount'] = transactions['percentage_discount'].abs()
transactions = transactions.drop(['RETAIL_DISC', 'COUPON_DISC', 'COUPON_MATCH_DISC'], axis=1)
transactions   
    

In [0]:
transactions['percentage_discount'] = (transactions['percentage_discount']
                                       .where(transactions['percentage_discount'] < 1, 1.0)
                                       .where(transactions['percentage_discount'] > 0, 0))

In [0]:
transactions[transactions['PRODUCT_ID']==981760]

## Overall Statistics

Calculate:

* The total sales (sum of `SALES_VALUE`), 
* Total discount (sum of `total_discount`)
* Overall percentage discount (sum of total_discount / sum of sales value)
* Total quantity sold (sum of `QUANTITY`).
* Max quantity sold in a single row. Inspect the row as well. Does this have a high discount percentage?
* Total sales value per basket (sum of sales value / nunique basket_id).
* Total sales value per household (sum of sales value / nunique household_key). 

In [0]:
transactions['SALES_VALUE'].sum()

In [0]:
transactions['total_discount'].sum()

In [0]:

transactions['total_discount'].sum() / transactions['SALES_VALUE'].sum()


In [0]:
transactions['percentage_discount'].mean()

In [0]:
transactions['QUANTITY'].sum()

In [0]:
transactions['QUANTITY'].max()

In [0]:
transactions.loc[transactions['QUANTITY'].argmax()]

In [0]:
transactions['SALES_VALUE'].sum() / transactions['BASKET_ID'].nunique()

In [0]:
transactions['SALES_VALUE'].sum() / transactions['household_key'].nunique()

## Household Analysis

* Plot the distribution of total sales value purchased at the household level. 
* What were the top 10 households by quantity purchased?
* What were the top 10 households by sales value?
* Plot the total sales value for our top 10 households by value, ordered from highest to lowest.


In [0]:
transactions

In [0]:
(transactions
.groupby('household_key')
.agg({'SALES_VALUE' : 'sum'})
.plot.hist())

In [0]:
top10_value = (transactions
.groupby('household_key')
.agg({'SALES_VALUE' : 'sum'})
.sort_values('SALES_VALUE', ascending=False)
.iloc[:10])




In [0]:
top10_quant = (transactions
.groupby('household_key')
.agg({'QUANTITY' : 'sum'})
.sort_values('QUANTITY', ascending=False)
.iloc[:10])


In [0]:
top10_value

In [0]:
top10_quant

In [0]:
(transactions
.groupby('household_key')
.agg({'SALES_VALUE' : 'sum','QUANTITY' : 'sum'})
.sort_values('SALES_VALUE', ascending=False)
.loc[:, 'SALES_VALUE']
.describe())

In [0]:
top10_value['SALES_VALUE'].plot.bar()

## Product Analysis

* Which products had the most sales by sales_value? Plot  a horizontal bar chart.
* Did the top 10 selling items have a higher than average discount rate?
* What was the most common `PRODUCT_ID` among rows with the households in our top 10 households by sales value?
* Look up the names of the  top 10 products by sales in the `products.csv` dataset.
* Look up the product name of the item that had the highest quantity sold in a single row.

In [0]:
top10_products = (transactions
                  .groupby(['PRODUCT_ID'])
                  .agg({'SALES_VALUE' : 'sum'})
                  .sort_values('SALES_VALUE', ascending=False)
                  .iloc[:10])

In [0]:
top10_products['SALES_VALUE'].sort_values().plot.barh()

In [0]:
discount_ratio = (transactions
                  .query('PRODUCT_ID in @top10_products.index')
                  .loc[:, 'total_discount']
                  .sum()) / (transactions
                             .query('PRODUCT_ID in @top10_products.index')
                             .loc[:, 'SALES_VALUE']
                             .sum())

In [0]:
discount_ratio

In [0]:
products = pd.read_csv('/Workspace/Users/ranjeeth.rikkala@ascentt.com/Programming/pandas_course_resources/project_data/product.csv')

In [0]:
products.head()

In [0]:
top_hh_products = (transactions
                   .query('household_key in @top10_value.index')
                   .loc[:, 'PRODUCT_ID']
                   .value_counts()
                   .iloc[:10]
                   .index)

In [0]:
products.query('PRODUCT_ID in @top_hh_products')

In [0]:
products.query('PRODUCT_ID == 6534178')

In [0]:
products.query('PRODUCT_ID in @top10_products.index')