# Mid-Course Project Notes

Hi There, and thanks for your help. If you're reading this you've been selected to help on a secret initiative.

You will be helping us analyze a portion of data from a company we want to acquire, which could greatly improve the fortunes of Maven Mega Mart.

We'll be working with `project_transactions.csv` and briefly take a look at `product.csv`.

First, read in the transactions data and explore it.

* Take a look at the raw data, the datatypes, and cast `DAY`, `QUANTITY`, `STORE_ID`, and `WEEK_NO` columns to the smallest appropriate datatype. Check the memory reduction by doing so.
* Is there any missing data?
* How many unique households and products are there in the data? The fields household_key and Product_ID will help here.

In [1]:
import pandas as pd
import numpy as np

In [2]:
transactions = pd.read_csv("../project_data/project_transactions.csv",
                          dtype={"DAY":"Int16",
                                  "QUANTITY":"Int32",
                                  "STORE_ID":"Int32",
                                  "WEEK_NO":"Int8"})
#changing the dtype to any lower int set off a TypeError "TypeError: cannot safely cast non-equivalent int64 to int8"

In [3]:
transactions.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
0,1364,26984896261,1,842930,1,2.19,31742,0.0,1,0.0,0.0
1,1364,26984896261,1,897044,1,2.99,31742,-0.4,1,0.0,0.0
2,1364,26984896261,1,920955,1,3.09,31742,0.0,1,0.0,0.0
3,1364,26984896261,1,937406,1,2.5,31742,-0.99,1,0.0,0.0
4,1364,26984896261,1,981760,1,0.6,31742,-0.79,1,0.0,0.0


In [4]:
transactions.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2146311 entries, 0 to 2146310
Data columns (total 11 columns):
 #   Column             Dtype  
---  ------             -----  
 0   household_key      int64  
 1   BASKET_ID          int64  
 2   DAY                Int16  
 3   PRODUCT_ID         int64  
 4   QUANTITY           Int32  
 5   SALES_VALUE        float64
 6   STORE_ID           Int32  
 7   RETAIL_DISC        float64
 8   WEEK_NO            Int8   
 9   COUPON_DISC        float64
 10  COUPON_MATCH_DISC  float64
dtypes: Int16(1), Int32(2), Int8(1), float64(4), int64(3)
memory usage: 145.3 MB


In [5]:
transactions = transactions.astype(
    {"DAY":"Int8",
    "QUANTITY":"Int16",
    "STORE_ID":"Int16",
    "WEEK_NO":"Int8"}

)# changing them at this level, not in the original read, allowed a lower dtype(shaved off another 1.2MB!) 


In [6]:
transactions.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2146311 entries, 0 to 2146310
Data columns (total 11 columns):
 #   Column             Dtype  
---  ------             -----  
 0   household_key      int64  
 1   BASKET_ID          int64  
 2   DAY                Int8   
 3   PRODUCT_ID         int64  
 4   QUANTITY           Int16  
 5   SALES_VALUE        float64
 6   STORE_ID           Int16  
 7   RETAIL_DISC        float64
 8   WEEK_NO            Int8   
 9   COUPON_DISC        float64
 10  COUPON_MATCH_DISC  float64
dtypes: Int16(2), Int8(2), float64(4), int64(3)
memory usage: 135.1 MB


In [7]:
transactions.isna().sum()

household_key        0
BASKET_ID            0
DAY                  0
PRODUCT_ID           0
QUANTITY             0
SALES_VALUE          0
STORE_ID             0
RETAIL_DISC          0
WEEK_NO              0
COUPON_DISC          0
COUPON_MATCH_DISC    0
dtype: int64

In [8]:
transactions["household_key"].nunique()

2099

In [9]:
transactions["PRODUCT_ID"].nunique()

84138

## Column Creation

Create two columns:

* A column that captures the `total_discount` by row (sum of `RETAIL_DISC`, `COUPON_DISC`)
* The percentage disount (`total_discount` / `SALES_VALUE`). Make sure this is positive (try `.abs()`).
* If the percentage discount is greater than 1, set it equal to 1. If it is less than 0, set it to 0. 
* Drop the individual discount columns (`RETAIL_DISC`, `COUPON_DISC`, `COUPON_MATCH_DISC`).

Feel free to overwrite the existing transaction DataFrame after making the modifications above.

In [12]:
#My first round...
# transactions["total_discount"] =transactions["RETAIL_DISC"]+transactions["COUPON_DISC"]
# transactions["percentage_discount"] = transactions["total_discount"].abs()/transactions["SALES_VALUE"]
# transactions.sample(10)

# After getting hungup on the 3rd item, I looked itup and found a btter way for the first two as well 
#(forgot about ".assign")
transactions= (transactions
               .assign(total_discount =transactions["RETAIL_DISC"]+transactions["COUPON_DISC"], 
                                  percentage_discount = (lambda x: (x["total_discount"]/x["SALES_VALUE"])
                                  .abs()))
              .drop(["RETAIL_DISC", "COUPON_DISC", "COUPON_MATCH_DISC"], axis=1))# Needed to add lambda function using
                                            # percentage_discount = (transactions["total_discount"]/transactions["SALES_VALUE"])
                                            # brought:  "KeyError: 'total_discount'"" (althogh it initially wokred!)
# forgot about the "where" method to get the 1 and 0!
transactions["percentage_discount"] = (transactions["percentage_discount"]
                                      .where(transactions["percentage_discount"] < 1, 1.0)
                                      .where(transactions["percentage_discount"] > 0, 0))

transactions.sample(10)

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,WEEK_NO,total_discount,percentage_discount
1424280,1761,34577361269,-13,8203710,1,3.52,427,72,0.0,0.0
1474063,1142,35109131191,1,915284,1,5.49,335,74,0.0,0.0
120937,1585,28141091284,96,985808,1,2.0,406,14,-0.99,0.495
1738965,304,40618820556,81,855544,1,0.89,322,85,0.0,0.0
1049670,72,32715833206,-127,1110695,1,1.67,361,56,-1.02,0.610778
448191,1776,29808136668,-56,972931,1,1.99,324,29,0.0,0.0
1107679,956,32957388478,-109,995965,1,2.0,370,58,-0.99,0.495
980073,1822,32407800985,108,965701,1,1.29,364,53,0.0,0.0
1471308,1074,35080946634,0,9396699,2,5.98,292,74,-2.0,0.334448
2100798,1922,42115045658,-70,889863,1,2.99,334,100,0.0,0.0


## Overall Statistics

Calculate:

* The total sales (sum of `SALES_VALUE`), 
* Total discount (sum of `total_discount`)
* Overall percentage discount (sum of total_discount / sum of sales value)
* Total quantity sold (sum of `QUANTITY`).
* Max quantity sold in a single row. Inspect the row as well. Does this have a high discount percentage?
* Total sales value per basket (sum of sales value / nunique basket_id).
* Total sales value per household (sum of sales value / nunique household_key). 

## Household Analysis

* Plot the distribution of total sales value purchased at the household level. 
* What were the top 10 households by quantity purchased?
* What were the top 10 households by sales value?
* Plot the total sales value for our top 10 households by value, ordered from highest to lowest.


## Product Analysis

* Which products had the most sales by sales_value? Plot  a horizontal bar chart.
* Did the top 10 selling items have a higher than average discount rate?
* What was the most common `PRODUCT_ID` among rows with the households in our top 10 households by sales value?
* Look up the names of the  top 10 products by sales in the `products.csv` dataset.
* Look up the product name of the item that had the highest quantity sold in a single row.