# ADA Project : Dunnhumby dataset, Tell me what you buy and I will tell you who you are



## Abstract
We would like to analyse the Dunnhumby dataset. Living in a time and age where every piece of our data is stored and analysed; and being active consumers ourselves, we would like to see what informations retail chains can gather and infer about us knowing only our shopping habits. As transactions over two years of several households and their basic demographic profiles are provided, we want to see if there are any links and correlations between specific demographics (e.g. marital status, income, number of children, etc) and purchase patterns. Furthermore, if time permits it, we want to see if we can create a model predicting a consumer demographic profile from their shopping. Thus, we would like to see how "easy" and how precise it actually is for retailers to infer who their customer is by what they buy and target them with specific marketing. Basically, we want to know how much of a target we actually
are.

**Research questions:** 
- What are the main shopping trends that we can identify in this data ?
- Can we relate shopping trends to specific demographic parameters ?
- Can we predict some of these demographic parameters (age, marital statute etc) with knowing the household's habbits?
- In the opposite way, can we predict household consumption behaviour with knowing its characteristics?
- What accuracy in consumption prediction can the retailer obtain from a simple profile information?

## Task 1: Clean up the data and prepare the sets we want to keep

In [None]:
%matplotlib inline
import pandas as pd

import matplotlib.pyplot as plt

import os

In [None]:
os.getcwd()

In [None]:
'''As we said in the description of our project, we are going to concentrate on 3 of the 8 tables :
- hh_demographic.csv
- transaction_data.csv
- product.csv
In this first step, we want to load the data, and prepare it for the analysis'''

#load the data
hh_demographic = pd.read_csv('../data/dunnhumby_complete_csv/hh_demographic.csv', sep = ',')

transaction_data = pd.read_csv('../data/dunnhumby_complete_csv/transaction_data.csv', sep = ',')

product = pd.read_csv('../data/dunnhumby_complete_csv/product.csv', sep = ',')

### Task 1.A: What's actually in the dataset ? 
This dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included. We have a look at a few samples from each table: 

#### A. Transaction data: 
Dataset of all products purchased by households during the study. Each line in the table is what could essentially be found in a store reciept. The attributes of the dataset are the following: 

- HOUSEHOLD_KEY: identifies each household, **unique**
- BASKET_ID: identifies a purchase occasion, **unique**
- DAY: day when transaction occured
- PRODUCT_ID: identifies each product, **unique**
- QUANTITY: Number of products purchased during trip
- SALES_VALUE: Amount of dollars retailer recieves from sale
- STORE_ID: identifies store, **unique**
- COUPON_MATCH_DISC: discount applied du to retailer's match of manufacturer coupon
- COUPON_DISC: discount applied due to manufacturer coupon
- RETAIL_DISC: discount applied due to retailer's loyalty card program
- TRANS_TIME: time of day when transaction occured
- WEEK_NO: week of the transaction. Ranges from 1-102. 

In [None]:
transaction_data.head(4)

**Q1: How many transactions occured during the two years ?**

In [None]:
print("In total there were %d transaction during the two years." %transaction_data.count()['household_key'])

**Q2: How many purchase occasions occured during the two years ?** <br>
Attention : here transactions are not what we usually think of. It's like on a reciept so the number of total unique purchases is not the number of transactions but the total of unique basket_id. 

In [None]:
print("In total there were %d purchase occasions during the two years." %len(transaction_data['BASKET_ID'].unique()))

#### B. Demographic data: 
Demographic info for a certain portion of households. Contains only the data for 801 households of 2500 though. The rest could not be aquiered. The attributes of the dataset are the following: 
 
- HOUSEHOLD_KEY : identifies each household, **unique**
- AGE_DES: estimated age range
- MARITAL_STATUS_CODE: A (Married), B (Single), C (Unknown)
- INCOME_DESC : Household income
- HOMEOWNER_DESC: Homeowner, renter, etc
- HH_COMP_DEC: Household composition
- HOUSEHOLD_SIZE_DESC: Size of household up to 5+ 
- KID_CATEGORY_DESC: Number of children present up to 3+ 

In [None]:
hh_demographic.head(4)

**Q3: How many age categories are there ? And what are they ?**


In [None]:
print("In total there are %d age categories" %len(hh_demographic['AGE_DESC'].unique()))
print("The different categories are:", hh_demographic['AGE_DESC'].unique())

**Q4: How many income categories are there ? And what are they ?**

In [None]:
print("In total there are %d income categories" %len(hh_demographic['INCOME_DESC'].unique()))
print("The different categories are:", hh_demographic['INCOME_DESC'].unique())

**Q5: How many homeowner categories are there ? And what are they ?**

In [None]:
print("In total there are %d homeowner categories" %len(hh_demographic['HOMEOWNER_DESC'].unique()))
print("The different categories are:", hh_demographic['HOMEOWNER_DESC'].unique())

**Q7: How many household composition categories are there ? And what are they ?**

In [None]:
print("In total there are %d homeowner categories" %len(hh_demographic['HH_COMP_DESC'].unique()))
print("The different categories are:", hh_demographic['HH_COMP_DESC'].unique())

**Q8: How many household size categories are there ? And what are they ?**

In [None]:
print("In total there are %d homeowner categories" %len(hh_demographic['HOUSEHOLD_SIZE_DESC'].unique()))
print("The different categories are:", hh_demographic['HOUSEHOLD_SIZE_DESC'].unique())

**Q9: How many kid size categories are there ? And what are they ?**

In [None]:
print("In total there are %d homeowner categories" %len(hh_demographic['KID_CATEGORY_DESC'].unique()))
print("The different categories are:", hh_demographic['KID_CATEGORY_DESC'].unique())

#### C. Product data: 
Information on each product sold such as type of product, national or private label and a brand identifier. The attributes of the dataset are the following: 
- PRODUCT_ID: **unique**, identifies product
- DEPARMENT: groups similar products together
- COMMODITY_DESC: groups similar products together at a lower level
- SUB_COMMODITY_DESC: groups similar products together at the lowest level. 
- MANUFACTURER: code that links products with the same manufacturer together 
- BRAND: indicates private or national label brand
- CURR_SIZE_OF_PRODUCT: indicates package size (not available for all) 

Let's have a look: 

In [None]:
product.head(4)

**Q10: How many products are there ?**

In [None]:
print("In total there are %d products" %product.count()['PRODUCT_ID'])

**Q11: How many department categories are there ? And what are they ?**

In [None]:
print("In total there are %d department categories" %len(product['DEPARTMENT'].unique()))
print("The different categories are:", product['DEPARTMENT'].unique())

**Q12: Are all produts in the product dataset represented in transactions ?**
There are 92 353 products. As for the households, we can investigate whether all the products are represented in the *transaction_data* table.

In [None]:
print("There are %d products in the transactions table" %len(transaction_data['PRODUCT_ID'].unique()))

There are 92 339 products represented in the *transaction_data* table, meaning that only **14** are not represented. We can more easily imagine to do an inner join, and just drop those 14 products. 

### TASK 1.B: Simple plots

In [None]:
hh_demographic.groupby('AGE_DESC').count()

In [None]:
hh_demographic['AGE_DESC'].value_counts().plot(kind='bar')

In [None]:
hh_demographic['MARITAL_STATUS_CODE'].value_counts().plot(kind='bar')

In [None]:
hh_demographic['INCOME_DESC'].value_counts().plot(kind='bar')

PS:
- we should continue to make some plots
- we should order the categories when it makes sense, so that the plots are more meaningful