# ADA Project : Dunnhumby dataset, Tell me what you buy and I will tell you who you are



## Abstract
We would like to analyse the Dunnhumby dataset. Living in a time and age where every piece of our data is stored and analysed; and being active consumers ourselves, we would like to see what informations retail chains can gather and infer about us knowing only our shopping habits. As transactions over two years of several households and their basic demographic profiles are provided, we want to see if there are any links and correlations between specific demographics (e.g. marital status, income, number of children, etc) and purchase patterns. Furthermore, if time permits it, we want to see if we can create a model predicting a consumer demographic profile from their shopping. Thus, we would like to see how "easy" and how precise it actually is for retailers to infer who their customer is by what they buy and target them with specific marketing. Basically, we want to know how much of a target we actually
are.

**Research questions:** 
- What are the main shopping trends that we can identify in this data ?
- Can we relate shopping trends to specific demographic parameters ?
- Can we predict some of these demographic parameters (age, marital statute etc) with knowing the household's habbits?
- In the opposite way, can we predict household consumption behaviour with knowing its characteristics?
- What accuracy in consumption prediction can the retailer obtain from a simple profile information?

## Task 1: Clean up the data and prepare the sets we want to keep

In [None]:
%matplotlib inline
import pandas as pd

import matplotlib.pyplot as plt
from pylab import *

import os

In [None]:
os.getcwd()

In [None]:
'''As we said in the description of our project, we are going to concentrate on 3 of the 8 tables :
- hh_demographic.csv
- transaction_data.csv
- product.csv
In this first step, we want to load the data, and prepare it for the analysis'''

#load the data
hh_demographic = pd.read_csv('../data/dunnhumby_complete_csv/hh_demographic.csv', sep = ',')

transaction_data = pd.read_csv('../data/dunnhumby_complete_csv/transaction_data.csv', sep = ',')

product = pd.read_csv('../data/dunnhumby_complete_csv/product.csv', sep = ',')

### Task 1.A: What's actually in the dataset ? 
This dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included. We have a look at a few samples from each table: 

#### A. Transaction data: 
Dataset of all products purchased by households during the study. Each line in the table is what could essentially be found in a store reciept. The attributes of the dataset are the following: 

- HOUSEHOLD_KEY: identifies each household, 
- BASKET_ID: identifies a purchase occasion, 
- DAY: day when transaction occured
- PRODUCT_ID: identifies each product, 
- QUANTITY: Number of products purchased during trip
- SALES_VALUE: Amount of dollars retailer recieves from sale
- STORE_ID: identifies store, 
- COUPON_MATCH_DISC: discount applied du to retailer's match of manufacturer coupon
- COUPON_DISC: discount applied due to manufacturer coupon
- RETAIL_DISC: discount applied due to retailer's loyalty card program
- TRANS_TIME: time of day when transaction occured
- WEEK_NO: week of the transaction. Ranges from 1-102. 

In [None]:
transaction_data['STORE_ID'].is_unique

In [None]:
transaction_data.head(4)

**Q1: How many transactions occured during the two years ?**

In [None]:
print("In total there were "+ f"{transaction_data.count()['household_key']:,d}" +" transactions during the two years.")

**Q2: How many purchase occasions occured during the two years ?** <br>
Attention : here transactions are not what we usually think of. It's like on a reciept so the number of total unique purchases is not the number of transactions but the total of unique basket_id. 

In [None]:
print("In total there were "+ f"{len(transaction_data['BASKET_ID'].unique()):,d}" +" purchase occasions during the two years." )

#### Q3: How many households are represented in the transactions?

In [None]:
print("In total there were "+ f"{len(transaction_data['household_key'].unique()):,d}" +" households represented during the two years." )

#### B. Demographic data: 
Demographic info for a certain portion of households. Contains only the data for 801 households of 2500 though. The rest could not be aquiered. The attributes of the dataset are the following: 
 
- HOUSEHOLD_KEY : identifies each household, **unique**
- AGE_DES: estimated age range
- MARITAL_STATUS_CODE: A (Married), B (Single), C (Unknown)
- INCOME_DESC : Household income
- HOMEOWNER_DESC: Homeowner, renter, etc
- HH_COMP_DEC: Household composition
- HOUSEHOLD_SIZE_DESC: Size of household up to 5+ 
- KID_CATEGORY_DESC: Number of children present up to 3+ 

In [None]:
hh_demographic['household_key'].is_unique

In [None]:
hh_demographic.head(4)

**Q3: How many age categories are there ? And what are they ?**


In [None]:
print("In total there are %d age categories" %len(hh_demographic['AGE_DESC'].unique()))
print("The different categories are:", hh_demographic['AGE_DESC'].unique())

**Q4: How many income categories are there ? And what are they ?**

In [None]:
print("In total there are %d income categories" %len(hh_demographic['INCOME_DESC'].unique()))
print("The different categories are:", hh_demographic['INCOME_DESC'].unique())

**Q5: How many homeowner categories are there ? And what are they ?**

In [None]:
print("In total there are %d homeowner categories" %len(hh_demographic['HOMEOWNER_DESC'].unique()))
print("The different categories are:", hh_demographic['HOMEOWNER_DESC'].unique())

**Q7: How many household composition categories are there ? And what are they ?**

In [None]:
print("In total there are %d household composition categories" %len(hh_demographic['HH_COMP_DESC'].unique()))
print("The different categories are:", hh_demographic['HH_COMP_DESC'].unique())

**Q8: How many household size categories are there ? And what are they ?**

In [None]:
print("In total there are %d household size categories" %len(hh_demographic['HOUSEHOLD_SIZE_DESC'].unique()))
print("The different categories are:", hh_demographic['HOUSEHOLD_SIZE_DESC'].unique())

**Q9: How many kid number categories are there ? And what are they ?**

In [None]:
print("In total there are %d kid number categories" %len(hh_demographic['KID_CATEGORY_DESC'].unique()))
print("The different categories are:", hh_demographic['KID_CATEGORY_DESC'].unique())

#### Q10: How many marital status categories are there? And what are they?

In [None]:
print("In total there are %d marital status categories" %len(hh_demographic['MARITAL_STATUS_CODE'].unique()))
print("The different categories are:", hh_demographic['MARITAL_STATUS_CODE'].unique())

For the marital status, the categories are not obvious:
- 'A' = 'married'
- 'B' = 'Single'
- 'U' = 'Unknown'

#### Q11: How many households are there ?

In [None]:
print("In total there are %d households for which we have the demographic data." %hh_demographic.count()['household_key'])

**Note for the bubble group :**

**Should we keep in the transaction data only the households for which we have the demographic data?? could be interesting considering the fact that we want to get insights on the shopping behavior according to the demographic data**

#### C. Product data: 
Information on each product sold such as type of product, national or private label and a brand identifier. The attributes of the dataset are the following: 
- PRODUCT_ID: **unique**, identifies product
- DEPARMENT: groups similar products together
- COMMODITY_DESC: groups similar products together at a lower level
- SUB_COMMODITY_DESC: groups similar products together at the lowest level. 
- MANUFACTURER: code that links products with the same manufacturer together 
- BRAND: indicates private or national label brand
- CURR_SIZE_OF_PRODUCT: indicates package size (not available for all) 

Let's have a look: 

In [None]:
product.head(4)

**Q10: How many products are there ?**

In [None]:
# Are the products IDs unique ?
product['PRODUCT_ID'].is_unique

In [None]:
print("In total there are "+ f"{product.count()['PRODUCT_ID']:,d}" +" products")

**Q11: How many department categories are there ? And what are they ?**

In [None]:
print("In total there are "+ f"{len(product['DEPARTMENT'].unique()) :,d}"+ " department categories" )
print("The different categories are:", product['DEPARTMENT'].unique())

**Q12: Are all produts in the product dataset represented in transactions ?**
There are 92 353 products. As for the households, we can investigate whether all the products are represented in the *transaction_data* table.

In [None]:
print("There are "+ f"{len(transaction_data['PRODUCT_ID'].unique()):,d}" +" products in the transactions table" )

There are 92 339 products represented in the *transaction_data* table, meaning that only **14** are not represented. We can more easily imagine to do an inner join, and just drop those 14 products. 

**Q13: Which are these 14 products that are never sold ?**

### TASK 1.B: Simple plots

#### A. HH-demographic

In [None]:
hh_demographic.head(4)

For now, the categories in this data frame are not arranged in a meaninful way, meaning that if we would make some plots now, we would not have the age categories ranged in ascending or descending order for example. 
Thus, we first want to arrange them, before making some exploratory plots.

In [None]:
ordered_age= ['19-24','25-34','35-44','45-54','55-64', '65+' ]

hh_demographic['AGE_DESC'] = pd.Categorical(hh_demographic['AGE_DESC'],
                      ordered = True,
                      categories = ordered_age)

print ('The order of the age categories is :', ordered_age)

In [None]:
ordered_income= ['Under 15K','15-24K','25-34K','35-49K','50-74K','75-99K','100-124K',
                 '125-149K','150-174K','175-199K','200-249K','250K+']

hh_demographic['INCOME_DESC'] = pd.Categorical(hh_demographic['INCOME_DESC'],
                      ordered = True,
                      categories = ordered_income)

print ('The order of the income categories is :', ordered_income)

In [None]:
ordered_homeowner= ['Unknown','Probable Renter','Renter','Probable Owner','Homeowner']

hh_demographic['HOMEOWNER_DESC'] = pd.Categorical(hh_demographic['HOMEOWNER_DESC'],
                      ordered = True,
                      categories = ordered_homeowner)

print ('The order of the homeowner categories is :', ordered_homeowner)

In [None]:
ordered_hh_comp= ['Unknown','Single Female','Single Male','1 Adult Kids','2 Adults No Kids','2 Adults Kids']

hh_demographic['HH_COMP_DESC'] = pd.Categorical(hh_demographic['HH_COMP_DESC'],
                      ordered = True,
                      categories = ordered_hh_comp)

print ('The order of the household composition categories is :', ordered_hh_comp)

In [None]:
ordered_hh_size= ['1','2','3','4','5+']

hh_demographic['HOUSEHOLD_SIZE_DESC'] = pd.Categorical(hh_demographic['HOUSEHOLD_SIZE_DESC'],
                      ordered = True,
                      categories = ordered_hh_size)

print ('The order of the household size categories is :', ordered_hh_size)

In [None]:
ordered_kid_number= ['None/Unknown','1','2','3+']

hh_demographic['KID_CATEGORY_DESC'] = pd.Categorical(hh_demographic['KID_CATEGORY_DESC'],
                      ordered = True,
                      categories = ordered_kid_number)

print ('The order of the kid number categories is :', ordered_kid_number)

In [None]:
ordered_marital_status= ['A','B','U']

hh_demographic['MARITAL_STATUS_CODE'] = pd.Categorical(hh_demographic['MARITAL_STATUS_CODE'],
                      ordered = True,
                      categories = ordered_marital_status)

print ('The order of the marital status categories is :', ordered_marital_status)

Now that all the categories in this data frame are ranged in a meaningful way, let's make some simple plots to have an idea of the characteristics of the population which we study.

In [None]:
fig1 = plt.figure(figsize=(20,20))

plt.subplot(2, 2, 1)
hh_demographic['AGE_DESC'].value_counts(sort = False).plot(kind = 'bar', title = 'Age histogram')

plt.subplot(2, 2, 2)
hh_demographic['MARITAL_STATUS_CODE'].value_counts(sort = False).plot(kind='bar', title = 'marital status histogram')

plt.subplot(2,2,3)
hh_demographic['INCOME_DESC'].value_counts(sort = False).plot(kind='bar', title = 'Income Histograms')

plt.subplot(2,2,4)
hh_demographic['HOMEOWNER_DESC'].value_counts(sort = False).plot(kind='bar', title = 'Homeowner histogram')

plt.show()

In [None]:
fig2 = plt.figure(figsize=(20,20))

plt.subplot(2,2,1)
hh_demographic['HH_COMP_DESC'].value_counts(sort = False).plot(kind='bar', title = 'Household composition histogram')

plt.subplot(2,2,2)
hh_demographic['HOUSEHOLD_SIZE_DESC'].value_counts(sort = False).plot(kind='bar', title = 'Household size histogram')

plt.subplot(2,2,3)
hh_demographic['KID_CATEGORY_DESC'].value_counts(sort = False).plot(kind='bar', title = 'Kid categories')

plt.show()

**C.Transaction data**<br/>
- HOUSEHOLD_KEY: identifies each household, 
- BASKET_ID: identifies a purchase occasion, 
- DAY: day when transaction occured
- PRODUCT_ID: identifies each product, 
- QUANTITY: Number of products purchased during trip
- SALES_VALUE: Amount of dollars retailer recieves from sale
- STORE_ID: identifies store, 
- COUPON_MATCH_DISC: discount applied du to retailer's match of manufacturer coupon
- COUPON_DISC: discount applied due to manufacturer coupon
- RETAIL_DISC: discount applied due to retailer's loyalty card program
- TRANS_TIME: time of day when transaction occured
- WEEK_NO: week of the transaction. Ranges from 1-102. 


In [None]:
import seaborn as sns
from scipy import stats

In [None]:
transaction_data.head(4)

Drop the coupons columns as we're not interessed in marketing. 

In [None]:
trans_clean = transaction_data.drop(['COUPON_DISC','COUPON_MATCH_DISC', 'RETAIL_DISC'], axis = 1)
trans_clean.head(4)

Group by households: 

In [None]:
grouped_trans = trans_clean.groupby(['household_key', 'BASKET_ID']).size()
grouped_trans[1]

In [None]:
#Number of total purchases by households: 
grouped_trans = trans_clean.groupby(['household_key', 'BASKET_ID']).size()

purchases_per_household = pd.DataFrame(index = range(1,2501))
purchases_per_household.index.name = "household_key"

number = []

for i in range(2501): 
    if i!= 0:   
        number.append(len(grouped_trans[i]))

purchases_per_household['total of purchases in two years'] = number

#Plot distribution: 
fig = plt.figure()

sns.distplot(purchases_per_household)
plt.xlabel('Number of total purchases in two years')
plt.title('Distribution of total purchaes over two years per household')

Take a look at the transactions per year. We realise that some households were not present in the first year and some other dropped out of the study in the second year. This has to be taken into account. 

In [None]:
trans_clean_year_1 = trans_clean[trans_clean['WEEK_NO'].apply(lambda x : x <= 51)]

trans_clean_year_2 = trans_clean[trans_clean['WEEK_NO'].apply(lambda x : x > 51)]

#See that not all households are present in the first year: 
len(trans_clean_year_1['household_key'].sort_values().unique())
len(trans_clean_year_2['household_key'].sort_values().unique())

#Find households that are not represented: 

missing_households_year1 = set(list(range(1,2501))).difference(set(trans_clean_year_1['household_key'].unique()))
missing_households_year2 = set(list(range(1,2501))).difference(set(trans_clean_year_2['household_key'].unique()))

print("The following households are not represented in the first year transaction data: ", missing_households_year1)
print('\n')
print("The following households are not represented in the second year transaction data: ", missing_households_year2)

In [None]:
#How many transactions per year ? Assume one year is 51 weeks. 
grouped_trans_year1 = trans_clean_year_1.groupby(['household_key','BASKET_ID']).size()


purchases_per_household_year1 = pd.DataFrame(index = trans_clean_year_1['household_key'].sort_values().unique())
purchases_per_household_year1.index.name = "household_key"


total_transaction = []
for i in trans_clean_year_1['household_key'].sort_values().unique():
    total_transaction.append(len(grouped_trans_year1[i]))
purchases_per_household_year1['total purchase in year 1 per household'] = total_transaction

#Plot distribution: 
fig = plt.figure()

sns.distplot(purchases_per_household_year1)
plt.xlabel('Number of total purchases in the first year')
plt.title('Distribution of total purchaes over the first year per household')

Q: What's the average budget per family ? Budget per family distribution? 
Q: What's the average budget per week per family ? Distribution of budget
Q: What's the amount of money spent per year per family ? 