# Do customers buy the SAME products again? 

To someone who already read this(https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2)

You've probably wondered if customers would really buy the exact same thing again (!?)  

At least I (as a one of the cosumer) **couldn't imagine buying the same product within 1 month**!, 

So I changed the above NOTEBOOK and adopted the strategy of "not recommending the same product". Then my score **DROPPED** from 0.02 to 0.006:(

Okay... I have to accept that fact. Some customers will purchase exact the same products again within 1 week. But for making sure, I will check this in a data driven way.


This notebook is my survey report about **'Do customers buy the SAME products again?'**  
Conclusion first. Yes they will. And there are clear features in the characteristics of **customers** and **products** that they buy again.


# **Please upvote! :)**

### If you haven't seen this notebook yet, take a look at this one too.
https://www.kaggle.com/lichtlab/h-m-data-deep-dive-chap-1-understand-article

# Agenda
1. Do customers buy the same product multiple times?
2. Do customers buy different colors and sizes of the same product?
3. Do customers buy the same product type?
4. What kind of customer feature or product feature lead to 'one more time purchase'? (will be updated soon)

# 1. Do customers buy the same product multiple times?

In [1]:
import pandas as pd
import plotly.express as px

### Data prep

In [2]:
df_trans = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',dtype={'article_id': str})
df_trans['t_dat'] = pd.to_datetime(df_trans['t_dat'])
df_trans = df_trans[df_trans['t_dat'] >= pd.to_datetime('2020-07-01')]
df_article = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv',dtype={'article_id': str})

df_article['idxgrp_idx_prdtyp'] = df_article['index_group_name'] + '_' + df_article['index_name'] + '_' + df_article['product_type_name']

df = pd.merge(
    df_trans,
    df_article,
    on='article_id',
    how='left'
)
df['product_code'] = df['product_code'].astype(str)
df['num_week'] = df['t_dat'].dt.isocalendar().week
df['product_code'] = df['product_code'].astype(str)

### Compare a set of products purchased in any given week with a set of products purchased 1, 2, or 3 weeks ago to see if they contain the SAME products.

In [3]:
def do_customers_purchase_same_AGGKEY(df, agg_key):
    dfagg = df.groupby(['num_week','customer_id'])[[agg_key]].agg({
            agg_key: lambda x: ','.join(x)
    }).reset_index().rename(columns={agg_key: 'purchased_set'})
    dfagg['num_2wk_before'] = dfagg['num_week'] + 2
    dfagg = pd.merge(
        dfagg[['num_week','customer_id','purchased_set']],
        dfagg.rename(columns={'purchased_set': '2wk_before_purchased_set'})[['num_2wk_before','customer_id','2wk_before_purchased_set']],
        left_on=['num_week', 'customer_id'],
        right_on=['num_2wk_before', 'customer_id'],
        how='left'
    )
    dfagg['num_1wk_before'] = dfagg['num_week'] + 1
    dfagg = pd.merge(
        dfagg,
        dfagg.rename(columns={'purchased_set': '1wk_before_purchased_set'})[['num_1wk_before','customer_id','1wk_before_purchased_set']],
        left_on=['num_week', 'customer_id'],
        right_on=['num_1wk_before', 'customer_id'],
        how='left'
    )
    dfagg['num_3wk_before'] = dfagg['num_week'] + 3
    dfagg = pd.merge(
        dfagg,
        dfagg.rename(columns={'purchased_set': '3wk_before_purchased_set'})[['num_3wk_before','customer_id','3wk_before_purchased_set']],
        left_on=['num_week', 'customer_id'],
        right_on=['num_3wk_before', 'customer_id'],
        how='left'

    )
    dfagg = dfagg[['num_week','customer_id','purchased_set','1wk_before_purchased_set','2wk_before_purchased_set','3wk_before_purchased_set']]
    for col in ['purchased_set','1wk_before_purchased_set', '2wk_before_purchased_set', '3wk_before_purchased_set']:
        dfagg[col] = dfagg[col].fillna('')
        dfagg[col] = dfagg[col].str.split(',')
    dfagg['2wk_before_purchased_set'] = dfagg['2wk_before_purchased_set'] + dfagg['1wk_before_purchased_set']
    dfagg['3wk_before_purchased_set'] = dfagg['3wk_before_purchased_set'] + dfagg['2wk_before_purchased_set']
    for col in ['purchased_set','1wk_before_purchased_set', '2wk_before_purchased_set', '3wk_before_purchased_set']:
        dfagg[col] = dfagg[col].map(set)

    dfagg['is_purchased_same_within_1wk'] = (dfagg['purchased_set'] & dfagg['1wk_before_purchased_set']).astype(int)
    dfagg['is_purchased_same_within_2wk'] = (dfagg['purchased_set'] & dfagg['2wk_before_purchased_set']).astype(int)
    dfagg['is_purchased_same_within_3wk'] = (dfagg['purchased_set'] & dfagg['3wk_before_purchased_set']).astype(int)
    print(
        len(dfagg[dfagg['is_purchased_same_within_3wk'] == 1]['customer_id'].unique()) / len(dfagg['customer_id'].unique()) * 100,
        len(dfagg[dfagg['is_purchased_same_within_2wk'] == 1]['customer_id'].unique()) / len(dfagg['customer_id'].unique()) * 100,
        len(dfagg[dfagg['is_purchased_same_within_1wk'] == 1]['customer_id'].unique()) / len(dfagg['customer_id'].unique()) * 100
    )
    df_vis = pd.DataFrame({
        'Pediod': ['Within_1wk', 'Within_2wk', 'Within_3wk'],
        'Ratio': [len(dfagg[dfagg['is_purchased_same_within_1wk'] == 1]['customer_id'].unique()) / len(dfagg['customer_id'].unique()) * 100,
                  len(dfagg[dfagg['is_purchased_same_within_2wk'] == 1]['customer_id'].unique()) / len(dfagg['customer_id'].unique()) * 100,
                  len(dfagg[dfagg['is_purchased_same_within_3wk'] == 1]['customer_id'].unique()) / len(dfagg['customer_id'].unique()) * 100]
    })
    fig = px.bar(df_vis, x='Pediod', y='Ratio')
    fig.show()
    return dfagg

### Result

In [4]:
dfagg_article = do_customers_purchase_same_AGGKEY(df, 'article_id')

6.643407353238566 6.154185552943195 5.093013422448641


# 1. Conclusion - Do customers buy the same product multiple times?
- **5.1%** of customers will buy the same product in one week   
- **6.2%** will buy the same product within two weeks  
- **6.6%** will buy the same product within three weeks  

In other words, most customers who purchase a product again purchase the same product within **two weeks**.

# 2. Do customers buy different colors and sizes of the same product?
In this analysis I aggregate data at 'product code'-granularity instead of 'article_id'-granularity.

### Result

In [5]:
dfagg_prdcd = do_customers_purchase_same_AGGKEY(df, 'product_code')

9.381619559067149 8.480797789003438 6.767806550056889


# 2. Conclusion - Do customers buy different colors and sizes of the same product?
- **6.8%** of customers will buy the same product code item in one week   
- **8.5%** will buy the same product code item within two weeks  
- **9.4%** will buy the same product code item within three weeks  

Customers seem to need a little more time when buying products with the same product code but different colors and sizes.

# 3. Do customers buy the same product type?
In this analysis I aggregate data at 'idxgrp_idx_prdtyp'-granularity instead of 'product code'-granularity.

*idxgrp_idx_prdtyp explanation  
https://www.kaggle.com/lichtlab/h-m-data-deep-dive-chap-1-understand-article

### Result

In [6]:
dfagg_idxgrp_idx_prdtyp = do_customers_purchase_same_AGGKEY(df, 'idxgrp_idx_prdtyp')

16.5346755101082 14.338590510118415 10.800260645936191


# 3. Conclusion - Do customers buy the same product type?

- **10.8%** of customers will buy the same idxgrp_idx_prdtyp item in one week   
- **14.3%** will buy the same idxgrp_idx_prdtyp item within two weeks  
- **16.5%** will buy the same idxgrp_idx_prdtyp item within three weeks  

**16.5%** is huge! If we observe the purchasing behavior of customers over a short period of time, they seem to buy multiple similar products.

# 4. What kind of customer feature or product feature lead to 'one more time purchase'?

## 4-1 Product feature
## Data prep
For each customer, set a binary flag to indicate whether the customer has purchased a product with the same product_code in the last three weeks.

In [7]:
dfagg = df.sort_values("t_dat")\
        .set_index('t_dat')\
        .groupby(['customer_id','product_code'])\
        .rolling("21d")[["price"]]\
        .count()\
        .reset_index()\
        .rename(columns={'price': 'num_purchased_same_article'})
dfagg['is_purchased_same_prdcd_within_3wk'] = (dfagg['num_purchased_same_article'] > 1).astype(int)
dfagg = dfagg[dfagg['t_dat'] > '2020-09-01']
dfagg = pd.merge(
    dfagg,
    df.groupby(['idxgrp_idx_prdtyp','product_code'])[[]].count().reset_index(),
    on='product_code'
)

## Model design
Create a logistic regression model with the idxgrp_idx_prdtyp as the explanatory variable to see if it can explain the binary flags I have created earlier.

In [8]:
from sklearn.linear_model import LogisticRegression
dftrain = pd.get_dummies(dfagg[['idxgrp_idx_prdtyp']], drop_first=True)
dftrain['is_purchased_same_prdcd_within_3wk'] = dfagg['is_purchased_same_prdcd_within_3wk']
dftrain_partial = dftrain.sample(frac=0.01, random_state=0)
dftrain_partial = dftrain_partial.dropna()
feature_cols = [col for col in dftrain.columns if col not in ['is_purchased_same_prdcd_within_3wk']]
model = LogisticRegression(C=5.0, penalty="l1", tol=0.01, solver="saga")
model.fit(dftrain_partial[feature_cols], dftrain_partial['is_purchased_same_prdcd_within_3wk'])
df_coef = pd.DataFrame(
        model.coef_,
        index=['coefficient'],
        columns=feature_cols).T.reset_index()
df_coef['index'] = df_coef['index'].str.replace('idxgrp_idx_prdtyp_', '')

# Types of products that are easy to purchase multiple times

In [9]:
px.bar(df_coef.sort_values(by='coefficient', ascending=False)[:15], x='index', y='coefficient')

# Types of products that are not purchased multiple times

In [10]:
px.bar(df_coef.sort_values(by='coefficient', ascending=False)[-15:], x='index', y='coefficient')