# H&M EDA and Baseline (WIP)

This notebook is a quick EDA and Baseline for the new H&M Personalized Fashion Recommendations competetion. If you find the notebook helpful please give an upvote :)

#### forked - added time filter and changed settings + get output + will work for anyone now without private data - Dan



### Contents:
[Load in the data ⏳](#first-bullet)
    
[Articles EDA 📚](#second-bullet)   
  
[Customers EDA 🛍](#third-bullet)
    
[Transaction EDA 💸](#fourth-bullet)
    
[Imagery 📸](#fith-bullet)
   
[Baseline 📈](#sixth-bullet)

### Imports

In [None]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from termcolor import colored
from PIL import Image
import os
import random

import warnings
warnings.filterwarnings('ignore')

# Load in the data ⏳

In [None]:
DATA_PATH = Path('../input/h-and-m-personalized-fashion-recommendations')
!ls $DATA_PATH

### Data Overview
- `images/` - a folder of images corresponding to each `article_id`; images are placed in subfolders starting with the first three digits of the `article_id`; note, not all `article_id` values have a corresponding image.
- `articles.csv` - detailed metadata for each article_id available for purchase
- `customers.csv` - metadata for each `customer_id` in dataset
- `sample_submission.csv` - a sample submission file in the correct format
- `transactions_train.csv` - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.

In [None]:
articles = pd.read_csv(DATA_PATH/'articles.csv')
customers = pd.read_csv(DATA_PATH/'customers.csv')
transactions_train = pd.read_csv(DATA_PATH/'transactions_train.csv')
samp_sub = pd.read_csv(DATA_PATH/'sample_submission.csv')

From: https://www.kaggle.com/hengzheng/time-is-our-best-friend

In [None]:
### From: https://www.kaggle.com/hengzheng/time-is-our-best-friend
# only use data after 2020-06-01

# transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])
# transactions = transactions[transactions['t_dat'] > pd.to_datetime('2020-06-01')]


transactions_train['t_dat'] = pd.to_datetime(transactions_train['t_dat'])
t_cut = pd.to_datetime('2020-06-01')
transactions_train = transactions_train.loc[transactions_train['t_dat'] > t_cut]

# Articles EDA 📚

In [None]:
articles.head()

In [None]:
num_articles = len(articles)
num_unique_id = len(articles['article_id'].unique())
print(f'We have {num_articles} rows in the df and {num_unique_id} unique article IDs') 

In [None]:
num_prod_codes = len(articles['product_code'].unique())
print(f'Each article has a product_code, some articles have the same product_code with a total of {num_prod_codes} unique values')

In [None]:
num_prod_name = len(articles['prod_name'].unique())
print(f'Each article also has a prod_name, with a total of {num_prod_name} unique values')

Interestingly there are a different number of unique `product_code` values and `prod_name` values meaning there isn't a 1 to 1 mapping betwen them..

In [None]:
num_prod_type_no = len(articles['product_type_no'].unique())
num_prod_type = len(articles['product_type_name'].unique())
print(f'We have {num_prod_type_no} unique product_type_no values and {num_prod_type} unique product_type_name values, with each number mapping to a name')

In [None]:
def plot_bar_chart(df, feature, x_lim):
    feature_count  = df[feature].value_counts()
    feature_count = feature_count[:x_lim,]
    plt.figure(figsize=(30,10))
    sns.barplot(feature_count.index, feature_count.values, alpha=0.7)
    sns.set(font_scale = 2)
    plt.title(f'Frequency of top {x_lim} {feature}', fontsize=30)
    plt.ylabel('Count', fontsize=30)
    plt.xlabel(feature.replace('_', ' '), fontsize=30)
    sns.set(font_scale=1.2)
    plt.show()

In [None]:
plot_bar_chart(articles, 'product_type_name', 15)

In [None]:
num_prod_group = len(articles['product_group_name'].unique())
print(f'We have {num_prod_group} unique product_group_names values')

In [None]:
plot_bar_chart(articles, 'product_group_name', 10)

We have six columns related to colour:

- `colour_group_code` 
- `colour_group_name`
- `perceived_colour_value_id`
- `perceived_colour_value_name`
- `perceived_colour_master_id`
- `perceived_colour_master_name`

For the sake of brevity we only plot `perceived_colour_master_name`

In [None]:
plot_bar_chart(articles, 'perceived_colour_master_name', 18)

In [None]:
articles['garment_group_name'].value_counts()

We then have the following peices of meta data along with their codes:
- `department_name` e.g 'Kids Girl Swimwear'
- `index_name` e.g 'Children Accessories, Swimwear'
- `index_group_name` e.g 'Baby/Children'
- `section_name` 'Baby Essentials & Complements'
- `garment_group_name` e.g. 'Swimwear'

In [None]:
display(articles['department_name'].value_counts().head(10))

In [None]:
display(articles['index_name'].value_counts().head(10))

In [None]:
display(articles['index_group_name'].value_counts().head(10))

In [None]:
display(articles['section_name'].value_counts().head(10))

In [None]:
display(articles['garment_group_name'].value_counts().head(10))

In [None]:
print('Each articles has a detailed description:\n')
for i, (index, row) in enumerate(articles.sample(5).iterrows()):
    description = row['detail_desc']
    print(f'{i+1}. {description} \n')

# Customers EDA 🛍

In [None]:
customers.head()

In [None]:
num_customers = len(customers)
num_customer_id = len(customers['customer_id'].unique())
print(f'We have {num_customers} rows and {num_customer_id} unique customer_ids')

In [None]:
customers['club_member_status'].value_counts()

In [None]:
customers['fashion_news_frequency'].value_counts()

In [None]:
colors = sns.color_palette('pastel')[0:5]
fig, ax = plt.subplots(1,2, figsize=(15, 6))
for i, feature in enumerate(['club_member_status', 'fashion_news_frequency']):
    fashion_news = customers[feature].value_counts()
    data = fashion_news.to_list()
    labels = fashion_news.index.to_list()
    ax[i].pie(data, labels = labels, colors = colors, autopct='%.0f%%')
    ax[i].set_title(feature)
fig.show()

In [None]:
plt.figure(figsize=(10,6))
p = sns.distplot(customers['age'], color="y")
p.set_xlabel("Age", fontsize = 20)
p.set_ylabel("Density", fontsize = 20)
p.set_title("Age of customers")
p.axvline(customers['age'].mean(), color='r', linestyle='--', label="Mean")
plt.show()

# Transaction EDA 💸

In [None]:
transactions_train.head(5)

In [None]:
## add in for compatability - 
transactions_train['t_dat'] = transactions_train['t_dat'].astype(str)

In [None]:
%%time
dates_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in transactions_train['t_dat'].to_list()]
transactions_train['new_date'] = dates_list
transactions_train['new_date'] = transactions_train['new_date'] - transactions_train['new_date'].min()
transactions_train["new_date"] = transactions_train["new_date"].apply(lambda x: x.days)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.distplot(transactions_train['new_date'], ax=axes[0])
axes[0].set_xlabel('Day')
axes[0].set_title('Date of sale')

sns.distplot(transactions_train['price'], ax=axes[1], color='g')
axes[1].set_title('Price')
axes[1].set_xlim(0, 0.2)
plt.show()

# Imagery 📸

There is a folder of images corresponding to each `article_id`; images are placed in subfolders starting with the first three digits of the `article_id`; note, not all `article_id` values have a corresponding image.

- The directory in which an image is stored is named 0 + the first two characters of the `article_id`
- The image file name is then the whole `article_id` as a jpg file

In [None]:
IMAGE_PATH = Path('../input/h-and-m-personalized-fashion-recommendations/images')

In [None]:
def plot_imgs(ids, rows, cols):
    figure, ax = plt.subplots(nrows=rows,ncols=cols,figsize=(16,8))
    for ind, id_ in enumerate(ids):
        fn = f'{IMAGE_PATH}/0{str(id_)[:2]}/0{id_}.jpg'
        try:
            img = Image.open(fn)
        except:
            pass
        ax.ravel()[ind].imshow(img)
        ax.ravel()[ind].set_axis_off()
    plt.tight_layout()
    plt.show()

In [None]:
# Random sample of 5 images
plot_imgs(articles.sample(5)['article_id'], 1, 5)

In [None]:
# 5 Sweaters
plot_imgs(articles[articles['product_type_name']=='Sweater'].sample(5)['article_id'], 1, 5)

# Baseline 📈

Submissions are evaluated according to the Mean Average Precision:

## $\frac{1}{U} \sum_{u=1}^{U}  \sum_{k=1}^{min(n,12)} P(k) \times rel(k)$

where 𝑈 is the number of customers, 𝑃(𝑘) is the precision at cutoff 𝑘, 𝑛 is the number predictions per image, and 𝑟𝑒𝑙(𝑘) is an indicator function equaling 1 if the item at rank 𝑘 is a relevant (correct) label, zero otherwise.

Notes:

You will be making purchase predictions for all `customer_id` values provided, regardless of whether these customers made purchases in the training data.
Customer that did not make any purchase during test period are excluded from the scoring.
There is never a penalty for using the full 12 predictions for a customer that ordered fewer than 12 items; thus, it's advantageous to make 12 predictions for each customer.

In [None]:
# Due to low memory
%reset -f

In [None]:
import pandas as pd
from pathlib import Path
from collections import Counter
from itertools import chain, combinations
import random
import pprint
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

In [None]:
DATA_PATH = Path('../input/h-and-m-personalized-fashion-recommendations')

def add_value(dict_obj, key, value):
    if key not in dict_obj:
        dict_obj[key] = value
    elif isinstance(dict_obj[key], list):
        dict_obj[key].append(value)
    else:
        dict_obj[key] = [dict_obj[key], value]
        
# Messy code to get prediction into correct format ..
def format_prediction(pred, old_pred):
    pred = [str(i) for i in pred[:12]]
    old_list = old_pred.split()
    del old_list[-len(pred):]
    old_list = old_list + pred
    pred =' '.join(old_list)
    return pred

The following Baseline suggests articles to the customer which other customers with similar purchase history have bought. It also uses the most-common-benchmark to fill in some gaps..

In [None]:
# Load in the data
transactions_train = pd.read_csv(DATA_PATH/'transactions_train.csv',
                                 dtype={'article_id': str},parse_dates=['t_dat'],infer_datetime_format=True
                                ,usecols=['t_dat', 'customer_id', 'article_id'])
### From: https://www.kaggle.com/hengzheng/time-is-our-best-friend
# only use data after 2020-06-01

# transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])
# transactions = transactions[transactions['t_dat'] > pd.to_datetime('2020-06-01')]


# transactions_train['t_dat'] = pd.to_datetime(transactions_train['t_dat'])
t_cut = pd.to_datetime('2020-02-01') ## '2020-06-01'
transactions_train = transactions_train.loc[(transactions_train['t_dat'] > t_cut) | (transactions_train['t_dat'].dt.month==9)]

In [None]:
%%time
# Select required columns
transactions_train = transactions_train[['customer_id', 'article_id']]

# Groupby customer ID and collect all the articles purchased into a list
transactions_train = transactions_train.groupby('customer_id')['article_id'].apply(list)

# Due to low memory select a subset of transactions -- to be improved ..
transactions_train = transactions_train[:180000] # 150000 - orig , trying more

# Find all pairs of articles which are purchased by the same customer
transactions_train = Counter(chain.from_iterable(combinations(customer, 2) for customer in transactions_train.to_list()))

# Remove pairs of articles which do not occur frequently together
frequent_pairs = {k: v for k, v in transactions_train.items() if v > 5}

# Sort by frequency
sorted_pairs = {k: v for k, v in sorted(frequent_pairs.items(), key=lambda item: item[1], reverse=True)}

In [None]:
dict(list(sorted_pairs.items())[0: 5])

We can see that `sorted pairs` is a dictionary containing article pairs and their corresponding co occurence frequency within a single customer purchase history (summed over all customers).

In [None]:
# Generate final pairing dictionary
final_pairs = {}
for k, v in sorted_pairs.items():
    add_value(final_pairs, k[0], k[1])

### generate 12 top myself
* based on:

In [None]:
# top_12_items = df.groupby('article_id')['customer_id'].nunique().sort_values(ascending=False).head(12).index.tolist()
top_12_items =  ['0706016001',
 '0372860001',
 '0706016002',
 '0610776002',
 '0759871002',
 '0372860002',
 '0464297007',
 '0720125001',
 '0673396002',
 '0610776001',
 '0673677002',
 '0706016003']

In [None]:
samp_sub = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
samp_sub['prediction'] =  ' '.join(top_12_items)

In [None]:
# # Load in the most-common-benchmark baseline submission ### private datasource
# samp_sub = pd.read_csv('../input/most-common-bench/most_common_b.csv')

# Convert to dictionary for speed
sub_dict = samp_sub.to_dict('records')

transactions_train = pd.read_csv(DATA_PATH/'transactions_train.csv', dtype={'article_id': str},usecols=['t_dat', 'customer_id', 'article_id'])
transactions_train = transactions_train[['t_dat', 'customer_id', 'article_id']]

# Sort by date
transactions_train['date'] =  pd.to_datetime(transactions_train["t_dat"])
transactions_train = transactions_train.sort_values(by="date")

customer_dict = dict(zip(transactions_train['customer_id'], transactions_train['article_id']))

In [None]:
dict(list(customer_dict.items())[0: 5])

We can see that `customer_dict` is a dictionary with `customer_id` and their most recent purchase `article_id`

In [None]:
preds = []
for i, row in tqdm(enumerate(sub_dict)):
    old_pred = row['prediction']
    customer = row['customer_id']
    try:
        most_recent_article = customer_dict[customer]
        pred = final_pairs[most_recent_article]
        if type(pred)==str:
            preds.append(old_pred[:-10] + str(pred))
            continue
        preds.append(format_prediction(pred, old_pred))
    except:
        preds.append(old_pred)

In [None]:
samp_sub['prediction'] = preds
samp_sub.to_csv('submission.csv', index=False)