# <span style='color:#8A0808'>🥼H&M fast EDA and Memory reduction🎈</span>

# <span style='color:#8A0808'>📚Introduction</span>

## <span style='color:#4A0404'>🎯Goal</span>

In this competition, H&M Group invites you to develop product recommendations based on data from previous transactions, as well as from customer and product meta data.

## <span style='color:#4A0404'>💾Data</span>

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring.

**Files**
* images/ - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
* articles.csv - detailed metadata for each article_id available for purchase
* customers.csv - metadata for each customer_id in dataset
* sample_submission.csv - a sample submission file in the correct format
* transactions_train.csv - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.

**NOTE**: You must make predictions for all customer_id values found in the sample submission. All customers who made purchases during the test period are scored, regardless of whether they had purchase history in the training data.

## <span style='color:#4A0404'>🔑Metric</span>

Submissions are evaluated according to the Mean Average Precision @ 12 (MAP@12):

## <span style='color:blue'>$MAP@12=\frac{1}{U}\sum_{u=1}^U \sum_{k=1}^{min(n,12)} P(k) \times rel(k)$</span>

where $U$ is the number of customers, $P(k)$ is the precision at cutoff $k$, $n$ is the number predictions per customer, and $rel(k)$ is an indicator function equaling 1 if the item at rank  is a relevant (correct) label, zero otherwise.

**Notes**:

You will be making purchase predictions for all customer_id values provided, regardless of whether these customers made purchases in the training data.
Customer that did not make any purchase during test period are excluded from the scoring.
There is never a penalty for using the full 12 predictions for a customer that ordered fewer than 12 items; thus, it's advantageous to make 12 predictions for each customer.

# <span style='color:#8A0808'>⚡Fast EDA</span>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import os

## Articles

There are 105542 articles

In [None]:
articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
print(f'articles shape {articles.shape}:\n{articles.loc[0,:]}')

In [None]:
articles.info()

Articles are grouped into 762 groups. Just 86 article groups have images

In [None]:
print('Number of article groups:', articles.article_id.apply(lambda x: str(x)[:3]).nunique())
print('Number of image groups:', len(os.listdir('../input/h-and-m-personalized-fashion-recommendations/images')))

Show some articles in a same group

In [None]:
article_group = '010'
impath = f'/kaggle/input/h-and-m-personalized-fashion-recommendations/images/{article_group}'

plt.figure(figsize=(10,10))
for idx, file in enumerate(os.listdir(impath)):
    plt.subplot(2,2,idx+1)
    plt.imshow(mpimg.imread(f'{impath}/{file}'))

Why the articles below are in a same group?

In [None]:
article_group = '039'
impath = f'/kaggle/input/h-and-m-personalized-fashion-recommendations/images/{article_group}'

plt.figure(figsize=(10,10))
for idx, file in enumerate(os.listdir(impath)):
    plt.subplot(2,2,idx+1)
    plt.imshow(mpimg.imread(f'{impath}/{file}'))
    if idx>2: break

## Customers

There are 1371980 customers

In [None]:
customers = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/customers.csv')
print(f'customers shape {customers.shape}:\n{customers.loc[0,:]}')

Customer age has a bi-modal distribution (young/old)

In [None]:
plt.figure(figsize=(10,5))
customers.age.hist(bins=100);

## Transactions

There are 31788324 transactions in the train set

In [None]:
%%time
transactions = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',
                           dtype={'article_id': str},
                           low_memory=True)
print(f'transactions shape {transactions.shape}:\n{transactions.loc[0,:]}')

There are 104547 articles in the train set (over the total number of articles of 105542) = 99.06%

In [None]:
transactions.article_id.nunique()/105542*100

There are 1362281 customers in the train set (over the total number of customers of 1371980) = 99.29%

In [None]:
transactions.customer_id.nunique()/1371980*100

There are two sale channels

In [None]:
transactions.sales_channel_id.nunique()

There are 69.48% and 93.16% articles in the first and second sale channels, respectively.

In [None]:
print('Number of articles in the first sale channel:', transactions.article_id[transactions.sales_channel_id==1].nunique()/105542*100)
print('Number of articles in the second sale channel:', transactions.article_id[transactions.sales_channel_id==2].nunique()/105542*100)

There are 53.73% and 80.79% customers in the first and second sale channels, respectively. A large amount of customers buy articles via both channels.

In [None]:
print('Number of customers in the first sale channel:', transactions.customer_id[transactions.sales_channel_id==1].nunique()/1371980*100)
print('Number of customers in the second sale channel:', transactions.customer_id[transactions.sales_channel_id==2].nunique()/1371980*100)

Prices distribution (prices are scaled)

In [None]:
plt.figure(figsize=(10,5))
transactions.price.hist(bins=100);
plt.xlim(0,0.2)

## Submission

In [None]:
submission = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')
print(f'submission shape {submission.shape}:\n{submission.loc[0,:]}')

In [None]:
submission.to_csv('submission.csv', index=False)

# <span style='color#A80808'>🎈Memory reduction</span>

Convert csv to pickle, parquet, feather to gain some memory space

In [None]:
articles.to_pickle('articles.pkl')
customers.to_pickle('customers.pkl')
transactions.to_pickle('transactions_train.pkl')
submission.to_pickle('sample_submission.pkl')

In [None]:
articles.to_parquet('articles.parquet')
customers.to_parquet('customers.parquet')
transactions.to_parquet('transactions_train.parquet')
submission.to_parquet('sample_submission.parquet')

In [None]:
articles.to_feather('articles.feather')
customers.to_feather('customers.feather')
transactions.to_feather('transactions_train.feather')
submission.to_feather('sample_submission.feather')