This is very sample EDA for H&M Personalized Fashion Recommendations. I just created this EDA for quick jump into competition. Hope you find it useful for your own competition start. Enjoy and have fun in competiton!

<div align="center"><img src="https://i0.wp.com/hitechwiki.com/wp-content/uploads/2021/09/hm-home-collabore-avec-le-duo-de-creatrices-francaises-sacree-frangine-.jpg?fit=1200%2C675&ssl=1"></div>

# COMPETITION GOAL

🛍️ Competition Goal: In this competition, H&M Group invites you to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images. 

## Files
- `images/` - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
- `articles.csv` - detailed metadata for each article_id available for purchase
- `customers.csv` - metadata for each customer_id in dataset
- `sample_submission.csv` - a sample submission file in the correct format
- `transactions_train.csv` - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.



#### Column Descriptions [(source)](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/overview)


<br> 

| Column Name - customers.csv | Description |
|:--|:--|
| customer_id | A unique identifier of every customer | 
| FN |  if a customer get Fashion News newsletter | 
| Active | if the customer is active for communication |
| club_member_status | Status in club |
| fashion_news_frequency | How often H&M may send news to customer |
| age | The current age |
| postal_code | Postal code of customer|

<br>

| Column Name - transactions_train.csv | Description |
|:--|:--|
| t_dat | transaction date | 
| customer_id |  A unique identifier of every customer (in customers table) | 
| article_id | A unique identifier of every article (in articles table) |
| price | Price of purchase |
| sales_channel_id | 2 is online and 1 store |

<br>

| Article.csv - Column Name | Description |
|:--|:--|
| article_id | A unique identifier of every article | 
| Date | Date of Order | 
| product_code, prod_name | A unique identifier of every product and its name (not the same) |
| Sproduct_type, product_type_name | The group of product_code and its name |
| graphical_appearance_no, graphical_appearance_name | The group of graphics and its name |
| colour_group_code, colour_group_name | The group of color and its name |
| graphical_appearance_no, graphical_appearance_name | The group of graphics and its name |
| perceived_colour_value_id, perceived_colour_value_name, perceived_colour_master_id, perceived_colour_master_name | The added color info |
| department_no, department_name | A unique identifier of every dep and its name |
| index_code, index_name | A unique identifier of every index and its name |
| index_group_no, index_group_name | A group of indeces and its name | 
| section_no, section_name | A unique identifier of every section and its name | 
| garment_group_no, garment_group_name | A unique identifier of every garment and its name |
| detail_desc | Details | 

<br>


### Import Libraries

In [None]:
from IPython.display import display_html


import sys
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns

from tqdm import tqdm
import glob
from collections import Counter
from PIL import Image

import pandas as pd
from pathlib import Path

import matplotlib as mpl
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import calendar
from termcolor import colored
from IPython.display import HTML

import time

In [None]:
# Environment check
import warnings
import os

warnings.filterwarnings("ignore")
os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': 'HandM', '_wandb_kernel': 'aot'}

# Custom colors
class clr:
    S = '\033[1m' + '\033[95m'
    E = '\033[0m'
    
my_colors = ["#003f5c", "#2f4b7c", "#665191", "#a05195", "#d45087", "#f95d6a", "#ff7c43", "#ffa600"]
print(clr.S+"Notebook Color Scheme:"+clr.E)
sns.palplot(sns.color_palette(my_colors))
plt.show()


# 1. Dataset

🛍️ **There are 3 metadata .csv files and 1 image file:**
* `images` - folder containing the photo of *almost* all `article_ids`
* `articles.csv` - description features of all `article_ids` **(105,542 datapoints)**
* `customers.csv` - description features of the customer profiles **(1,371,980 datapoints)**
* `transactions_train.csv` - file containing the `customer_id`, the article that was bought and at what price **(31,788,324 datapoints)**

In [None]:
%%time
path = Path("/kaggle/input/h-and-m-personalized-fashion-recommendations/")

articles_df = pd.read_csv(path / "articles.csv", dtype = {'article_id': str})
cust_df = pd.read_csv(path / "customers.csv", dtype = {'customer_id': str})
trans_df = pd.read_csv(path / "transactions_train.csv", dtype = {'article_id': str,'customer_id': str})


### Getting an overview of the datasets
#### ---- Start the data cleaning----

#### **<span id="Articles" style="color:#023e8a;">Article dataset</span>**

In [None]:
# number_of_rows = len(articles_df)
# number_of_col = len(articles_df.columns)
# print(f'Number of rows in articles.csv: {number_of_rows}')
# print(f'Number of rows in articles.csv: {number_of_col}')


print(clr.S+"ARTICLES:"+clr.E, articles_df.shape)
display_html(articles_df.head(3).T)
print("\n", clr.S+"CUSTOMERS:"+clr.E, cust_df.shape)
display_html(cust_df.head(3).T)
print("\n", clr.S+"TRANSACTIONS:"+clr.E, trans_df.shape)
display_html(trans_df.head(3).T)
# print("\n", clr.S+"SAMPLE_SUBMISSION:"+clr.E, ss.shape)
# display_html(ss.head(3))

In [None]:
#### Duplicate Records
##### How many duplicate transaction records are there?
dup_rows = articles_df.duplicated().sum()
print(f'Number of duplicate record in articles.csv:{dup_rows}')
dup_rows = cust_df.duplicated().sum()
print(f'Number of duplicate record in customers.csv:{dup_rows}')
dup_rows = trans_df.duplicated().sum()
print(f'Number of duplicate record in transactions_train.csv:{dup_rows}')
#### Drop Duplicate Records
##### Drop the duplicated records.

# articles_df = articles_df.drop_duplicates()
# cust_df= cust_df.drop_duplicates()
# cust_df = cust_df.drop_duplicates()
#### Missing Values
##### How many missing values are there?
# articles_df.isnull().sum()

### Functions

In [None]:
def adjust_id(x):
    '''Adjusts article ID code.'''
    x = str(x)
    if len(x) == 9:
        x = "0"+x
    
    return x


def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)

# 2. Articles

## I. Preprocessing

🛍️ **Important Notes**:
* There are *more* `article_ids` than actual images:
    * unique article ids: 105,542
    * unique images: 105,100
* The `path` processing was taking too long, so the fastest (takes 1 second) way to do it was to create a variable that contains all article ids within the `images` folder (remember, `set()` is faster than a `list`), and then to correct any path that was invalid within the `articles.csv` file.
* There are only 416 missing values within the `desc` column - product description

In [None]:
# Get all paths from the image folder

all_image_paths = glob.glob(f"/kaggle/input/h-and-m-personalized-fashion-recommendations/images/*/*")

print(clr.S+"Number of unique article_ids within articles.csv:"+clr.E, len(articles_df), "\n"+
      clr.S+"Number of unique images within the image folder:"+clr.E, len(all_image_paths), "\n"+
      clr.S+"=> not all article_ids have a corresponding image!!!"+clr.E, "\n")


# Get all valid article ids
# Create a set() - as it moves faster than a list
all_image_ids = set()

for path in tqdm(all_image_paths):
    article_id = path.split('/')[-1].split('.')[0]
    all_image_ids.add(article_id)

In [None]:
print(clr.S+"There are no missing values in any columns but 'Detail Description':"+clr.E,
      articles_df.isna().sum()[-1], "total missing values")

# Replace missing values
articles_df.fillna(value="No Description", inplace=True)

# Adjust the article ID and product code to be string & add "0"
articles_df["article_id"] = articles_df["article_id"].apply(lambda x: adjust_id(x))
articles_df["product_code"] = articles_df["article_id"].apply(lambda x: x[:3])

In [None]:
# An image path example: ../input/h-and-m-personalized-fashion-recommendations/images/010/0108775015.jpg

# Create full path to the article image
images_path = "../input/h-and-m-personalized-fashion-recommendations/images/"
articles_df ["path"] = images_path + articles_df["product_code"] + "/" + articles_df["article_id"] + ".jpg"

# Adjust the incorrect paths and set them to None
for k, article_id in tqdm(enumerate(articles_df["article_id"])):
    if article_id not in all_image_ids:
        articles_df.loc[k, "path"] = None

## II. Explore

In [None]:
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)

data = articles_df
art_dtypes = articles_df.dtypes.value_counts()

fig = plt.figure(figsize=(8,2),facecolor='white')

ax0 = fig.add_subplot(1,1,1)
font = 'monospace'
ax0.text(1, 0.8, "Key figures",color='black',fontsize=28, fontweight='bold', fontfamily=font, ha='center')

ax0.text(0, 0.4, "{:,d}".format(len(articles_df)), color='#fcba03', fontsize=24, fontweight='bold', fontfamily=font, ha='center')
ax0.text(0, 0.001, "# of unique article_id \nin articles.csv",color='dimgrey',fontsize=15, fontweight='light', fontfamily=font,ha='center')

ax0.text(0.6, 0.4, "{:,d}".format(len(cust_df)), color='#fcba03', fontsize=24, fontweight='bold', fontfamily=font, ha='center')
ax0.text(0.6, 0.001, "# of unique customer_id \nin customers.csv",color='dimgrey',fontsize=15, fontweight='light', fontfamily=font,ha='center')

ax0.text(1.2, 0.4, "{:,d}".format(len(trans_df)), color='#fcba03', fontsize=24, fontweight='bold', fontfamily=font, ha='center')
ax0.text(1.2, 0.001, "# of transaction \nin the transaction_train",color='dimgrey',fontsize=15, fontweight='light', fontfamily=font, ha='center')


ax0.text(1.9, 0.4,"{:,d}".format(len(trans_df.groupby(['customer_id'])['customer_id'].count())), color='#fcba03', fontsize=24, fontweight='bold', fontfamily=font, ha='center')
ax0.text(1.9, 0.001,"# of numeric columns \nin the dataset",color='dimgrey',fontsize=15, fontweight='light', fontfamily=font,ha='center')

ax0.set_yticklabels('')
ax0.tick_params(axis='y',length=0)
ax0.tick_params(axis='x',length=0)
ax0.set_xticklabels('')

for direction in ['top','right','left','bottom']:
    ax0.spines[direction].set_visible(False)

fig.subplots_adjust(top=0.9, bottom=0.2, left=0, hspace=1)

fig.patch.set_linewidth(5)
fig.patch.set_edgecolor('#8c8c8c')
fig.patch.set_facecolor('#f6f6f6')
ax0.set_facecolor('#f6f6f6')
    
plt.show()

#### **<span style="color:#023e8a;">Articles dataset</span>**
#### Descriptive of Articles dataset

In [None]:
print(clr.S+"Number of unique article_ids within articles.csv:"+clr.E, len(articles_df), "\n"+
      clr.S+"Number of unique images within the image folder:"+clr.E, len(all_image_paths), "\n")

#### **<span style="color:#023e8a;">Customer dataset</span>**
#### Descriptive of customer dataset

In [None]:
n = cust_df.customer_id.nunique()
print(f'Number of customer: {n}')
n = cust_df.age.mean()
print(f'Average age of customer: {n}')

In [None]:
cust_df.iloc[:, :-1].describe().T.sort_values(by='std' , ascending = False)\
                     .style.background_gradient(cmap='GnBu')\
                     .bar(subset=["max"], color='#F8766D')\
                     .bar(subset=["mean",], color='#00BFC4')

#### **<span style="color:#023e8a;">Transaction dataset</span>**

In [None]:
number_of_rows = len(trans_df)
number_of_col = len(trans_df.columns)
print(f"Number of rows in transactions_train.csv: {colored(number_of_rows, 'yellow')}")
print(f'Number of rows in transactions_train.csv:{number_of_col}')

#### Join tables

In [None]:
#Extract only the column needed
articles_df = articles_df[['article_id', 'product_type_name','product_group_name','colour_group_name','prod_name','path','graphical_appearance_name']]
cust_df = cust_df[['customer_id', 'club_member_status','fashion_news_frequency','age']]


In [None]:
df = pd.merge(trans_df, cust_df, on='customer_id', how='left')
del trans_df,cust_df

In [None]:
df = pd.merge(df, articles_df, on='article_id', how='left')
# del articles_df

In [None]:
df.info()

In [None]:
time.sleep(30)

#### Consistent formatting - Date

In [None]:
import calendar
df['t_dat'] = pd.to_datetime(df['t_dat'])
df['YYYY_MM'] = df['t_dat'].dt.year.astype(str) + '_' + df['t_dat'].dt.month.astype(str)
df['year'] = df['t_dat'].dt.year
# df['month'] = df['t_dat'].dt.month
# df['month_'] = df['month'].apply(lambda x: calendar.month_abbr[x])

In [None]:
count_df = df[['t_dat', 'customer_id','article_id']]
count_df = count_df.groupby(['t_dat', 'customer_id']).size().rename('quantity').reset_index()

In [None]:
# Printing minimum and the maximum date from dataset.
print(f'The time range of the transaction csv: From {df.t_dat.min():%Y-%m-%d} To {df.t_dat.max():%Y-%m-%d}')
# print(f'Number of unique customers: {df.customer_id.nunique()}')
# print(f'Number of unique items: {df.article_id.nunique()}')

print(f'Average purchase quantity per interaction: {int(count_df.quantity.mean())}')
print(f'Minimum purchase quantity per interaction: {count_df.quantity.min()}')
print(f'Maximum purchase quantity per interaction: {count_df.quantity.max()}')

## Sale of H&M per month(Sep2018 - Sep2020)

In [None]:
# import plotly.express as px
# dfg = df[['t_dat','price']]
# dfg = df.groupby(df['t_dat']).agg({'price':sum}).reset_index()
# dfg
# fig = px.line(dfg, x="t_dat", y="price"
#               ,hover_data={"t_dat": "|%B %d, %Y"}
#              ,template = "plotly_white"
#              )
# fig.update_layout(
#     title="Sale per month (Sep2018-Sep2022)"
#     ,xaxis_title="year_month"
#     ,yaxis_title="Sale"

# )

# fig.update_xaxes(
#     dtick="M1",
#     tickformat="%b\n%Y",
#     ticklabelmode="period")

# fig.show()

In [None]:
# import plotly.express as px
# dfg = df.groupby(['t_dat','sales_channel_id']).agg({'price':sum}).reset_index()
# dfg
# fig = px.line(dfg, x="t_dat", y="price",
#               color='sales_channel_id'
#               ,hover_data={"t_dat": "|%B %d, %Y"}
#              ,template = "plotly_white"
#              )
# fig.update_layout(
#     title="Sale per month by sale channel(Sep2018-Sep2022)"
#     ,xaxis_title="year_month"
#     ,yaxis_title="Sale"

# )

# fig.update_xaxes(
#     dtick="M1",
#     tickformat="%b\n%Y",
#     ticklabelmode="period")

# fig.show()

In [None]:
# dfg = df[['t_dat','price']]
# dfg = df.groupby(df['t_dat'].dt.strftime('%Y_%b')).agg({'price':sum}).reset_index()
# dfg
# fig = px.line(dfg, x="t_dat", y="price"
#               ,hover_data={"t_dat": "|%B, %Y"}
#               ,markers=True
# #               ,color_discrete_sequence=px.colors.diverging.PRGn
#              ,template = "plotly_white"
#              )
# fig.update_layout(
#     title="Sale per month (Sep2018-Sep2022)"
#     ,xaxis_title="year_month"
#     ,yaxis_title="Sale"

# )

# fig.update_xaxes(
#     dtick="M1",
#     tickformat="%b\n%Y",
#     ticklabelmode="period")

# fig.show()

In [None]:
# Top Prod Category

In [None]:
# #Top 10 Prod Category
# taba = pd.crosstab(df.product_group_name, df.YYYY_MM, values=df.price, aggfunc='sum').round(0)
# taba = taba.sort_values(by='2020_9',ascending=False)
# tabb = pd.crosstab(df.product_group_name, df.YYYY_MM, values=df.price, aggfunc='sum',normalize='columns').round(4)*100

# tab = (
#    pd.concat([taba,tabb],axis = 1, keys = ['sum','%'])
#    .swaplevel(axis = 1)
#    .sort_index(axis = 1, ascending=[True, False])
#    .rename_axis(['YYYY_MM', 'product_group_name'], axis = 1)
# )

# tab


### Compare the % of total sale by month.

In [None]:
# cross_tab_prop = df.loc[df['year'] >= 2019]
# cross_tab_prop = cross_tab_prop[['product_group_name', 'year','price']]

# cross_tab_prop = cross_tab_prop.groupby(['year','product_group_name'])['price'].sum()
# cross_tab_prop = cross_tab_prop.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).rename('percentage')

# cross_tab_prop = cross_tab_prop.to_frame().reset_index()

In [None]:
# fig=px.bar(cross_tab_prop
#             ,x='percentage'
#             ,y='year'
#             ,color = 'product_group_name'
#             , orientation='h'
#             , barmode = 'stack'
#             ,color_discrete_sequence=px.colors.diverging.PRGn
#             ,text=cross_tab_prop['percentage'].map('{:,.2f}%'.format)
#            )

# fig.update_layout(title = "Percentage share of product category", 
#      template = 'simple_white', xaxis_title = '%', 
#      yaxis_title = 'year',
#     legend_title_text='product_group_name')

# fig


In [None]:
# cross_tab_prop = df.groupby([df['t_dat'].dt.strftime('%Y_%b'),'product_group_name']).agg({'price':sum}).reset_index()
# fig = px.line(cross_tab_prop, x="t_dat", y="price"
#               ,hover_data={"t_dat": "|%Y_%b"}
# #               ,markers=True
#               ,color='product_group_name'
#               ,color_discrete_sequence=px.colors.diverging.PRGn
#              ,template = "plotly_white"
#              )
# fig.update_layout(
#     title="Sale per month (Sep2018-Sep2020)"
#     ,xaxis_title="year_month"
#     ,yaxis_title="Sale"
# )

# fig.update_xaxes(
#     dtick="M1",
#     tickformat="%b\n%Y",
#     ticklabelmode="period")

# fig.show("")

In [None]:

# fig = px.bar(cross_tab_prop, x="t_dat", y="price",color='product_group_name'
#              ,hover_data={"t_dat": "|%Y_%b"}
# #             ,markers=True
#               ,color_discrete_sequence=px.colors.diverging.PRGn
#              ,template = "plotly_white"
#             )
# fig.update_layout(
#     title="Sale per month by product category(Sep2018-Sep2020)"
#     ,xaxis_title="Year"
#     ,yaxis_title="Sale"
#     ,legend_title_text='product_group_name'
# )


# fig.show()

### Most Freq Product Names

In [None]:
# print(clr.S+"Total Number of unique Product Names:"+clr.E, df["prod_name"].nunique())

# # Data
# prod_name = df["prod_name"].value_counts().reset_index().head(15)
# total_prod_names = df["prod_name"].nunique()
# clrs = ["#CB2170" if x==max(prod_name["prod_name"]) else '#954E93' for x in prod_name["prod_name"]]

# # Get images
# prod_name_images = articles[articles["prod_name"].isin(prod_name["index"].tolist())].groupby("prod_name")["path"].first().reset_index()
# image_paths = prod_name_images["path"].tolist()
# image_names = prod_name_images["prod_name"].tolist()

# # Plot
# fig, ax = plt.subplots(figsize=(25, 13))
# plt.title('- Most Frequent Product Names -', size=22, weight="bold")

# sns.barplot(data=prod_name, x="prod_name", y="index", ax=ax,
#             palette=clrs)
# x0,x1 = ax.get_xlim()
# y0,y1 = ax.get_ylim()
# plt.imshow(bk_image, zorder=0, extent=[x0, x1, y0, y1], alpha=0.35, aspect='auto')

# show_values_on_bars(axs=ax, h_v="h", space=0.4)
# plt.ylabel("Product Name", size = 16, weight="bold")
# plt.xlabel("")
# plt.xticks([])
# plt.yticks(size=16)
# plt.tick_params(size=16)

# insert_image(path='../input/hm-fashion-recommender-dataset/pics/dragonfly.jpg', zoom=0.45, xybox=(92, 11), ax=ax)

# sns.despine(left=True, bottom=True)
# plt.show();

# print("\n")

# # Plot
# fig, axs = plt.subplots(3, 5, figsize=(23, 8))
# fig.suptitle('- Example Images -', size=22, weight="bold")
# axs = axs.flatten()

# for k, (path, name) in enumerate(zip(image_paths, image_names)):
#     axs[k].set_title(f"{name}", size = 16)
#     img = plt.imread(path)
#     axs[k].imshow(img)
#     axs[k].axis("off")

# plt.tight_layout()
# plt.show()

#### Find the Monthly Top 12 Articles
I would recommend the latest monthly Top 10 items to the customer who does not have transaction(that I can not learn) idea from https://www.kaggle.com/negoto/best-selling-items-catalog-like-eda-of-articles

In [None]:
monthly_df = df.query("'2020-9-1' <= t_dat")
weekly_df = df.query("'2020-9-16' <= t_dat")

In [None]:
time.sleep(30)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import Counter
from PIL import Image
from pathlib import Path


def show_images(article_ids, cols=1, rows=-1):
    if isinstance(article_ids, int) or isinstance(article_ids, str):
        article_ids = [article_ids]
    article_count = len(article_ids)
    if rows < 0: rows = (article_count // cols) + 1
    plt.figure(figsize=(3 + 3.5 * cols, 3 + 5 * rows))
    for i in range(article_count):
        article_id = ("0" + str(article_ids[i]))[-10:]
        plt.subplot(rows, cols, i + 1)
        plt.axis('off')
        plt.title(article_id)
        try:
            image = Image.open(f"/kaggle/input/h-and-m-personalized-fashion-recommendations/images/{article_id[:3]}/{article_id}.jpg")
            plt.imshow(image)
        except:
            pass


sales_counts = Counter(df.article_id)
for i in range(len(articles_df)):
    articles_df.at[i, "sales_count"] = sales_counts[articles_df.at[i, "article_id"]]

# monthly_sales_counts = Counter(monthly_df.article_id)
# for i in range(len(df)):
#     df.at[i, "monthly_sales_count"] = monthly_sales_counts[df.at[i, "article_id"]]
    
# weekly_sales_counts = Counter(weekly_df.article_id)
# for i in range(len(df)):
#     df.at[i, "weekly_sales_count"] = weekly_sales_counts[df.at[i, "article_id"]]

In [None]:
articles_df = articles_df.sort_values(by="sales_count", ascending=False)
temp = articles_df.article_id[:12]
show_images(list(temp), 6)

In [None]:
print(clr.S+"Total Number of unique Product Types:"+clr.E, articles_df["product_type_name"].nunique())

# Data
prod_type = articles_df["product_type_name"].value_counts().reset_index().head(15)
total_prod_types = articles_df["product_type_name"].nunique()
clrs = ["#00BDE3" if x==max(prod_type["product_type_name"]) else '#398BBB' for x in prod_type["product_type_name"]]

# Get images
prod_type_images = articles_df[articles_df["product_type_name"].isin(prod_type["index"].tolist())].groupby("product_type_name")["path"].first().reset_index()
image_paths = prod_type_images["path"].tolist()
image_names = prod_type_images["product_type_name"].tolist()

# Plot
fig, ax = plt.subplots(figsize=(25, 13))
plt.title('- Most Frequent Product Types -', size=22, weight="bold")

sns.barplot(data=prod_type, x="product_type_name", y="index", ax=ax,
            palette=clrs)
x0,x1 = ax.get_xlim()
y0,y1 = ax.get_ylim()
# plt.imshow(bk_image, zorder=0, extent=[x0, x1, y0, y1], alpha=0.35, aspect='auto')

show_values_on_bars(axs=ax, h_v="h", space=0.4)
plt.ylabel("Product Type", size = 16, weight="bold")
plt.xlabel("")
plt.xticks([])
plt.yticks(size=16)
plt.tick_params(size=16)

# insert_image(path='../input/hm-fashion-recommender-dataset/pics/blue.jpg', zoom=0.45, xybox=(11000, 11), ax=ax)

sns.despine(left=True, bottom=True)
plt.show();

print("\n")

# Plot
fig, axs = plt.subplots(3, 5, figsize=(23, 8))
fig.suptitle('- Example Images -', size=22, weight="bold")
axs = axs.flatten()

for k, (path, name) in enumerate(zip(image_paths, image_names)):
    axs[k].set_title(f"{name}", size = 16)
    img = plt.imread(path)
    axs[k].imshow(img)
    axs[k].axis("off")

plt.tight_layout()
plt.show()

In [None]:
print(clr.S+"Total Number of unique Product Group:"+clr.E, articles_df["product_group_name"].nunique())

# Data
prod_group = articles_df["product_group_name"].value_counts().reset_index()
total_prod_groups = articles_df["product_group_name"].nunique()
clrs = ["#E90B60" if x==max(prod_group["product_group_name"]) else '#AF0848' for x in prod_group["product_group_name"]]

# Get images
prod_group_images = articles_df[articles_df["product_group_name"].isin(prod_group["index"].tolist())].groupby("product_group_name")["path"].first().reset_index()
image_paths = prod_group_images["path"].tolist()
image_names = prod_group_images["product_group_name"].tolist()

# Plot
fig, ax = plt.subplots(figsize=(25, 13))
plt.title('- Most Frequent Product Groups -', size=22, weight="bold")

sns.barplot(data=prod_group, x="product_group_name", y="index", ax=ax,
            palette=clrs)
x0,x1 = ax.get_xlim()
y0,y1 = ax.get_ylim()
# plt.imshow(bk_image, zorder=0, extent=[x0, x1, y0, y1], alpha=0.35, aspect='auto')

show_values_on_bars(axs=ax, h_v="h", space=0.4)
plt.ylabel("Product Group", size = 16, weight="bold")
plt.xlabel("")
plt.xticks([])
plt.yticks(size=16)
plt.tick_params(size=16)

# insert_image(path='../input/hm-fashion-recommender-dataset/pics/chloe.jpg', zoom=0.45, xybox=(40000, 14), ax=ax)

sns.despine(left=True, bottom=True)
plt.show();

print("\n")

# Plot
fig, axs = plt.subplots(4, 6, figsize=(23, 10))
fig.suptitle('- Example Images -', size=22, weight="bold")
axs = axs.flatten()

for k, (path, name) in enumerate(zip(image_paths, image_names)):
    axs[k].set_title(f"{name}", size = 16)
    img = plt.imread(path)
    axs[k].imshow(img)
    axs[k].axis("off")

for a in [-1, -2, -3, -4, -5]: axs[a].set_visible(False)
plt.tight_layout()
plt.show()

In [None]:
def change_color(x):
    '''Change color name.'''
    if ("light" in x.lower().strip()) or \
        ("dark" in x.lower().strip()) or \
        ("greyish" in x.lower().strip()) or \
        ("yellowish" in x.lower().strip()) or \
        ("greenish" in x.lower().strip()) or \
        ("off" in x.lower().strip()) or \
        ("other" in x.lower().strip()):
        x = x.split(" ")[-1]
        
    return x

articles_df["colour_group_name"] = articles_df["colour_group_name"].apply(lambda x: change_color(x))

In [None]:
# Appearance and color
print(clr.S+"Total Number of unique Product Appearances:"+clr.E, articles_df["graphical_appearance_name"].nunique())
print(clr.S+"Total Number of unique Product Colors (after preprocess):"+clr.E, articles_df["colour_group_name"].nunique())

# --- Data 1 ---
prod_appearance = articles_df["graphical_appearance_name"].value_counts().reset_index().head(15)
total_prod_appearances = articles_df["graphical_appearance_name"].nunique()
clrs1 = ["#AF0848" if x==max(prod_appearance["graphical_appearance_name"]) else '#E90B60' for x in prod_appearance["graphical_appearance_name"]]


# Get images
prod_appearance_images = articles_df[articles_df["graphical_appearance_name"].isin(prod_appearance["index"].tolist())].groupby("graphical_appearance_name")["path"].first().reset_index()
image_paths1 = prod_appearance_images["path"].tolist()
image_names1 = prod_appearance_images["graphical_appearance_name"].tolist()

# --- Data 2 ---
prod_color = articles_df["colour_group_name"].value_counts().reset_index().head(15)
total_prod_color = articles_df["colour_group_name"].nunique()
clrs2 = ["#CB2170" if x==max(prod_color["colour_group_name"]) else '#954E93' for x in prod_color["colour_group_name"]]

# Get images
prod_color_images = articles_df[articles_df["colour_group_name"].isin(prod_color["index"].tolist())].groupby("colour_group_name")["path"].first().reset_index()
image_paths2 = prod_color_images["path"].tolist()
image_names2 = prod_color_images["colour_group_name"].tolist()

# Plot
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(25, 13))

ax1.set_title('- Most Frequent Product Appearances -', size=22, weight="bold")
sns.barplot(data=prod_appearance, x="graphical_appearance_name", y="index", ax=ax1,
            palette=clrs2)
x0,x1 = ax1.get_xlim()
y0,y1 = ax1.get_ylim()
# ax1.imshow(bk_image, zorder=0, extent=[x0, x1, y0, y1], alpha=0.35, aspect='auto')

show_values_on_bars(axs=ax1, h_v="h", space=0.4)
ax1.set_ylabel("Product Appearance", size = 16, weight="bold")
ax1.set_xlabel("")
ax1.set_xticks([])
# ax1.set_yticks(size=16)
# ax1.set_tick_params(size=16)

# insert_image(path='../input/hm-fashion-recommender-dataset/pics/blue.jpg', zoom=0.45, xybox=(11000, 11), ax=ax1)


ax2.set_title('- Most Frequent Product Colors -', size=22, weight="bold")
sns.barplot(data=prod_color, x="colour_group_name", y="index", ax=ax2,
            palette=clrs2)
x0,x1 = ax2.get_xlim()
y0,y1 = ax2.get_ylim()
# ax2.imshow(bk_image, zorder=0, extent=[x0, x1, y0, y1], alpha=0.35, aspect='auto')

show_values_on_bars(axs=ax2, h_v="h", space=0.4)
ax2.set_ylabel("Product Colors", size = 16, weight="bold")
ax2.set_xlabel("")
ax2.set_xticks([])
# ax1.set_yticks(size=16)
# ax1.set_tick_params(size=16)

# insert_image(path='../input/hm-fashion-recommender-dataset/pics/blue.jpg', zoom=0.45, xybox=(11000, 11), ax=ax1)

sns.despine(left=True, bottom=True)
plt.show();

print("\n")

# Plot
fig, axs = plt.subplots(3, 5, figsize=(23, 8))
fig.suptitle('- Example Images [Appearance] -', size=22, weight="bold")
axs = axs.flatten()

for k, (path, name) in enumerate(zip(image_paths1, image_names1)):
    axs[k].set_title(f"{name}", size = 16)
    img = plt.imread(path)
    axs[k].imshow(img)
    axs[k].axis("off")

plt.tight_layout()
plt.show()

# Plot
fig, axs = plt.subplots(3, 5, figsize=(23, 8))
fig.suptitle('- Example Images [Color] -', size=22, weight="bold")
axs = axs.flatten()

for k, (path, name) in enumerate(zip(image_paths2, image_names2)):
    axs[k].set_title(f"{name}", size = 16)
    img = plt.imread(path)
    axs[k].imshow(img)
    axs[k].axis("off")

plt.tight_layout()
plt.show()

### Mean price by Cat
Now check the mean price change in time for top 5 product groups by mean price:

-Shoes

-Garment Full body

-Bags

-Garment Lower body

-Underwear/nightwear

In [None]:
time.sleep(30)

In [None]:
product_list = ['Shoes', 'Garment Full body', 'Bags', 'Garment Lower body', 'Underwear/nightwear']
colors = ['cadetblue', 'orange', 'mediumspringgreen', 'tomato', 'lightseagreen']
k = 0
f, ax = plt.subplots(3, 2, figsize=(20, 15))
for i in range(3):
    for j in range(2):
        try:
            product = product_list[k]
            articles_for_merge_product = df[df.product_group_name == product_list[k]]
            series_mean = articles_for_merge_product[['t_dat', 'price']].groupby(pd.Grouper(key="t_dat", freq='M')).mean().fillna(0)
            series_std = articles_for_merge_product[['t_dat', 'price']].groupby(pd.Grouper(key="t_dat", freq='M')).std().fillna(0)
            ax[i, j].plot(series_mean, linewidth=4, color=colors[k])
            ax[i, j].fill_between(series_mean.index, (series_mean.values-2*series_std.values).ravel(), 
                             (series_mean.values+2*series_std.values).ravel(), color=colors[k], alpha=.1)
            ax[i, j].set_title(f'Mean {product_list[k]} price in time')
            ax[i, j].set_xlabel('month')
            ax[i, j].set_xlabel(f'{product_list[k]}')
            k += 1
        except IndexError:
            ax[i, j].set_visible(False)
plt.show()

### SHOW MOST COMMON WORDS IN DESCRIPTION

In [None]:
time.sleep(30)
del articles_df
del df

In [None]:
%%time
path = Path("/kaggle/input/h-and-m-personalized-fashion-recommendations/")

articles_df = pd.read_csv(path / "articles.csv", dtype = {'article_id': str})
trans_df = pd.read_csv(path / "transactions_train.csv", dtype = {'article_id': str,'customer_id': str})

#Extract only the column needed
articles_df = articles_df[['article_id','detail_desc']]

df = pd.merge(trans_df, articles_df, on='article_id', how='left')
# del articles_df

In [None]:
prod_desc = articles_df[articles_df.detail_desc.notnull()].detail_desc.sample(5000).values

In [None]:
from wordcloud import WordCloud, STOPWORDS

stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 800, 
                      height = 800,
                      background_color ='white',
                      min_font_size = 10,
                      stopwords = stopwords,).generate(' '.join(prod_desc)) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show() 

In [None]:
time.sleep(30)
del articles_df
del trans_df
del df

### Customer Portfolio Analysis

In [None]:
%%time
path = Path("/kaggle/input/h-and-m-personalized-fashion-recommendations/")

articles_df = pd.read_csv(path / "articles.csv", dtype = {'article_id': str})
cust_df = pd.read_csv(path / "customers.csv", dtype = {'customer_id': str})
trans_df = pd.read_csv(path / "transactions_train.csv", dtype = {'article_id': str,'customer_id': str})

In [None]:
time.sleep(30)

In [None]:
#Extract only the column needed
articles_df = articles_df[['article_id', 'product_type_name','product_group_name','colour_group_name','prod_name','graphical_appearance_name']]
cust_df = cust_df[['customer_id', 'club_member_status','fashion_news_frequency','age']]


df = pd.merge(trans_df, cust_df, on='customer_id', how='left')
del trans_df,cust_df

# df = pd.merge(df, articles_df, on='article_id', how='left')
# del articles_df

time.sleep(30)

import calendar
df['t_dat'] = pd.to_datetime(df['t_dat'])
df['YYYY_MM'] = df['t_dat'].dt.year.astype(str) + '_' + df['t_dat'].dt.month.astype(str)
df['year'] = df['t_dat'].dt.year
# df['month'] = df['t_dat'].dt.month
# df['month_'] = df['month'].apply(lambda x: calendar.month_abbr[x])

In [None]:
dfg = df[['age','fashion_news_frequency','customer_id']]
dfg = dfg.groupby(['age','fashion_news_frequency']).count().reset_index()
dfg.rename(columns = {"customer_id": "count"}, inplace=True)
dfg
fig = px.bar(dfg, x="age", y="count",color='fashion_news_frequency'
#               ,markers=True
              ,color_discrete_sequence=px.colors.diverging.PRGn
             ,template = "plotly_white"
             ) 
fig.update_layout(
    title="Number of customer by age"
    ,xaxis_title="Age"
    ,yaxis_title="Count"
    ,legend_title_text='fashion_news_frequency'
)

fig.show()

Reference：　https://www.kaggle.com/code/melodyyiphoiching/h-m-deep-sales-and-customers-analysis/

### Which Age Group purchase more products?

In [None]:
df['age_groups'] = pd.cut(df['age'], bins=[16, 20, 30, 40,50, 60, 70, float('Inf')], labels=['16-20', '20-30','30-40','40-50','50-60','60-70' , '70+'])

In [None]:
plt.figure(figsize=(8,5))
plt.title("Purchased quantity by age group\n", fontweight="bold", size=28)
g = sns.barplot(x="age_groups", y="Purchased Quantity(%)", data=df.groupby("age_groups")["article_id"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('Purchased Quantity(%)').reset_index(), palette="icefire", edgecolor="black")
plt.xlabel("Age Group",fontweight="bold", size=22)
plt.ylabel("Purchased Quantity (%)",fontweight="bold", size=19)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.1f', fontsize=18, color="black")
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

Insights:

- Customers in the range 20-30 are responsible for more than 42% of the total purchased products.
- Customers in the range 16-20. 60-70 and 70+ are responsible for the 8% of the total purchased products
- Customers in the range 30-40, 40-50 and 50-60 are responsible for 16% of purchased quantity each.

After analyzing the purchases quantity, it could be interesting to analyze the earnings provided to the company by each customer.

### Which Age Group generate more earnings for the company?

In [None]:
plt.figure(figsize=(8,5))
plt.title("Company Earnings by age group\n", fontweight="bold", size=28)
g = sns.barplot(x="age_groups", y="earning(%)", data=df.groupby("age_groups")["price"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index(), palette="icefire",edgecolor="black")
plt.xlabel("Age Group",fontweight="bold", size=22)
plt.ylabel("Earnings (%)",fontweight="bold", size=25)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.1f', fontsize=18, color="black")
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

Indeed a very similar situation to the purchases quantity can be found in the earnings analysis, since customers who buys more, on average leads to higher earnings for the company.
The age group 20-30 is by far responsible for the highest earnings for the company (41.9% of total earnings).

### Do active customers on the fashion news purchase more products?

In [None]:
plt.figure(figsize=(9,5))
plt.title("Purchased quantity by Fashion News Frequency\n", fontweight="bold", size=20)
g = sns.barplot(x="fashion_news_frequency", y="Purchased Quantity(%)", data=df.groupby("fashion_news_frequency")["article_id"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('Purchased Quantity(%)').reset_index(), palette="Spectral", edgecolor="black")
plt.xlabel("Fashion News Frequency",fontweight="bold", size=22)
plt.ylabel("Purchased Quantity (%)",fontweight="bold", size=25)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.3f', fontsize=18, color="black")
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

Active customers on the fashion news are responsible for 43% of the total purchases, while the remaining 57% of purchased quantity comes from customer not registed in the fashion news.
The other 2 categories "Monthly" and "None" can be ignored and won't be considered for the further analysis.

So then it could be interesting to check the fashion news frequency by age group, to find more useful insights,

In [None]:
x, y = 'age_groups', 'fashion_news_frequency'
df_age_news = df.groupby(x)[y].value_counts(normalize=True)
df_age_news = df_age_news.mul(100)
df_age_news = df_age_news.rename('percent(%)').reset_index()
df_age_news = df_age_news[df_age_news["fashion_news_frequency"].isin(["Regularly","NONE"])]

In [None]:
palette1 = {"Regularly":'#46C646', "NONE":'#FF0000'}

plt.figure(figsize=(13,6))
plt.title("Fashion News Frequency by age group\n",fontweight="bold", size=33)
g=sns.barplot(x="age_groups", y="percent(%)",data=df_age_news, hue="fashion_news_frequency", palette=palette1)
plt.xlabel("Age group",fontweight="bold", size=22)
plt.ylabel("Percentage (%)",fontweight="bold", size=25)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.1f', fontsize=16, color="black")
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.legend(title='News\nFrequency',bbox_to_anchor=(1.0, 1.0), ncol=1, fancybox=True, shadow=True, fontsize=17,title_fontsize=22)
plt.show()

We can see that customers in the range 20-30 and 30-40 have the lowest percentage of fashion news frequency, while being the groups which buy the most.
Moreover, the frequency of customer that regulary check fashion news starts increasing from the range 40-50, with a peak value of 43.7% of regular/active users for customers in the range 70+ years old. This means that checking fashion news seems to be more effective for older customers, who still represent a small percentage of total sold products, while younger customers do not need to check the news to buy new products.
It could be effective for the company to invite younger customers (range 20-40) to check the news more frequently in order to increase the sold items.

### Does the club member status influence the purchased quantity?

In [None]:
df["club_member_status"].value_counts(normalize=True)

We can see that:

- More than 93% of the customers belong to the ACTIVE category
- 6.8% of the customers belong to the PRE-CREATE cateory
- 0.3% of the customers belong to the LEFT CLUB category

This shows a very high imbalance among the classes: if we consider the sum of purchased products per each category, this will likely show that the most part of Purchased products belongs to the ACTIVE members.

In [None]:
df.groupby("club_member_status")["article_id"].sum()

Indeed, more customers in a group leads to higher purchases. For this reason, it is more wise to consider a mean Purchased quantity instead of a sum:

In [None]:
print("The average quantity of purchased products by the customers is {:.0f} products ".format(df["article_id"].mean()))

In [None]:
print("The average quantity of purchased products by the ACTIVE customers is {:.0f} products ".format(df.groupby("club_member_status")["article_id"].mean()["ACTIVE"]))
print("The average quantity of purchased products by the LEFT-CLUB customers is {:.0f} products ".format(df.groupby("club_member_status")["article_id"].mean()["LEFT CLUB"]))
print("The average quantity of purchased products by the PRE-CREATE customers is {:.0f} products ".format(df.groupby("club_member_status")["article_id"].mean()["PRE-CREATE"]))

By considering the mean, we can see a very different situation, which will be shown as percentages in the following plot:

In [None]:
plt.figure(figsize=(9,5))
plt.title("Average Purchased Quantity by Club Member Status\n", fontweight="bold", size=22)
g = sns.barplot(x="club_member_status", y="article_id", data=df.groupby("club_member_status")["article_id"].mean().astype(int).reset_index(), palette="viridis", edgecolor="black")
plt.axhline(y = cust_details["article_id"].mean(), color = 'r', linestyle = '--')
plt.text(0.76, 23.7, 'Mean Purchased Quantity: {:.0f}'.format(df["article_id"].mean()), size=16, color="red",fontweight="bold")
plt.xlabel("Club Member Status",fontweight="bold", size=20)
plt.ylabel("Average Purchased Quantity",fontweight="bold", size=16)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.0f', fontsize=23, color="black")
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

This plots shows that the average purchased quantity differs a lot among the categories.
In particular, customers belonging to the ACTIVE clubs, purchase more products than other categories, while those in the "pre-create" category purchaes on average less than a third of third of active customers.

Finally, since the distribution of the purchased quantity is heavily right skewed, it could be interesting to check out also the median purhcased quantity.

In [None]:
plt.figure(figsize=(9,5))
plt.title("Median Purchased Quantity by Club Member Status\n", fontweight="bold", size=22)
g = sns.barplot(x="club_member_status", y="article_id", data=df.groupby("club_member_status")["article_id"].median().reset_index(), palette="viridis", edgecolor="black")
plt.axhline(y = cust_details["article_id"].median(), color = 'r', linestyle = '--')
plt.text(0.76, 9.3, 'Median Purchased Quantity: {:.2f}'.format(df["article_id"].median()), size=16, color="red",fontweight="bold")
plt.xlabel("Club Member Status",fontweight="bold", size=20)
plt.ylabel("Median Purchaed Quantity",fontweight="bold", size=16)
for container in g.containers:
    g.bar_label(container, padding = 5, fmt='%.0f', fontsize=23, color="black")
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

Indeed, even if the Median is quite different for the Mean due to high skeweness of the data, a very similar situation situation to the mean purchases quantity can be observed, where ACTIVE customers buys more product on average.

work in progress.
If you find my work impressive 💖 and useful 👍🏻, Please Upvote it.