# Introduction

The dataset contains 4 csv files and one folder with several subfolders, each with a different number of images.

In this Exploratory Data Analysis Notebook we will look to the data, will analyze the content of each csv file, check for missing data, understand the data distribution, see what are the relations between data in various files.

We will also explore the image data, understand how images are indexed in the csv files, if there are articles in the dataset without images. We will also explore image additional information, like image width and height.

We also investigate a very simple baseline model and create an initial submission.



![](http://images.unsplash.com/photo-1578983662508-41895226ebfb?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1211&q=80)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Intro**

**The competition is dedicated to the product recomendations (H&M)**

**Here we have different kinds of data that help us to get good recomendations:**

**📸 images - images of every article_id**

**🙋 articles - detailed metadata of every article_id**

**👔 customers - detailed metadata of every customer_id**

**🧾 transactions_train - purchases with details**

# Analysis preparation

We will include here the required packages for reading, parsing, filtering, processing, visualizing the data, both tabular and image.

![](https://lp2.hm.com/hmgoepprod?set=quality%5B79%5D%2Csource%5B%2F55%2F6e%2F556e6853e5cf8986017f40e5fcf356016a599e41.jpg%5D%2Corigin%5Bdam%5D%2Ccategory%5Bmen_tshirtstanks_shortsleeve%5D%2Ctype%5BDESCRIPTIVEDETAIL%5D%2Cres%5Bm%5D%2Chmver%5B2%5D&call=url[file:/product/main])

# Exploratory Data Analysis

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from datetime import datetime
from PIL import Image

In [None]:
articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
transactions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

**Let's look at the tables and try to get some outcomes about data inside.**

# Articles

This table contains all h&m articles with details such as a type of product, a color, a product group and other features.
Article data description:

article_id : A unique identifier of every article.

product_code, prod_name : A unique identifier of every product and its name (not the same).

product_type, product_type_name : The group of product_code and its name

graphical_appearance_no, graphical_appearance_name : The group of graphics and its name

colour_group_code, colour_group_name : The group of color and its name

perceived_colour_value_id, perceived_colour_value_name, perceived_colour_master_id, perceived_colour_master_name : The added color info

department_no, department_name: : A unique identifier of every dep and its name

index_code, index_name: : A unique identifier of every index and its name

index_group_no, index_group_name: : A group of indeces and its name

section_no, section_name: : A unique identifier of every section and its name

garment_group_no, garment_group_name: : A unique identifier of every garment and its name

detail_desc: : Details

In [None]:
articles.head()

****Ladieswear accounts for a significant part of all dresses. Sportswear has the least portion.****

In [None]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=articles, y='index_name', color='orange')
ax.set_xlabel('count by index name')
ax.set_ylabel('index name')
plt.show()

****The garments grouped by index: Jersey fancy is the most frequent garment, especially for women and children. The next by number is accessories, many various accessories with low price.****

In [None]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=articles, y='garment_group_name', color='orange', hue='index_group_name', multiple="stack")
ax.set_xlabel('count by garment group')
ax.set_ylabel('garment group')
plt.show()

**Now,the index group-index structure. Ladieswear and Children/Baby have subgroups.**

In [None]:
articles.groupby(['index_group_name', 'index_name']).count()['article_id']

**Accessories are really various, the most numerious: bags, earrings and hats. However, trousers prevail.**

In [None]:
pd.options.display.max_rows = None
articles.groupby(['product_group_name', 'product_type_name']).count()['article_id']

**Table with number of unique values in columns:**

In [None]:
for col in articles.columns:
    if not 'no' in col and not 'code' in col and not 'id' in col:
        un_n = articles[col].nunique()
        print(f'n of unique {col}: {un_n}')

# Customers

Customers data description:

customer_id : A unique identifier of every customer
    
FN : 1 or missed
    
Active : 1 or missed
    
club_member_status : Status in club
    
fashion_news_frequency : How often H&M may send news to customer
    
age : The current age
    
postal_code : Postal code of customer

In [None]:
pd.options.display.max_rows = 50
customers.head()

In [None]:
customers.shape[0] - customers['customer_id'].nunique()

****Here we have abnormal number of customers by one postal code. One has 120303, it might be encoded nan adress or smth like a huge distribution center, or pickup****

**Ages, club_member_status are different, like customer_ids.**

In [None]:
customers[customers['postal_code']=='2c29ae653a9282cce4151bd87643c907644e09541abc28ae87dea0d1f6603b1c'].head(5)

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers, x='age', bins=50, color='orange')
ax.set_xlabel('Distribution of the customers age')
plt.show()

**Status in H&M club.**

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers, x='club_member_status', color='orange')
ax.set_xlabel('Distribution of club member status')
plt.show()

In [None]:
customers['fashion_news_frequency'].unique()

In [None]:
customers.loc[~customers['fashion_news_frequency'].isin(['Regularly', 'Monthly']), 'fashion_news_frequency'] = 'None'
customers['fashion_news_frequency'].unique()

In [None]:
pie_data = customers[['customer_id', 'fashion_news_frequency']].groupby('fashion_news_frequency').count()

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
# ax = sns.histplot(data=customers, x='fashion_news_frequency', color='orange')
# ax = sns.pie(data=customers, x='fashion_news_frequency', color='orange')
colors = sns.color_palette('pastel')
ax.pie(pie_data.customer_id, labels=pie_data.index, colors = colors)
ax.set_facecolor('lightgrey')
ax.set_xlabel('Distribution of fashion news frequency')
plt.show()

# Transactions

Transactions data description:

t_dat : A unique identifier of every customer
    
customer_id : A unique identifier of every customer (in customers table)
    
article_id : A unique identifier of every article (in articles table)
    
price : Price of purchase
    
sales_channel_id : 1 or 2

In [None]:
transactions.head()

**we see outliers for price.**

In [None]:
pd.set_option('display.float_format', '{:.4f}'.format)
transactions.describe()['price']

In [None]:
transactions.head()

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.boxplot(data=transactions, x='price', color='orange')
ax.set_xlabel('Price outliers')
plt.show()


In [None]:
transactions_byid = transactions.groupby('customer_id').count()

In [None]:
transactions_byid.sort_values(by='price', ascending=False)['price'][:10]

**However, comparing prices inside groups is more accurate, because accessories and trousers prices may vary largerly.**



**Get subset from articles and merge it to transactions.**

In [None]:
articles_for_merge = articles[['article_id', 'prod_name', 'product_type_name', 'product_group_name', 'index_name']]

In [None]:
articles_for_merge = transactions[['customer_id', 'article_id', 'price', 't_dat']].merge(articles_for_merge, on='article_id', how='left')

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(25,18))
ax = sns.boxplot(data=articles_for_merge, x='price', y='product_group_name')
ax.set_xlabel('Price outliers', fontsize=22)
ax.set_ylabel('Index names', fontsize=22)
ax.xaxis.set_tick_params(labelsize=22)
ax.yaxis.set_tick_params(labelsize=22)

plt.show()

**Then look at boxplot prices according to accessories product group and find the reasons of high prices inside group.**

**The largest outliers can be found among bags, which is logical enough. In addition, scarves and other accessories have articles with prices highly contrasting to the rest of garments.**

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(25,18))
_ = articles_for_merge[articles_for_merge['product_group_name'] == 'Accessories']
ax = sns.boxplot(data=_, x='price', y='product_type_name')
ax.set_xlabel('Price outliers', fontsize=22)
ax.set_ylabel('Index names', fontsize=22)
ax.xaxis.set_tick_params(labelsize=22)
ax.yaxis.set_tick_params(labelsize=22)
del _

plt.show()

In [None]:
articles_index = articles_for_merge[['index_name', 'price']].groupby('index_name').mean()
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.barplot(x=articles_index.price, y=articles_index.index, color='orange', alpha=0.8)
ax.set_xlabel('Price by index')
ax.set_ylabel('Index')
plt.show()

In [None]:
articles_index = articles_for_merge[['product_group_name', 'price']].groupby('product_group_name').mean()
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.barplot(x=articles_index.price, y=articles_index.index, color='orange', alpha=0.8)
ax.set_xlabel('Price by product group')
ax.set_ylabel('Product group')
plt.show()

**Check the mean price change in time for top 5 product groups by mean price:**

* Shoes
* Garment Full body
* Bags
* Garment Lower body
* Underwear/nightwear

In [None]:
articles_for_merge['t_dat'] = pd.to_datetime(articles_for_merge['t_dat'])

In [None]:
product_list = ['Shoes', 'Garment Full body', 'Bags', 'Garment Lower body', 'Underwear/nightwear']
colors = ['cadetblue', 'orange', 'mediumspringgreen', 'tomato', 'lightseagreen']
k = 0
f, ax = plt.subplots(3, 2, figsize=(20, 15))
for i in range(3):
    for j in range(2):
        try:
            product = product_list[k]
            articles_for_merge_product = articles_for_merge[articles_for_merge.product_group_name == product_list[k]]
            series_mean = articles_for_merge_product[['t_dat', 'price']].groupby(pd.Grouper(key="t_dat", freq='M')).mean().fillna(0)
            series_std = articles_for_merge_product[['t_dat', 'price']].groupby(pd.Grouper(key="t_dat", freq='M')).std().fillna(0)
            ax[i, j].plot(series_mean, linewidth=4, color=colors[k])
            ax[i, j].fill_between(series_mean.index, (series_mean.values-2*series_std.values).ravel(), 
                             (series_mean.values+2*series_std.values).ravel(), color=colors[k], alpha=.1)
            ax[i, j].set_title(f'Mean {product_list[k]} price in time')
            ax[i, j].set_xlabel('month')
            ax[i, j].set_xlabel(f'{product_list[k]}')
            k += 1
        except IndexError:
            ax[i, j].set_visible(False)
plt.show()

# Images with description and price

**Check the last purchases by max price and by min price**

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

In [None]:
max_price_ids = transactions[transactions.t_dat==transactions.t_dat.max()].sort_values('price', ascending=False).iloc[:5][['article_id', 'price']]
min_price_ids = transactions[transactions.t_dat==transactions.t_dat.min()].sort_values('price', ascending=True).iloc[:5][['article_id', 'price']]

**Photos with description and price (top 5 max)**

In [None]:
f, ax = plt.subplots(1, 5, figsize=(20,10))
i = 0
for _, data in max_price_ids.iterrows():
    desc = articles[articles['article_id'] == data['article_id']]['detail_desc'].iloc[0]
    desc_list = desc.split(' ')
    for j, elem in enumerate(desc_list):
        if j > 0 and j % 5 == 0:
            desc_list[j] = desc_list[j] + '\n'
    desc = ' '.join(desc_list)
    img = mpimg.imread(f'../input/h-and-m-personalized-fashion-recommendations/images/0{str(data.article_id)[:2]}/0{int(data.article_id)}.jpg')
    ax[i].imshow(img)
    ax[i].set_title(f'price: {data.price:.2f}')
    ax[i].set_xticks([], [])
    ax[i].set_yticks([], [])
    ax[i].grid(False)
    ax[i].set_xlabel(desc, fontsize=10)
    i += 1
plt.show()

**Photos with description and price (top 5 min)**

In [None]:
f, ax = plt.subplots(1, 5, figsize=(20,10))
i = 0
for _, data in min_price_ids.iterrows():
    desc = articles[articles['article_id'] == data['article_id']]['detail_desc'].iloc[0]
    desc_list = desc.split(' ')
    for j, elem in enumerate(desc_list):
        if j > 0 and j % 4 == 0:
            desc_list[j] = desc_list[j] + '\n'
    desc = ' '.join(desc_list)
    img = mpimg.imread(f'../input/h-and-m-personalized-fashion-recommendations/images/0{str(data.article_id)[:2]}/0{int(data.article_id)}.jpg')
    ax[i].imshow(img)
    ax[i].set_title(f'price: {data.price:.4f}')
    ax[i].set_xlabel(desc, fontsize=10)
    ax[i].set_xticks([], [])
    ax[i].set_yticks([], [])
    ax[i].grid(False)
    i += 1
plt.axis('off')
plt.show()

# Read the data

In [None]:
print(f"files and folders: {os.listdir('/kaggle/input/h-and-m-personalized-fashion-recommendations/')}")
print("Subfolders in images folder: ", len(list(os.listdir("/kaggle/input/h-and-m-personalized-fashion-recommendations/images"))))

In [None]:
total_folders = total_files = 0
folder_info = []
images_names = []
for base, dirs, files in tqdm(os.walk('/kaggle/input/h-and-m-personalized-fashion-recommendations/')):
    for directories in dirs:
        folder_info.append((directories, len(os.listdir(os.path.join(base, directories)))))
        total_folders += 1
    for _files in files:
        total_files += 1
        if len(_files.split(".jpg"))==2:
            images_names.append(_files.split(".jpg")[0])

In [None]:
print(f"Total number of folders: {total_folders}\nTotal number of files: {total_files}")
folder_info_df = pd.DataFrame(folder_info, columns=["folder", "files count"])
folder_info_df.sort_values(["files count"], ascending=False).head()

In [None]:
print("folder names: ", list(folder_info_df.folder.unique()))

In [None]:
articles_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/customers.csv")
sample_submission_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")

In [None]:
transactions_train_df = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

In [None]:
articles_df.head()

In [None]:
customers_df.head()

In [None]:
sample_submission_df.head()

Prediction in sample submission is a sequence of article ids (max 12 article ids).

In [None]:
transactions_train_df.head()

There are 3 main tables:

* articles - contains informations about each article (like product code, name, product group code, name ...)
* customers - contains informations about each customer (fidelity card membership, age, postal code)
* transactions (train)

Transactions have customer_id and article_id, which are foreign keys for the customer and articles tables. Beside this, transaction also contains sales_channel_id.

Transaction train data has entries for the date of the transaction, the customer id, the article id, a price (per transaction) and a sales channel id.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])


In [None]:
def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    return tt

In [None]:
articles_df.info()

In [None]:
missing_data(articles_df)

In the article data, the only missing data is for the detailed description of the article (0.4% missing data).

In [None]:
customers_df.info()

In [None]:
missing_data(customers_df)

Only customer id and postal code are completely filled. Age, fashion news frequency have arounfd 1% misssing data, FN has 65% missing and Active has 66% missing data.

In [None]:
sample_submission_df.info()

In [None]:

transactions_train_df.info()

In [None]:
missing_data(transactions_train_df)

In [None]:
unique_values(articles_df)

In [None]:

unique_values(customers_df)

In [None]:

unique_values(transactions_train_df)

In [None]:
print(f"Percent of articles present in transactions: {round(104547/105542,3)*100}%")
print(f"Percent of articles present in transactions: {round(1362281/1371980,3)*100}%")

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=5,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
show_wordcloud(articles_df["prod_name"], "Wordcloud from product name")

In [None]:
show_wordcloud(articles_df["detail_desc"], "Wordcloud from detailed description of articles")

# Some Analysis of Image data

![](http://images.unsplash.com/photo-1575729312527-1bdecaae271e?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=687&q=80)

There are 105542 articles and 105100 different images. Let's check first which articles does not have corresponding images.

The article_id corresponds to digits from 2nd to the last of the image name. The digits from 2nd to 7th of image name correspond to product code (product_code).

In [None]:
image_name_df = pd.DataFrame(images_names, columns = ["image_name"])
image_name_df["article_id"] = image_name_df["image_name"].apply(lambda x: int(x[1:]))

In [None]:
image_name_df.head()

In [None]:
image_article_df = articles_df[["article_id", "product_code", "product_group_name", "product_type_name"]].merge(image_name_df, on=["article_id"], how="left")
print(image_article_df.shape)
image_article_df.head()

Products without images.

In [None]:
article_no_image_df = image_article_df.loc[image_article_df.image_name.isna()]
print(article_no_image_df.shape)
article_no_image_df.head()

In [None]:
print("Product codes with some missing images: ", article_no_image_df.product_code.nunique())
print("Product groups with some missing images: ", list(article_no_image_df.product_group_name.unique()))

Visualize few images.

In [None]:
def plot_image_samples(image_article_df, product_group_name, cols=1, rows=-1):
    image_path = "/kaggle/input/h-and-m-personalized-fashion-recommendations/images/"
    _df = image_article_df.loc[image_article_df.product_group_name==product_group_name]
    article_ids = _df.article_id.values[0:cols*rows]
    plt.figure(figsize=(2 + 3 * cols, 2 + 4 * rows))
    for i in range(cols * rows):
        article_id = ("0" + str(article_ids[i]))[-10:]
        plt.subplot(rows, cols, i + 1)
        plt.axis('off')
        plt.title(f"{product_group_name} {article_id[:3]}\n{article_id}.jpg")
        image = Image.open(f"{image_path}{article_id[:3]}/{article_id}.jpg")
        plt.imshow(image)

In [None]:
print(image_article_df.product_group_name.unique())

**We will represent images grouped on product group name.**

In [None]:
plot_image_samples(image_article_df, "Garment Lower body", 4, 2)

In [None]:
plot_image_samples(image_article_df, "Stationery", 4, 1)

In [None]:
plot_image_samples(image_article_df, "Fun", 2, 1)

In [None]:
plot_image_samples(image_article_df, "Accessories", 4, 1)

In [None]:
plot_image_samples(image_article_df, "Swimwear", 4, 2)

In [None]:
plot_image_samples(image_article_df, "Furniture", 4, 2)

In [None]:
plot_image_samples(image_article_df, "Cosmetic", 4, 1)

In [None]:
plot_image_samples(image_article_df, "Bags", 4, 3)

# Submission

For this submission, we apply the following simplified logic:

* if there are articles for a certain client, pick the most recent buys;
* if there are not articles for a certain client, just pick the most frequently buyed articles.

In [None]:
transactions_train_df = transactions_train_df.sort_values(["customer_id", "t_dat"], ascending=False)

In [None]:
transactions_train_df.head()

Find first what are the most frequent recently bought articles.

In [None]:
last_date = transactions_train_df.t_dat.max()
print(last_date)
print(transactions_train_df.loc[transactions_train_df.t_dat==last_date].shape)

In [None]:
most_frequent_articles = list(transactions_train_df.loc[transactions_train_df.t_dat==last_date].article_id.value_counts()[0:12].index)
art_list = []
for art in most_frequent_articles:
    art = "0"+str(art)
    art_list.append(art)
art_str = " ".join(art_list)
print("Frequent articles bought recently: ", art_str)

In [None]:
agg_df = transactions_train_df.groupby(["customer_id"])["article_id"].agg(lambda x: str(x.values[0:12])[1:-1]).reset_index()

In [None]:
def padding_articles(x):
    if x:
        xl = x.split()
        x = []
        for xi in xl:
            x.append("0"+xi)
        dimm_x = len(x)
        if dimm_x < 12:
            x.extend(art_list[:12-dimm_x])
        return(" ".join(x))

In [None]:
agg_df["article_id"] = agg_df["article_id"].apply(lambda x: padding_articles(x))


In [None]:
print("Aggregated transaction history: ", agg_df.customer_id.nunique())
print("Submission sample: ", sample_submission_df.customer_id.nunique())

In [None]:
print(sample_submission_df.shape)
sample_submission_df.head()

In [None]:
submission_df = agg_df.merge(sample_submission_df[["customer_id"]], how="right")
submission_df.columns = ["customer_id", "prediction"]
print(submission_df.shape)
submission_df.head()

In [None]:
print("Rows with missing data in submission: ", submission_df.loc[submission_df.prediction.isna()].shape[0])

In [None]:
submission_df.loc[submission_df.prediction.isna(), ["prediction"]] = art_str

In [None]:
print("Rows with missing data in submission: ", submission_df.loc[submission_df.prediction.isna()].shape[0])

In [None]:
submission_df.to_csv("submission.csv", index=False)