# H&M Fashion | EDA | H&M Personalized Fashion Recommendations
Hello everyone! In this my new notebook we are going to look through [H&M Personalized Fashion Recommendations](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations) Prediction Competition.

#### Acknowledgements 😍
My acknowledgments are given to:
* [GABRIEL PREDA. H&M EDA and Prediction](https://www.kaggle.com/gpreda/h-m-eda-and-prediction)

#### About 👚
H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. In this competition, H&M Group invites you to develop product recommendations based on data from previous transactions, as well as from customer and product meta data.

# 1. Import libraries 📚
Here we import libraries that will be used.

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
from PIL import Image
from tqdm import tqdm
from datetime import datetime
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

# 2. Read data 📖

In [None]:
total_folders = 0
total_files = 0

folder_info = []
images_names = []

path = "../input/h-and-m-personalized-fashion-recommendations"

for base, dirs, files in tqdm(os.walk(path)):
    for directories in dirs:
        folder_info.append((directories, 
                            len(os.listdir(os.path.join(base, directories)))))
        total_folders = total_folders + 1
    
    for _files in files:
        total_files = total_files + 1
        if (len(_files.split(".jpg"))==2):
            images_names.append(_files.split(".jpg")[0])

After that we can check the final reading result:

In [None]:
print(f"• Total number of folders: {total_folders}")
print(f"• Total number of files: {total_files}")

In [None]:
folder_info_df = pd.DataFrame(folder_info, 
                              columns=["folder", 
                                       "files count"])

folder_info_df.sort_values(["files count"], ascending=False).head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
articles_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
sample_submission_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")

transactions_train_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

After that, we can check, what we've read:

In [None]:
# articles_df
articles_df.head(3).style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
# customers_df
customers_df.head(3).style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
# sample_submission_df
# prediction in sample submission is a sequence of article ids (max 12 article ids)
sample_submission_df.head(3).style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
# transactions_train_df
transactions_train_df.head(3).style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

# 3. Feature engineering 💻
***Transactions table*** is the train data. It contains customer_id and article_id, which are foreign keys for the customer and articles tables. Also, Transactions also contains sales_channel_id.

Here we can **check missing data, count unique values** etc.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    return tt

Here we can look through missing data in different files.

In the **articles_df**, the only missing data is for the detailed description of the article (0.4% missing data).

In the **cutomers_df**, only *customer_id* and *postal_code* are completely filled. *Age*, *fashion_news_frequency* have around 1% misssing data, *FN* has 65% missing and *Active* has 66% missing data.

In the **transactions_train_df**, there is no missing data.

In [None]:
missing_data(articles_df).head(7).style.set_properties(**{'background-color': 'rgba(245, 181, 152,.5)'})

In [None]:
missing_data(customers_df).style.set_properties(**{'background-color': 'rgba(245, 245, 152,.5)'})

In [None]:
missing_data(transactions_train_df).style.set_properties(**{'background-color': 'rgba(152, 243, 245,.5)'})

After that we can check **unique values**:

In [None]:
unique_values(articles_df).head(5).style.set_properties(**{'background-color': 'rgba(145, 178, 227,.5)'})

In [None]:
unique_values(customers_df).head(5).style.set_properties(**{'background-color': 'rgba(130, 126, 230,.5)'})

In [None]:
unique_values(transactions_train_df).head(5).style.set_properties(**{'background-color': 'rgba(188, 126, 230,.5)'})

We can see that **not all** the customers in customer data are appearing as having transactions in transaction train data. 
**Moreover,** not all articles are represented in this data. 

It is **interesting** that the number of different prices is quite small, out of 31.7M transactions, and for 1.3M customers, buying 104K different articles. Same for the dates, there are only 734 different dates. 

# 4. Data visualisations. Articles Data 📊

In [None]:
def pie_chart(df, col_values, labels, ax, color, title):
    n_classes = len(df)
    explode = (0.1,) * n_classes # explode for 0.1 each slice
    ax.pie(df[col_values],
           colors=color, 
           explode=explode,
           labels=df[labels],
           shadow=True)
    ax.set_title(title, fontsize=16)
    
def bar_plot(df, col_x, col_y, ax, color, title):
    ax.bar(x=df[col_x],
           height=df[col_y],
           color=color)
    ax.set_title(title, fontsize=16) 
    plt.xticks(rotation=90)

In [None]:
temp = articles_df.groupby(["product_group_name"])["product_type_name"].nunique()
df = pd.DataFrame({'Product Group': temp.index,'Product Types': temp.values})
df = df.sort_values(['Product Types'], ascending=False)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
color = plt.cm.autumn(np.linspace(0, 1, len(df)))

pie_chart(df,
          'Product Types', 
          'Product Group',
          axes, 
          color,  
          "Product Types per each Product Group")  

In [None]:
temp = articles_df.groupby(["product_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Group': temp.index,'Articles': temp.values})
df = df.sort_values(['Articles'], ascending=False)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
color = plt.cm.summer(np.linspace(0, 1, len(df)))

pie_chart(df,
          'Articles', 
          'Product Group',
          axes, 
          color,  
          "Product Types per each Product Group")  

In [None]:
temp = articles_df.groupby(["index_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Index Group Name': temp.index,'Articles': temp.values})
df = df.sort_values(['Articles'], ascending=False)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(12,6))
color = plt.cm.spring(np.linspace(0, 1, len(df)))

bar_plot(df,
         'Index Group Name',
         'Articles',
         axes, 
         color, 
         "Number of Articles per each Index Group Name")


In [None]:
temp = articles_df.groupby(["product_type_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Type': temp.index,'Articles': temp.values})
total_types = len(df['Product Type'].unique())
df = df.sort_values(['Articles'], ascending=False)[0:30]

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10,6))
color = plt.cm.cool(np.linspace(0, 1, len(df)))

bar_plot(df,
         'Product Type',
         'Articles',
         axes, 
         color, 
         "Number of Articles per each Product Type (top 30)")


In [None]:
temp = articles_df.groupby(["department_name"])["article_id"].nunique()
df = pd.DataFrame({'Department Name': temp.index,'Articles': temp.values})
total_depts = len(df['Department Name'].unique())
df = df.sort_values(['Articles'], ascending=False).head(20)


fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(22,10))
color = plt.cm.pink(np.linspace(0, 1, len(df)))

pie_chart(df,
          'Articles', 
          'Department Name',
          axes, 
          color,  
          "Number of Articles per each Department (top 20)")  

In [None]:
temp = articles_df.groupby(["section_name"])["article_id"].nunique()
df = pd.DataFrame({'Section Name': temp.index,'Articles': temp.values})
df = df.sort_values(['Articles'], ascending=False)


fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(22,10))
color = plt.cm.PRGn(np.linspace(0, 1, len(df)))

pie_chart(df.head(15),
          'Articles', 
          'Section Name',
          axes, 
          color,  
          "Number of Articles per each Section Name (top 15)")  


In [None]:
temp = articles_df.groupby(["graphical_appearance_name"])["article_id"].nunique()
df = pd.DataFrame({'Graphical Appearance Name': temp.index,'Articles': temp.values})
df = df.sort_values(['Articles'], ascending=False).head(50)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(22,10))
color = plt.cm.PiYG(np.linspace(0, 1, len(df)))

pie_chart(df.head(15),
          'Articles', 
          'Graphical Appearance Name',
          axes, 
          color,  
          "Number of Articles per each Graphical Appearance Name (top 15)")  

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(background_color='#edf7ee',
                          stopwords=stopwords,
                          max_words=400,
                          max_font_size=40, 
                          scale=5,
                          colormap="spring",
                          random_state=1).generate(str(data))

    
    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
    
    
    
show_wordcloud(articles_df["prod_name"], "Wordcloud from product name")

# 5. Data visualisations. Customers Data 📊

In [None]:
temp = customers_df.groupby(["age"])["customer_id"].count()
df = pd.DataFrame({'Age': temp.index,'Customers': temp.values})
df = df.sort_values(['Age'], ascending=False)


fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10,6))
color = plt.cm.cool(np.linspace(0, 1, len(df)))

bar_plot(df,
         'Age',
         'Customers',
         axes, 
         color, 
         "Number of Customers per each Age")

In [None]:
temp = customers_df.groupby(["fashion_news_frequency"])["customer_id"].count()
df = pd.DataFrame({'Fashion News Frequency': temp.index,'Customers': temp.values})
df = df.sort_values(['Customers'], ascending=False)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4,6))
color = plt.cm.seismic(np.linspace(0, 1, len(df)))

bar_plot(df,
         'Fashion News Frequency',
         'Customers',
         axes, 
         color, 
         "Number of Customers per each Fashion News Frequency")

In [None]:
temp = customers_df.groupby(["club_member_status"])["customer_id"].count()
df = pd.DataFrame({'Club Member Status': temp.index,'Customers': temp.values})
df = df.sort_values(['Customers'], ascending=False)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(22,10))
color = plt.cm.summer(np.linspace(0, 1, len(df)))

pie_chart(df.head(15),
          'Customers', 
          'Club Member Status',
          axes, 
          color,  
          "Number of Customers per each Club Member Status") 

# 6. Data visualisations. Transactions Data 📊

In [None]:
df = transactions_train_df.sample(100_000)
fig, ax = plt.subplots(1, 1, figsize=(14, 7))
sns.kdeplot(np.log(df.loc[df["sales_channel_id"]==1].price.value_counts()),
           color="red")
sns.kdeplot(np.log(df.loc[df["sales_channel_id"]==2].price.value_counts()),
           color="blue")

ax.legend(labels=['Sales channel 1', 
                  'Sales channel 2'])

plt.title("Logaritmic distribution of price frequency \
in transactions, grouped per sales channel (100k sample)")

plt.show()

In [None]:
df = transactions_train_df.sample(100_000).groupby(["t_dat"])["article_id"].count().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Transactions"]

fig, ax = plt.subplots(1, 1, figsize=(16,6))
plt.plot(df["Date"], df["Transactions"], color="red")
plt.xlabel("Date")
plt.ylabel("Transactions")
plt.title(f"Transactions per day (100k sample)")
plt.show()

In [None]:
df = transactions_train_df.sample(100_000).groupby(["t_dat", "sales_channel_id"])["article_id"].count().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

df.columns = ["Date", "Sales Channel Id", "Transactions"]

fig, ax = plt.subplots(1, 1, figsize=(16,6))
g1 = ax.plot(df.loc[df["Sales Channel Id"]==1, "Date"], 
             df.loc[df["Sales Channel Id"]==1, 
                    "Transactions"], 
             label="Sales Channel 1", 
             color="Blue")

g2 = ax.plot(df.loc[df["Sales Channel Id"]==2, "Date"], 
             df.loc[df["Sales Channel Id"]==2, 
                    "Transactions"], 
             label="Sales Channel 2", 
             color="Red")

plt.xlabel("Date")
plt.ylabel("Transactions")
ax.legend()
plt.title(f"Transactions per day, grouped by Sales Channel (100k sample)")
plt.show()

In [None]:
df = transactions_train_df.groupby(["t_dat", "sales_channel_id"])["article_id"].nunique().reset_index()
df["t_dat"] = df["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df.columns = ["Date", "Sales Channel Id", "Unique Articles"]

fig, ax = plt.subplots(1, 1, figsize=(16,6))
g1 = ax.plot(df.loc[df["Sales Channel Id"]==1, 
                    "Date"], 
             df.loc[df["Sales Channel Id"]==1, 
                    "Unique Articles"], 
             label="Sales Channel 1", 
             color="Blue")

g2 = ax.plot(df.loc[df["Sales Channel Id"]==2, 
                    "Date"], 
             df.loc[df["Sales Channel Id"]==2, 
                    "Unique Articles"], 
             label="Sales Channel 2", 
             color="Orange")

plt.xlabel("Date")
plt.ylabel("Unique Articles / Day")
ax.legend()
plt.title(f"Unique articles per day, grouped by Sales Channel")
plt.show()

# 7. Data visualisations. Image Data 📊

In [None]:
image_name_df = pd.DataFrame(images_names, columns = ["image_name"])
image_name_df["article_id"] = image_name_df["image_name"].apply(lambda x: int(x[1:]))
image_name_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
image_article_df = articles_df[["article_id", 
                                "product_code", 
                                "product_group_name", 
                                "product_type_name"]].merge(image_name_df, 
                                                            on=["article_id"], 
                                                            how="left")
image_article_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
# products without images
article_no_image_df = image_article_df.loc[image_article_df.image_name.isna()]
article_no_image_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

Let's plot some image data.

In [None]:
def plot_image_samples(image_article_df, product_group_name, cols=1, rows=-1):
    image_path = "../input/h-and-m-personalized-fashion-recommendations/images/"
    _df = image_article_df.loc[image_article_df.product_group_name==product_group_name]
    article_ids = _df.article_id.values[0:cols*rows]
    plt.figure(figsize=(2 + 3 * cols, 2 + 4 * rows))
    for i in range(cols * rows):
        article_id = ("0" + str(article_ids[i]))[-10:]
        plt.subplot(rows, cols, i + 1)
        plt.axis('off')
        plt.title(f"{product_group_name} {article_id[:3]}\n{article_id}.jpg")
        image = Image.open(f"{image_path}{article_id[:3]}/{article_id}.jpg")
        plt.imshow(image)
        
plot_image_samples(image_article_df, "Garment Lower body", 5, 1)
plot_image_samples(image_article_df, "Accessories", 5, 1)
plot_image_samples(image_article_df, "Swimwear", 5, 1)
plot_image_samples(image_article_df, "Bags", 5, 1)

# 8. Predictions ☂️
For this initial submission, I dediced to follow logic, that was described in article, that is give in anknowledgements:

* if there are articles for a certain client, pick the most recent buys;
* if there are not articles for a certain client, just pick the most frequently buyed articles.

In [None]:
transactions_train_df = transactions_train_df.sort_values(["customer_id", 
                                                           "t_dat"], 
                                                          ascending=False)
transactions_train_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

Let's capture first what are the most frequent recently bought articles.

In [None]:
last_date = transactions_train_df.t_dat.max()
print(last_date)
print(transactions_train_df.loc[transactions_train_df.t_dat==last_date].shape)
print()

most_frequent_articles = list(transactions_train_df.loc[transactions_train_df.t_dat==last_date].article_id.value_counts()[0:12].index)
art_list = []
for art in most_frequent_articles:
    art = "0"+str(art)
    art_list.append(art)
art_str = " ".join(art_list)
print("Frequent articles bought recently:", art_str, end="\n")

In [None]:
agg_df = transactions_train_df.groupby(["customer_id"])["article_id"].agg(lambda x: str(x.values[0:12])[1:-1]).reset_index()

In [None]:
def padding_articles(x):
    if x:
        xl = x.split()
        x = []
        for xi in xl:
            x.append("0"+xi)
        dimm_x = len(x)
        if dimm_x < 12:
            x.extend(art_list[:12-dimm_x])
        return(" ".join(x))

In [None]:
agg_df["article_id"] = agg_df["article_id"].apply(lambda x: padding_articles(x))
print("Aggregated transaction history: ", agg_df.customer_id.nunique())
print("Submission sample: ", sample_submission_df.customer_id.nunique())

We'll replace the values in sample submission with the existent in aggregated transactions data and just let the default one otherwise.

In [None]:
sample_submission_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

For the customers with missing articles, we simply replace with most frequent buyed articles in most recent days.

In [None]:
submission_df = agg_df.merge(sample_submission_df[["customer_id"]], how="right")
submission_df.columns = ["customer_id", "prediction"]
print(submission_df.shape)
submission_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

In [None]:
print("Rows with missing data in submission: ", submission_df.loc[submission_df.prediction.isna()].shape[0])

We replace the missing data with the most frequently bought articles, from recent days. We calculated it before.

In [None]:
submission_df.loc[submission_df.prediction.isna(), ["prediction"]] = art_str
print("Rows with missing data in submission: ", submission_df.loc[submission_df.prediction.isna()].shape[0])
submission_df.to_csv("submission.csv", index=False)

# 9. Conclusion 💖
Thank you for reading my new article! **If you liked it, please, make an upvote 💖**

My other articles:
* [House Prices Regression sklearn](https://www.kaggle.com/maricinnamon/house-prices-regression-sklearn)
* [Harry Potter Movies Dataset | Starter Notebook](https://www.kaggle.com/maricinnamon/harry-potter-movies-dataset-starter-notebook)
* [Automobile Customer Clustering (K-means & PCA)](https://www.kaggle.com/maricinnamon/automobile-customer-clustering-k-means-pca)