# Introduction

**If you like my notebook , please support me by upvoting!**

# What is the Problem?
H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Their online store offers shoppers an extensive selection of products to browse through. But with too many choices, customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. To enhance the shopping experience, product recommendations are key. More importantly, helping customers make the right choices also has a positive implications for sustainability, as it reduces returns, and thereby minimizes emissions from transportation.

In this competition, H&M Group invites us to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

There are no preconceptions on what information that may be useful – that is for us to find out. If we want to investigate a categorical data type algorithm, or dive into NLP and image processing deep learning, that is up to us.

# Way of Solving
1. Knowing about Data
2. EDA
3. Data Cleaning
4. Modelling
5. Prediction & Submission

# Checking Working Directory

In [None]:
#Checking current working directory!
import os
cwd = os.getcwd()
print("Your current working directory is : " , cwd)

# Importing Python Libraries

In [None]:
import numpy as np
import pandas as pd
import os
from os import listdir
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

# Finding location of Dataset

In [None]:
articlesData = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customersData = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
transactionsData = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")
submissionData = pd.read_csv("/kaggle/input/h-and-m-personalized-fashion-recommendations/sample_submission.csv")

In [None]:
images_dir = '../input/h-and-m-personalized-fashion-recommendations/images'
cat_images = [f for f in listdir(images_dir)]

# Statistics of data

In [None]:
articlesData.info()

In [None]:
articlesData.head()

In [None]:
customersData.info()

In [None]:
customersData.head()

In [None]:
transactionsData.info()

In [None]:
transactionsData.head()

# Data Visualization

In [None]:
temp = articlesData.groupby(["product_group_name"])["product_type_name"].nunique()
df = pd.DataFrame({'Product Group': temp.index,
                   'Product Types': temp.values
                  })
df = df.sort_values(['Product Types'], ascending=False)
plt.figure(figsize = (8,6))
plt.title('Number of Product Types/Product Group')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Product Group', y="Product Types", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
temp = articlesData.groupby(["product_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Group': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (8,6))
plt.title('Number of Articles per each Product Group')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Product Group', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
temp = articlesData.groupby(["product_type_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Type': temp.index,
                   'Articles': temp.values
                  })
total_types = len(df['Product Type'].unique())
df = df.sort_values(['Articles'], ascending=False)[0:50]
plt.figure(figsize = (16,6))
plt.title(f'Number of Articles per each Product Type (top 50 from total: {total_types})')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Product Type', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()



In [None]:
temp = articlesData.groupby(["department_name"])["article_id"].nunique()
df = pd.DataFrame({'Department Name': temp.index,
                   'Articles': temp.values
                  })
total_depts = len(df['Department Name'].unique())
df = df.sort_values(['Articles'], ascending=False).head(50)
plt.figure(figsize = (16,6))
plt.title(f'Number of Articles per each Department (top 50 from total: {total_depts})')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Department Name', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

# Image Data & Visualization 

In [None]:
total_folders = total_files = 0
folder_info = []
images_names = []
for base, dirs, files in tqdm(os.walk('/kaggle/input/h-and-m-personalized-fashion-recommendations/')):
    for directories in dirs:
        folder_info.append((directories, len(os.listdir(os.path.join(base, directories)))))
        total_folders += 1
    for _files in files:
        total_files += 1
        if len(_files.split(".jpg"))==2:
            images_names.append(_files.split(".jpg")[0])

In [None]:
imageData = pd.DataFrame(images_names, columns = ["image_name"])
imageData["article_id"] = imageData["image_name"].apply(lambda x: int(x[1:]))
imageData.head(10)


In [None]:
imageArticle = articlesData[["article_id", "product_code", "product_group_name", "product_type_name"]].merge(imageData, on=["article_id"], how="left")
print(imageArticle.shape)
imageArticle.head()

In [None]:
article_no_image_df = imageArticle.loc[imageArticle.image_name.isna()]
print(article_no_image_df.shape)
article_no_image_df.head()

In [None]:
print("Product codes without images: ", article_no_image_df.product_code.nunique())
print("Product group names without images: ", list(article_no_image_df.product_group_name.unique()))

In [None]:
def plot_image_samples(imageArticle, product_group_name, cols=1, rows=-1):
    image_path = "/kaggle/input/h-and-m-personalized-fashion-recommendations/images/"
    _df = imageArticle.loc[imageArticle.product_group_name==product_group_name]
    article_ids = _df.article_id.values[0:cols*rows]
    plt.figure(figsize=(2 + 3 * cols, 2 + 4 * rows))
    for i in range(cols * rows):
        article_id = ("0" + str(article_ids[i]))[-10:]
        plt.subplot(rows, cols, i + 1)
        plt.axis('off')
        plt.title(f"{product_group_name} {article_id[:3]}\n{article_id}.jpg")
        image = Image.open(f"{image_path}{article_id[:3]}/{article_id}.jpg")
        plt.imshow(image)

In [None]:
print(imageArticle.product_group_name.unique())

In [None]:
plot_image_samples(imageArticle, "Garment Upper body", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Underwear", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Socks & Tights", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Garment Lower body", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Accessories", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Items", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Nightwear", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Unknown", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Underwear/nightwear", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Shoes", 2, 1)

In [None]:
plot_image_samples(imageArticle, "Swimwear", 5, 1)

In [None]:
plot_image_samples(imageArticle, "'Garment Full body", 0, 1)

In [None]:
plot_image_samples(imageArticle, "Cosmetic", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Interior textile", 3, 1)

In [None]:
plot_image_samples(imageArticle, "Bags", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Furniture", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Garment and Shoe care", 5, 1)

In [None]:
plot_image_samples(imageArticle, "Fun", 2, 1)

In [None]:
plot_image_samples(imageArticle, "Stationery", 5, 1)

# Modelling & Submitting

In [None]:
from pathlib import Path

data_path = Path('/kaggle/input/h-and-m-personalized-fashion-recommendations/')
df = pd.read_csv(
    data_path / 'transactions_train.csv',
    # set dtype or pandas will drop the leading '0' and convert to int
    dtype={'article_id': str} 
)

In [None]:
print(df.shape)
df.head()

In [None]:
df['t_dat'] = pd.to_datetime(df['t_dat'])

In [None]:
df_3_week = df[df['t_dat'] >= pd.to_datetime('2020-08-31')].copy()
df_2_week = df[df['t_dat'] >= pd.to_datetime('2020-09-07')].copy()
df_1_week = df[df['t_dat'] >= pd.to_datetime('2020-09-15')].copy()

In [None]:
purchase_dict_3_week= {}

for i,x in enumerate(zip(df_3_week['customer_id'], df_3_week['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_3_week:
        purchase_dict_3_week[cust_id] = {}
    
    if art_id not in purchase_dict_3_week[cust_id]:
        purchase_dict_3_week[cust_id][art_id] = 0
    
    purchase_dict_3_week[cust_id][art_id] += 1
    
print(len(purchase_dict_3_week))

dummy_list_3_week = list((df_3_week['article_id'].value_counts()).index)[:12]

In [None]:
purchase_dict_2_week= {}

for i,x in enumerate(zip(df_2_week['customer_id'], df_2_week['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_2_week:
        purchase_dict_2_week[cust_id] = {}
    
    if art_id not in purchase_dict_2_week[cust_id]:
        purchase_dict_2_week[cust_id][art_id] = 0
    
    purchase_dict_2_week[cust_id][art_id] += 1
    
print(len(purchase_dict_2_week))

dummy_list_2_week = list((df_2_week['article_id'].value_counts()).index)[:12]

In [None]:
purchase_dict_1_week= {}

for i,x in enumerate(zip(df_1_week['customer_id'], df_1_week['article_id'])):
    cust_id, art_id = x
    if cust_id not in purchase_dict_1_week:
        purchase_dict_1_week[cust_id] = {}
    
    if art_id not in purchase_dict_1_week[cust_id]:
        purchase_dict_1_week[cust_id][art_id] = 0
    
    purchase_dict_1_week[cust_id][art_id] += 1
    
print(len(purchase_dict_1_week))

dummy_list_1_week = list((df_1_week['article_id'].value_counts()).index)[:12]

In [None]:
print(submissionData.shape)
submissionData.head()

In [None]:
need_improvemnet_model = submissionData[['customer_id']]
prediction_list = []

dummy_list = list((df_2_week['article_id'].value_counts()).index)[:12]
dummy_pred = ' '.join(dummy_list)

for i, cust_id in enumerate(submissionData['customer_id'].values.reshape((-1,))):
    if cust_id in purchase_dict_1_week:
        l = sorted((purchase_dict_1_week[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_1_week[:(12-len(l))])
    elif cust_id in purchase_dict_2_week:
        l = sorted((purchase_dict_2_week[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_2_week[:(12-len(l))])
    elif cust_id in purchase_dict_3_week:
        l = sorted((purchase_dict_3_week[cust_id]).items(), key=lambda x: x[1], reverse=True)
        l = [y[0] for y in l]
        if len(l)>12:
            s = ' '.join(l[:12])
        else:
            s = ' '.join(l+dummy_list_3_week[:(12-len(l))])
    else:
        s = dummy_pred
    prediction_list.append(s)

need_improvemnet_model['prediction'] = prediction_list
print(need_improvemnet_model.shape)
need_improvemnet_model.head()

In [None]:
need_improvemnet_model.to_csv('submission.csv', index=False)
need_improvemnet_model.head()

References-
~https://www.kaggle.com/chiranjeevbit/h-m-personalized-recommendation-eda-wordcloud
~https://www.kaggle.com/remekkinas/h-m-eda-first-look-into-data#DATASET-INFORMATION
~https://www.kaggle.com/jillanisofttech/h-m-personalized-fashion-recommendation

# This Notebook will be modified . If you like it , please support me by upvoting .