# Introduction

H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Our online store offers shoppers an extensive selection of products to browse through. But with too many choices, customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. To enhance the shopping experience, product recommendations are key. More importantly, helping customers make the right choices also has a positive implications for sustainability, as it reduces returns, and thereby minimizes emissions from transportation.

In this competition, H&M Group invites you to develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

There are no preconceptions on what information that may be useful – that is for you to find out. If you want to investigate a categorical data type algorithm, or dive into NLP and image processing deep learning, that is up to you.
# Files

* images/ - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
* articles.csv - detailed metadata for each article_id available for purchase
* customers.csv - metadata for each customer_id in dataset
* sample_submission.csv - a sample submission file in the correct format
* transactions_train.csv - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.

NOTE: You must make predictions for all customer_id values found in the sample submission. All customers who made purchases during the test period are scored, regardless of whether they had purchase history in the training data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from collections import Counter

In [None]:
article = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customer = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
transaction = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

In [None]:
article.head()

In [None]:
customer

In [None]:
transaction

# Data Cleaning

Since customer dataset have missing values, we frist have to see how many in each column.

From Host

- customer dataset: FN is if a customer get Fashion News newsletter, Active is if the customer is active for communication, 
- transactions_train: sales channel id, 2 is online and 1 store.

In [None]:
miss_value_count = customer.isnull().sum()
miss_value_count

Filling Empty field with Zero

In [None]:
customer.fillna(0, inplace=True)
customer

Renaming columns for better understanding

In [None]:
customer = customer.rename(columns={"FN":"Fashion_News_newsletter", "Active": "Active_communication"})
customer

In [None]:
customer["Fashion_News_newsletter"].equals(customer.Active_communication)

Since both colomun have different values meaning both colomun hold different meanings, we shouldn't combine these two.

In [None]:
customer["Diff"] = np.where( customer.Fashion_News_newsletter == customer.Active_communication, "1.0", "0")
print(customer.loc[customer.Diff=="1.0", ["Fashion_News_newsletter", "Active_communication", "Diff"]],end='')
customer.loc[customer.Active_communication != customer.Fashion_News_newsletter, ["Fashion_News_newsletter", "Active_communication", "Diff"]]

# Exploratory data analysis

Let's start with some pie chart for Article dataset

plotly as px

In [None]:
Types = article.product_group_name.unique()
count = article.product_group_name.value_counts()
fig = px.pie(article, values=count, names=Types, title="Product Group by population", color_discrete_sequence=px.colors.qualitative.Alphabet)
fig.show()

In [None]:
label = article.index_group_name.unique()
count1 = article.index_group_name.value_counts()
fig = px.pie(article, values=count1, title="Pie Chart on Type on Products", names=label,color_discrete_sequence=px.colors.sequential.Purpor)
fig.show()

Sunburst Chart

The sunburst chart is ideal for displaying hierarchical data. Each level of the hierarchy is represented by one ring or circle with the innermost circle as the top of the hierarchy.

In [None]:
fig = px.sunburst(article, path=['index_name', 'product_group_name', 'garment_group_name'],width=800,
    height=800,color_discrete_sequence=px.colors.cyclical.Edge)
fig.show()

In [None]:
#mask = np.array(Image.open('../input/maskiiiii/10356-light-blue-dress-design.png'))
#mask

In [None]:
#transaction