# <b>1 <span style='color:#c9184a'>|</span> Introduction</b>

<center><img src='https://media-cldnry.s-nbcnews.com/image/upload/t_social_share_1200x630_center,f_auto,q_auto:best/newscms/2017_24/1222336/hm-today-170616-tease.jpg' width =650></center>

In this notebook, I develop product recommendations based on data from previous H&M transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

<div class="alert alert-block alert-info">Throughout this notebook, the plots have been made using <b>plotly</b>, as it provides efficient interactive plots, implying better analysis. <br> Sit back, toggle the sidebar, and enjoy!</div>

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
from plotly.offline import plot, iplot, init_notebook_mode
import plotly.express as px
import plotly.graph_objects as go
init_notebook_mode(connected=True)
from datetime import datetime

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.1 | Looking at the Data</b></p>
</div>


In [None]:
l = os.listdir('/kaggle/input/h-and-m-personalized-fashion-recommendations/')
print(f"Folders: {l}")

First let us focus on the three kinds of csv's available, `articles.csv`, `transactions_train.csv` and `customers.csv`.

In [None]:
articles = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/articles.csv')
articles.info()

In [None]:
articles.head()

In [None]:
customers = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/customers.csv')
customers.info()

In [None]:
customers.head()

In [None]:
transactions = pd.read_csv('/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
transactions.info()

In [None]:
transactions.head()

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.2 | Missing Values</b></p>
</div>

In [None]:
def missing_values(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Number of Missing Values', 'Percentage'])

#### Articles

In [None]:
articles_missing=missing_values(articles)
articles_missing.loc[articles_missing['Percentage']>0]

#### Customers

In [None]:
customers_missing=missing_values(customers)
customers_missing.loc[customers_missing['Percentage']>0]

#### Transactions

In [None]:
transactions_missing=missing_values(transactions)
transactions_missing.loc[transactions_missing['Percentage']>0]

Based on the above, we can see that:
* In the articles dataframe, the only missing data is for *detail_desc*, the detailed description of the article.
* In the customers dataframe,a significant amount of data is present in *FN* and *Active*.
* There is no missing data in transactions dataframe.

# <b>2 <span style='color:#c9184a'>|</span> Data Visualization</b>

# **2.1 Articles**

In [None]:
fig = px.pie(articles, values='article_id', 
             title='Distribution by Index Group Name',
             names='index_group_name',
             color_discrete_sequence=px.colors.sequential.RdBu,
             hover_data=['index_group_name'],
             labels={'index_group_name':'Index Group Name'},
            height=450)
fig.show()

* Ladieswear comprises of the maximum artciles, followed by Baby/Children on a close second.

In [None]:
fig = px.histogram(articles, x='garment_group_name',color="index_group_name", 
                   title="Index Group Name per Garment Group Name",
                  color_discrete_sequence=px.colors.sequential.Agsunset,
                  height=600)
fig.show()


*P.s. For some reason I had a hard time plotting this stacked plot, after numerous efforts, I owe the succes to [this blog by Vaclav Dekanovsky.](https://towardsdatascience.com/histograms-with-plotly-express-complete-guide-d483656c5ad7)*

* The maximum garments sold are of the group Jersey Fancy, followed by Accessories.
* Around 40% of Jersey Fancy comprises of those of Baby/Children.
* Around 70% of the shirts sold are in Menswear.
* Group Divided comprises  of almost 50% of the Dresses Ladies.


In [None]:
df1 = articles.groupby(["section_name"]).count().reset_index()

fig = px.bar(df1,
             x=articles.groupby(["section_name"]).size(),
             y="section_name",
             color='section_name',
             title='Distribution by Section Name',
             hover_data=['section_name'],
             text_auto='.2s',
             labels={'section_name':'Section Name',"x":"Count"},
             orientation='h',
             color_discrete_sequence=px.colors.diverging.Temps,         
             height=1000)
fig.update_traces(textfont_size=11, textangle=0, textposition="outside", cliponaxis=False)
fig.update_layout(xaxis_title = 'Count')
fig.show()

* The highest number of articles belong to the section Womens Everyday Collection, followed by Divided collection, and next Baby Essentials and Complements.
* The least belong to 'Men Other' group.
* There are almost 3 times Womens shoes than Men. 
* The highest number of articles in Menswear belong to 'Men Underwear', followed by 'Mens Suits and Tailoring.

In [None]:
df4= articles.groupby(["index_name"])["article_id"].nunique()
df4 = pd.DataFrame({'IndexName': df4.index,
                   'Articles': df4.values
                  })
labels=df4['IndexName']
values=df4['Articles']
fig = px.pie(labels, values = values, hole = 0.35,
              names = labels,
              title = 'Distribution by Index Name',
              color_discrete_sequence =px.colors.cyclical.mygbm
              )
fig.show()

In [None]:
df5 = articles.groupby(["perceived_colour_master_name"]).count().reset_index()
colors = ['#F5F5DC','#000000','#023e8a','#168aad','#7f5539','#90be6d','#b7b7a4','#606c38','#9d4edd','#b7b7a4','#9e2a2b','#f77f00','#ffafcc','#d00000','#34a0a4','#3e1f47','#ffffff','#fcbf49','#dddf00','#9e0059'] 
fig = px.bar(df5,
             y=articles.groupby(["perceived_colour_master_name"]).size(),
             x="perceived_colour_master_name",
             color='perceived_colour_master_name',
             hover_data=['perceived_colour_master_name'],
             text_auto='.2s',
             color_discrete_sequence =colors,
             title='Distribution by Percieved Color Master Name',
             labels={'perceived_colour_master_name':'Percieved Color Master Name'})
fig.update_traces(textfont_size=11, textangle=0, textposition="outside", cliponaxis=False)
fig.update_layout(yaxis_title = 'Count')
fig.show()

# **2.2 Customers**

In [None]:
fig = px.histogram(customers, 
                   x="age", 
                   range_x=["0","100"],
                   title="Age Distribution",
                   height=450,
                   color_discrete_sequence =px.colors.cyclical.IceFire
                  )
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
df6 = customers.groupby(["age"]).count().reset_index()
age_groups = pd.cut(customers["age"], bins=[0, 20, 40, 60, 80,100])
fig = px.histogram(age_groups.astype('str'), 
                   x="age", 
                   title="Age Distribution in Groups",
                   height=350,
                   text_auto=True,
                   color_discrete_sequence =px.colors.sequential.Jet
                  )

fig.show()

* The highest number of customers prevail from ages 20-40. This is about 52% of the total customers.

# **2.3 Transactions**

In [None]:
cus_transactions = transactions.groupby('customer_id').count()
print("Top 10 Customers by Number of Items Purchased: ")
cus_transactions.sort_values(by='price', ascending=False)['price'][:10]

In [None]:
merged_at = articles[['article_id', 'prod_name', 'product_type_name', 'product_group_name', 'index_name']]
merged_at = transactions.merge(merged_at, on='article_id', how='left')

In [None]:
df7 = merged_at[['index_name', 'price']].groupby('index_name').sum()
fig = px.bar(df7,
             x=df7.price,
             y=df7.index,
             title='Sales per Index Name',
             text_auto='.2s',
             orientation='h',
             color_discrete_sequence=px.colors.diverging.Temps,         
             height=500,
            labels={'t_dat':'Transaction date', 'price':'Sales'})
fig.update_traces(textfont_size=11, textangle=0, textposition="outside", cliponaxis=False)
fig.update_layout(xaxis_title = 'Total Sale',yaxis_title ='Index Name')
fig.show()

In [None]:
df8 = transactions.groupby(["t_dat"])["price"].sum().reset_index()
df8["t_dat"] = df8["t_dat"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

fig = px.line(df8,
             x=df8["t_dat"],
             y=df8["price"],
             title='Sales Amount per Transaction date',
             color_discrete_sequence=px.colors.diverging.Portland,         
             height=400,
             labels={'t_dat':'Transaction date', 'price':'Sales'})
fig.update_layout(xaxis_title = 'Transaction Date',yaxis_title ='Total Sales Amount')
fig.show()

* The highest sales was made on September 28, 2019.

In [None]:
cust_age= pd.DataFrame(customers, columns = ['customer_id','age'])
total_tran = pd.merge(cust_age,transactions, how='right', on='customer_id')
datanew= pd.DataFrame(total_tran, columns = ['price','t_dat','sales_channel_id','article_id'])
dfp = datanew.groupby(["t_dat", "sales_channel_id"])["price"].sum().reset_index()

fig_bar = px.line(dfp, x="t_dat", y="price", color="sales_channel_id",
                 animation_frame="sales_channel_id", animation_group="t_dat", 
                  title='Sales Amount per Sales Channel',
                 color_discrete_sequence=px.colors.diverging.Portland,
                 labels={'sales_channel_id': 'Sales channel ID','t_dat':'Transaction date', 'price':'Sales'})
fig_bar.update_yaxes(showgrid=False),
fig_bar.update_layout(
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title='Sales Channel', yaxis_title="Sales Amount",
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig_bar.show()

In [None]:
fig = px.histogram(dfp, 
    y="price", 
    x="sales_channel_id",
    color_discrete_sequence =['#fe6d73'],
    barmode="group", 
    animation_frame="t_dat", 
    title='Day Wise Sales per Sales Channel',
    range_x=[0,3]
)
fig.update_layout(xaxis_title='Sales Channel', yaxis_title="Sales Amount",bargap=0.5
                         )

fig.show()

**I will soon be updating the notebook with prediction section as well. Do let me know your reviews and suggestions.**

**Until next time!**

>"**And, when you can’t go back, you have to worry only about the best way of moving forward.**"
    *- Paulo Coelho, The Alchemist*