# **If you find "My Notebook" impressive 💖 and useful 👍🏻, Please Upvote it.**
### **(Your vote and feedbacks are always welcome and valuable for me 😊✌🏻)**

# **Introduction**

The dataset contains 4 csv files and one folder with several subfolders.

In this Exploratory Data Analysis Notebook we will look to the data, will analyze the content of each csv file, understand the data distribution, see what are the relations between data in various files.

**Aim is to show insightful plots and simple prediction.**

# **Table of Content**

1).  Data Exploration and Visualization

2).  Articles

3).  Customers

4).  Transactions

5).  Simple Prediction






 ## **Data Exploration**

### **Import Libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

### **Reading the files**

In [None]:
df_articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
df_customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
df_transactions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

## **Article data description:**

* **article_id :** A unique identifier of every article.

* **product_code, prod_name :** A unique identifier of every product and its name (not the same).

* **product_type, product_type_name :** The group of product_code and its name

* **graphical_appearance_no, graphical_appearance_name :** The group of graphics and its name

* **colour_group_code, colour_group_name :** The group of color and its name

* **graphical_appearance_no, graphical_appearance_name :** The group of graphics and its name

* **perceived_colour_value_id, perceived_colour_value_name, perceived_colour_master_id, perceived_colour_master_name :** The added color info

* **department_no, department_name: :** A unique identifier of every dep and its name

* **index_code, index_name: :** A unique identifier of every index and its name

* **index_group_no, index_group_name: :** A group of indeces and its name

* **section_no, section_name: :** A unique identifier of every section and its name

* **garment_group_no, garment_group_name: :** A unique identifier of every garment and its name

* **detail_desc: :** Details

## **Customers data description:**

* **customer_id :** A unique identifier of every customer

* **FN :** 1 or missed

* **Active :** 1 or missed

* **club_member_status :** Status in club

* **fashion_news_frequency :** How often H&M may send news to customer

* **age :** The current age

* **postal_code :** Postal code of customer

## **Transactions data description:**

* **t_dat :** A unique identifier of every customer

* **customer_id :** A unique identifier of every customer (in customers table)

* **article_id :** A unique identifier of every article (in articles table)

* **price :** Price of purchase

* **sales_channel_id :** 1 or 2

In [None]:
def plot_distribution(x, data, title):
        fig = px.histogram(
        data, 
        x = x,
        width = 800,
        height = 500,
        title = title
        )

        fig.show()

### **Checking tables of articles, customers and transactions**

In [None]:
df_articles.shape

In [None]:
df_articles.head()

In [None]:
df_customers.shape

In [None]:
df_customers.head()

In [None]:
df_transactions.shape

In [None]:
df_transactions.head()

In [None]:
df_articles.iloc[:, :-1].describe().T.sort_values(by='std' , ascending = False)\
                     .style.background_gradient(cmap='GnBu')\
                     .bar(subset=["max"], color='#F8766D')\
                     .bar(subset=["mean",], color='#00BFC4')

## **Insightful Plots : Articles**

In [None]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=df_articles, y='index_name', color='cyan')
ax.set_xlabel('count by index name')
ax.set_ylabel('index name')
plt.show()

**From above we observed that Ladieswear accounts for a significant part of all dresses. Sportswear has the least portion.**

In [None]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=df_articles, y='garment_group_name', color='tomato', hue='index_group_name', multiple="stack")
ax.set_xlabel('count by garment group')
ax.set_ylabel('garment group')
plt.show()

**From above it seems that the garments grouped by index: Jersey fancy is the most frequent garment, especially for women and children. The next by number is accessories, many various accessories with low price.**

#### **And the table with number of unique values in columns:**

In [None]:
for col in df_articles.columns:
    if not 'no' in col and not 'code' in col and not 'id' in col:
        un_n = df_articles[col].nunique()
        print(f'n of unique {col}: {un_n}')

## **Insightful Plots : Customers**

In [None]:
df_customers.iloc[:, :-1].describe().T.sort_values(by='std' , ascending = False)\
                     .style.background_gradient(cmap='GnBu')\
                     .bar(subset=["max"], color='#F8766D')\
                     .bar(subset=["mean",], color='#00BFC4')

In [None]:
plot_distribution('age', df_customers, 'Age distribution')

**From above plot we observed that the most common age is about 21-24**

In [None]:
df_customers.fashion_news_frequency.value_counts()

In [None]:
plot_distribution('fashion_news_frequency', df_customers, 'Fasion News Frequency')

**From it seems that customers prefer not to get any messages about the current news.**

In [None]:
df_customers.club_member_status.value_counts()

In [None]:
plot_distribution('club_member_status', df_customers, 'Club Member Status')

**From above we sae that Status in H&M club. Almost every customer has an active club status, some of them begin to activate it (pre-create). A tiny part of customers abandoned the club.**

## **Insightful Plots : Transactions**

In [None]:
pd.set_option('display.float_format', '{:.4f}'.format)
df_transactions.describe()['price']

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.boxplot(data=df_transactions, x='price', color='darkorange')
ax.set_xlabel('Price outliers')
plt.show()

**From above we saw that the outliers for price**

### **Top 10 customers by num of transactions.**

In [None]:
df_transactions_byid = df_transactions.groupby('customer_id').count()

In [None]:
df_transactions_byid.sort_values(by='price', ascending=False)['price'][:10]

**Get subset from articles and merge it to transactions.**

In [None]:
articles_for_merge = df_articles[['article_id', 'prod_name', 'product_type_name', 'product_group_name', 'index_name']]

In [None]:
articles_for_merge = df_transactions[['customer_id', 'article_id', 'price', 't_dat']].merge(articles_for_merge, on='article_id', how='left')

**The index with the highest mean price is Ladieswear. With the lowest - Children.**

In [None]:
articles_index = articles_for_merge[['index_name', 'price']].groupby('index_name').mean()
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.barplot(x=articles_index.price, y=articles_index.index, color='gold', alpha=0.8)
ax.set_xlabel('Price by index')
ax.set_ylabel('Index')
plt.show()

**Stationery has the lowest mean price, the highest - Shoes.**

In [None]:
articles_index = articles_for_merge[['product_group_name', 'price']].groupby('product_group_name').mean()
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.barplot(x=articles_index.price, y=articles_index.index, color='orangered', alpha=0.8)
ax.set_xlabel('Price by product group')
ax.set_ylabel('Product group')
plt.show()

**Now check the mean price change in time for top 5 product groups by mean price:**

In [None]:
articles_for_merge['t_dat'] = pd.to_datetime(articles_for_merge['t_dat'])

In [None]:
product_list = ['Shoes', 'Garment Full body', 'Bags', 'Garment Lower body', 'Underwear/nightwear']
colors = ['cadetblue', 'orange', 'mediumspringgreen', 'tomato', 'lightseagreen']
k = 0
f, ax = plt.subplots(3, 2, figsize=(20, 15))
for i in range(3):
    for j in range(2):
        try:
            product = product_list[k]
            articles_for_merge_product = articles_for_merge[articles_for_merge.product_group_name == product_list[k]]
            series_mean = articles_for_merge_product[['t_dat', 'price']].groupby(pd.Grouper(key="t_dat", freq='M')).mean().fillna(0)
            series_std = articles_for_merge_product[['t_dat', 'price']].groupby(pd.Grouper(key="t_dat", freq='M')).std().fillna(0)
            ax[i, j].plot(series_mean, linewidth=4, color=colors[k])
            ax[i, j].fill_between(series_mean.index, (series_mean.values-2*series_std.values).ravel(), 
                             (series_mean.values+2*series_std.values).ravel(), color=colors[k], alpha=.1)
            ax[i, j].set_title(f'Mean {product_list[k]} price in time')
            ax[i, j].set_xlabel('month')
            ax[i, j].set_xlabel(f'{product_list[k]}')
            k += 1
        except IndexError:
            ax[i, j].set_visible(False)
            
plt.show()

## **Simple Prediction** 

**For this initial submission, we apply the following simplified logic:**
* If there are articles for a certain client, pick the most recent buys;

* If there are not articles for a certain client, just pick the most frequently buyed  articles.

In [None]:
df_transactions = df_transactions.sort_values(["customer_id", "t_dat"], ascending=False)

In [None]:
df_transactions.head()

Let's capture first what are the most frequent recently bought articles.

In [None]:
last_date = df_transactions.t_dat.max()
print(last_date)
print(df_transactions.loc[df_transactions.t_dat==last_date].shape)

In [None]:
most_frequent_articles = list(df_transactions.loc[df_transactions.t_dat==last_date].article_id.value_counts()[0:12].index)
art_list = []
for art in most_frequent_articles:
    art = "0"+str(art)
    art_list.append(art)
art_str = " ".join(art_list)
print("Frequent articles bought recently: ", art_str)

In [None]:
agg_df = df_transactions.groupby(["customer_id"])["article_id"].agg(lambda x: str(x.values[0:12])[1:-1]).reset_index()

In [None]:
def padding_articles(x):
    if x:
        xl = x.split()
        x = []
        for xi in xl:
            x.append("0"+xi)
        dimm_x = len(x)
        if dimm_x < 12:
            x.extend(art_list[:12-dimm_x])
        return(" ".join(x))

In [None]:
agg_df["article_id"] = agg_df["article_id"].apply(lambda x: padding_articles(x))

In [None]:
df_sample_submission = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')

In [None]:
print("Aggregated transaction history: ", agg_df.customer_id.nunique())
print("Submission sample: ", df_sample_submission.customer_id.nunique())

We will replace the values in sample submission with the existent in aggregated transactions data and just let the default one otherwise.

In [None]:
print(df_sample_submission.shape)
df_sample_submission.head()

For the customers with missing articles, we simply replace with most frequent buyed articles in most recent day(s).

In [None]:
My_Final_Submission = agg_df.merge(df_sample_submission[["customer_id"]], how="right")
My_Final_Submission.columns = ["customer_id", "prediction"]
print(My_Final_Submission.shape)
My_Final_Submission.head()

**Rows with missing data in submission:**

In [None]:
My_Final_Submission.loc[My_Final_Submission.prediction.isna()].shape[0]

We replace the missing data with the most frequently bought articles, from recent days. We calculated it before.

In [None]:
My_Final_Submission.loc[My_Final_Submission.prediction.isna(), ["prediction"]] = art_str

In [None]:
print("Rows with missing data in submission: ", My_Final_Submission.loc[My_Final_Submission.prediction.isna()].shape[0])

In [None]:
My_Final_Submission.to_csv("submission.csv", index=False)