# <center> Item categories analysis

## Summary
- Most of categories names **~70%** consist of pattern `{Family name} - {Subfamily name}`
- There are many categories contains `Цифра` word. It translates as electronic item.
- `Доставка товара`, `Билеты (Цифра)`, `Чистые носители (шпиль)`, `Служебные`, `Элементы питания`, and `Чистые носители (штучные)` **(ID 8, 9, 79, 81,82,83)** do not have any additional category.
- Family level are good separatable. It can be used as feature.
- Unique subfamily names count very high. So It is not reason to use it as feature in feature generation.
- Item count by count are separatable because of high deviation.

## Import libraries and load datasets

In [67]:
%%capture
%store -r item_cat
%store -r item
%store -r sub
%store -r shops
%store -r sales_test
%store -r sales_train
%store -r __ipy
%store -r __da

In [68]:
__ipy

Helper ipython script loaded


In [69]:
__da

Basic Data Analysis tools was loaded


In [70]:
%%capture
import plotly.express as px
import seaborn as sns
from basic_text_preprocessing import BasicPreprocessText

## Glimpse

- Most of categories names **~70%** consist of pattern `{Family name} - {Subfamily name}`
- There are many categories contains `Цифра` word. It translates as electronic item.
- `Доставка товара`, `Билеты (Цифра)`, `Чистые носители (шпиль)`, `Служебные`, `Элементы питания`, and `Чистые носители 

In [6]:
with pd.option_context('display.max_rows', 100):
    display(item_cat)

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4
5,Аксессуары - PSVita,5
6,Аксессуары - XBOX 360,6
7,Аксессуары - XBOX ONE,7
8,Билеты (Цифра),8
9,Доставка товара,9


All `item_category_id` are unique

In [74]:
item_cat.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
item_category_id,84.0,41.5,24.39,0.0,20.75,41.5,62.25,83.0


## Items count per category

- Category **ID 40** contains the biggest items count. 
- Others categories contains less than **2 000** items.
- **50%** of categories contains less than **41** items
- **25%** of categories contains less than **8** items

In [143]:
def set_plotly_layout(fig, title=""):
    fig.update_layout(
        font=dict(size=12, family='Helvetica'),
        title = title,
        yaxis = dict(
          scaleanchor = "x",
          scaleratio = 1,
        )
    )
    return fig


item_cat_grouped = (item.groupby('item_category_id', as_index=False)['item_id']
                    .count()
                    .sort_values('item_id', ascending=False).rename(columns={'item_id': "item_cnt"}))

item_cat_grouped = item_cat_grouped[item_cat_grouped['item_cnt'] < 5000]
fig = (px.bar(item_cat_grouped, y = 'item_cnt', x = 'item_category_id', width=1000, height=300)
.update_xaxes(type='category', title_text = "Category ID")
.update_yaxes(nticks=20, title_text = "Item count per category"))

set_plotly_layout(fig, "Items count per category").show()

In [140]:
pd.concat([item_cat_grouped[['item_cnt']].describe().T,
                      item_cat_grouped[['item_cnt']].median().rename('median'),
                      item_cat_grouped[['item_cnt']].skew().rename('skew'),
                      item_cat_grouped[['item_cnt']].kurt().rename('kurt')
                     ], axis=1)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,median,skew,kurt
item_cnt,83.0,206.45,370.58,1.0,8.0,41.0,284.0,2365.0,41.0,3.7,17.02


## Family category 

- Family category is first part of category name (before dash symbol)
- Family category  maximum count of categories is 14 `'игры'`
- Category names which consist only of one level contains 1 or 2 categories.

In [147]:
item_category_names = pd.Series(BasicPreprocessText().vectorize_process_text(item_cat['item_category_name']))
item_category_names_category_1 = item_category_names.apply(lambda x: x.split()[0])
item_category_names_category_1 = (item_category_names_category_1.value_counts()
.to_frame().reset_index())

item_category_names_category_1.columns = ['name', 'item_cat_count']
fig = (px.bar(item_category_names_category_1, x='name', y='item_cat_count')
.update_xaxes(type='category', title_text = "Category name")
.update_yaxes(nticks=20, title_text = "Category count"))

set_plotly_layout(fig, "Category family").show()

## Subfamily category 

- Subfamily category is first part of category name (before dash symbol)
- Subfamily categories compraise 1 or 2 levels.

In [150]:

item_category_names_category_2 = item_category_names.apply(lambda x: " ".join(x.split()[1:]))
item_category_names_category_2 = item_category_names_category_2.value_counts()\
.to_frame().reset_index()
item_category_names_category_2.columns = ['name', 'item_cat_count']

fig = (px.bar(item_category_names_category_2, x='name', y='item_cat_count')
.update_xaxes(type='category', title_text = "Category name")
.update_yaxes(nticks=20, title_text = "Category count"))

set_plotly_layout(fig, "Subcategory family").show()