#ProductTag Cleaning
In this section, we aim to find the similarity among subcategory. This similarity will be used to build the recommender engine. We reduce feature space by applying the idea of TF-IDF as we did in Exploratary Analysis-Designer. However, we calculate by inner product directly instead of applying K-mean because we already know exactly the number of groups 

## Description of data set

The data was colleced from Pinkoi website. The date of collection is from 12/17 0800 12/19 0800

##Field
* subcategory: the subcategory the products belong to, string
* Product Tag: tags of the product, string
* tid: the unique id of the product, string

The following step is similar to the ExploratoryAnalysis-Designer. We calculate the TF-IDF score to eliminate rare tag

In [1]:
import pandas as pd
import numpy as np

In [2]:
data=pd.read_csv("Intermediate/data_final.csv",sep=",",encoding='utf8',low_memory=False)

In [3]:
Product=data[['tid','subcategory','Product Tag']]

In [4]:
Product.head()

Unnamed: 0,tid,subcategory,Product Tag
0,1fAYZRbb,紙膠帶,紙膠帶
1,1Wpbjtdt,糖果/軟糖,"乾果醬,交換禮物,繽紛,聖誕節"
2,1cf23LXr,手提包,手提帶
3,1XkKaVA0,髮飾,"好煩小姐,髮帶,點點"
4,1cDqf1tY,髮飾,"好煩小姐,復古,手作,混織,髮帶"


In [5]:
#We use subcategory as a key and Product Tag as a value to combine
Product_group=Product.groupby('subcategory')['Product Tag'].apply(lambda x:",".join(x))

In [30]:
def ReduceByKeys(x):
    L=x.split(",")
    D={}.fromkeys(L,0)
    
    for e in L:
        D[e]+=1.0
    return D 

In [31]:
Product_dic=[ReduceByKey(e) for e in Product_group]

In [32]:
def parse_pdtag(pdtag):
    Pd=[]
    
    for e in pdtag:
        Pd.extend(e.keys())
    return list(set(Pd))

In [35]:
def TF_IDF_tag(d_list):
    tf_list=[]
    idf_list=[]
    for t in Product_tag:
        #print t
     
        idf=0.0
        tf=0.0
        for d in d_list:
            if t in d.keys():
                tf+=d[t]
                idf+=1.0
            else:
                continue
        #print tf
        #print idf
        tf_list.append(tf)
        idf_list.append(idf)
        #print tf_list
        #print idf_list
    #print tf_list
    TF_IDF=[((e[0]/max(tf_list)))/np.log10(e[1]/len(d_list)) if e[0]!=0 else -1000 for e in zip(tf_list,idf_list) ]
    Dominant_f=[(e[0],e[1]) for e in zip(Product_tag,TF_IDF) if (e[1]<np.percentile(TF_IDF,71)) and (e[1]!=1000)]
    return TF_IDF,max(tf_list),Dominant_f

In [36]:
Product_tag=parse_pdtag(Product_dic)

In [37]:
len(Product_dic)

183

In [39]:
Tf_Idf_score,max_tf,Dominant_feature=TF_IDF_tag(Product_dic)

In [43]:
np.percentile(Tf_Idf_score,71)

-8.2653621308803749e-06

Based on obervation, score above 71 percentile=-8.2653e-06. This score happened when the tag appear only once in only one subcategory, which is too rare. We eliminate all tags with this score

In [44]:
len(Product_tag)

31349

In [45]:
len(Dominant_feature)

22024

In [62]:
#Just save the feature
D=[]
V=[]
for i in Dominant_feature:
    D.append(i[0])
    V.append(i[1])
df_p=pd.DataFrame(index=range(len(Dominant_feature)),columns=['Product Tag','Values'])
df_p['Product Tag']=D
df_p['Values']=V
df_p.to_csv("Intermediate/feature_p.csv",sep=",",encoding='utf8',index=False)
