# User Similarity Calculation
This section aims to find the 2 most similar users of one specifc user. Features used to calculate similarity are as follow:

* Purchase record of subcategory
* The style in purchase record

### Purchase record of subcategory
We don't directly use the purchase record. Instead, we normalize the original records by its maximum. For exmaple, if an user's purchase record is {'sub1':3,'sub2':1,'sub3':5,'sub4':2}. the converted one would be: {'sub1':0.6,'sub2':0.2,'sub3':1,'sub4':0.4}
<br>
Following are two reaons why we convert it:
<br>
#### Rule out the temporal effect to make comparison based on the same basis
Consider following, if user A has a record like {'sub1':3,'sub2':1,'sub3':4,'sub4':2} while user B has  {'sub1':6,'sub2':2,'sub3':15,'sub4':3}. If we consider the orginial one, one might conclude that User B like sub3 more than A. However, it might because user B register in Pinkoi earlier than A. By normalization, we can compare them on the same ground.

### Easy to trace the change of preference
It could be achieved with the original one but it would be easier for normalize one since it represent each subcategory by percentage. This is useful when we want to test the novel/serendipity performance of recommendation system. When finding the some percentage of new subcategory surge in one user due to our recommendation system. We can conclude that this recommendation perform well in recommending novel/serenpditious product to user.

### Implementation
First we find the top 3 subcategory user prefere(MostComm &UserFeatureTransform). Second, we group user whose T1 subcategory is the same then calculate the similarity within this group

In [2]:
import pandas as pd
import numpy as np
from random import shuffle
from collections import Counter

In [14]:
#Here we use cython to do a little bit speedup
%load_ext Cython

In [9]:
#Find the most common subcategory of an user 
def MostComm(x,mosk,u):
    subcat=x.split(",")
    shuffle(subcat)
    most_com=[e[0] for e in Counter(subcat).most_common(mosk)]
    
    if len(most_com)<u-1:
        embed=[-1]*(u-1-len(most_com))
        most_com.extend(embed)
    return most_com

In [7]:
#Find the top 3 subcaegory of an user
def UserFeatureTrans(u,k):
    user_id=[]
    for i in xrange(len(u)):
        user_id.append(u'{s}'.format(s=i))
    
    User_feature=pd.DataFrame(index=range(len(u)),columns=['T1','T2','T3'])
    User_feature.insert(0,'UID',user_id)
    
    Most_common=[MostComm(e,k,len(User_feature.columns)) for e in UserData.ix[:,'subcategory']]
    
    for i in xrange(len(User_feature)):
        User_feature.ix[i,1:]=Most_common[i]
    return User_feature

In [15]:
%%cython
from libc.math cimport sqrt,pow

def SimDesigner(x,y):
    Dx=x.split(",")
    Dy=y.split(",")

    Cdx={}.fromkeys(Dx,0)
    for e in Dx:
        Cdx[e]+=1
    Cdy={}.fromkeys(Dy,0)
    for e in Dy:
        Cdy[e]+=1
    
    Norm_cdx=sqrt(sum([e*e for e in Cdx.values()]))
    Norm_cdy=sqrt(sum([e*e for e in Cdy.values()]))
    
    common=set(Cdx.keys()).intersection(Cdy.keys())
    score=0.0
    
    if len(common)==0:
        return score
    else:
        for k in common:
            score+=Cdx[k]*Cdy[k]
        return score/(Norm_cdx*Norm_cdy)
    
def SimSubcategory(x,y):
    Dx=x.split(",")
    Dy=y.split(",")
    
    Countx={}.fromkeys(Dx,0)
    for e in Dx:
        Countx[e]+=1
    County={}.fromkeys(Dy,0)
    for e in Dy:
        County[e]+=1
    
    max_x=max(Countx.values())
    max_y=max(County.values())
    
    for kx in Countx.keys():
        Countx[kx]=float(Countx[kx])/float(max_x)
    for ky in County.keys():
        County[ky]=float(County[ky])/float(max_y)
    
    Norm_cdx=sqrt(sum([e*e for e in Countx.values()]))
    Norm_cdy=sqrt(sum([e*e for e in County.values()]))

    common=set(Countx.keys()).intersection(County.keys())

    score=0.0
    if len(common)==0:
        return score
    else:
        for k in common:
            score+=Countx[k]*County[k]
        return score/(Norm_cdx*Norm_cdy)

def SimStyle(x,y):

    Dx=x.split(",")
    Dy=y.split(",")
    Dx=[e[1:] for e in Dx if Dx.index(e)!=0]
    Dy=[e[1:] for e in Dy if Dy.index(e)!=0]
 
    Cdx={}.fromkeys(Dx,0)
    for e in Dx:
        Cdx[e]+=1
    Cdy={}.fromkeys(Dy,0)
    for e in Dy:
        Cdy[e]+=1
    
    Norm_cdx=sqrt(sum([e*e for e in Cdx.values()]))
    Norm_cdy=sqrt(sum([e*e for e in Cdy.values()]))
    common=set(Cdx.keys()).intersection(Cdy.keys())

    score=0.0
    
    if len(common)==0:
        return score
    else:
        for k in common:
            score+=Cdx[k]*Cdy[k]
        return score/(Norm_cdx*Norm_cdy)

def TakeTopN(l,n):
    tmpN=[]
    tmp_sort=sorted([e[1] for e in l],reverse=True)
    tmpN=[e[0] for e in l if e[1] in tmp_sort[0:n]]
    return tmpN

In [16]:
def SimWeight(x,y):
    ci=UserData.ix[x,'Content weight']
    si=UserData.ix[x,'Style weight']
    cj=UserData.ix[y,'Content weight']
    sj=UserData.ix[y,'Style weight']
    
    li=np.sqrt(pow(ci,2)+pow(si,2))
    lj=np.sqrt(pow(cj,2)+pow(sj,2))
    
    return (ci*cj+si*sj)/(li*lj)

In [21]:
def SimularityCal(df,k):
    UserSim_df=pd.DataFrame(index=range(len(df)),columns=range(k))
    for i in xrange(len(df)):
        print "Customer "+str(i)
        ui=[]
        TopN=[]
        mask_idx=[e for e in User_feature[User_feature['T1']==User_feature.ix[i,'T1']].index if e!=0]
        if len(mask_idx)==0:
            mask_idx=[e for e in User_feature[User_feature['T2']==User_feature.ix[i,'T2']].index if e!=0]
        
        data_filtered=df.ix[mask_idx,:]

        designer_score=np.array([SimDesigner(df.ix[i,'designer'],e) for e in data_filtered['designer']])
        category_score=np.array([SimSubcategory(df.ix[i,'subcategory'],e) for e in data_filtered['subcategory']])
        style_score=np.array([SimStyle(df.ix[i,'Product Tag'],e) for e in data_filtered['Product Tag']])
        weight_score=np.array([SimWeight(i,e) for e in mask_idx])
        sim_score=list(designer_score+category_score+style_score+weight_score)
        #print sim_score
        ui=[(mask_idx[sim_score.index(e)],e)for e in sim_score]

        if len(ui)!=0:
            TopN=TakeTopN(ui,k)
            UserSim_df.ix[i,:]=TopN
        else:
            UserSim_df.ix[i,:]=i
    return UserSim_df        

In [3]:
UserData=pd.read_csv("Intermediate/SimUserData.csv",sep=",",encoding='utf8')

In [8]:
UserData.head()

Unnamed: 0,UID,designer,tid,subcategory,Product Tag,Content weight,Style weight,Browse
0,0,"panicjunkie,beanbeancase,lovespringtime,innere...","1LNcMozq,1fKcWUPZ,1dWKrV5w,1NrPloT8,1xo784ZW,1...","iPhone 週邊,Android 週邊,肩背包/斜背包,平板/電腦保護套,電腦包,束口後背...","三星,google,htc,押花,nexus,note5,自然,特別,iPhone 週邊,禮...",0.050896,0.949104,1tLIr0zx
1,1,"edie,googoods,wakakuwa,kimmidoll,wunghuh,daffo...","1R9IzDnu,149GTys_,18H5ZFLX,159QNvoZ,1z4JPC9Z,1...","胸針,項鍊/墜子,髮飾,鑰匙圈/鑰匙包,耳環,胸針,項鍊/墜子,髮飾,鑰匙圈/鑰匙包,耳環,...","fashion,羊毛氈,日本花簪,項鍊,胸章,兒童,迷你小相本鑰匙圈◆幾何系列◆,desig...",0.806415,0.193585,0
2,2,"deliatai,oone-n-only,tsukiniyorosiku,twine,ses...","1EB9FX2e,1MZaVziY,1YY6E82N,1TaTLdiX,14yhgn_L,1...","胸針,項鍊/墜子,髮飾,鑰匙圈/鑰匙包,耳環,胸針,項鍊/墜子,髮飾,鑰匙圈/鑰匙包,耳環,...","ses,胸章,鑰匙圈,創意,design,純銀項鍊,韓風髮箍,kkk.首飾りnecklace...",0.036503,0.963497,"1r3ZCWTo,1r_VB2EE"
3,3,"iamidesign,littleprince,everythinginbetween,si...","14Rq3sEa,1VTZwwBQ,1xoV23DQ,1f1NhrOS,1XBJ0jUP","iPhone 週邊,Android 週邊,肩背包/斜背包,平板/電腦保護套,其他7","mini,love,cute,萌,悠遊卡手機殼,ipadmini,貓,可愛,悠遊卡保護殼,i...",0.883987,0.116013,0
4,4,"bang-on-shop,pollinosis,lunablue,nadia,ishan13...","1xfaWs4D,1lwO72OY,1unlj5pu,1ukhY4Bu,1xHzNBML,1...","胸針,項鍊/墜子,髮飾,鑰匙圈/鑰匙包,耳環,胸針,項鍊/墜子,髮飾,鑰匙圈/鑰匙包,耳環,...","雙鍊,小物,項鍊,仿珍珠,蕾絲,施華洛世奇水晶,懷錶,創意,design,婚禮,復古手工耳環...",0.591378,0.408622,"1MLRlJzi,1w9Pu9rP,1UJkCb2H,1SFUDbQV"


In [4]:
Product=pd.read_csv("Intermediate/feature_p.csv",sep=",",encoding='utf8')
Style_feature=list(Product.ix[:,0])

In [5]:
Designer=pd.read_csv("Intermediate/designer_data_list.csv",encoding='utf8',header=None)
Design_feature=list(Designer.ix[:,0])

In [10]:
User_feature=UserFeatureTrans(UserData,3)

In [11]:
User_feature.head()

Unnamed: 0,UID,T1,T2,T3
0,0,iPhone 週邊,電腦包,束口後背包
1,1,項鍊/墜子,耳環,髮飾
2,2,項鍊/墜子,鑰匙圈/鑰匙包,髮飾
3,3,其他7,平板/電腦保護套,Android 週邊
4,4,胸針,項鍊/墜子,髮飾


In [19]:
UserData['Product Tag']=UserData['Product Tag'].fillna(u'0')

In [22]:
UserDf=SimularityCal(UserData,2)

In [None]:
UserDf.to_csv("Intermediate/Similarity_user.csv",sep=",",encoding='utf8',index=False)