<h1> NBO/A Recommender Code Walkthrough 

Author: Sabine Joseph (Accenture GmbH)
sabine.a.joseph@accenture.com

In [7]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

*Generating toy dataset of 5 orders described by a maximum of 5 features each (sample SA codes)*
- we assume that this dataset represents a bucket (subset of full dataset for selected market)
- i.e. the dataset only includes orders of the same typeclass, engine specs. and int./ext. packages

In [12]:
labels = ['OrderID', 'Features']
orders = [(1001, 'SA1,SA2,SA4'),
         (1002, 'SA1,SA2,SA4,SA5'),
         (1003, 'SA1,SA2,SA3,SA4,SA5'),
         (1004, 'SA1,SA2,SA3,SA5'),
         (1005, 'SA2,SA5')]
df = pd.DataFrame.from_records(orders, columns=labels)
df

Unnamed: 0,OrderID,Features
0,1001,"SA1,SA2,SA4"
1,1002,"SA1,SA2,SA4,SA5"
2,1003,"SA1,SA2,SA3,SA4,SA5"
3,1004,"SA1,SA2,SA3,SA5"
4,1005,"SA2,SA5"


*Vectorization of features*
- generates sparse order feature matrix from Feature column for each order

In [16]:
vectorizer = CountVectorizer(tokenizer=lambda features: features.split(","), lowercase=False)
orderFeatureMatrix = vectorizer.fit_transform(df['Features'])
featureList = vectorizer.get_feature_names() 
orderFeatureMatrixDF = pd.DataFrame(orderFeatureMatrix.todense(), index=None, columns=featureList)
orderFeatureMatrixDF

Unnamed: 0,SA1,SA2,SA3,SA4,SA5
0,1,1,0,1,0
1,1,1,0,1,1
2,1,1,1,1,1
3,1,1,1,0,1
4,0,1,0,0,1


*Matrix transposition*
- as similarity function takes following input format, that is why the above matrix is transposed:
    - n_samples_X (here: orders), n_features (here: Features)

In [24]:
orderFeatureMatrixDF.T

Unnamed: 0,0,1,2,3,4
SA1,1,1,1,1,0
SA2,1,1,1,1,1
SA3,0,0,1,1,0
SA4,1,1,1,0,0
SA5,0,1,1,1,1


*Calculating cosine similarity*
- Why cosine?
    - very efficient and commonly used for evaluation of sparse vectors/matrices
    - the results between 1 and -1 or easy to interpret
    - other similarity metrics: Jaccard, Pearson, Spearmann, Euclidian, Manhattan 

In [25]:
similarityMatrix = cosine_similarity(orderFeatureMatrix.T)
featureList = ['SA1', 'SA2', 'SA3', 'SA4', 'SA5']
similarityMatrixDF = pd.DataFrame(similarityMatrix, index=featureList, columns=featureList)
similarityMatrixDF

Unnamed: 0,SA1,SA2,SA3,SA4,SA5
SA1,1.0,0.894427,0.707107,0.866025,0.75
SA2,0.894427,1.0,0.632456,0.774597,0.894427
SA3,0.707107,0.632456,1.0,0.408248,0.707107
SA4,0.866025,0.774597,0.408248,1.0,0.57735
SA5,0.75,0.894427,0.707107,0.57735,1.0


*Calculating feature take rates*
- based on orderFeatureMatrixDF table
- taking mean of each column

In [28]:
featureTakeratesDF = pd.DataFrame(featureList, columns=['Features'])
featureTakeratesDF['Takerate'] = orderFeatureMatrixDF.mean().values
featureTakeratesDF

Unnamed: 0,Features,Takerate
0,SA1,0.8
1,SA2,1.0
2,SA3,0.4
3,SA4,0.6
4,SA5,0.8


*Calculating scores for a specific order*
- here: the given order contains 2 Features: SA1 and SA3

In [79]:
#################################################################
currentOrderFeatures = 'SA1,SA3'
currentOrderFeatureCount = len(currentOrderFeatures.split(','))
#################################################################

def FeatureInFeatureList(Feature, currentOrderFeatures):
    return (",{},".format(Feature)) in ",{},".format(currentOrderFeatures)

def CalculateScore(row, featureTakeratesDF, currentOrderFeatureCount):
    score = sum(row.values * featureTakeratesDF['CurrentOrder'].values * (featureTakeratesDF['Takerate'].values))
    score = score / currentOrderFeatureCount
    return score

featureTakeratesDF['CurrentOrder'] = featureTakeratesDF.apply(lambda feature: 1 if FeatureInFeatureList(feature[0], currentOrderFeatures) else 0, axis = 1)
similarityMatrixDF['Score'] = similarityMatrixDF.apply(lambda row: -1 if FeatureInFeatureList(row.name, currentOrderFeatures) else CalculateScore(row, featureTakeratesDF, currentOrderFeatureCount))

As a result, the featureTakeratesDF now contains a new column (CurrentOrder), specifying which features were part of the current order 

In [73]:
featureTakeratesDF

Unnamed: 0,Features,Takerate,CurrentOrder
0,SA1,0.8,1
1,SA2,1.0,0
2,SA3,0.4,1
3,SA4,0.6,0
4,SA5,0.8,0


And we also get a new Score column in our similarity matrix table, where individual scores are shown for each feature

In [81]:
similarityMatrixDF

Unnamed: 0,SA1,SA2,SA3,SA4,SA5,Score
SA1,1.0,0.894427,0.707107,0.866025,0.75,-1.0
SA2,0.894427,1.0,0.632456,0.774597,0.894427,0.484262
SA3,0.707107,0.632456,1.0,0.408248,0.707107,-1.0
SA4,0.866025,0.774597,0.408248,1.0,0.57735,0.42806
SA5,0.75,0.894427,0.707107,0.57735,1.0,0.441421


Let's break down how the scoring works exactly!
- if the feature is already part of the current order, the Score is assigned to -1
     - those features will not be recommended
- otherwise a score is calculated for each feature as follows:
 i.e. how do we get a score for SA2?
    - multiplication of 3 vectors containing 5 values each: 
        entire SA2 column, featureTakeratesDF CurrentOrder column, Takerate column
    - sum of resulting vector is taken
    - division by total number of features in current order (here: currentOrderFeatureCount = 2)

In [76]:
print 'SA2 values ' + str(similarityMatrixDF.SA2.values)
print 'feature takerates from current order ' + str(featureTakeratesDF['CurrentOrder'].values)
print 'feature takerates from dataset ' + str(featureTakeratesDF['Takerate'].values)
res = sum(similarityMatrixDF.SA2.values * featureTakeratesDF['CurrentOrder'].values * (featureTakeratesDF['Takerate'].values))

print 'sum of values in resulting vector ' + str(res)
print 'division by total number of features in current order ' + str(res/currentOrderFeatureCount)
print 'the final score for SA2 is ' + str(res/currentOrderFeatureCount)

SA2 values [ 0.89442719  1.          0.63245553  0.77459667  0.89442719]
feature takerates from current order [1 0 1 0 0]
feature takerates from dataset [ 0.8  1.   0.4  0.6  0.8]
sum of values in resulting vector 0.968523965613
division by total number of features in current order 0.484261982807
the final score for SA2 is 0.484261982807


Why do we get slightly different scores for different orders of the same bucket?
- .. although similarity remain the same ..
- due to different amount of features included in each order
- see sample order 2

In [63]:
#################################################################
currentOrderFeatures_2 = 'SA1,SA3,SA5'
currentOrderFeatureCount_2 = len(currentOrderFeatures_2.split(','))
#################################################################

In [66]:
featureTakeratesDF['CurrentOrder'] = featureTakeratesDF.apply(lambda feature: 1 if FeatureInFeatureList(feature[0], currentOrderFeatures_2) else 0, axis = 1)
similarityMatrixDF['Score'] = similarityMatrixDF.apply(lambda row: -1 if FeatureInFeatureList(row.name, currentOrderFeatures) else CalculateScore(row, featureTakeratesDF, currentOrderFeatureCount_2))

In [67]:
similarityMatrixDF

Unnamed: 0,SA1,SA2,SA3,SA4,SA5,Score
SA1,1.0,0.894427,0.707107,0.866025,0.75,-1.0
SA2,0.894427,1.0,0.632456,0.774597,0.894427,0.561355
SA3,0.707107,0.632456,1.0,0.408248,0.707107,-1.0
SA4,0.866025,0.774597,0.408248,1.0,0.57735,0.439333
SA5,0.75,0.894427,0.707107,0.57735,1.0,0.560948


*Ranking / sorting of scores*
- we only use values greater than 0 for sorting of values in descending order
- only the features with highest scores are recommended

In [84]:
rankingDF = pd.DataFrame(index=similarityMatrixDF.index)
rankingDF['Ranking'] = 0
rankingDF['SA_Code'] = rankingDF.index
rankingDF['Score'] = similarityMatrixDF['Score'].values
rankingDF = rankingDF[rankingDF['Score'] > 0].sort_values(by = 'Score', ascending = False)
rankingDF['Ranking'] = range(1,rankingDF.index.size + 1)
rankingDF

Unnamed: 0,Ranking,SA_Code,Score
SA2,1,SA2,0.484262
SA5,2,SA5,0.441421
SA4,3,SA4,0.42806


In [1]:
','.join(list('U89,P23')[::-1])

'3,2,P,,,9,8,U'

In [6]:
','.join('U89,P23'.split(',')[::-1])

'P23,U89'

In [10]:
type_lookup =  {'A205': 'C-Class', 'C205': 'C-Class', 'W205': 'C-Class', 'X253': 'GLC', 'C253': 'GLC'}
file_lookup = {'C-Class': '/Data/Baumusterreferenzliste_filtered_C.xlsx', 
               'GLC': '/Data/Baumusterreferenzliste_filtered_GLC.csv'}

file_lookup[type_lookup['A205']]

'/Data/Baumusterreferenzliste_filtered_C.xlsx'