# Faire - The Online Wholesale Marketplace & Store

Welcome! Lets build a new search ranking model.

### Description

Knowing whether a Product in Search will be bought in advance could provide huge business value to Faire. This task is very important for purchase prediction as well as for short term user engagement prediction.

- In this dataset we have sampled ~20k rows from Faire search logs. The dataset is anonymized. Before describing the dataset, lets give some preliminary knowledge of how Faire Wholesale MarketPlace Search works. Faire is a two-sided marketplace where retailers come to shop wholesale products from brands. When a retailer makes a search on the site, we call that a search request. The response is a page with many products. We assign a `request_id` to this search request response, and different pages (page number 1, 2, 3...) from the same search have different `request_id`s (i.e. `request_id` is more of a "page id" than a "search session id"). Each row in this dataset represents one single product that was impressed for that `request_id`. For each request_id you can have many results (due to this being a random sample some of them might be missed in a some cases). 

- Each row contains the following fields: `request_id`, `retailer_token` (anonymzed user token), `query_text` (the actual search string), `page_number`, `page_size`, `position`, `filter_string` (filters applied on top of the search), `has_product_click` (was it clicked or not). 
- We have some features from our feature store (computed using data from before the search timestamp): anything starting with `product.` is a product-level feature. We have also a few personalization features in this dataset, anything starting with `retailerbrand.` is a personalization feature and relates to a particular retailer:brand pair. 
- Note that we have personalization features only at the level of retailer(user) and the brand the product belongs to. Brands usually have many products in our marketplace, so Product <-> Brand mapping is Many:1.

### Tasks
In approximately 1 hour, please do the following:
- Please build first a ranking model using the provided dataset and evaluate it. 
- If there is time, please implement 1 or 2 additional features, and list up to 10 more features (without implementing them).
- If there is time, please give us an idea of the next steps, how would you improve this model if you had 1 week more, 1 month more, or 3 months more?

Please write as many comments or communicate out loud your thought process. Once your time allocation is up, please send back the completed notebook in .ipynb and .pdf format

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
# plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
import csv

In [2]:
data = pd.read_csv("datasets/faire-ml-rank-small.csv")
data.head(3)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,product.product_brand_page_click_to_cart_rate,product.product_brand_page_impression_to_click_rate,product.product_click_to_cart_rate_4w,product.product_impression_to_click_rate_4w,product.product_is_high_sell_through,product.product_num_cart_adds_4w,product.product_num_clicks_4w,product.product_num_impressions_4w,...,description,has_product_click,created_at_a,query_text,filter_string,page_number,page_size,position,retailer_token_anon,request_id_anon
0,26152,26152,1.0708,0.0133,,,0.0,,,303.0,...,Hat-leopard visor sun hat.,0,2020-05-31 00:22:33.18,sun hat,,0.0,48,23.0,188,1433
1,26890,26890,1.9197,0.0134,,,0.0,,,77.0,...,"Offbeat, cheeky and distinctive greeting cards...",0,2020-05-31 01:21:00.418,beach,,23.0,24,559.0,132,703
2,41983,41983,0.608,0.0423,,,0.0,,,50.0,...,"Selene, considered the human personification o...",0,2020-05-31 23:34:57.391,gold love necklace,,1.0,48,58.0,81,516


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 50 columns):
 #   Column                                                                      Non-Null Count  Dtype  
---  ------                                                                      --------------  -----  
 0   Unnamed: 0.1                                                                20000 non-null  int64  
 1   Unnamed: 0                                                                  20000 non-null  int64  
 2   product.product_brand_page_click_to_cart_rate                               19813 non-null  float64
 3   product.product_brand_page_impression_to_click_rate                         19815 non-null  float64
 4   product.product_click_to_cart_rate_4w                                       301 non-null    float64
 5   product.product_impression_to_click_rate_4w                                 4555 non-null   float64
 6   product.product_is_high_sell_through          

In [4]:
def match_title(row):
    if row['query_text'] in row['title']:
        return 1.0
    else:
        return 0.0

def match_desc(row):
    if row['query_text'] in row['description']:
        return 1.0
    else:
        return 0.0
data['description'] = data['description'].fillna("")
data['title_match'] = data.apply(match_title, axis=1)
data['description_match'] = data.apply(match_desc, axis=1)

In [5]:
# Drop columns that are not needed
features = data.drop(
    ['Unnamed: 0.1', 'Unnamed: 0', 'rand', 'title', 'description', 'created_at_a', 'query_text','filter_string', 'retailer_token_anon', 'product.product_is_high_sell_through'],
    axis=1
)
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 42 columns):
 #   Column                                                                      Non-Null Count  Dtype  
---  ------                                                                      --------------  -----  
 0   product.product_brand_page_click_to_cart_rate                               19813 non-null  float64
 1   product.product_brand_page_impression_to_click_rate                         19815 non-null  float64
 2   product.product_click_to_cart_rate_4w                                       301 non-null    float64
 3   product.product_impression_to_click_rate_4w                                 4555 non-null   float64
 4   product.product_num_cart_adds_4w                                            301 non-null    float64
 5   product.product_num_clicks_4w                                               21 non-null     float64
 6   product.product_num_impressions_4w            

In [6]:
features.describe()

Unnamed: 0,product.product_brand_page_click_to_cart_rate,product.product_brand_page_impression_to_click_rate,product.product_click_to_cart_rate_4w,product.product_impression_to_click_rate_4w,product.product_num_cart_adds_4w,product.product_num_clicks_4w,product.product_num_impressions_4w,product.product_num_search_excess_cart_adds_4w,product.product_num_search_excess_clicks_4w,product.product_num_search_impressions_4w,...,retailerbrand.retailer_last_added_brands_brand_lightfm_cosine_similarity,retailerbrand.retailer_last_ordered_brands_brand_lightfm_cosine_similarity,retailerbrand.retailer_last_visited_brands_brand_lightfm_cosine_similarity,has_product_click,page_number,page_size,position,request_id_anon,title_match,description_match
count,19813.0,19815.0,301.0,4555.0,301.0,21.0,19909.0,19804.0,19803.0,19804.0,...,12021.0,11324.0,12214.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,0.412997,0.020205,0.396585,0.034586,81.475083,578.619048,320.249586,0.11403,1.006001,196.564027,...,0.947523,0.946361,0.946014,0.04195,2.0748,43.0522,97.56155,732.61825,0.00625,0.16055
std,0.30217,0.022409,0.324875,0.030168,97.222306,198.494452,432.717359,2.207659,10.33838,590.221938,...,0.220458,0.222759,0.22535,0.20048,6.625179,10.839181,225.353999,427.707775,0.078811,0.367125
min,0.0,0.0,0.0177,0.0,4.0,413.0,1.0,-81.2543,-209.2991,1.0,...,-0.673838,-0.67032,-0.671783,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,0.2075,0.0085,0.15,0.0148,28.0,460.0,94.0,-0.2064,-0.62365,31.0,...,0.999331,0.999297,0.999342,0.0,0.0,48.0,14.0,358.0,0.0,0.0
50%,0.3371,0.0142,0.3621,0.0262,52.0,510.0,195.0,-0.0824,-0.1835,70.0,...,0.999664,0.99965,0.999647,0.0,0.0,48.0,37.0,735.0,0.0,0.0
75%,0.5286,0.0239,0.5464,0.0455,88.0,554.0,374.0,-0.0176,0.6863,160.0,...,0.999816,0.999823,0.999805,0.0,2.0,50.0,89.0,1104.25,0.0,0.0
max,2.5419,0.4945,2.4107,0.3735,569.0,1029.0,5030.0,42.1637,356.3961,9644.0,...,1.0,1.0,1.0,1.0,172.0,50.0,5022.0,1462.0,1.0,1.0


In [7]:
from sklearn.preprocessing import StandardScaler

ignore_columns = {'request_id_anon', 'has_product_click'}
features = features.fillna(0.0)
scaler = StandardScaler()
for column in features.columns:
    if column in ignore_columns:
        continue
    scaler = StandardScaler()
    features[column] = scaler.fit_transform(features[[column]])    

features['has_product_click'].describe()

count    20000.00000
mean         0.04195
std          0.20048
min          0.00000
25%          0.00000
50%          0.00000
75%          0.00000
max          1.00000
Name: has_product_click, dtype: float64

In [8]:
from sklearn.model_selection import train_test_split

request_ids = features['request_id_anon'].unique()
RANDOM_STATE = 42
TEST_SIZE = 0.2

r_train, r_test = train_test_split(request_ids, test_size=TEST_SIZE, random_state=RANDOM_STATE)
# r_train.shape, r_test.shape
X_train = features[features['request_id_anon'].isin(r_train)].drop(['request_id_anon'], axis=1)
X_test = features[features['request_id_anon'].isin(r_test)].drop(['request_id_anon'], axis=1)

Y_train = X_train['has_product_click'].astype(int)
X_train = X_train.drop('has_product_click', axis=1)

Y_test = X_test['has_product_click'].astype(int)
X_test = X_test.drop('has_product_click', axis=1)
X_train.shape, Y_test.shape

((15998, 40), (4002,))

In [9]:
from sklearn.metrics import precision_score, recall_score, precision_recall_curve, auc, confusion_matrix, f1_score

def metrics(model):

    Y_pred = model.predict(X_test)
    # print(f"{Y_pred.shape} - {Y_test.shape}")
    matrix = confusion_matrix(Y_test, Y_pred)
    precision = precision_score(Y_test, Y_pred)
    recall = recall_score(Y_test, Y_pred)
    f1 = f1_score(Y_test, Y_pred)

    print("")
    print(f"F1: {f1}")
    print(f"{matrix}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")

    if isinstance(model, DecisionTreeClassifier):
        featureImportances = model.feature_importances_
    elif isinstance(model, LogisticRegression):
        featureImportances = model.coef_[0]

    features = X_train.columns
    features = list(zip(features, featureImportances))
    features.sort(key=lambda x: x[1], reverse=True)

    for i in range(10):
        print(f"{features[i][0]}:\t{round(features[i][1], 4)}")

    




In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(
    max_iter=200,
    random_state = RANDOM_STATE
)
lr.fit(X_train, Y_train)

metrics(lr)

dtc = DecisionTreeClassifier(
    random_state=RANDOM_STATE
)
dtc.fit(X_train, Y_train)
metrics(dtc)

# rfc = RandomForestClassifier(
#     random_state=RANDOM_STATE
# )
# rfc.fit(X_train, Y_train)
# metrics(rfc)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



F1: 0.0
[[3817    0]
 [ 185    0]]
Precision: 0.0
Recall: 0.0
retailerbrand.retailer_brand_num_brand_visits_12w:	0.2367
product.product_num_impressions_4w:	0.2093
product.product_brand_page_impression_to_click_rate:	0.1827
retailerbrand.retailer_brand_num_brand_visits_1w:	0.1581
retailerbrand.retailer_brand_num_brand_orders_12w:	0.1529
retailerbrand.retailer_brand_num_products_added_to_cart_4w:	0.1496
retailerbrand.retailer_last_added_brands_brand_lightfm_cosine_similarity:	0.1362
retailerbrand.retailer_brand_num_brand_impressions_4w:	0.1254
product.product_brand_page_click_to_cart_rate:	0.1163
product.product_num_search_excess_clicks_4w:	0.0926

F1: 0.1483375959079284
[[3640  177]
 [ 156   29]]
Precision: 0.1407766990291262
Recall: 0.15675675675675677
product.product_brand_page_click_to_cart_rate:	0.1252
product.product_num_impressions_4w:	0.1142
product.product_num_search_excess_cart_adds_4w:	0.1051
product.product_num_search_excess_clicks_4w:	0.0978
product.product_brand_page_impre

In [11]:

data['filter_string'].dropna(axis=0)


9                            ["category:jewelry|earrings"]
20       ["maker_value:made_in_usa","maker_value:not_so...
44               ["category:home_living|kitchen_tabletop"]
78                                      ["category:women"]
103                   ["wholesale_price:under_10_dollars"]
                               ...                        
19936                 ["wholesale_price:under_10_dollars"]
19958       ["maker_minimum:less_or_equal_to_200_dollars"]
19962                       ["category:jewelry|necklaces"]
19979    ["maker_value:made_in_usa","maker_value:eco_fr...
19999               ["category:paper_novelty|books_games"]
Name: filter_string, Length: 1719, dtype: object