# Faire - The Online Wholesale Marketplace & Store

Welcome! Lets build a new search ranking model.

### Description

Knowing whether a Product in Search will be bought in advance could provide huge business value to Faire. This task is very important for purchase prediction as well as for short term user engagement prediction.

- In this dataset we have sampled ~20k rows from Faire search logs. The dataset is anonymized. Before describing the dataset, lets give some preliminary knowledge of how Faire Wholesale MarketPlace Search works. Faire is a two-sided marketplace where retailers come to shop wholesale products from brands. When a retailer makes a search on the site, we call that a search request. The response is a page with many products. We assign a `request_id` to this search request response, and different pages (page number 1, 2, 3...) from the same search have different `request_id`s (i.e. `request_id` is more of a "page id" than a "search session id"). Each row in this dataset represents one single product that was impressed for that `request_id`. For each request_id you can have many results (due to this being a random sample some of them might be missed in a some cases). 

- Each row contains the following fields: `request_id`, `retailer_token` (anonymzed user token), `query_text` (the actual search string), `page_number`, `page_size`, `position`, `filter_string` (filters applied on top of the search), `has_product_click` (was it clicked or not). 
- We have some features from our feature store (computed using data from before the search timestamp): anything starting with `product.` is a product-level feature. We have also a few personalization features in this dataset, anything starting with `retailerbrand.` is a personalization feature and relates to a particular retailer:brand pair. 
- Note that we have personalization features only at the level of retailer(user) and the brand the product belongs to. Brands usually have many products in our marketplace, so Product <-> Brand mapping is Many:1.

### Tasks
In approximately 1 hour, please do the following:
- Please build first a ranking model using the provided dataset and evaluate it. 
- If there is time, please implement 1 or 2 additional features, and list up to 10 more features (without implementing them).
- If there is time, please give us an idea of the next steps, how would you improve this model if you had 1 week more, 1 month more, or 3 months more?

Please write as many comments or communicate out loud your thought process. Once your time allocation is up, please send back the completed notebook in .ipynb and .pdf format

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
# plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
import csv

In [None]:
data = pd.read_csv("datasets/faire-ml-rank-small.csv")
data.head(3)

In [None]:
data.info()

In [None]:
def match_title(row):
    if row['query_text'] in row['title']:
        return 1.0
    else:
        return 0.0

def match_desc(row):
    if row['query_text'] in row['description']:
        return 1.0
    else:
        return 0.0
data['description'] = data['description'].fillna("")
data['title_match'] = data.apply(match_title, axis=1)
data['description_match'] = data.apply(match_desc, axis=1)

In [None]:
# Drop columns that are not needed
features = data.drop(
    ['Unnamed: 0.1', 'Unnamed: 0', 'rand', 'title', 'description', 'created_at_a', 'query_text','filter_string', 'retailer_token_anon', 'product.product_is_high_sell_through'],
    axis=1
)
features.info()

In [None]:
features.describe()

In [None]:
from sklearn.preprocessing import StandardScaler

ignore_columns = {'request_id_anon', 'has_product_click'}
features = features.fillna(0.0)
scaler = StandardScaler()
for column in features.columns:
    if column in ignore_columns:
        continue
    scaler = StandardScaler()
    features[column] = scaler.fit_transform(features[[column]])    

features['has_product_click'].describe()

In [None]:
from sklearn.model_selection import train_test_split

request_ids = features['request_id_anon'].unique()
RANDOM_STATE = 42
TEST_SIZE = 0.2

r_train, r_test = train_test_split(request_ids, test_size=TEST_SIZE, random_state=RANDOM_STATE)
# r_train.shape, r_test.shape
X_train = features[features['request_id_anon'].isin(r_train)].drop(['request_id_anon'], axis=1)
X_test = features[features['request_id_anon'].isin(r_test)].drop(['request_id_anon'], axis=1)

Y_train = X_train['has_product_click'].astype(int)
X_train = X_train.drop('has_product_click', axis=1)

Y_test = X_test['has_product_click'].astype(int)
X_test = X_test.drop('has_product_click', axis=1)
X_train.shape, Y_test.shape

In [None]:
from sklearn.metrics import precision_score, recall_score, precision_recall_curve, auc, confusion_matrix, f1_score

def metrics(model):

    Y_pred = model.predict(X_test)
    # print(f"{Y_pred.shape} - {Y_test.shape}")
    matrix = confusion_matrix(Y_test, Y_pred)
    precision = precision_score(Y_test, Y_pred)
    recall = recall_score(Y_test, Y_pred)
    f1 = f1_score(Y_test, Y_pred)

    print("")
    print(f"F1: {f1}")
    print(f"{matrix}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")

    if isinstance(model, DecisionTreeClassifier):
        featureImportances = model.feature_importances_
    elif isinstance(model, LogisticRegression):
        featureImportances = model.coef_[0]

    features = X_train.columns
    features = list(zip(features, featureImportances))
    features.sort(key=lambda x: x[1], reverse=True)

    for i in range(10):
        print(f"{features[i][0]}:\t{round(features[i][1], 4)}")

    




In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(
    max_iter=200,
    random_state = RANDOM_STATE
)
lr.fit(X_train, Y_train)

metrics(lr)

dtc = DecisionTreeClassifier(
    random_state=RANDOM_STATE
)
dtc.fit(X_train, Y_train)
metrics(dtc)

# rfc = RandomForestClassifier(
#     random_state=RANDOM_STATE
# )
# rfc.fit(X_train, Y_train)
# metrics(rfc)


In [None]:

data['filter_string'].dropna(axis=0)
