# Faire - The Online Wholesale Marketplace & Store

Welcome! Lets build a new search ranking model.

### Description

Knowing whether a Product in Search will be bought in advance could provide huge business value to Faire. This task is very important for purchase prediction as well as for short term user engagement prediction.

- In this dataset we have sampled ~20k rows from Faire search logs. The dataset is anonymized. Before describing the dataset, lets give some preliminary knowledge of how Faire Wholesale MarketPlace Search works. Faire is a two-sided marketplace where retailers come to shop wholesale products from brands. When a retailer makes a search on the site, we call that a search request. The response is a page with many products. We assign a `request_id` to this search request response, and different pages (page number 1, 2, 3...) from the same search have different `request_id`s (i.e. `request_id` is more of a "page id" than a "search session id"). Each row in this dataset represents one single product that was impressed for that `request_id`. For each request_id you can have many results (due to this being a random sample some of them might be missed in a some cases). 

- Each row contains the following fields: `request_id`, `retailer_token` (anonymzed user token), `query_text` (the actual search string), `page_number`, `page_size`, `position`, `filter_string` (filters applied on top of the search), `has_product_click` (was it clicked or not). 
- We have some features from our feature store (computed using data from before the search timestamp): anything starting with `product.` is a product-level feature. We have also a few personalization features in this dataset, anything starting with `retailerbrand.` is a personalization feature and relates to a particular retailer:brand pair. 
- Note that we have personalization features only at the level of retailer(user) and the brand the product belongs to. Brands usually have many products in our marketplace, so Product <-> Brand mapping is Many:1.

### Tasks
In approximately 1 hour, please do the following:
- Please build first a ranking model using the provided dataset and evaluate it. 
- If there is time, please implement 1 or 2 additional features, and list up to 10 more features (without implementing them).
- If there is time, please give us an idea of the next steps, how would you improve this model if you had 1 week more, 1 month more, or 3 months more?

Please write as many comments or communicate out loud your thought process. Once your time allocation is up, please send back the completed notebook in .ipynb and .pdf format

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
# plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
import csv

In [None]:
data = pd.read_csv("datasets/faire-ml-rank-small.csv")
data

In [None]:
# data.head(10)

In [None]:
data.info()

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

corr_features = data.select_dtypes(include=[np.number])
corr = corr_features.corr()

plt.figure(figsize=[10,8])
sns.heatmap(corr)
plt.title('Correlation Matrix')
plt.plot()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
RANDOM_STATE = 7

def prepare_data(features, scale=False):
    # remove unwanted columns
    numeric_features = features.select_dtypes(include=[np.number])
    numeric_features = numeric_features.drop(["Unnamed: 0.1", "Unnamed: 0", "retailer_token_anon", "rand"], axis=1)
    
    numeric_features = numeric_features.fillna(0)

    # print(numeric_data.describe())
    # split the data into training and testing sets uing request_id_annon as the target variable
    request_id_annon = numeric_features['request_id_anon'].unique()
    X_train_rid, X_test_rid = train_test_split(request_id_annon, test_size=0.2, random_state=RANDOM_STATE)

    X1 = numeric_features[numeric_features['request_id_anon'].isin(X_train_rid)]
    X2 = numeric_features[numeric_features['request_id_anon'].isin(X_test_rid)]

    X1 = X1.drop(['request_id_anon'], axis=1)
    X2 = X2.drop(['request_id_anon'], axis=1)

   
   
    if scale:
        columns = X1.columns
        scaler = StandardScaler()
        X1 = scaler.fit_transform(X1)
        X2 = scaler.transform(X2)

        X1 = pd.DataFrame(X1, columns=columns)
        X2 = pd.DataFrame(X2, columns=columns)
    
    
    X_train = X1.drop('has_product_click', axis=1)
    Y_train = X1['has_product_click'].astype(int)
    X_test = X2.drop('has_product_click', axis=1)
    Y_test = X2['has_product_click'].astype(int)
    

    return X_train, Y_train, X_test, Y_test

In [None]:
def metrics(Y_test, Y_pred, Y_proba):
    class_probabilities = Y_proba[:, 1]
    print(f"Y_pred: {Y_pred[:3]}")
    print(f"Y_proba shape: {class_probabilities[:3]}")

    from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
    print(f"Confusion Matrix: \n{confusion_matrix(Y_test, Y_pred)}")
    print(f"Accuracy: {accuracy_score(Y_test, Y_pred)}")
    print(f"Precision: {precision_score(Y_test, Y_pred)}")
    print(f"Recall: {recall_score(Y_test, Y_pred)}")
    print(f"F1 Score: {f1_score(Y_test, Y_pred)}")

    # print pr_auc score
    from sklearn.metrics import auc
    from sklearn.metrics import precision_recall_curve
    precision, recall, _ = precision_recall_curve(Y_test, class_probabilities)
    print(f"PR AUC: {auc(recall, precision)}")




In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier

def lr_model(current_data, scale=False):
    X_train, Y_train, X_test, Y_test = prepare_data(current_data, scale)
    lr = LogisticRegression(random_state=RANDOM_STATE)
    lr.fit(X_train, Y_train)
    Y_pred = lr.predict(X_test)
    print(f"Y_pred shape: {Y_pred.shape}")
    Y_proba = lr.predict_proba(X_test)
    metrics(Y_test, Y_pred, Y_proba)

    sorted_indices = lr.coef_[0].argsort()
    for i in range(1, 11):
        print(f"Feature: {X_train.columns[sorted_indices[-i]]}, Importance: {lr.coef_[0][sorted_indices[-i]]}")


lr_model(data)
lr_model(data, True)

In [None]:
from sklearn.tree import DecisionTreeClassifier

def dt_model(current_data, scale=False):
    X_train, Y_train, X_test, Y_test = prepare_data(current_data, scale)
    dtc = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=10, min_samples_leaf=10, criterion='entropy')
    dtc.fit(X_train, Y_train)
    Y_pred = dtc.predict(X_test)

    metrics(Y_test, Y_pred)

    sorted_indices = dtc.feature_importances_.argsort()

    # print the top 10 features and their importance score
    for i in range(1, 11):
        print(f"Feature: {X_train.columns[sorted_indices[-i]]}, Importance: {dtc.feature_importances_[sorted_indices[-i]]}")


dt_model(numeric_data)
dt_model(numeric_data, scale=True)

In [None]:

data['query_text'] = data['query_text'].str.lower()
data['title'] = data['title'].str.lower()
data['description'] = data['description'].str.lower()
# print(data[['query_text', 'title', 'description']].head(10))


# chekc if the query_text is present in the title column
data['query_in_title'] = data.apply(lambda x: 1 if x['query_text'] in x['title'] else 0, axis=1)

data['description'].fillna("", inplace=True)
# check if the query_text is present in the description column
data['query_text_in_description'] = data.apply(lambda x: 1 if x['query_text'] in x['description'] else 0, axis=1)

# data['query_text_in_description'].value_counts()
selected_data = data.select_dtypes(include=[np.number, bool])

dt_model(selected_data)
dt_model(selected_data, scale=True)
