# O'Reilly (Sr.) Machine Learning Engineer Takehome

Welcome to the evaluation project for the (Sr.) Machine Learning Engineer position at O'Reilly Media. In this project you will evaluate a search academic dataset built using common learn-to-rank features, build a ranking model using the dataset, and discuss how additional features could be used and how they would impact the performance of the model.

Overview:
- Make a copy of this notebook
- Fill in the contact info
- Download the dataset to the notebook (link in Step 3.2 comments)
- Preprocess and evaluate the dataset
- Build a **ranking** model
- Evaluate your ranking model using a metric of your choice
- Answer discussion questions
- Submit your notebook


## Notes



*   Throughout the notebook you should include notes explaining your choices and what you are doing. Your thought process is more important than the actual performance of your model.
*   Create as many cells as you want. The exisiting cells are just provided to provide some initial organization.
* You may use any choice of libraries or frameworks.

# Step 1. Make a copy of this Colab notebook and work on your personal copy

# Step 2. Fill in your contact information

Candidate Full Name: Sharhad Bashar

Candidate Email: sharhad.bashar@uwaterloo.ca

# Step 3. Create the Model

### 1) Imports

In [28]:
# Import dependencies here
import pandas as pd

import xgboost as xgb
import lightgbm as lgb

from sklearn.model_selection import GridSearchCV

### 2) Download Dataset

In [None]:
# Download the dataset located at https://storage.googleapis.com/personalization-takehome/MSLR-WEB10K.zip
# You can read about the features included in the dataset here: https://www.microsoft.com/en-us/research/project/mslr/


### 3) Preprocess and evaluate the dataset

In [15]:
def convert_to_numeric(df):
    for col in range(2, df.shape[1]):  # Adjust the range based on your actual data
        df[col] = pd.to_numeric(df[col], errors = 'coerce')
    return df

In [16]:
# Preprocess and evaluate the dataset

# Load the training, validation, and test data
fold = 1
train_data = pd.read_csv(f'MSLR-WEB10K/Fold{fold}/train.txt', sep = ' ', header = None)
validate_data = pd.read_csv(f'MSLR-WEB10K/Fold{fold}/vali.txt', sep = ' ', header = None)
test_data = pd.read_csv(f'MSLR-WEB10K/Fold{fold}/test.txt', sep = ' ', header = None)

# Convert the datasets
train_data = convert_to_numeric(train_data)
validate_data = convert_to_numeric(validate_data)
test_data = convert_to_numeric(test_data)

# Extract features and labels from each dataset
y_train = train_data[0]  # relevance labels for training
qid_train = train_data[1]  # query IDs for training
X_train = train_data.drop([0, 1], axis = 1)  # feature vectors for training

y_validate = validate_data[0]  # relevance labels for validation
qid_validate = validate_data[1]  # query IDs for validation
X_validate = validate_data.drop([0, 1], axis = 1)  # feature vectors for validation

y_test = test_data[0]  # relevance labels for testing
qid_test = test_data[1]  # query IDs for testing
X_test = test_data.drop([0, 1], axis = 1)  # feature vectors for testing


### 4) Build ranking model

In [36]:
# Build ranking model
# Prepare DMatrix for XGBoost
train_matrix = xgb.DMatrix(X_train, label=y_train, group=qid_train.value_counts().sort_index().values)
validate_matrix = xgb.DMatrix(X_validate, label=y_validate, group=qid_validate.value_counts().sort_index().values)

# Set parameters for XGBoost
params = {
    'objective': 'rank:pairwise',
    'eval_metric': 'ndcg',
    'eta': 0.05,
    'max_depth': 10,
    'min_child_weight': 300,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'verbosity': 1
}



In [37]:
# Train the model
model = xgb.train(params, train_matrix, num_boost_round=100, evals=[(validate_matrix, 'eval')], early_stopping_rounds=10)

[0]	eval-ndcg:0.31999
[1]	eval-ndcg:0.31999
[2]	eval-ndcg:0.31999
[3]	eval-ndcg:0.31999
[4]	eval-ndcg:0.31999
[5]	eval-ndcg:0.31999
[6]	eval-ndcg:0.31999
[7]	eval-ndcg:0.31999
[8]	eval-ndcg:0.31999
[9]	eval-ndcg:0.31999


### 5) Evaluate model performance

In [38]:
# Evaluate model performance
# Prepare DMatrix for test set
test_matrix = xgb.DMatrix(X_test, label=y_test)

# Predict using the test data
y_pred = model.predict(test_matrix)

# Evaluate using NDCG
from sklearn.metrics import ndcg_score

ndcg = ndcg_score([y_test], [y_pred], k=10)
print(f"NDCG@10: {ndcg:.4f}")

NDCG@10: 0.1696


In [None]:
param_grid = {
    'num_leaves': [31, 63],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_iterations': [100, 200]
}

grid_search = GridSearchCV(estimator=lgb.LGBMRanker(), param_grid=param_grid, cv=gkf, scoring='neg_mean_squared_error')
grid_search.fit(X, y, group=qid)
print("Best parameters found: ", grid_search.best_params_)

# Step 4. Discussion

### 1) Please answer the following questions about your choices:
- Discuss your model and why you chose the model you chose (eg architecture, design, loss functions, etc)
- Why did you choose your metric to evaluate the model?
- How well would you say your model performed?
- If you had more time what else would you want to try?

### 2) Please answer the following questions about how you would use additional features:

- If you had an additional feature for each row of the dataset that was unique identifier for the user performing the query e.g. `user_id`, how could you use it to improve the performance of the model?
- If you had the additional features of: `query_text` or the actual textual query itself, as well as document text features like `title_text`, `body_text`, `anchor_text`, `url` for the document, how would you include them in your model (or any model) to improve its performance?




# Step 5. Please share your colab notebook with: qma@oreilly.com and jtorres@oreilly.com  
The share icon is located in the upper right corner of your Colab notebook.
