In [None]:
!pip install lightgbm


In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Question 14

In [34]:
import lightgbm as lgb
from sklearn.datasets import load_svmlight_file
from sklearn.metrics import ndcg_score
import numpy as np
import pandas as pd

from lightgbm import LGBMRanker

# Load the dataset for one fold
def load_one_fole(data_path):
    X_train, y_train, qid_train = load_svmlight_file(str(data_path + 'train.txt'), query_id=True)
    X_val, y_val, qid_val = load_svmlight_file(str(data_path + 'vali.txt'), query_id=True)

    y_train = y_train.astype(int)
    y_val = y_val.astype(int)
    _, group_train = np.unique(qid_train, return_counts=True)
    _, group_val = np.unique(qid_val, return_counts=True)
    return X_train, y_train, qid_train, group_train, X_val, y_val, qid_val, group_val

def ndcg_single_query(y_score, y_true, k):
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gain = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)

# calculate NDCG score given a trained model
def compute_ndcg_all(model, X_test, y_test, qids_test, k=10):
    unique_qids = np.unique(qids_test)
    ndcg_ = list()
    for i, qid in enumerate(unique_qids):
        y = y_test[qids_test == qid]

        if np.sum(y) == 0:
            continue

        p = model.predict(X_test[qids_test == qid])

        idcg = ndcg_single_query(y, y, k=k)
        ndcg_.append(ndcg_single_query(p, y, k=k) / idcg)
    return np.mean(ndcg_)

# get importance of features
def get_feature_importance(model, importance_type='gain'):
  importance_df = (
    pd.DataFrame({
        'feature_name': model.booster_.feature_name(),
        'importance_gain': model.booster_.feature_importance(importance_type='gain'),
    })
    .sort_values('importance_gain', ascending=False)
    .reset_index(drop=True)
)

  return importance_df

ndcg3 = []
ndcg5 = []
ndcg10 = []
top5s = []
feature_importance = []
# for i in range(1,2):
for i in range(1,6):
  path = 'drive/MyDrive/MSLR-WEB10K/Fold' + str(i) + '/'
  X_train, y_train, qid_train, group_train, X_val, y_val, qid_val, group_val = load_one_fole(path)

  gbm = LGBMRanker(objective= 'lambdarank')
  model = gbm.fit(X_train, y_train, group=group_train)


  ndcg3.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=3))
  ndcg5.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=5))
  ndcg10.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=10))
  top5s.append(get_feature_importance(model))
  feature_importance.append(model.booster_.feature_importance(importance_type='gain'))


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.680866 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 25637
[LightGBM] [Info] Number of data points in the train set: 723412, number of used features: 136
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.757615 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 25623
[LightGBM] [Info] Number of data points in the train set: 716683, number of used features: 136
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.433278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 25659
[LightGBM] [Info] Number of

In [11]:
print ('model\'s performance (nDCG3) on test set for different folds:\n', ndcg3)
print ('model\'s performance (nDCG5) on test set for different folds:\n', ndcg5)
print ('model\'s performance (nDCG10) on test set for different folds:\n', ndcg10)


model's performance (nDCG3) on test set for different folds:
 [0.46533315500113603, 0.4513238963933018, 0.4611828580288599, 0.44747518797567715, 0.46197214361690514]
model's performance (nDCG5) on test set for different folds:
 [0.47013914781237626, 0.4601948976846903, 0.4643478855231082, 0.4562056520782107, 0.4704871221202469]
model's performance (nDCG10) on test set for different folds:
 [0.49023661809449154, 0.4819255308954761, 0.4814498486560052, 0.4762678196678019, 0.48896202334511024]


### QUESTION 15:
- Result Analysis and Interpretation:

In [26]:
for i, top in enumerate(top5s):
  print('top 5 most important features when trained on fold {}:\n'.format(i), top[:5], '\n')

top 5 most important features when trained on fold 0:
   feature_name  importance_gain
0   Column_133     23856.702951
1     Column_7      4248.546391
2   Column_107      4135.244450
3    Column_54      4078.463216
4   Column_129      3635.037024 

top 5 most important features when trained on fold 1:
   feature_name  importance_gain
0   Column_133     23578.908250
1     Column_7      5157.964912
2    Column_54      4386.669757
3   Column_107      4094.012172
4   Column_129      4035.070673 

top 5 most important features when trained on fold 2:
   feature_name  importance_gain
0   Column_133     23218.075441
1    Column_54      4991.303372
2   Column_107      4226.807395
3   Column_129      4059.752514
4     Column_7      3691.792320 

top 5 most important features when trained on fold 3:
   feature_name  importance_gain
0   Column_133     23796.899673
1     Column_7      4622.622978
2    Column_54      3883.481706
3   Column_129      3356.846980
4   Column_128      3207.575537 

top 

Column 133, 7, 54, and 129 are among the top 5 for all folds. Column 107 is in all folds except the fourth one. Instead there exist column 128 among the top 5 for the fourth fold. There might be some correlation between column 107 and 128.

### QUESTION 16:
- Experiments with Subset of Features:

• Remove the top 20 most important features

In [36]:
ndcg3_top20_removed = []
ndcg5_top20_removed = []
ndcg10_top20_removed = []
# for i in range(1,2):
for i in range(1,6):
  path = 'drive/MyDrive/MSLR-WEB10K/Fold' + str(i) + '/'
  X_train, y_train, qid_train, group_train, X_val, y_val, qid_val, group_val = load_one_fole(path)


  aa = feature_importance[i-1]
  # indices of top important features
  top_imp_feat_ind = np.argsort(aa)[::-1]
  # remove top 20 most important features
  X_train = X_train[:, b[20:]]
  X_val = X_val[:, b[20:]]

  gbm = LGBMRanker(objective= 'lambdarank')
  model = gbm.fit(X_train, y_train, group=group_train)

  ndcg3_top20_removed.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=3))
  ndcg5_top20_removed.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=5))
  ndcg10_top20_removed.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=10))


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.419967 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 21482
[LightGBM] [Info] Number of data points in the train set: 723412, number of used features: 116
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.711728 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 21457
[LightGBM] [Info] Number of data points in the train set: 716683, number of used features: 116
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.385986 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 21500
[LightGBM] [Info] Number of

In [37]:
print ('model\'s performance (nDCG3) on test set for different folds:\n', ndcg3_top20_removed)
print ('model\'s performance (nDCG5) on test set for different folds:\n', ndcg5_top20_removed)
print ('model\'s performance (nDCG10) on test set for different folds:\n', ndcg10_top20_removed)


model's performance (nDCG3) on test set for different folds:
 [0.38497943398827433, 0.3816705515016462, 0.37646908990938893, 0.37770426538754925, 0.3851757652588439]
model's performance (nDCG5) on test set for different folds:
 [0.3932316982095952, 0.38638787321379076, 0.3821637904664613, 0.38377045754959466, 0.39195510138328515]
model's performance (nDCG10) on test set for different folds:
 [0.41678784963415666, 0.40744958230498046, 0.4042523025092867, 0.4072240303443009, 0.4122777004349055]


As shown above, the average nDCG has decreased by 18%. The results align with the expectation that the performance of the model decay when trained on the features from which top 20 most important ones are removed. However, it is expected that the model performance to deteriorate much more than this amount. The reason it has not decreased significantly is that there might be some retained features which are in correlation with the removed features. Therefore, those important features are still playing role in practice.

- Remove the 60 least important features

In [38]:
ndcg3_bot60_removed = []
ndcg5_bot60_removed = []
ndcg10_bot60_removed = []
# for i in range(1,2):
for i in range(1,6):
  path = 'drive/MyDrive/MSLR-WEB10K/Fold' + str(i) + '/'
  X_train, y_train, qid_train, group_train, X_val, y_val, qid_val, group_val = load_one_fole(path)


  aa = feature_importance[i-1]
  # indices of least important features
  top_imp_feat_ind = np.argsort(aa)
  # remove least 60 most important features
  X_train = X_train[:, b[60:]]
  X_val = X_val[:, b[60:]]

  gbm = LGBMRanker(objective= 'lambdarank')
  model = gbm.fit(X_train, y_train, group=group_train)

  ndcg3_bot60_removed.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=3))
  ndcg5_bot60_removed.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=5))
  ndcg10_bot60_removed.append(compute_ndcg_all(model, X_val, y_val, qid_val, k=10))


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.318756 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13083
[LightGBM] [Info] Number of data points in the train set: 723412, number of used features: 76
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.341413 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13054
[LightGBM] [Info] Number of data points in the train set: 716683, number of used features: 76
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.256982 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13092
[LightGBM] [Info] Number of d

In [39]:
print ('model\'s performance (nDCG3) on test set for different folds:\n', ndcg3_bot60_removed)
print ('model\'s performance (nDCG5) on test set for different folds:\n', ndcg5_bot60_removed)
print ('model\'s performance (nDCG10) on test set for different folds:\n', ndcg10_bot60_removed)


model's performance (nDCG3) on test set for different folds:
 [0.3391632995764517, 0.33524111860888645, 0.33003747324869515, 0.3316743968622034, 0.3299411772323068]
model's performance (nDCG5) on test set for different folds:
 [0.35073224178104045, 0.34652388764153436, 0.34002518996590064, 0.3436793472736115, 0.3419468081292484]
model's performance (nDCG10) on test set for different folds:
 [0.3837353393543075, 0.37203082030646895, 0.36646089619945754, 0.37086903541899086, 0.3707322538416216]


The expectation is that removing irrelevant or redundant features can help the model focus on more informative features, potentially leading to improved performance. By retaining only the most important features, the model might achieve better generalization on unseen data. However, this has not happened here as shown in the above results.

The explanation might lie in the fact that removing features, even those considered less important, can result in a loss of information. There's a possibility that some of the supposedly less important features might still carry valuable information for the model, especially in complex datasets.