# LightGBM: A Highly Efficient Gradient Boosting Decision Tree
This notebook will give you an example of how to train a LightGBM model to estimate the ratings of Movielens dataset.

*NOTE: This notebook is based on code from [Recommenders library](https://github.com/recommenders-team/recommenders), under MIT license.*

[LightGBM](https://github.com/Microsoft/LightGBM) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
* Fast training speed and high efficiency.
* Low memory usage.
* Great accuracy.
* Support of parallel and GPU learning.
* Capable of handling large-scale data.

## 0 Global Settings and Imports

In [1]:
import pandas as pd
import lightgbm as lgb
from sklearn.preprocessing import MultiLabelBinarizer

from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.evaluation.python_evaluation import (
    rmse,
    mae,
    rsquared,
    exp_var
)


### 0.1 Parameter Setting
Let's set the main related parameters for LightGBM now. Basically, the task is a regression, and we are going to use the mean average error (`MAE`) as the metric to evaluate the model.

Generally, the basic parameters to adjust are the number of leaves (`MAX_LEAF`), maximum number of trees (`NUM_OF_TREES`), and the learning rate (`LEARNING_RATE`).

Besides, we can also adjust some other listed parameters to optimize the results. [In this link](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst), a list of all the parameters is shown. Also, some advice on how to tune these parameters can be found [in this url](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters-Tuning.rst). 

In [2]:
# Top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = "100k"

# Other data settings
USER_COL = "userID"
ITEM_COL = "itemID"
RATING_COL = "rating"
PREDICTION_COL = "prediction"
ITEM_FEAT_COL = "genre"

# Train test split ratio
SPLIT_RATIO = 0.75

# Model settings
MAX_LEAF = 64
NUM_OF_TREES = 100
LEARNING_RATE = 0.05
METRIC = "mae"

SEED = 42

In [3]:
params = {
    "objective": "regression",
    "boosting_type": "gbdt",
    "metric": METRIC,
    "num_leaves": MAX_LEAF,
    "n_estimators": NUM_OF_TREES,
    "boost_from_average": True,
    "n_jobs": -1,
    "learning_rate": LEARNING_RATE,
}

## 1 Data Preparation


In [4]:
# The genres of each movie are returned as '|' separated string, e.g. "Animation|Children's|Comedy".
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USER_COL, ITEM_COL, RATING_COL],
    genres_col=ITEM_FEAT_COL
)
data.head()

100%|██████████| 4.81k/4.81k [00:16<00:00, 284KB/s]


Unnamed: 0,userID,itemID,rating,genre
0,196,242,3.0,Comedy
1,63,242,3.0,Comedy
2,226,242,5.0,Comedy
3,154,242,3.0,Comedy
4,306,242,5.0,Comedy


#### 1.1 Encode Item Features (Genres)
To use genres from our model, we multi-hot-encode them with scikit-learn's [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html).

For example, *Movie id=2355* has three genres, *Animation|Children's|Comedy*, which are being converted into an integer array of the indicator value for each genre like `[0, 0, 1, 1, 1, 0, 0, 0, ...]`. In the later step, we convert this into a float array and feed into the model.

> For faster feature encoding, you may load ratings and items separately (by using `movielens.load_item_df`), encode the item-features, then combine the rating and item dataframes by using join-operation. 

In [5]:
genres_encoder = MultiLabelBinarizer()
data[ITEM_FEAT_COL] = genres_encoder.fit_transform(
    data[ITEM_FEAT_COL].apply(lambda s: s.split("|"))
).tolist()
print("Genres:", genres_encoder.classes_)
data.head()

Genres: ['Action' 'Adventure' 'Animation' "Children's" 'Comedy' 'Crime'
 'Documentary' 'Drama' 'Fantasy' 'Film-Noir' 'Horror' 'Musical' 'Mystery'
 'Romance' 'Sci-Fi' 'Thriller' 'War' 'Western' 'unknown']


Unnamed: 0,userID,itemID,rating,genre
0,196,242,3.0,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,63,242,3.0,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,226,242,5.0,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,154,242,3.0,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,306,242,5.0,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [6]:
# Expand the 'genre' list into separate columns
number_of_genres = len(genres_encoder.classes_)
expanded_genre = pd.DataFrame(data[ITEM_FEAT_COL].tolist(), columns=[f"{ITEM_FEAT_COL}_{i+1}" for i in range(number_of_genres)])

# Concatenate the expanded genre columns with the original DataFrame
data = pd.concat([data, expanded_genre], axis=1)

# Drop the original 'genre' column
data.drop(ITEM_FEAT_COL, axis=1, inplace=True)
data.head()

Unnamed: 0,userID,itemID,rating,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,...,genre_10,genre_11,genre_12,genre_13,genre_14,genre_15,genre_16,genre_17,genre_18,genre_19
0,196,242,3.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,63,242,3.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,226,242,5.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,154,242,3.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,306,242,5.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### 1.2 Split the data using the python random splitter provided in utilities:

We split the full dataset into a `train` and `test` dataset to evaluate performance of the algorithm against a held-out set not seen during training. Because SAR generates recommendations based on user preferences, all users that are in the test set must also exist in the training set. For this case, we can use the provided `python_stratified_split` function which holds out a percentage (in this case 25%) of items from each user, but ensures all users are in both `train` and `test` datasets. Other options are available in the `dataset.python_splitters` module which provide more control over how the split occurs.

In [7]:
train, test = python_stratified_split(data, ratio=SPLIT_RATIO, col_user=USER_COL, col_item=ITEM_COL, seed=SEED)

In [8]:
print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train[USER_COL].unique()),
    train_items=len(train[ITEM_COL].unique()),
    test_total=len(test),
    test_users=len(test[USER_COL].unique()),
    test_items=len(test[ITEM_COL].unique()),
))


Train:
Total Ratings: 74992
Unique Users: 943
Unique Items: 1646

Test:
Total Ratings: 25008
Unique Users: 943
Unique Items: 1451



## 2 Train the LightGBM Model


In [9]:
lgb_regressor = lgb.LGBMRegressor(**params)


In [10]:
with Timer() as train_time:
    lgb_regressor.fit(
        X=train[train.columns.difference([RATING_COL])].values, 
        y=train[RATING_COL].values,
    )

print(f"Took {train_time.interval} seconds for training.")

Took 113.83255609998014 seconds for training.


## 3 Evaluate the model

In [11]:
# Evaluate the Model
with Timer() as test_time:
    y_pred = lgb_regressor.predict(test[test.columns.difference([RATING_COL])])

print(f"Took {test_time.interval} seconds for prediction.")

Took 0.3662549000000581 seconds for prediction.


In [12]:
pred = test[[USER_COL, ITEM_COL, RATING_COL]].copy()
pred[PREDICTION_COL] = y_pred
pred.head()

Unnamed: 0,userID,itemID,rating,prediction
26975,1,48,5.0,3.905406
87870,1,149,2.0,3.861776
83701,1,103,1.0,2.700646
60240,1,49,3.0,3.546311
5678,1,194,4.0,3.947219


In [13]:
# Rating metrics
eval_rmse = rmse(test, pred, col_user=USER_COL, col_item=ITEM_COL, col_rating=RATING_COL, col_prediction=PREDICTION_COL)
eval_mae = mae(test, pred, col_user=USER_COL, col_item=ITEM_COL, col_rating=RATING_COL, col_prediction=PREDICTION_COL)
eval_rsquared = rsquared(test, pred, col_user=USER_COL, col_item=ITEM_COL, col_rating=RATING_COL, col_prediction=PREDICTION_COL)
eval_exp_var = exp_var(test, pred, col_user=USER_COL, col_item=ITEM_COL, col_rating=RATING_COL, col_prediction=PREDICTION_COL)


In [14]:
print("Model:\t\tLightGBM",
      "RMSE:\t\t%f" % eval_rmse,
      "MAE:\t\t%f" % eval_mae,
      "R2:\t\t%f" % eval_rsquared,
      "Exp var:\t%f" % eval_exp_var,
      sep='\n')

Model:		LightGBM
RMSE:		1.001431
MAE:		0.804058
R2:		0.206897
Exp var:	0.206919


## Additional Reading

\[1\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154.<br>
\[2\] The parameters of LightGBM: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst <br>
