# Information Retrieval Project

Bagging Model for Product Title Quality with Noise

CIKM AnalytiCup 2017

Aarib Ahmed Vahidy

Partham Kumar Chawla

### Loading Libraries

In [1]:
pip install pandas numpy scikit-learn xgboost lightgbm nltk gensim

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\aarib\AppData\Local\Programs\Python\Python312\python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
import lightgbm as lgb
import xgboost as xgb
import warnings
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error
from tqdm import tqdm

warnings.filterwarnings('ignore')

### Data Loading and Label Integration

Training and validation datasets are loaded from CSV files.

Corresponding labels (clarity and conciseness) are loaded from separate .labels and .predict files.

These labels are assigned meaningful column names.

The label files are then concatenated with their respective main datasets.

Column names of both train and validation DataFrames are standardized for consistency and easier downstream processing.

### Loading training data

In [3]:
#Loading the main data
data_train = pd.read_csv(r'C:\Users\aarib\6thSemester\IR\IRProject\Product Title Classification\CIKMAnalytiCup2017_Lazada\training\data_train.csv')

#Loading labels
clarity_labels = pd.read_csv(r'C:\Users\aarib\6thSemester\IR\IRProject\Product Title Classification\CIKMAnalytiCup2017_Lazada\training\clarity_train.labels', header=None)
conciseness_labels = pd.read_csv(r'C:\Users\aarib\6thSemester\IR\IRProject\Product Title Classification\CIKMAnalytiCup2017_Lazada\training\conciseness_train.labels', header=None)

#Assigning meaningful column names
clarity_labels.columns = ['clarity']
conciseness_labels.columns = ['conciseness']

#Combining everything into one DataFrame
train = pd.concat([data_train, clarity_labels, conciseness_labels], axis=1)

print("Shape of training data:", train.shape)
train.head()

Shape of training data: (36283, 11)


Unnamed: 0,my,AD674FAASTLXANMY,Adana Gallery Suri Square Hijab – Light Pink,Fashion,Women,Muslim Wear,<ul><li>Material : Non sheer shimmer chiffon</li><li>Sizes : 52 x 52 inches OR 56 x 56 inches</li><li>Cut with curved ends</li></ul>,49.0,local,clarity,conciseness
0,my,AE068HBAA3RPRDANMY,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Health & Beauty,Bath & Body,Hand & Foot Care,Formulated with oil-free hydrating botanicals/...,128.0,international,1,1
1,my,AN680ELAA9VN57ANMY,Andoer 150cm Cellphone Smartphone Mini Dual-He...,"TV, Audio / Video, Gaming & Wearables",Audio,Live Sound & Stage,<ul> <li>150cm mini microphone compatible for ...,25.07,international,1,1
2,my,AN957HBAAAHDF4ANMY,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520...,Health & Beauty,Hair Care,Shampoos & Conditioners,<ul> <li>ANMYNA Complaint Silky Set (Shampoo 5...,118.0,local,1,0
3,my,AR511HBAXNWAANMY,Argital Argiltubo Green Clay For Face and Body...,Health & Beauty,Men's Care,Body and Skin Care,<ul> <li>100% Authentic</li> <li>Rrefresh and ...,114.8,international,1,1
4,my,AS575ELCMZ4WANMY,Asus TP300LJ-DW004H Transformer Book Flip 4GB ...,Computers & Laptops,Laptops,Traditional Laptops,"<div class=""prod_content""> <div class=""prod_de...",2599.0,local,1,1


### Adding column names to training data

In [4]:
train.columns = [
    'country', 'product_id', 'title', 
    'category_lvl_1', 'category_lvl_2', 'category_lvl_3',
    'description', 'price', 'delivery_type', 
    'clarity', 'conciseness'
]

In [5]:
train.head()

Unnamed: 0,country,product_id,title,category_lvl_1,category_lvl_2,category_lvl_3,description,price,delivery_type,clarity,conciseness
0,my,AE068HBAA3RPRDANMY,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Health & Beauty,Bath & Body,Hand & Foot Care,Formulated with oil-free hydrating botanicals/...,128.0,international,1,1
1,my,AN680ELAA9VN57ANMY,Andoer 150cm Cellphone Smartphone Mini Dual-He...,"TV, Audio / Video, Gaming & Wearables",Audio,Live Sound & Stage,<ul> <li>150cm mini microphone compatible for ...,25.07,international,1,1
2,my,AN957HBAAAHDF4ANMY,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520...,Health & Beauty,Hair Care,Shampoos & Conditioners,<ul> <li>ANMYNA Complaint Silky Set (Shampoo 5...,118.0,local,1,0
3,my,AR511HBAXNWAANMY,Argital Argiltubo Green Clay For Face and Body...,Health & Beauty,Men's Care,Body and Skin Care,<ul> <li>100% Authentic</li> <li>Rrefresh and ...,114.8,international,1,1
4,my,AS575ELCMZ4WANMY,Asus TP300LJ-DW004H Transformer Book Flip 4GB ...,Computers & Laptops,Laptops,Traditional Laptops,"<div class=""prod_content""> <div class=""prod_de...",2599.0,local,1,1


### Loading validation data

In [6]:
#Loading the main data
data_val = pd.read_csv(r'C:\Users\aarib\6thSemester\IR\IRProject\Product Title Classification\CIKMAnalytiCup2017_Lazada\validation\data_valid.csv')

#Loading labels
clarity_labels = pd.read_csv(r'C:\Users\aarib\6thSemester\IR\IRProject\Product Title Classification\CIKMAnalytiCup2017_Lazada\validation\clarity_valid.predict', header=None)
conciseness_labels = pd.read_csv(r'C:\Users\aarib\6thSemester\IR\IRProject\Product Title Classification\CIKMAnalytiCup2017_Lazada\validation\conciseness_valid.predict', header=None)

#Assigning meaningful column names
clarity_labels.columns = ['clarity']
conciseness_labels.columns = ['conciseness']

#Combining everything into one DataFrame
validation = pd.concat([data_val, clarity_labels, conciseness_labels], axis=1)

print("Shape of validation data:", validation.shape)
validation.head()

Shape of validation data: (11838, 11)


Unnamed: 0,my,AP564ELASSTWANMY,Apple MacBook Pro MGXC2ZP/A 16GB i7 15.4-inch Retina Display Laptop,Computers & Laptops,Laptops,Macbooks,OS X Lion<br> Intel Core i7<br> 15-inch Retina Display<br> 16GB RAM / 512GB Flash<br> GeForce GT 750M<br> WiFi + BT4 + RJ45<br>,12550.0,local,clarity,conciseness
0,my,BR924HBAA5B3TLANMY,BRAND'S® American Ginseng Triple Pack (3x 6's)...,Health & Beauty,Food Supplements,Well Being,<ul> <li>Traditionally used to calm the mind a...,105.0,local,0.34139,0.98747
1,my,CA673ELAA5UG3XANMY,Canon EOS M10 Mirrorless Digital Camera 18MP w...,Cameras,Mirrorless,,<div> <ul> <li>18.0MP APS-C CMOS Sensor</li> <...,1588.0,local,0.22758,0.2977
2,my,DE759ELAA7QM1XANMY,"Dell LED Monitor 23"" (E2316H)",Computers & Laptops,Computer Accessories,Monitors,"<div class=""prod_content""> <div class=""prod_de...",565.0,local,0.36162,0.59796
3,my,ES802OTAABHAY8ANMY,Esprit Tallac Brave Nubuck Sand ES107601001 Be...,Watches Sunglasses Jewellery,Watches,Men,<ul> <li>stainless steel case</li> <li>mineral...,279.0,local,0.83418,0.56371
4,my,HP961ELAABF7N7ANMY,"(Refurbished) HP Compaq 3330 Pro MT + 19"" LCD",Computers & Laptops,Desktops Computers,All-purpose,"<ul> <li>Model : HP Compaq 3330 Pro MT + 19"" L...",1259.0,local,0.12018,0.5022


### Adding column names to validation data

In [7]:
validation.columns = [
    'country', 'product_id', 'title', 
    'category_lvl_1', 'category_lvl_2', 'category_lvl_3',
    'description', 'price', 'delivery_type', 
    'clarity', 'conciseness'
]

In [8]:
validation.head()

Unnamed: 0,country,product_id,title,category_lvl_1,category_lvl_2,category_lvl_3,description,price,delivery_type,clarity,conciseness
0,my,BR924HBAA5B3TLANMY,BRAND'S® American Ginseng Triple Pack (3x 6's)...,Health & Beauty,Food Supplements,Well Being,<ul> <li>Traditionally used to calm the mind a...,105.0,local,0.34139,0.98747
1,my,CA673ELAA5UG3XANMY,Canon EOS M10 Mirrorless Digital Camera 18MP w...,Cameras,Mirrorless,,<div> <ul> <li>18.0MP APS-C CMOS Sensor</li> <...,1588.0,local,0.22758,0.2977
2,my,DE759ELAA7QM1XANMY,"Dell LED Monitor 23"" (E2316H)",Computers & Laptops,Computer Accessories,Monitors,"<div class=""prod_content""> <div class=""prod_de...",565.0,local,0.36162,0.59796
3,my,ES802OTAABHAY8ANMY,Esprit Tallac Brave Nubuck Sand ES107601001 Be...,Watches Sunglasses Jewellery,Watches,Men,<ul> <li>stainless steel case</li> <li>mineral...,279.0,local,0.83418,0.56371
4,my,HP961ELAABF7N7ANMY,"(Refurbished) HP Compaq 3330 Pro MT + 19"" LCD",Computers & Laptops,Desktops Computers,All-purpose,"<ul> <li>Model : HP Compaq 3330 Pro MT + 19"" L...",1259.0,local,0.12018,0.5022


### Text Cleaning & Vectorization

A custom function clean_text():

Removes HTML tags using BeautifulSoup

Eliminates special characters and digits

Converts text to lowercase

Strips extra whitespace

This function is applied to both the product title and description for the training and validation sets.

The cleaned title and description are concatenated to form a new feature called text.

Then:

A character-level n-gram vectorizer is defined using CountVectorizer with ngram_range=(2, 6) and max_features=5000, as outlined in the original paper.

The vectorizer is fitted on the training text, and both training and validation texts are transformed into numerical feature vectors.

Finally, the target variables y_clarity and y_conciseness are extracted for model training.

In [9]:
def clean_text(text):
    if pd.isna(text):
        return ""
    #Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    #Remove special characters and digits
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    #Convert to lowercase
    text = text.lower()
    #Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

#Applying transformations  to both train and validation sets
for df in [train, validation]:
    df['clean_title'] = df['title'].apply(clean_text)
    df['clean_desc'] = df['description'].apply(clean_text)
    df['text'] = df['clean_title'] + " " + df['clean_desc']

In [10]:
#Using character n-grams as described in the paper (ngram_range = (2,6))
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 6), max_features=5000)

#Fitting on train text, transform train and validation
X_train = char_vectorizer.fit_transform(train['text'])
X_val = char_vectorizer.transform(validation['text'])

#Targets
y_clarity = train['clarity']
y_conciseness = train['conciseness']

### Bagged Ensemble Model Training with LightGBM

This section implements a robust ensemble strategy to improve prediction accuracy by reducing variance:

Model Choice:

Used LightGBM Regressor (LGBMRegressor) for its speed and performance on high-dimensional, sparse data (like character n-grams).

Ensemble Strategy – Bagging + K-Fold CV:

Bagging (n_bags=4): Run the full 10-fold cross-validation process four times with different random seeds to simulate different “bags” of models.

K-Fold Cross-Validation (n_folds=10): Each bag uses 10-fold CV to split the training data into training and validation subsets.

In total, 40 models are trained (4 bags × 10 folds).

Predictions from all models are averaged to produce final validation predictions.

Model Configuration:

n_estimators=300: Each model trains up to 300 boosting iterations.

learning_rate=0.05: Slower learning for better generalization.

num_leaves=31: Controls model complexity.

random_state: Varied per bag to encourage diversity in learned models.

Data Compatibility Fix:

Transformed X_train and X_val to np.float32 to resolve type compatibility errors with LightGBM (which expects float32 or float64 input, not int64).

Target-wise Training:

The process is run independently for both target variables:

clarity

conciseness

This setup helps create more stable and generalizable predictions by aggregating results from a diverse set of models trained on different subsets of data.

In [12]:
def run_lgbm_bagged(X, y, X_val, n_bags=4, n_folds=10):
    val_preds = np.zeros(X_val.shape[0])
    
    for bag in range(n_bags):
        kf = KFold(n_splits=n_folds, shuffle=True, random_state=42 + bag)
        for train_idx, valid_idx in tqdm(kf.split(X), desc=f"Bag {bag+1}/{n_bags}"):
            X_train_kf, X_valid_kf = X[train_idx], X[valid_idx]
            y_train_kf, y_valid_kf = y.iloc[train_idx], y.iloc[valid_idx]
            
            model = lgb.LGBMRegressor(
                n_estimators=300,
                learning_rate=0.05,
                max_depth=-1,
                num_leaves=31,
                random_state=42
            )
            model.fit(X_train_kf, y_train_kf)
            val_preds += model.predict(X_val) / (n_folds * n_bags)
    
    return val_preds


# Fix dtype for LightGBM
X_train = X_train.astype(np.float32)
X_val = X_val.astype(np.float32)

# Run for both targets
print("Training ensemble for CLARITY...")
val_preds_clarity = run_lgbm_bagged(X_train, y_clarity, X_val)

print("\nTraining ensemble for CONCISENESS...")
val_preds_conciseness = run_lgbm_bagged(X_train, y_conciseness, X_val)

Training ensemble for CLARITY...


Bag 1/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.407419 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36642
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943315


Bag 1/4: 1it [00:29, 29.80s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.463508 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36547
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943498


Bag 1/4: 2it [01:00, 30.14s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.364475 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36533
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943407


Bag 1/4: 3it [01:29, 29.68s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.314961 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36484
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943623


Bag 1/4: 4it [01:58, 29.49s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.304809 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36555
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942949


Bag 1/4: 5it [02:27, 29.27s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.325908 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36579
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943041


Bag 1/4: 6it [02:56, 29.16s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.239547 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36568
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943531


Bag 1/4: 7it [03:25, 29.22s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.254350 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36588
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943225


Bag 1/4: 8it [03:54, 29.18s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.462657 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36336
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943255


Bag 1/4: 9it [04:24, 29.48s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.283993 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36654
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943776


Bag 1/4: 10it [04:53, 29.36s/it]
Bag 2/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.292463 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36480
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943376


Bag 2/4: 1it [00:29, 29.36s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.328388 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36593
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943774


Bag 2/4: 2it [00:58, 29.15s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.531994 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36604
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942794


Bag 2/4: 3it [01:28, 29.43s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.439848 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36595
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943225


Bag 2/4: 4it [01:58, 29.63s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.276306 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36642
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943010


Bag 2/4: 5it [02:27, 29.43s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.349372 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36328
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.944021


Bag 2/4: 6it [02:56, 29.28s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.330854 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36369
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943653


Bag 2/4: 7it [03:25, 29.16s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.251002 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36655
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943837


Bag 2/4: 8it [03:53, 29.08s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.308800 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36602
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942551


Bag 2/4: 9it [04:22, 28.97s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.278358 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36564
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943378


Bag 2/4: 10it [04:50, 29.08s/it]
Bag 3/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.324436 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36540
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943009


Bag 3/4: 1it [00:29, 29.20s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.365287 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36543
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943621


Bag 3/4: 2it [00:57, 28.86s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.424073 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36609
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942855


Bag 3/4: 3it [01:27, 29.01s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.508627 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36589
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943960


Bag 3/4: 4it [01:57, 29.63s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.349576 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36596
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942918


Bag 3/4: 5it [02:26, 29.54s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.498272 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36607
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942980


Bag 3/4: 6it [02:56, 29.48s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.440403 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36507
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.944082


Bag 3/4: 7it [03:26, 29.75s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.432217 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36442
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943439


Bag 3/4: 8it [03:56, 29.77s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.279957 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36489
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943776


Bag 3/4: 9it [04:25, 29.53s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.319927 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36563
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942980


Bag 3/4: 10it [04:54, 29.41s/it]
Bag 4/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.339730 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36583
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942917


Bag 4/4: 1it [00:29, 29.18s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.531009 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36601
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943254


Bag 4/4: 2it [00:59, 29.82s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.465576 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36569
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943437


Bag 4/4: 3it [01:29, 29.82s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.382676 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36542
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943500


Bag 4/4: 4it [01:58, 29.55s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.447751 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36329
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.942704


Bag 4/4: 5it [02:27, 29.45s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.439655 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36535
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943592


Bag 4/4: 6it [02:58, 29.81s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.248675 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36483
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943898


Bag 4/4: 7it [03:26, 29.40s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.504953 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36547
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943439


Bag 4/4: 8it [03:57, 29.88s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.213508 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36611
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943225


Bag 4/4: 9it [04:25, 29.37s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.242355 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36587
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.943653


Bag 4/4: 10it [04:54, 29.50s/it]



Training ensemble for CONCISENESS...


Bag 1/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.345949 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36642
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685245


Bag 1/4: 1it [00:29, 29.08s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.213448 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36547
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686103


Bag 1/4: 2it [00:56, 28.16s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.408899 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36533
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685888


Bag 1/4: 3it [01:26, 29.11s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.317140 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36484
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685132


Bag 1/4: 4it [01:55, 28.87s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.289175 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36555
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685041


Bag 1/4: 5it [02:24, 28.97s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.484897 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36579
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684214


Bag 1/4: 6it [02:54, 29.34s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.311157 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36568
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684459


Bag 1/4: 7it [03:23, 29.26s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.349879 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36588
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685592


Bag 1/4: 8it [03:53, 29.32s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.343261 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36336
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685684


Bag 1/4: 9it [04:22, 29.44s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.310768 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36654
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685990


Bag 1/4: 10it [04:52, 29.22s/it]
Bag 2/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.394535 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36480
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684694


Bag 2/4: 1it [00:28, 28.06s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.496306 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36593
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685950


Bag 2/4: 2it [00:38, 17.86s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.493557 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36604
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685276


Bag 2/4: 3it [00:50, 14.93s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.533357 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36595
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685439


Bag 2/4: 4it [01:03, 14.26s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.594559 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36642
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686051


Bag 2/4: 5it [01:15, 13.62s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.495168 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36328
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686449


Bag 2/4: 6it [01:26, 12.70s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.655068 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36369
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684612


Bag 2/4: 7it [01:39, 12.52s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.534390 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36655
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685622


Bag 2/4: 8it [01:51, 12.47s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.436141 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36602
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686480


Bag 2/4: 9it [02:02, 12.18s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.534146 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36564
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.682774


Bag 2/4: 10it [02:14, 13.48s/it]
Bag 3/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.599762 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36540
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.683990


Bag 3/4: 1it [00:11, 11.55s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.534254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36543
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686225


Bag 3/4: 2it [00:23, 11.74s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.593168 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36609
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685705


Bag 3/4: 3it [00:35, 12.12s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.544267 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36589
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685102


Bag 3/4: 4it [00:47, 12.01s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.546027 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36596
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685010


Bag 3/4: 5it [00:59, 12.02s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.514510 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36607
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684887


Bag 3/4: 6it [01:11, 11.97s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.487307 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36507
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685224


Bag 3/4: 7it [01:23, 11.82s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.450996 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36442
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685347


Bag 3/4: 8it [01:34, 11.70s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.553108 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36489
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686694


Bag 3/4: 9it [01:46, 11.77s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.535348 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36563
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685163


Bag 3/4: 10it [01:58, 11.87s/it]
Bag 4/4: 0it [00:00, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.611548 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36583
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685582


Bag 4/4: 1it [00:11, 11.83s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.514088 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36601
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684970


Bag 4/4: 2it [00:24, 12.18s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.557074 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36569
[LightGBM] [Info] Number of data points in the train set: 32654, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685429


Bag 4/4: 3it [00:36, 12.36s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.480498 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36542
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684520


Bag 4/4: 4it [00:48, 12.11s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.566565 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36329
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.684244


Bag 4/4: 5it [01:00, 12.08s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.535702 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36535
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686266


Bag 4/4: 6it [01:12, 12.18s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.526817 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36483
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685408


Bag 4/4: 7it [01:25, 12.24s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.699698 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36547
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685132


Bag 4/4: 8it [01:39, 12.99s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.491581 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36611
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.686357


Bag 4/4: 9it [01:51, 12.54s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.586812 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36587
[LightGBM] [Info] Number of data points in the train set: 32655, number of used features: 5000
[LightGBM] [Info] Start training from score 0.685439


Bag 4/4: 10it [02:03, 12.31s/it]


In [14]:
#Extracting actual labels from validation set
y_val = validation[['clarity', 'conciseness']]

#Calculate RMSE
rmse_clarity = mean_squared_error(y_val['clarity'], val_preds_clarity, squared=False)
rmse_conciseness = mean_squared_error(y_val['conciseness'], val_preds_conciseness, squared=False)

print(f"Validation RMSE - Clarity: {rmse_clarity:.5f}")
print(f"Validation RMSE - Conciseness: {rmse_conciseness:.5f}")

Validation RMSE - Clarity: 0.52931
Validation RMSE - Conciseness: 0.35273


### Results Comparison: Our Implementation vs. Research Paper

This section compares our model’s validation performance with the results reported in the CIKM AnalytiCup 2017 paper titled "Lazada Product Title
Quality Challenge Bagging Model for Product Title Quality with Noise".

Target	     Algorithm	   Paper RMSE	  Our RMSE
Conciseness	 XGBoost	      0.31553	   0.35273
Clarity	     XGBoost	      0.20745	   0.52931

Observations:
Our RMSE values are higher, particularly for the Clarity task.

The paper’s results were achieved using additional techniques such as:

Ensemble of different algorithms (e.g., XGBoost, Ridge, SVR)

More aggressive hyperparameter tuning

Possibly more preprocessing steps or feature engineering

Our pipeline is simpler (using only LightGBM + character n-grams), but still reasonably competitive for conciseness.