This notebook is the Modified version by us of a starter notebook by Chris Deotte
https://www.kaggle.com/code/cdeotte/catboost-starter-lb-0-60

# CatBoost 
We use both spectrogram and spectrogram_made_from_eeg features.     
To make model learn better we incresed the frequency of the high quality data(total voter >=10) two times.      
We also used a differnt loss function that is 'MultiCrossEntropy' because it takes the Probability of all classes as labels.      
In this notebook, we also compare five CV scores. Kaggle's sample submission uses equal predictions of 1/6 for all targets and achieves CV 0.69, LB 0.51. 


# Load Libraries

In [None]:
import os, gc
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

VER = 3

# Load Train Data

In [None]:
df = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
TARGETS = df.columns[-6:]
print('Train shape:', df.shape )
print('Targets', list(TARGETS))
df.head()

# Create Non-Overlapping Eeg Id Train Data
The competition data description says that test data does not have multiple crops from the same `eeg_id`. Therefore we will train and validate using only 1 crop per `eeg_id`. There is a discussion about this [here][1].

[1]: https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/467021

In [None]:
df['total_evaluators'] = df[['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']].sum(axis=1)

In [None]:
train = df.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_id':'first','spectrogram_label_offset_seconds':'min'})
train.columns = ['spec_id','min']

tmp = df.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_label_offset_seconds':'max'})
train['max'] = tmp

tmp = df.groupby('eeg_id')[['patient_id']].agg('first')
train['patient_id'] = tmp

tmp = df.groupby('eeg_id')[['total_evaluators']].agg('mean')
train['total_evaluators'] = tmp

tmp = df.groupby('eeg_id')[TARGETS].agg('sum')
for t in TARGETS:
    train[t] = tmp[t].values
    
y_data = train[TARGETS].values
y_data = y_data / y_data.sum(axis=1,keepdims=True)
train[TARGETS] = y_data

tmp = df.groupby('eeg_id')[['expert_consensus']].agg('first')
train['target'] = tmp

train = train.reset_index()
print('Train non-overlapp eeg_id shape:', train.shape )
train.head()

# Feature Engineer
In this section, we create features for our CatBoost model. 

First we need to read in all 11k train spectrogram files. Reading thousands of files takes 11 minutes with Pandas. Instead, we can read 1 file from Chris's [Kaggle dataset here][1] which contains all the 11k spectrograms in less than 1 minute! To use Chris's [Kaggle dataset][1], set variable `READ_SPEC_FILES = False`.

we also loaded EEG spectrograms from  Chris's dataset ( https://www.kaggle.com/datasets/cdeotte/brain-eeg-spectrograms
).
    
Next we need to engineer features for our CatBoost model. We took the mean and min (over time) of both Kaggle spectrograms and EEG spectrograms.
 



In [None]:
READ_SPEC_FILES = False
READ_EEG_SPEC_FILES = False

In [None]:
%%time
# READ ALL SPECTROGRAMS
PATH = '/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/'
files = os.listdir(PATH)
print(f'There are {len(files)} spectrogram parquets')

if READ_SPEC_FILES:    
    spectrograms = {}
    for i,f in enumerate(files):
        if i%100==0: print(i,', ',end='')
        tmp = pd.read_parquet(f'{PATH}{f}')
        name = int(f.split('.')[0])
        spectrograms[name] = tmp.iloc[:,1:].values
else:
    spectrograms = np.load('/kaggle/input/brain-spectrograms/specs.npy',allow_pickle=True).item()

In [None]:
%%time
# READ ALL EEG SPECTROGRAMS
if READ_EEG_SPEC_FILES:
    all_eegs = {}
    for i,e in enumerate(train.eeg_id.values):
        if i%100==0: print(i,', ',end='')
        x = np.load(f'/kaggle/input/brain-eeg-spectrograms/EEG_Spectrograms/{e}.npy')
        all_eegs[e] = x
else:
    all_eegs = np.load('/kaggle/input/brain-eeg-spectrograms/eeg_specs.npy',allow_pickle=True).item()

In [None]:
%time
# ENGINEER FEATURES
import warnings
warnings.filterwarnings('ignore')

# FEATURE NAMES
SPEC_COLS = pd.read_parquet(f'{PATH}1000086677.parquet').columns[1:]
FEATURES = [f'{c}_mean_10m' for c in SPEC_COLS]
FEATURES += [f'{c}_min_10m' for c in SPEC_COLS]
FEATURES += [f'{c}_mean_20s' for c in SPEC_COLS]
FEATURES += [f'{c}_min_20s' for c in SPEC_COLS]
FEATURES += [f'eeg_mean_f{x}_10s' for x in range(512)]
FEATURES += [f'eeg_min_f{x}_10s' for x in range(512)]
FEATURES += [f'eeg_max_f{x}_10s' for x in range(512)]
FEATURES += [f'eeg_std_f{x}_10s' for x in range(512)]
print(f'We are creating {len(FEATURES)} features for {len(train)} rows... ',end='')

data = np.zeros((len(train),len(FEATURES)))
for k in range(len(train)):
    if k%100==0: print(k,', ',end='')
    row = train.iloc[k]
    r = int( (row['min'] + row['max'])//4 ) 

    # 10 MINUTE WINDOW FEATURES (MEANS and MINS)
    x = np.nanmean(spectrograms[row.spec_id][r:r+300,:],axis=0)
    data[k,:400] = x
    x = np.nanmin(spectrograms[row.spec_id][r:r+300,:],axis=0)
    data[k,400:800] = x

    # 20 SECOND WINDOW FEATURES (MEANS and MINS)
    x = np.nanmean(spectrograms[row.spec_id][r+145:r+155,:],axis=0)
    data[k,800:1200] = x
    x = np.nanmin(spectrograms[row.spec_id][r+145:r+155,:],axis=0)
    data[k,1200:1600] = x

    # RESHAPE EEG SPECTROGRAMS 128x256x4 => 512x256
    eeg_spec = np.zeros((512,256),dtype='float32')
    xx = all_eegs[row.eeg_id]
    for j in range(4): eeg_spec[128*j:128*(j+1),] = xx[:,:,j]

    # 10 SECOND WINDOW FROM EEG SPECTROGRAMS 
    x = np.nanmean(eeg_spec.T[100:-100,:],axis=0)
    data[k,1600:2112] = x
    x = np.nanmin(eeg_spec.T[100:-100,:],axis=0)
    data[k,2112:2624] = x
    x = np.nanmax(eeg_spec.T[100:-100,:],axis=0)
    data[k,2624:3136] = x
    x = np.nanstd(eeg_spec.T[100:-100,:],axis=0)
    data[k,3136:3648] = x

train[FEATURES] = data
print(); print('New train shape:',train.shape)

In [None]:
# FREE MEMORY
del all_eegs, spectrograms, data
gc.collect()

# Train CatBoost
We use the default settings for CatBoost which are pretty good. We can tune CatBoost manually to improve CV and LB score. Note that CatBoost will automatically use both Kaggle T4 GPUs (when we add parameter `task_type='GPU'`)  for super fast training!

In [None]:
import catboost as cat
from catboost import CatBoostClassifier, Pool
print('CatBoost version',cat.__version__)

In [None]:
from sklearn.model_selection import KFold, GroupKFold

all_oof = []
all_true = []
all_oof2 = []
all_true2 = []
TARS = {'Seizure':0, 'LPD':1, 'GPD':2, 'LRDA':3, 'GRDA':4, 'Other':5}

gkf = GroupKFold(n_splits=5)
for i, (train_index, valid_index) in enumerate(gkf.split(train, train.target, train.patient_id)):   
    
    print('#'*25)
    print(f'### Fold {i+1}')
    print(f'### train size {len(np.concatenate((train.loc[train_index,FEATURES],train.loc[train_index,FEATURES][train.iloc[train_index]["total_evaluators"]>=10]),axis=0))}, valid size {len(np.concatenate((train.loc[valid_index,FEATURES],train.loc[valid_index,FEATURES][train.iloc[valid_index]["total_evaluators"]>=10]),axis=0))}')
    print('#'*25)
    
    model = CatBoostClassifier(task_type='GPU',
                               loss_function='MultiCrossEntropy')
   # train_pool all dataset training data 
    train_pool = Pool(
        data = train.loc[train_index,FEATURES],
        label = np.array(train[train.columns[6:12]])[train_index],
    )
    
    valid_pool = Pool(
        data = train.loc[valid_index,FEATURES],
        label =np.array(train[train.columns[6:12]])[valid_index],
    )
    # train_pool 3 all dataset where the amount of high quality data('total voters >=10') is doubled 
    train_pool3 = Pool(
        data = np.concatenate((train.loc[train_index,FEATURES],train.loc[train_index,FEATURES][train.iloc[train_index]["total_evaluators"]>=10]),axis=0),
        label = np.concatenate((np.array(train[train.columns[6:12]])[train_index],np.array(train[train.columns[6:12]])[train_index][train.iloc[train_index]["total_evaluators"]>=10]),axis=0),
    )
    
    valid_pool3 = Pool(
        data = np.concatenate((train.loc[valid_index,FEATURES],train.loc[valid_index,FEATURES][train.iloc[valid_index]["total_evaluators"]>=10]),axis=0),
        label = np.concatenate((np.array(train[train.columns[6:12]])[valid_index],np.array(train[train.columns[6:12]])[valid_index][train.iloc[valid_index]["total_evaluators"]>=10]),axis=0),
    )
    
    
    model.fit(train_pool3,
             verbose=100,
             eval_set=valid_pool3,
             )
    model.save_model(f'CAT_v{VER}_f{i}.cat')
    
    # train_pool 2 high quality data dataset
    train_pool2 = Pool(
        data = train.loc[train_index,FEATURES][train.iloc[train_index]["total_evaluators"]>=10],
        label = np.array(train[train.columns[6:12]])[train_index][train.iloc[train_index]["total_evaluators"]>=10],
    )
    
    valid_pool2 = Pool(
        data = train.loc[valid_index,FEATURES][train.iloc[valid_index]["total_evaluators"]>=10],
        label =np.array(train[train.columns[6:12]])[valid_index][train.iloc[valid_index]["total_evaluators"]>=10],
    )
    
    oof = model.predict_proba(valid_pool)
    all_oof.append(oof)
    all_true.append(train.loc[valid_index, TARGETS].values)
    oof2 = model.predict_proba(valid_pool2)
    all_oof2.append(oof2)
    all_true2.append(np.array(train[train.columns[6:12]])[valid_index][train.iloc[valid_index]["total_evaluators"]>=10])
    
    del train_pool, valid_pool,oof2, oof,train_pool2,valid_pool2,train_pool3,valid_pool3 #model
    gc.collect()
    
    #break
    
all_oof = np.concatenate(all_oof)
all_true = np.concatenate(all_true)
all_oof2 = np.concatenate(all_oof2)
all_true2 = np.concatenate(all_true2)

# Feature Importance
Below we display the CatBoost top 25 feature importance for the last fold we trained.

In [None]:
TOP = 25

feature_importance = model.feature_importances_
sorted_idx = np.argsort(feature_importance)
fig = plt.figure(figsize=(10, 8))
plt.barh(np.arange(len(sorted_idx))[-TOP:], feature_importance[sorted_idx][-TOP:], align='center')
plt.yticks(np.arange(len(sorted_idx))[-TOP:], np.array(FEATURES)[sorted_idx][-TOP:])
plt.title(f'Feature Importance - Top {TOP}')
plt.show()

# CV Score for CatBoost
This is CV score for our CatBoost model.

In [None]:
import sys
sys.path.append('/kaggle/input/kaggle-kl-div')
from kaggle_kl_div import score


oof = pd.DataFrame(all_oof/np.sum(all_oof,axis=1).reshape(-1,1).copy())
oof['id'] = np.arange(len(oof))

true = pd.DataFrame(all_true.copy())
true['id'] = np.arange(len(true))

cv = score(solution=true, submission=oof, row_id_column_name='id')
print('CV Score KL-Div for CatBoost =',cv)

oof2 = pd.DataFrame(all_oof2/np.sum(all_oof2,axis=1).reshape(-1,1).copy())
oof2['id'] = np.arange(len(oof2))

true2 = pd.DataFrame(all_true2.copy())
true2['id'] = np.arange(len(true2))

cv2 = score(solution=true2, submission=oof2, row_id_column_name='id')
print('CV Score KL-Div for CatBoost for high quality data =',cv2)