# Solution

## Data loading

We've already determined which pixels are "interesting" in the [previous notebook](Interesting.ipynb).

In [172]:
interesting = pd.read_pickle('data/interesting.pkl')
interesting.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,r,c,area,eccentricity,solidity,is_satellite
part,sequence,frame,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
test,1,1,7.216981,339.245283,106,0.986314,0.848,
test,1,1,10.751724,264.551724,145,0.980358,0.843023,
test,1,1,18.708333,40.666667,144,0.980083,0.862275,
test,1,1,20.5,462.827869,122,0.985141,0.877698,
test,1,1,26.525424,65.909605,177,0.993053,0.811927,


## Feature extraction

In [180]:
import numpy as np
from scipy import stats

def region(img: np.ndarray, r: int, c: int, w: int):
    """Returns the square of length width with (x, y) being at the center."""
    return img[
        max(r - w, 0) : min(r + w + 1, img.shape[0]),
        max(c - w, 0) : min(c + w + 1, img.shape[1])
    ]

def extract_features(img, r, c):
    r3x3 = region(img, r, c, 3).ravel()
    r5x5 = region(img, r, c, 5).ravel()
    r7x7 = region(img, r, c, 7).ravel()
    val = img[r, c]
    return {
        'pixel_value': val,
        '3x3_std': r3x3.std(),
        '3x3_min': val - r3x3.min(),
        '3x3_max': val - r3x3.max(),
        '5x5_std': r5x5.std(),
        '5x5_entropy': stats.entropy(r5x5),
        '5x5_min': val - r5x5.min(),
        '5x5_max': val - r5x5.max(),
        '7x7_std': r7x7.std(),
        '7x7_entropy': stats.entropy(r7x7),
        '7x7_kurtosis': stats.kurtosis(r7x7),
        '7x7_skew': stats.skew(r7x7)
    }

Extract features for each interesting region.

In [181]:
import tqdm

samples = {}

#for (part, sequence, frame), locations in tqdm.tqdm(interesting.groupby(['part', 'sequence', 'frame']), position=0):

for (sequence, frame), locations in tqdm.tqdm(interesting.loc['train'].groupby(['sequence', 'frame']), position=0): 
    part = 'train'
    
    img = np.asarray(Image.open(f'data/spotGEO/{part}/{sequence}/{frame}.png')).astype(np.float32)
    
    for _, location in locations.iterrows():
    
        r = int(location['r'])
        c = int(location['c'])

        samples[part, sequence, frame, r, c] = {
            'is_satellite': location['is_satellite'],
            'area': location['area'],
            'eccentricity': location['eccentricity'],
            'solidity':  location['solidity'],
            **extract_features(img, r=r, c=c)
        }
        
samples = pd.DataFrame.from_dict(samples, orient='index')
samples.index.names = ['part', 'sequence', 'frame', 'r', 'c']
samples.head()

100%|██████████| 6400/6400 [23:26<00:00,  4.55it/s]  


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,is_satellite,area,eccentricity,solidity,pixel_value,3x3_std,3x3_min,3x3_max,5x5_std,5x5_entropy,5x5_min,5x5_max,7x7_std,7x7_entropy,7x7_kurtosis,7x7_skew
part,sequence,frame,r,c,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
train,1,1,1,227,False,32,0.887012,0.969697,50.0,3.431665,9.0,-5.0,4.20707,4.339688,12.0,-5.0,4.304743,4.900646,-0.715237,0.476524
train,1,1,7,233,False,5,0.0,1.0,51.0,3.382095,14.0,0.0,4.085722,4.791327,17.0,-4.0,4.129508,5.411494,0.165889,0.777811
train,1,1,13,158,False,53,0.98519,0.670886,55.0,4.588253,16.0,-2.0,4.419131,4.791466,16.0,-2.0,4.349029,5.41181,-0.129127,0.799852
train,1,1,10,254,False,5,0.0,1.0,51.0,3.561912,16.0,0.0,3.106009,4.792973,16.0,0.0,2.924519,5.413541,0.321749,0.694747
train,1,1,22,168,False,8,0.57735,1.0,55.0,3.580688,14.0,-1.0,4.182978,4.791868,16.0,-3.0,4.101444,5.412225,-0.130888,0.729648


## Learning phase

Split into train and test.

In [182]:
from sklearn import utils

X_train = samples.loc['train'].copy()
y_train = X_train.pop('is_satellite').astype(bool)
X_train, y_train = utils.shuffle(X_train, y_train, random_state=42)

try:
    X_test = samples.loc['test'].drop(columns='is_satellite')
except KeyError:
    X_test = None

Do the LGBM CV dance.

In [183]:
import lightgbm
from sklearn import metrics
from sklearn import model_selection
from sklearn import utils

model = lightgbm.LGBMClassifier(
    num_leaves=2 ** 5,
    metric='binary',
    random_state=42,
    n_estimators=10_000
)

cv = model_selection.GroupKFold(n_splits=5)
groups = X_train.index.get_level_values('sequence')

oof = pd.Series(dtype=bool, index=X_train.index)
if X_test is not None:
    y_test = pd.DataFrame(index=X_test.index)

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train, groups=groups)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model.fit(
        X_fit, y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=['fit', 'val'],
        early_stopping_rounds=20,
        verbose=10
    )
    oof.iloc[val_idx] = model.predict(X_val)
    
    if X_test is not None:
        y_test[i] = model.predict_proba(X_test)[:, 1]
    
    print()

print(metrics.classification_report(y_train, oof, digits=4))

Training until validation scores don't improve for 20 rounds
[10]	fit's binary_logloss: 0.0190191	val's binary_logloss: 0.0225442
[20]	fit's binary_logloss: 0.015414	val's binary_logloss: 0.0198047
[30]	fit's binary_logloss: 0.0137132	val's binary_logloss: 0.0188865
[40]	fit's binary_logloss: 0.0136892	val's binary_logloss: 0.019299
[50]	fit's binary_logloss: 0.0127296	val's binary_logloss: 0.0199022
Early stopping, best iteration is:
[34]	fit's binary_logloss: 0.0131033	val's binary_logloss: 0.0186464

Training until validation scores don't improve for 20 rounds
[10]	fit's binary_logloss: 0.0194098	val's binary_logloss: 0.0215464
[20]	fit's binary_logloss: 0.0155941	val's binary_logloss: 0.0184621
[30]	fit's binary_logloss: 0.0141205	val's binary_logloss: 0.0178726
[40]	fit's binary_logloss: 0.0128171	val's binary_logloss: 0.0176393
Early stopping, best iteration is:
[26]	fit's binary_logloss: 0.0143031	val's binary_logloss: 0.0175811

Training until validation scores don't improve fo

In [None]:
precision    recall  f1-score   support

       False     0.9967    0.9988    0.9978    657058
        True     0.8789    0.7166    0.7895      7691

    accuracy                         0.9956    664749
   macro avg     0.9378    0.8577    0.8936    664749
weighted avg     0.9953    0.9956    0.9954    664749

Feature importances.

In [197]:
pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)

7x7_kurtosis    164
7x7_std         102
7x7_skew         84
area             84
7x7_entropy      81
3x3_std          81
eccentricity     79
pixel_value      71
5x5_std          53
3x3_min          49
5x5_entropy      39
5x5_min          38
solidity         29
3x3_max          20
5x5_max          18
dtype: int32

## Out-of-fold predictions

In [None]:
%run toolbox.py

In [211]:
oof.head()

sequence  frame  r    c  
785       2      82   179    False
466       1      246  519    False
692       1      334  211    False
412       5      467  167    False
676       4      60   105    False
dtype: bool

In [222]:
save_predictions(oof, 'oof.json')

100%|██████████| 6400/6400 [00:03<00:00, 1853.95it/s]


In [213]:
!python validation.py oof.json data/spotGEO/train_anno.json

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file


## Test predictions

In [215]:
y_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,0,1,2,3,4
sequence,frame,r,c,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1,7,339,0.000505,0.000618,0.000417,0.00136,0.000273
1,1,10,264,0.003549,0.002715,0.003404,0.00272,0.00134
1,1,18,40,0.003549,0.002715,0.004508,0.003282,0.001752
1,1,20,462,0.000723,0.001682,0.000763,0.001753,0.000796
1,1,26,65,0.002998,0.001684,0.004473,0.002475,0.001141


In [216]:
import zipfile

save_predictions(y_test.mean(axis='columns') > .5, 'submission.json')

with zipfile.ZipFile('submission.zip', mode='w') as f:
    f.write('submission.json')

100%|██████████| 25600/25600 [00:14<00:00, 1727.08it/s]


Next is [PostProcessing.ipynb](PostProcessing.ipynb).