# Can we predict neutrino movement by connecting two dots?

## Idea

We know that neutrinos are extremely difficult to detect, due to their low mass and lack of electric charge. So what does IceCube actually detect? 
It detects charges particles such as electrons and muons as neutrinos interact with the particles of waters. This is called Cherenkov's effect and the radiation wave coming from this kind of interaction is actually what we are detecting in IceCube. 

Of course the actual path of leptons travelling through IceCube is very convoluted and thus we have this problem in front of us. 

But the idea in this notebook - how good can two dots represents a neutrino's path?

Now of course, this is kind of a naive approach but worth trying. 

## What will we do?

1. Install Polars and use Polars to speed things up
2. Calculate MAE (mean angular error) for each of the pairs of sensors in events
3. Analyze the results. 
4. Try to predict neutrino's movement

In [26]:
#!pip install - q polars
!pip install -q ../input/polars01516/typing_extensions-4.4.0-py3-none-any.whl
!pip install -q ../input/polars01516/polars-0.15.16-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

[0m

In [27]:
import numpy as np
import pandas as pd
import polars as pl
import seaborn as sns
import math

from pathlib import Path
from matplotlib import pyplot as plt
from tqdm import tqdm
from sklearn.model_selection import KFold, GridSearchCV
from catboost import CatBoostRegressor, Pool, metrics, cv

In [28]:
# Configuration
TEST_MODE = False
PATH_INPUT = Path("/kaggle/input/icecube-neutrinos-in-deep-ice")

# Constructing training dataset

* We will construct a training dataset by finding the best pairs in each event with minimum MAE (mean angular error)

1. We take a random batch on which we will train the dataset
2. We generate all possible pairs of points (sensor 1, sensor 2) where we can fit a line
3. Join the datasets
4. Calculate azimuth and zenith for a given pair of dots
5. Calculate MAE
6. We take smaller training sample - only 10k events to make it time efficient

Everything is done in Polars through pipe function

In [29]:
np.random.seed(0)
train_batch_id = np.random.choice(range(660))
print('Training batch', train_batch_id)
batch_path = "train/batch_" + str(train_batch_id)+ ".parquet" 
train_batch = pl.scan_parquet(PATH_INPUT / batch_path).lazy()
df_train_meta = pl.scan_parquet(PATH_INPUT / "train_meta.parquet").lazy()
df_sensor_geometry = pl.scan_csv(PATH_INPUT / 'sensor_geometry.csv').with_columns(pl.col('sensor_id').cast(pl.Int16)).lazy()

Training batch 559


In [30]:
def generate_pairs(dataf):
    return dataf.groupby(['event_id']).agg(pl.col('sensor_id').unique().alias('sensor_id_1')).explode(pl.col('sensor_id_1')
        ).join(dataf.groupby(['event_id']).agg(pl.col('sensor_id').unique().alias('sensor_id_2')), on='event_id').explode('sensor_id_2'
        ).unique().filter(pl.col('sensor_id_1') != pl.col('sensor_id_2'))

In [31]:
def split_by_events(dataf, number_of_events = 0, event_ids = []):
    assert number_of_events > 0 or len(event_ids) > 0
    if len(event_ids) == 0:
        np.random.seed(0)
        event_ids = np.random.choice(np.array(dataf.select(pl.col('event_id')).unique().collect()).reshape(-1), size = number_of_events, replace=False)
        return dataf.filter(
            pl.col('event_id').is_in(pl.Series(event_ids))
        ).with_columns([
            (pl.col('time') - pl.col('time').min()).over('event_id')
        ])
    else:
        return dataf.filter(
            pl.col('event_id').is_in(pl.Series(event_ids))
        ).with_columns([
            (pl.col('time') - pl.col('time').min()).over('event_id')
        ])            

In [32]:
def join_data(dataf, metaf, TEST_MODE=False):
    if TEST_MODE == True:
        return dataf.join(df_sensor_geometry,left_on = 'sensor_id_1', right_on ='sensor_id'
             ).join(df_sensor_geometry,left_on = 'sensor_id_2', right_on ='sensor_id'
             ).with_columns([
        pl.col('x').alias('x1'),
        pl.col('y').alias('y1'),
        pl.col('z').alias('z1'),
        pl.col('x_right').alias('x2'),
        pl.col('y_right').alias('y2'),
        pl.col('z_right').alias('z2')
        ]).join(metaf, on='event_id'
        ).with_columns([
        pl.col('azimuth').alias('az_true'),
        pl.col('zenith').alias('zen_true')
        ])
    
    if TEST_MODE == False:
        return dataf.join(df_sensor_geometry,left_on = 'sensor_id_1', right_on ='sensor_id'
             ).join(df_sensor_geometry,left_on = 'sensor_id_2', right_on ='sensor_id'
             ).with_columns([
        pl.col('x').alias('x1'),
        pl.col('y').alias('y1'),
        pl.col('z').alias('z1'),
        pl.col('x_right').alias('x2'),
        pl.col('y_right').alias('y2'),
        pl.col('z_right').alias('z2')
        ]).join(metaf, on='event_id'
        )

In [33]:
def normalize(dataf):
    return dataf.with_columns([
    (pl.col('x1') - pl.col('x2')).alias('x'),
    (pl.col('y1') - pl.col('y2')).alias('y'),
    (pl.col('z1') - pl.col('z2')).alias('z'),
    ]).with_columns([
    pl.col('x') / (pl.col('x') ** 2 + pl.col('y') ** 2 + pl.col('z') ** 2) ** 0.5,
    pl.col('y') / (pl.col('x') ** 2 + pl.col('y') ** 2 + pl.col('z') ** 2) ** 0.5,
    pl.col('z') / (pl.col('x') ** 2 + pl.col('y') ** 2 + pl.col('z') ** 2) ** 0.5
    ])


def add_azimuth_zenith(dataf):
    return dataf.with_columns([
    pl.col('z').arccos().alias('zenith')
]).with_columns([
    pl.when(pl.col("z").round(2).abs() == 1).then(0).otherwise((pl.col('x') / (pl.col('zenith').sin())).arccos()).alias('azimuth')
]).with_columns([
        pl.col('azimuth').fill_nan(0).alias('az_pred'),
        pl.col('zenith').fill_nan(0).alias('zen_pred')
    ])

In [34]:
def calculate_mae(dataf):
    return dataf.with_columns([
    pl.col('az_true').sin().alias('sa1'),
    pl.col('az_true').cos().alias('ca1'),
    pl.col('zen_true').sin().alias('sz1'),
    pl.col('zen_true').cos().alias('cz1'),
    
    pl.col('az_pred').sin().alias('sa2'),
    pl.col('az_pred').cos().alias('ca2'),
    pl.col('zen_pred').sin().alias('sz2'),
    pl.col('zen_pred').cos().alias('cz2')
]).with_columns([
        (pl.col('sz1')*pl.col('sz2')*(pl.col('ca1')*pl.col('ca2') + pl.col('sa1')*pl.col('sa2')) + (pl.col('cz1')*pl.col('cz2'))).arccos().abs().alias('mae')
])

In [35]:
def generate_features(dataf,number_of_events=0, event_ids=[]):
    return dataf.pipe(split_by_events, number_of_events, event_ids).with_columns([
    pl.col('time').min().over(['event_id','sensor_id']).alias('sensor_min_time'),
    pl.col('time').mean().over(['event_id','sensor_id']).alias('sensor_mean_time'),
    pl.col('time').max().over(['event_id','sensor_id']).alias('sensor_max_time'),
    pl.col('charge').min().over(['event_id','sensor_id']).alias('sensor_min_charge'),
    pl.col('charge').mean().over(['event_id','sensor_id']).alias('sensor_mean_charge'),
    pl.col('charge').max().over(['event_id','sensor_id']).alias('sensor_max_charge'),
    pl.col('time').count().over(['event_id','sensor_id']).alias('sensor_count'),
    pl.col('auxiliary').sum().over(['event_id','sensor_id']).alias('sensor_aux_sum'),
    pl.col('auxiliary').sum().over(['event_id']).alias('overall_aux_sum'),
    pl.col('time').count().over(['event_id']).alias('overall_count'),
    pl.col('time').max().over(['event_id']).alias('time_overall'),
    pl.col('charge').max().over(['event_id']).alias('max_charge_overall'),
    pl.col('charge').mean().over(['event_id']).alias('mean_charge_overall'),
    pl.col('charge').min().over(['event_id']).alias('min_charge_overall')
]).with_columns([
    pl.when(pl.col('time_overall') != 0).then(pl.col('sensor_min_time') / pl.col('time_overall')).otherwise(1).alias('sensor_min_time_ratio'),
    pl.when(pl.col('time_overall') != 0).then(pl.col('sensor_mean_time') / pl.col('time_overall')).otherwise(1).alias('sensor_mean_time_ratio'),
    pl.when(pl.col('time_overall') != 0).then(pl.col('sensor_max_time') / pl.col('time_overall')).otherwise(1).alias('sensor_max_time_ratio'),
    pl.when(pl.col('overall_count') != 0).then(pl.col('sensor_count') / pl.col('overall_count')).otherwise(1).alias('sensor_count_ratio'),
    pl.when(pl.col('overall_count') != 0).then(pl.col('sensor_aux_sum') / pl.col('overall_count')).otherwise(1).alias('sensor_aux_count_ratio'),
    pl.when(pl.col('mean_charge_overall') != 0).then(pl.col('sensor_min_charge') / pl.col('mean_charge_overall')).otherwise(1).alias('sensor_min_charge_ratio'),
    pl.when(pl.col('mean_charge_overall') != 0).then(pl.col('sensor_mean_charge') / pl.col('mean_charge_overall')).otherwise(1).alias('sensor_mean_charge_ratio'),
    pl.when(pl.col('mean_charge_overall') != 0).then(pl.col('sensor_max_charge') / pl.col('mean_charge_overall')).otherwise(1).alias('sensor_max_charge_ratio')
]).groupby(['event_id', 'sensor_id']).agg([
    pl.col('sensor_min_time').first(),
    pl.col('sensor_mean_time').first(),
    pl.col('sensor_max_time').first(),
    pl.col('sensor_min_charge').first(),
    pl.col('sensor_mean_charge').first(),
    pl.col('sensor_max_charge').first(),
    pl.col('sensor_count').first(),
    pl.col('sensor_aux_sum').first(),
    pl.col('overall_aux_sum').first(),
    pl.col('overall_count').first(),
    pl.col('time_overall').first(),
    pl.col('max_charge_overall').first(),
    pl.col('mean_charge_overall').first(),
    pl.col('min_charge_overall').first(),
    pl.col('sensor_min_time_ratio').first(),
    pl.col('sensor_mean_time_ratio').first(),
    pl.col('sensor_max_time_ratio').first(),
    pl.col('sensor_aux_count_ratio').first(),
    pl.col('sensor_count_ratio').first(),
    (pl.col('sensor_aux_sum').first() / (pl.col('overall_aux_sum').first() + 1)).alias('sensor_aux_ratio'),
    pl.col('sensor_min_charge_ratio').first(),
    pl.col('sensor_mean_charge_ratio').first(),
    pl.col('sensor_max_charge_ratio').first()
])


In [36]:
def join_features(dataf,initial_df,number_of_events=0, event_ids = []):
    columns_to_return = [
             'event_id',
             'batch_id',
             'x',
             'y',
             'z',
             'x1',
             'y1',
             'z1',
             'x2',
             'y2',
             'z2',
             'az_pred',
             'zen_pred',
             'sensor_min_time',
             'sensor_mean_time',
             'sensor_max_time',
             'sensor_min_charge',
             'sensor_mean_charge',
             'sensor_max_charge',
             'sensor_count',
             'sensor_aux_sum',
             'overall_aux_sum',
             'overall_count',
             'time_overall',
             'max_charge_overall',
             'mean_charge_overall',
             'min_charge_overall',
             'sensor_min_time_ratio',
             'sensor_mean_time_ratio',
             'sensor_max_time_ratio',
             'sensor_count_ratio',
             'sensor_aux_ratio',
             'sensor_aux_count_ratio',
             'sensor_min_charge_ratio',
             'sensor_mean_charge_ratio',
             'sensor_max_charge_ratio',
             'sensor_min_time_right',
             'sensor_mean_time_right',
             'sensor_max_time_right',
             'sensor_min_charge_right',
             'sensor_mean_charge_right',
             'sensor_max_charge_right',
             'sensor_count_right',
             'sensor_aux_sum_right',
             'sensor_min_time_ratio_right',
             'sensor_mean_time_ratio_right',
             'sensor_max_time_ratio_right',
             'sensor_count_ratio_right',
             'sensor_aux_count_ratio_right',
             'sensor_aux_ratio_right',
             'sensor_min_charge_ratio_right',
             'sensor_mean_charge_ratio_right',
             'sensor_max_charge_ratio_right',
        ]
    if len(event_ids) == 0:
        columns_to_return = columns_to_return + ['mae']

    return dataf.join(
        initial_df.pipe(generate_features, number_of_events, event_ids), left_on=['event_id', 'sensor_id_1'], right_on=['event_id', 'sensor_id']).join(
        initial_df.pipe(generate_features, number_of_events, event_ids), left_on=['event_id', 'sensor_id_2'], right_on=['event_id', 'sensor_id']
    ).select(columns_to_return)

In [37]:
def make_smaller_sample(dataf, size = 10):
    return dataf.filter(
    (pl.arange(0, pl.count()).shuffle(seed=0).over("event_id") < 10) |
    (pl.col('mae') == pl.col('mae').min().over(pl.col('event_id'))) |
    (pl.col('mae') == pl.col('mae').max().over(pl.col('event_id')))
).unique()

In [38]:
number_of_events=10000
train_data = train_batch.pipe(split_by_events, number_of_events
                       ).pipe(generate_pairs
                       ).pipe(join_data, df_train_meta, True
                       ).pipe(normalize
                       ).pipe(add_azimuth_zenith
                       ).pipe(calculate_mae
                       ).pipe(make_smaller_sample
                       ).pipe(join_features,train_batch, number_of_events)

# Hyperparameters tuning

* We have to convert the dataset back to numpy in order to train our CatBoost model
* We did hyperparameters tuning already (number of iterations and learning rate)

In [39]:
train_sample_np = np.array(train_data.collect())
features_train = train_sample_np[:][2:len(train_sample_np)-1].T
target_train = train_sample_np[:][len(train_sample_np)-1].T

del train_sample_np
del train_data

In [40]:
model_catboost = CatBoostRegressor(
    iterations=3300,
    loss_function ='RMSE',
    learning_rate = 0.3,
    random_seed = 1,
    od_type = "Iter",
    od_wait = 200,
    depth = 6,
    #task_type = "GPU",
    #devices = '0:1',
    save_snapshot = False,
)

model_catboost.fit(
    features_train, target_train,
    verbose=1000,
);

0:	learn: 0.8957311	total: 65.3ms	remaining: 3m 35s
1000:	learn: 0.5471131	total: 49.5s	remaining: 1m 53s
2000:	learn: 0.4718790	total: 1m 37s	remaining: 1m 3s
3000:	learn: 0.4157862	total: 2m 26s	remaining: 14.6s
3299:	learn: 0.4016118	total: 2m 41s	remaining: 0us


# Fit test data

* Fitting the test data. The test batches are split into 50 parts in order to fit the memory limit

In [41]:
TEST_MODE == False

True

In [42]:
if TEST_MODE == True:
    df_test_meta = pl.scan_parquet(PATH_INPUT /  'train_meta.parquet')
    test_batch_ids = [120, 121, 122]
else:
    df_test_meta = pl.scan_parquet(PATH_INPUT /  'test_meta.parquet')
    test_batch_ids = [test_event_id for test_event_id in np.array(df_test_meta.select('batch_id').unique().collect()).reshape(-1)]

In [43]:
def load_test_batch(test_batch_id):
    if TEST_MODE:
        test_batch = pl.scan_parquet(PATH_INPUT / ('train/batch_' + str(test_batch_id) +'.parquet'))
    else:
        test_batch = pl.scan_parquet(PATH_INPUT / ('test/batch_' + str(test_batch_id) +'.parquet'))
    return test_batch

In [44]:
def predict_azimuth_zenith(dataf):
    features_test = np.array(dataf)[:][2:].T
    return dataf.with_columns([
    pl.Series(model_catboost.predict(features_test)).alias('mae_predict')
    ]).select([
    pl.col(['event_id', 'az_pred', 'zen_pred', 'mae_predict']).sort_by(by=[pl.col('mae_predict')]).head(1).list().over(pl.col('event_id')).flatten()
]).join(df_train_meta, on = 'event_id').with_columns([
    pl.col(['event_id', 'az_pred', 'zen_pred', 'mae_predict']),
    pl.col('azimuth').alias('az_true'),
    pl.col('zenith').alias('zen_true')])

In [45]:
def predict_azimuth_zenith_test(dataf):
    features_test = np.array(dataf)[:][2:].T
    return dataf.with_columns([
    pl.Series(model_catboost.predict(features_test)).alias('mae_predict')
    ]).select([
    pl.col(['event_id', 'az_pred', 'zen_pred', 'mae_predict']).sort_by(by=[pl.col('mae_predict')]).head(1).list().over(pl.col('event_id')).flatten()])

In [46]:
def make_smaller_sample_test(dataf):
    return dataf.filter(
    (pl.arange(0, pl.count()).shuffle(seed=0).over("event_id") < 100))

In [49]:
result_df = pl.DataFrame()
number_of_events = 0

if TEST_MODE == True:
    for batch_sample_id in test_batch_ids:
        all_event_ids = np.array((df_test_meta.filter(pl.col('batch_id') == batch_sample_id)).select(pl.col('event_id')).unique().collect()).reshape(-1)
        all_event_ids = np.array_split(all_event_ids, 50)
        for i, batch_event_ids in tqdm(enumerate(all_event_ids)):
            test_batch = load_test_batch(batch_sample_id)
            test_batch = test_batch.pipe(split_by_events, number_of_events, batch_event_ids
                                        ).pipe(generate_pairs
                                   ).pipe(join_data
                                   ).pipe(normalize
                                   ).pipe(add_azimuth_zenith
                                   ).pipe(join_features,load_test_batch(batch_sample_id), number_of_events, batch_event_ids
                                   ).make_smaller_sample_test(dataf).collect()
            test_batch = test_batch.pipe(predict_azimuth_zenith
                                   ).pipe(calculate_mae).select(['event_id', 'az_pred', 'zen_pred', 'mae'])

            if len(result_df) == 0:
                result_df = test_batch
            else:
                result_df = pl.concat([result_df, test_batch])
            del test_batch
            
if TEST_MODE == False:
        for batch_sample_id in test_batch_ids:
            all_event_ids = np.array((df_test_meta.filter(pl.col('batch_id') == batch_sample_id)).select(pl.col('event_id')).unique().collect()).reshape(-1)
            all_event_ids = np.array_split(all_event_ids, 20)
            for i, batch_event_ids in tqdm(enumerate(all_event_ids)):
                if len(batch_event_ids) == 0:
                    break
                test_batch = load_test_batch(batch_sample_id)
                test_batch = test_batch.pipe(split_by_events, number_of_events, batch_event_ids
                                            ).pipe(generate_pairs
                                       ).pipe(join_data, df_test_meta
                                       ).pipe(normalize
                                       ).pipe(add_azimuth_zenith
                                       ).pipe(join_features,load_test_batch(batch_sample_id), number_of_events, batch_event_ids
                                       ).pipe(make_smaller_sample_test).collect()
                test_batch = test_batch.pipe(predict_azimuth_zenith_test
                                       ).select([
                                        pl.col('event_id'), 
                                        pl.col('az_pred').alias('azimuth'), 
                                        pl.col('zen_pred').alias('zenith')])

                if len(result_df) == 0:
                    result_df = test_batch
                else:
                    result_df = pl.concat([result_df, test_batch])
                del test_batch
    

3it [00:00, 19.74it/s]


In [50]:
result_df.sort('event_id').write_csv('submission.csv')

In [51]:
pd.read_csv('submission.csv')

Unnamed: 0,event_id,azimuth,zenith
0,2092,1.447789,0.763679
1,7344,1.731352,0.994442
2,9482,0.337664,1.885815


# Conclusion

* In this excersise we tried to do a naive 2-dots method. There are a lot of limitation with this method but mostly is due to computation complexity. The training sample explodes into N^2 in size which makes already quite a heavy dataset - impossible to manage with the given Kaggle restriction.
* The idea was to experiment and see how well it can do. It obviously doesn't do well, however, it was a nice practice to try out Polars library and recieve some kind of baseline