In [None]:
import random

import pandas as pd

from jsontodf import *
from scipy import stats
import seaborn as sns

## Data exploration and analysis
This notebook aims to test some basic aspects of the data as well as test the pipeline parsing to ``Dataframe``. First, a few sanity checks:

In [None]:
test_emissions_df, test_no_emissions_df = jsontodf("../json_data/json_misc/compressed_test_json.json", two_returns=True)

In [None]:
test_emissions_df

In [None]:
test_no_emissions_df

In [None]:
test_events_df = jsontodf("../json_data/json_misc/compressed_test_json.json", two_returns=False)

In [None]:
test_events_df

### Time for the real deal
Now let's study a few aspects of one of the given data describing interactions in the lung!

In [None]:
%%time
lung_df = pd.read_pickle("../pickled_data/lung_dataset.pkl")

In [None]:
lung_df.head(50)

We first divert our attention the distribution of emissions.

In [None]:
lung_total = len(lung_df)

In [None]:
emissions_total = len(lung_df[lung_df["emission"] == 1])

In [None]:
p_emission = emissions_total/lung_total
print(f"Probability of an emission is {p_emission}.")

Is there a correlation between incoming particle parameters and the presence (or not) of an emission?

In [None]:
stats.spearmanr(lung_df["en_p"], lung_df["emission"])

In [None]:
stats.spearmanr(lung_df["dist_p"], lung_df["emission"])

Notice the extremely small p-values (0 in fact), thanks to the size of our dataset. Thus we can extremely confidently reject the hypothesis that incoming energy and travelled distance are uncorrelated from emission - which makes sense.

Now for the correlations with the outputs:

In [None]:
stats.spearmanr(lung_df["de_p"], lung_df["emission"])

In [None]:
stats.spearmanr(lung_df["cos_p"], lung_df["emission"])

Here again, no surprises that the fact an emission happened has major impact on the energy delta and the rotation of the incoming particle.

Let's have a look at a few other basic distributions.

In [None]:
sns.histplot(lung_df["en_p"].apply(np.log)).set(title="Distribution of parent particle energy in lung");

In [None]:
sns.histplot(lung_df["dist_p"]).set(title="Distribution of distance covered by particle before event in lung");

The spike at $1000$ corresponds to the step limit, i.e to the max distance a particle can cover in a step.

In [None]:
step_limit_df = lung_df[lung_df['dist_p'] == lung_df['dist_p'].max()]

In [None]:
step_limit_count = len(step_limit_df)
p_step_limit = step_limit_count/lung_total
print(f"Probability of reaching the step limit: {p_step_limit}")

In [None]:
emissions_when_step_limit = len(step_limit_df[step_limit_df["emission"] == 1])
p_emission_cond_step_limit = emissions_when_step_limit/step_limit_count
print(f"Probability of emission when reaching step limit (i.e P(emission | step_limit reached)): {p_emission_cond_step_limit}")

Let's understand whether the distance covered follows a power law or an exponential distribution, by looking at the plot in log-log:

In [None]:
sns.ecdfplot(lung_df["dist_p"], log_scale=(True,True)).set(title="ECDF of distance covered by particle before event in lung - log-log");

Data looks quite linear on a log-log, indicating a possible exponential distribution. Let's test that:

In [None]:
stats.kstest(lung_df['dist_p'], "expon")

Test says ``dist_p`` is not exponentially distributed, interestingly. Could that be because of the lack of accuracy for small distances? (smaller than $10^{-2}$)

In [None]:
stats.kstest(lung_df[lung_df['dist_p'] > 0.01]['dist_p'], "expon")

Looks like the hypothesis of an exponential distribution is to be rejected, and that the distance covered actually follows a power law.

Checking whether initial energy and distance travelled are correlated:

In [None]:
stats.spearmanr(lung_df['dist_p'], lung_df['en_p'])

They definitely are but not much. What about the emission indicator?

In [None]:
stats.spearmanr(lung_df['dist_p'], lung_df['emission'])

In [None]:
for i in [0.1, 0.2, 0.3, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 20.0]:
    p = len(lung_df[(lung_df['en_p'] == i) & (lung_df['emission'] == 1)])/len(lung_df[(lung_df['en_p'] == i)])
    print(f"For energy {i}, proba of emission: {p}")

In [None]:
stats.spearmanr(lung_df['en_p'], lung_df['emission'])

Now, let's look at the particle rotation:

In [None]:
sns.displot(lung_df['cos_p']).set(title="Distribution of particle rotation in lung");

Rotation seems roughly centered at $0$ degrees, with an additive delta at $0$ maybe due to particles that do not emit any other ones? Let's check that:

In [None]:
sns.displot(lung_df[lung_df['emission'] == 1]['cos_p']).set(title="Distribution of particle rotation in lung, conditioned on emission");

Don't forget looking at the y-axis. Indeed the spike at $0$ degree rotation has diminished, although not by that much. Let's quickly check that particles that do not emit a new one are not rotated much:

In [None]:
sns.histplot(lung_df[lung_df['emission'] == 0]['cos_p'], bins=400, log_scale=(False, True)).set(title="Distribution of particle rotation, conditioned on no emission");

In [None]:
lung_df_no_emissions = lung_df[lung_df["emission"] == 0]

In [None]:
len(lung_df_no_emissions[lung_df_no_emissions["cos_p"] > 0.98])/len(lung_df_no_emissions)

In [None]:
sns.histplot(lung_df_no_emissions["de_p"])

In [None]:
stats.spearmanr(lung_df_no_emissions["cos_p"], lung_df_no_emissions["dist_p"])

In [None]:
sns.histplot(lung_df_no_emissions["dist_p"], log_scale=(False, True));

In [None]:
stats.spearmanr(lung_df_no_emissions["cos_p"], lung_df_no_emissions["en_p"])

In [None]:
lung_df_emissions = lung_df[lung_df["emission"] == 1]

In [None]:
stats.spearmanr(lung_df_emissions["cos_p"], lung_df_emissions["dist_p"])

In [None]:
sns.histplot(lung_df_emissions["dist_p"]);

Now, we do similar tests for the water dataset just to check that things aren't crazily different:

In [None]:
water_df = pd.read_pickle("../pickled_data/water_dataset.pkl")
water_df.head(40)

In [None]:
len(water_df)

In [None]:
sns.histplot(lung_df["en_p"]).set(title="Distribution of parent particle energy in water");

Nothing changed much there.

In [None]:
sns.histplot(water_df["dist_p"]).set(title="Distribution of distance covered by particle in water");

Interestingly, much less particles reach the step limit.

In [None]:
step_limit_df = water_df[water_df['dist_p'] == water_df['dist_p'].max()]

In [None]:
step_limit_count = len(step_limit_df)
p_step_limit = step_limit_count/len(water_df)
print(f"Probability of reaching the step limit: {p_step_limit}")

Can this have anything do with the probability of emission?

In [None]:
emission_count = len(water_df[water_df["emission"] == 1])
p_emission = emission_count/len(water_df)
print(f"Probability of emission: {p_emission}")

In this indeed much higher than in the lung. Lastly, for the correlation between energy and distance, which was noticeably low for the lungs:

In [None]:
stats.spearmanr(water_df['dist_p'], water_df['en_p'])

It's a bit higher here. We expect quite different models, although a very similar ML system architecture can still be used for both datasets.

About augmented data:

In [None]:
%%time
lung_aug_df = pd.read_pickle("../pickled_data/lung_augmented_dataset.pkl")

In [None]:
sns.histplot(lung_aug_df["en_p"]).set(title="Energy distribution in augmented data");