# Chi square calculation

In this notebook triple data obtained on the previous step was used. 

1. Load prepared triple results, markup and graph data/
2. Get frequency tables
3. Chi square calculation

In [1]:
import numpy as np
import pandas as pd
from triple_patterns import get_label_intervals
from triple_patterns import get_freq_tables_triples
from triple_patterns import get_chi_square_results_df

In [2]:
path_triple_data = r"..\data\triple_data.npz"
markup_path = r"..\data\markup.csv"
graph_path = r"..\data\graph.npz"

### Load prepared triple results and markup

Triple_data is an archive of three arrays:  
1. Array of triples with size $M\times3$, where M - number of finding triples. 
2. Array with triple activity in time with size $M\times T$, where $T =$ number of frames $\div 8$. Array contains only $0$ and $1$, where $1$ signify moments of time, where triple was active.  
3. Array with triple activity during different labels with size $M \times L$, where $M$ - number of found triples, $L$ - number of labels in markup.

In [3]:
triple_data = np.load(path_triple_data)
triples = triple_data["triples"]
triple_activity = triple_data["triple_activity"]
triple_info = triple_data["triple_info"]

In [4]:
markup_df = pd.read_csv(markup_path, index_col=0)
markup = markup_df.to_dict(orient='list')
markup = {key: np.array(value) for key, value in markup.items()}

In [5]:
graph_data = np.load(graph_path)
nt = graph_data["nt"]

### Markup preprocessing

On this step we need to get number of intervals active for every label in the whole video.

In [6]:
label_intervals = get_label_intervals(markup, nt)
packed_label_intensity = {key:np.packbits(value) for key, value in markup.items()}

### Triple info preprocessing

On this step frequency tables are calculated and all results are placed in dataframe for convenience for further analysis.

In [7]:
dict_freq_tables = get_freq_tables_triples(
            triple_activity, triples, packed_label_intensity, label_intervals
        )
df_edges = pd.DataFrame(
        list(dict_freq_tables.items()), 
        index=list(dict_freq_tables.keys()), 
        columns=['edge', 'freq_table']
    )

0: [2 3 6]


### Chi-square calculation

In [8]:
df_chi_res = get_chi_square_results_df(df_edges, bool_gtest=False)
df_chi_res.head()

Unnamed: 0,chi2_pval,chi2_expected_freq,chi2_pval_adj
"(2, 3, 6)",0.324901,"[[0.24778761061946902, 0.4424778761061947, 0.3...",0.859771
"(2, 3, 9)",0.901601,"[[1.238938053097345, 2.212389380530974, 1.5486...",0.901601
"(2, 6, 3)",0.529604,"[[0.24778761061946902, 0.4424778761061947, 0.3...",0.859771
"(2, 9, 3)",0.324901,"[[0.24778761061946902, 0.4424778761061947, 0.3...",0.859771
"(3, 2, 6)",0.216231,"[[0.24778761061946902, 0.4424778761061947, 0.3...",0.768191
