# Label Correlation and Overlap Analysis

This notebook analyzes the relationships and overlaps between six label columns in your DataFrame. Three columns have values between 0 and 4, and three have values between 0 and 9. The goal is to understand how these labels are correlated and whether specific values (e.g., 4) in one label coincide with the same value in another.

## 1. Load and Inspect Label Columns

Load the DataFrame and select the six label columns. Print their names and the first few rows to verify.

In [9]:
import joblib
import pandas as pd

# Load the DataFrame (adjust path if needed)
PATH_DATA = "G:\\Python\\Data"
PATH_DATA_DTS=PATH_DATA+"\\DTS_FULL\\"
df_full = joblib.load(PATH_DATA_DTS+"PARIS_TREND_1D_V5_5B_lab_20_adj_20_50_class_5_10_PREDICT_FULL_V2.pkl")

print(df_full.shape)
df_pred=df_full[df_full['PART'].isin(['VAL','CONF'])].copy()
print(df_pred.shape)

# List the six label columns (update names if needed)
label_cols_5 = ['lab_perf_50d_class5','lab_perf_20d_class5',  'lab_perf_20d_class_5_adj']
pred_cols_5=['predict_LGBM_CLASS_5_50D','predict_LGBM_CLASS_5_20D','predict_5_adj']
pred_cols_10=['predict_LGBM_CLASS_10_50D','predict_LGBM_CLASS_10_20D','predict_10_adj']

pred_cols = pred_cols_5.copy()

for col in pred_cols_10:
    new_col=col+'_d2'
    df_pred.loc[:, new_col] = df_pred[col]//2
    pred_cols.append(new_col)
    # replace pred_cols_10 element by the new_col
    pred_cols_10[pred_cols_10.index(col)] = new_col

# convertNan values to -1 for cols in pred_cols
df_pred[label_cols_5+pred_cols] = df_pred[label_cols_5+pred_cols].fillna(-1)

print("Label columns:", label_cols_5)
print("prediction columns 5 :", pred_cols_5)
print("prediction columns 10:", pred_cols_10)
print("prediction columns:", pred_cols)
print(df_pred[label_cols_5+pred_cols].head())

# 3/consistnces des prédictions entre les modèles
# pour chaque val, sortir une matrice de toutes les autres val 
# donc 1 matrice a 25 de côté et il y a 30 matrices
# mettre ça sous forme de tableau dans un fichier excel
# 4/analyse des prédictions 3-4/7-8-9 par rapport au réel et par part (grosse dégradation ? )
# 5/analyse des proba 4/9 VS 0-1 et 3/7-8 VS 0-1
# est-ce qu'il y a moyen d'améliorer les prédictions
# est-ce qu'un bon 3 ne serait pasmieux qu'un 4 ? (7VS9)
# 6/liste des scenarii à étudier


(1297807, 41)
(453808, 41)
Label columns: ['lab_perf_50d_class5', 'lab_perf_20d_class5', 'lab_perf_20d_class_5_adj']
prediction columns 5 : ['predict_LGBM_CLASS_5_50D', 'predict_LGBM_CLASS_5_20D', 'predict_5_adj']
prediction columns 10: ['predict_LGBM_CLASS_10_50D_d2', 'predict_LGBM_CLASS_10_20D_d2', 'predict_10_adj_d2']
prediction columns: ['predict_LGBM_CLASS_5_50D', 'predict_LGBM_CLASS_5_20D', 'predict_5_adj', 'predict_LGBM_CLASS_10_50D_d2', 'predict_LGBM_CLASS_10_20D_d2', 'predict_10_adj_d2']
                     lab_perf_50d_class5  lab_perf_20d_class5  \
OPEN_DATETIME CODE                                              
2017-08-01    AB.PA                 -1.0                 -1.0   
2017-08-02    AB.PA                 -1.0                 -1.0   
2017-08-03    AB.PA                 -1.0                 -1.0   
2017-08-04    AB.PA                 -1.0                 -1.0   
2017-08-07    AB.PA                 -1.0                 -1.0   

                     lab_perf_20d_class_5_

## 2. Check Value Ranges and Distributions

Print value counts for each label column to confirm their ranges and distributions.

In [None]:
for col in pred_cols:
    print(f"\nValue counts for {col}:")
    print(df_pred[col].value_counts().sort_index())

## 3. Create Boolean Masks for Label Values

For each label column, create boolean masks for specific values (e.g., value == 4 for 0-4 columns, value == 9 for 0-9 columns).

In [None]:
# Masks for value == 4 (for 0-4 columns)
masks_4 = {col: (df_pred[col] == 4) for col in pred_cols}



# Example: Number of rows where each label is 4 or 9
for col in pred_cols:
    print(f"Rows with {col} == 4: {masks_4[col].sum()}")


## 4. Cross-Tabulate Label Value Occurrences

Count how often specific values (e.g., 4) in one label column coincide with the same value in another label column.

In [None]:
# Example: Cross-tabulate value == 4 between two 0-4 label columns
ct_4 = pd.crosstab(df_pred[pred_cols[0]] == 4, df_pred[pred_cols[1]] == 4)
print(f"Cross-tabulation for value == 4 between {pred_cols[0]} and {pred_cols[1]}:")
print(ct_4)

# Generalized: For all pairs
from itertools import combinations

print("\nOverlap counts for value == 4 (0-4 columns):")
for col1, col2 in combinations(pred_cols, 2):
    overlap = (masks_4[col1] & masks_4[col2]).sum()
    print(f"{col1} == 4 & {col2} == 4: {overlap}")


## 5. Visualize Overlap Between Labels

Plot heatmaps to visualize the overlap of specific values across label columns.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create overlap matrix for value == 4 (0-4 columns)
overlap_matrix_4 = pd.DataFrame(index=pred_cols, columns=pred_cols)
for col1 in pred_cols:
    for col2 in pred_cols:
        overlap_matrix_4.loc[col1, col2] = (masks_4[col1] & masks_4[col2]).sum()

overlap_matrix_4 = overlap_matrix_4.astype(int)

plt.figure(figsize=(6,5))
sns.heatmap(overlap_matrix_4, annot=True, fmt="d", cmap="Blues")
plt.title("Overlap (value == 4) between 0-4 label columns")
plt.show()

## 6. Calculate Correlation Matrix for Label Columns

Compute and display the correlation matrix for the six label columns to quantify their relationships.

In [None]:
# Compute correlation matrix
corr_matrix = df_pred[pred_cols].corr()

print("Correlation matrix for label columns:")
print(corr_matrix)

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation Matrix of Label Columns")
plt.show()

7. Extraction to a csv file 

In [10]:
PATH_DATA = "G:\\Python\\Data"
PATH_DATA_DTS=PATH_DATA+"\\DTS_FULL\\"

results = []
for col in label_cols_5:
    # store index of col in label_cols
    col_index = label_cols_5.index(col)
    for val in range(5):
        mask = df_pred[col] == val
        for other_col in label_cols_5+pred_cols: #[c for c in label_cols+pred_cols if c != col]:
            # check index 
            if other_col in pred_cols_5 :
                col_other_index = pred_cols_5.index(other_col)
                # print(f"Column {other_col} found in pred_cols_5")
            elif other_col in pred_cols_10:
                col_other_index = pred_cols_10.index(other_col)
                # print(f"Column {other_col} found in pred_cols_10")
            elif other_col in label_cols_5:
                col_other_index = col_index
                # print(f"Column {other_col} found in label_cols_5")
            else:
                col_other_index = -1
                print(f"Column {other_col} not found in pred_cols nor label_cols_5")
            if col_index==col_other_index:
                
                counts = df_pred.loc[mask, other_col].value_counts().sort_index()
                print(f"For {col} Column {other_col} found with {counts.sum()} occurrences")
                for other_val, count in counts.items():
                    if count>0:
                        results.append({
                            'label_col': col,
                            'label_val': val,
                            'other_col': other_col,
                        'other_val': other_val,
                        'count': count
                    })

df_result = pd.DataFrame(results)
df_result.to_csv(PATH_DATA_DTS+'label_value_cross_counts_v3.csv', index=False)

For lab_perf_50d_class5 Column lab_perf_50d_class5 found with 56151 occurrences
For lab_perf_50d_class5 Column lab_perf_20d_class5 found with 56151 occurrences
For lab_perf_50d_class5 Column lab_perf_20d_class_5_adj found with 56151 occurrences
For lab_perf_50d_class5 Column predict_LGBM_CLASS_5_50D found with 56151 occurrences
For lab_perf_50d_class5 Column predict_LGBM_CLASS_10_50D_d2 found with 56151 occurrences
For lab_perf_50d_class5 Column lab_perf_50d_class5 found with 42878 occurrences
For lab_perf_50d_class5 Column lab_perf_20d_class5 found with 42878 occurrences
For lab_perf_50d_class5 Column lab_perf_20d_class_5_adj found with 42878 occurrences
For lab_perf_50d_class5 Column predict_LGBM_CLASS_5_50D found with 42878 occurrences
For lab_perf_50d_class5 Column predict_LGBM_CLASS_10_50D_d2 found with 42878 occurrences
For lab_perf_50d_class5 Column lab_perf_50d_class5 found with 38845 occurrences
For lab_perf_50d_class5 Column lab_perf_20d_class5 found with 38845 occurrences
Fo

In [None]:
df_pred_c=df_pred.copy()
# remove lines of df_pred_c where a vlue of a column of pred_cols = -1
df_pred_c = df_pred_c[~df_pred_c[pred_cols].isin([-1]).any(axis=1)]
df_pred_c['lab_concat'] = df_pred_c[label_cols_5].astype(str).agg( axis=1)
