<a href="https://colab.research.google.com/github/Tal144155/DTS_Project/blob/main/TDS_Project_p1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular Data Science - Research Project
### Group Members: 
* Tal Ariel Ziv
* Arnon Lutsky

#### Introduction
Our final project aims to enhance and automate the data visualization process within the data science pipeline. Visualization is a critical step in understanding the data, allowing users to explore distributions, analyze relationships between features and target variables, and gain meaningful insights from different perspectives. By improving and automating this process, we seek to make data exploration more efficient, more intuitive, and accessible. Our solution is an algorithm that automatically analyzes the data for different statistical relations and interesting observations and recommends visualizations based on analysis and a recommendation system.<br>

Before we begin, let's install all packages that are needed to run the notebook.
#### Installation Guide:
1. Download python **version 3.12** (and up). You can use the following [link](https://www.python.org/downloads/).
2. Please download all required packages, using the following command (write it in your CMD): `pip install -r requirements.txt`<br><br>
<font size=4px>**Now, we are able to begin.**</font>

### Relation Detection Algorithm:

For deeper understanging of the relation detection algorithm, please refer to the pdf with the full explanaion of the project, under Relation Detection Algorithm, section 2.1 .



Now, lets start analyzing the data.

### 0. Imports

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import chi2_contingency, f_oneway
from sklearn.metrics import mutual_info_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from scipy import stats


import warnings
warnings.filterwarnings('ignore')


### 1. Understanding The Data
Now, let's define the functions that will find the relations.

We'll set the top 10 relations as the default number of relations returned.

In [None]:
TOP_N_RELATIONS = 10

We'll make a function that can find which features are in a format of a date.

In [None]:
def column_to_date(df):
    """This function recognize columns that are in the forma of date."""
    date_pattern = r'^(\d{4}-\d{2}-\d{2})|^(\d{2}/\d{2}/\d{4})|^(\d{4}/\d{2}/\d{2})'
    for column in df.columns:
        if df[column].dtype == 'object':
            if df[column].str.match(date_pattern).any():
                try:
                    df[column] = pd.to_datetime(df[column], errors='coerce')
                    print(f"Converted column '{column}' to datetime.")
                except Exception as e:
                    print(f"Warning: Could not parse column {column} as datetime. {str(e)}")

This function checks that the dataset exists in the given path, and readx the data using pandas.
If the data does not exist in the path, print a message to the user.

In [None]:
def read_data(dataset_path, index_col = None):

    print("- Loading the dataset.")
    if not os.path.exists(dataset_path):
        print(f"Error: The file '{dataset_path}' does not exist. Please check the path and try again.")
        return None
    if index_col:
        df = pd.read_csv(dataset_path, index_col = index_col)
    else:
        df = pd.read_csv(dataset_path)
    column_to_date(df)
    df.dropna(inplace=True)
    return df

This function determines if an integer column is categorical or numeric. 
If we have very little unique integer values, the column is probably categorical.

In [None]:
def is_potentially_categorical(column, threshold=0.01):

    unique_values = column.nunique()
    total_values = len(column)
    # check if the percentage of unique values in the column is smaller then the threshold.
    if unique_values / total_values < threshold and unique_values < 20:
        return True
    return False

This function determines the type of each column in our dataset, in order to do smart visualization later.
Types we recognize: integer, categorical int, float, boolean, string, categorical string, date, object, other.

In [None]:
def get_column_types(df):
    print("- Finding features types in the dataset.")
    column_types = {}
    for column in df.columns:
        if pd.api.types.is_integer_dtype(df[column]):
            if is_potentially_categorical(df[column]):
                column_types[column] = 'categorical_int'
            else:
                column_types[column] = 'integer'
        elif pd.api.types.is_float_dtype(df[column]):
            column_types[column] = 'float'
        elif pd.api.types.is_bool_dtype(df[column]):
            column_types[column] = 'boolean'
        elif pd.api.types.is_string_dtype(df[column]):
            if is_potentially_categorical(df[column]):
                column_types[column] = 'categorical_string'
                df[column] = df[column].astype('category')
            else:
                column_types[column] = 'string'
        elif pd.api.types.is_datetime64_any_dtype(df[column]):
            column_types[column] = 'datetime'
        elif pd.api.types.is_timedelta64_dtype(df[column]):
            column_types[column] = 'timedelta'
        elif pd.api.types.is_object_dtype(df[column]):
            column_types[column] = 'object'
        else:
            column_types[column] = 'other'
    return column_types

This function finds all relations with the Target Variable and features with a correlation higher then the threshold 

In [None]:
def correlation_target_value(df, numerical_columns, target_variable, relations, correlation_threshold=0.5):
    # Relations with the Target Variable
    print("- Checking for correlation with the target variable.")
    correlations = df[numerical_columns].corr()
    if target_variable in numerical_columns:
        for feature in numerical_columns:
            if feature != target_variable:
                corr_value = correlations.loc[feature, target_variable]
                if abs(corr_value) > correlation_threshold:
                    relations.append({
                        'attributes': [feature, target_variable],
                        'relation_type': 'target_correlation',
                        'details': {'correlation_value': corr_value}
                    })


This function finds all relations between two features with a correlation higher then the threshold 

In [None]:

def correlation_relations(df, numerical_columns, target_variable, relations, correlation_threshold=0.5):
    # High Correlation Relations (Excluding Target Variable)
    print("- Checking for correlation.")
    correlations = df[numerical_columns].drop(columns=[target_variable], errors='ignore').corr()
    for i, feature1 in enumerate(correlations.columns):
        for feature2 in correlations.columns[i + 1:]:
            corr_value = correlations.loc[feature1, feature2]
            if abs(corr_value) > correlation_threshold:
                relations.append(
                    {'attributes': [feature1, feature2],
                     'relation_type': 'high_correlation',
                     'details': {'correlation_value': corr_value}})





This function assesses the impact of categorical variables
on a numerical target variable using Analysis of Variance
(ANOVA). 

In [None]:
def categorical_effects(df, categorical_columns, numerical_columns, target_variable, relations, p_value_threshold=0.05):
    print("- Checking for categorical effect.")
    temp_relations = []
    if target_variable in numerical_columns:
        for cat_feature in categorical_columns:
            groups = [df[df[cat_feature] == cat][target_variable].dropna() for cat in df[cat_feature].unique()]
            if len(groups) > 1:
                f_stat, p_value = f_oneway(*groups)
                if p_value < p_value_threshold:
                    temp_relations.append(
                        {'attributes': [cat_feature, target_variable],
                         'relation_type': 'categorical_effect',
                         'details': {'p_value': p_value}})
    temp_relations.sort(key=lambda x: x['details']['p_value'])
    relations.extend(temp_relations[:TOP_N_RELATIONS])

This function performs the chi 2 test between
categorical features. 

In [None]:
def chi_squared_relationship(df, categorical_columns, relations, p_value_threshold=0.05):
    print("- Checking for chi square relation.")
    temp_relations = []
    for i, feature1 in enumerate(categorical_columns):
        for feature2 in categorical_columns[i + 1:]:
            contingency_table = pd.crosstab(df[feature1], df[feature2])
            chi2, p, _, _ = chi2_contingency(contingency_table)
            if p < p_value_threshold:
                temp_relations.append(
                    {'attributes': [feature1, feature2],
                     'relation_type': 'chi_squared',
                     'details': {'p_value': p}})
    temp_relations.sort(key=lambda x: x['details']['p_value'])
    relations.extend(temp_relations[:TOP_N_RELATIONS])


Function to check for numerical feature trends over time

In [None]:

def date_numerical_relationship(df, date_columns, numerical_columns, relations, correlation_threshold=0.5):
    print("- Checking for date with numerical variables.")
    temp_relations = []
    for date_col in date_columns:
        # Safely convert to datetime and drop NaT values
        valid_dates = pd.to_datetime(df[date_col], errors='coerce').dropna()
        if valid_dates.empty:
            continue
        df['time_ordinal'] = valid_dates.map(pd.Timestamp.toordinal)
        for num_feature in numerical_columns:
            # Only use rows where the date is valid
            valid_data = df.loc[valid_dates.index, num_feature].dropna()
            if not valid_data.empty:
                corr_value = df.loc[valid_data.index, 'time_ordinal'].corr(valid_data)
                if abs(corr_value) > correlation_threshold:
                    temp_relations.append(
                        {'attributes': [date_col, num_feature],
                         'relation_type': 'date_numerical_trend',
                         'details': {'correlation_value': corr_value}}
                    )
    temp_relations.sort(key=lambda x: abs(x['details']['correlation_value']), reverse=True)
    relations.extend(temp_relations[:TOP_N_RELATIONS])

Function to check for categorical feature distribution over date features

In [None]:
def date_categorical_relationship(df, date_columns, categorical_columns, relations, p_value_threshold=0.05):
    print("- Checking for date with categorical variable.")
    temp_relations = []
    for date_col in date_columns:
        df['date_period'] = pd.to_datetime(df[date_col]).dt.to_period('M')
        for cat_feature in categorical_columns:
            contingency_table = pd.crosstab(df['date_period'], df[cat_feature])
            chi2, p, _, _ = chi2_contingency(contingency_table)
            if p < p_value_threshold:
                temp_relations.append(
                    {'attributes': [date_col, cat_feature],
                     'relation_type': 'date_categorical_distribution',
                     'details': {'p_value': p}}
                )
    temp_relations.sort(key=lambda x: x['details']['p_value'])
    relations.extend(temp_relations[:TOP_N_RELATIONS])

This function computes mutual information scores between numerical features to detect strong relationships that
may not be identified by linear correlation methods

In [None]:
def non_linear_relationships(df, numerical_columns, relations, threshold=0.5):    
    print("- Checking for non linear relation.")
    for col1 in numerical_columns:
        for col2 in numerical_columns:
            if col1 != col2:
                mi = mutual_info_score(
                    pd.qcut(df[col1], 10, duplicates='drop', labels=False), 
                    pd.qcut(df[col2], 10, duplicates='drop', labels=False)
                )
                if mi > threshold:
                    relations.append({
                        'attributes': [col1, col2],
                        'relation_type': 'non_linear',
                        'details': {'mutual_information': mi}
                    })

This function employs a Random Forest Regressor to assess the importance of numerical features in predicting
the target variable. 

In [None]:
def feature_importance_relations(df, numerical_columns, target_variable, relations, top_n=5):
    print("- Checking for feature importance.")
    
    if target_variable in numerical_columns:
        X = df[numerical_columns].drop(columns=[target_variable])
        y = df[target_variable]
        model = RandomForestRegressor(random_state=42)
        model.fit(X, y)
        importances = model.feature_importances_
        
        feature_importances = sorted(
            zip(X.columns, importances), 
            key=lambda x: x[1], 
            reverse=True
        )[:top_n]
        
        importance_details = {
            feature: {
                'importance_value': importance,
                'relative_rank': rank + 1
            }
            for rank, (feature, importance) in enumerate(feature_importances)
        }
        
        relations.append({
            'attributes': [f[0] for f in feature_importances],
            'relation_type': 'feature_importance',
            'details': {
                'importances': importance_details,
                'target_variable': target_variable
            }
        })


This function dentifies outliers using the Z-score method and analyzes how these outliers influence feature correlations

In [None]:
def outlier_relationships(df, numerical_columns, relations, z_score_threshold=3.0, min_outlier_ratio=0.01, max_outlier_ratio=0.05, correlation_diff_threshold=0.3):
    print("- Checking for outliers relation.")
    for col in numerical_columns:
        z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
        outliers = df[z_scores > z_score_threshold]
        
        outlier_ratio = len(outliers) / len(df)
        
        if min_outlier_ratio < outlier_ratio < max_outlier_ratio:
            for other_col in numerical_columns:
                if col != other_col:
                    outlier_correlation = outliers[col].corr(outliers[other_col])
                    normal_correlation = df[col].corr(df[other_col])
                    
                    if outlier_correlation is not None and normal_correlation is not None:
                        if abs(outlier_correlation - normal_correlation) > correlation_diff_threshold:
                            relations.append({
                                'attributes': [col, other_col],
                                'relation_type': 'outlier_pattern',
                                'details': {
                                    'outlier_correlation': outlier_correlation,
                                    'normal_correlation': normal_correlation,
                                    'outlier_count': len(outliers)
                                }
                            })

This function analyses the target variable by detecting outliers
and assessing it's distribution against known probability
distributions.

In [None]:
def target_variable_analysis(df, target_variable, relations, z_score_threshold=3.0):
    print("- Checking for target variable.")
    target_data = df[target_variable]
    z_scores = np.abs((target_data - target_data.mean()) / target_data.std())
    outliers = target_data[z_scores > z_score_threshold]
    
    outlier_ratio = len(outliers) / len(target_data)
    
    distribution_types = ['norm', 'lognorm', 'expon', 'gamma', 'beta']
    best_fit = None
    best_p_value = 0
    
    for dist_name in distribution_types:
        dist = getattr(stats, dist_name)
        params = dist.fit(target_data)
        ks_stat, p_value = stats.kstest(target_data, dist_name, args=params)
        
        if p_value > best_p_value:
            best_fit = dist_name
            best_p_value = p_value
    
    relations.append({
        'attributes': [target_variable],
        'relation_type': 'target_analysis',
        'details': {
            'outlier_ratio': outlier_ratio,
            'outlier_count': len(outliers),
            'distribution_type': best_fit,
            'distribution_p_value': best_p_value
        }
    })


This function finds all interesting relations in the dataset and returnes a list of them.

In [None]:
def find_relations(df, target_variable, dataset_types):
    relations = []
    numerical_columns = [col for col, col_type in dataset_types.items() if col_type in ['integer', 'float']]
    categorical_columns = [col for col, col_type in dataset_types.items() if col_type in ['categorical_int', 'categorical_string']]
    datetime_columns = [col for col, col_type in dataset_types.items() if col_type == 'datetime']
    categorical_int_columns = [col for col, col_type in dataset_types.items() if col_type == 'categorical_int']

    # Get the relations with high correlation
    correlation_relations(df, numerical_columns, target_variable, relations)

    # Get the relations with the target value
    correlation_target_value(df, numerical_columns, target_variable, relations)

    # Get the relations with categorical features
    categorical_effects(df, categorical_columns, numerical_columns, target_variable, relations)

    # Get categorical relations using chi-square test
    chi_squared_relationship(df, categorical_columns, relations)

    # Get relation between date attribute and numerical attributes
    date_numerical_relationship(df, datetime_columns, numerical_columns, relations)

    # Get relations between date attribute and categorical attributes
    date_categorical_relationship(df, datetime_columns, categorical_columns, relations)

    # Get non-linear relations between attributes
    non_linear_relationships(df, numerical_columns, relations)

    # Get attributes importance using random forest
    feature_importance_relations(df, numerical_columns + categorical_int_columns, target_variable, relations)

    # Get outliers relations
    outlier_relationships(df, numerical_columns, relations)
    
    # Get the distribution of the target variable
    target_variable_analysis(df, target_variable, relations)

    return relations

Now we'll run the relation detection algorithm.

In [None]:
dataset_path = "Final Project/Datasets_Testing/AB_NYC_2019.csv"
# input("Please enter the path to your Dataset: ")
index_col = "id"
# input("Please enter the index column: ")
target_value = "price"
# input("Please enter the name of your target value: ")
df = read_data(dataset_path, index_col)
if not df is None:
    # Understanding the types of columns in the data in order to create better visualizations.
    dataset_types = get_column_types(df)
    # Calling method to get the relations in the data
    find_relations(df, target_value, dataset_types)

In [None]:
import pandas as pd
import numpy as np
import pickle
import os.path
from sklearn.metrics.pairwise import pairwise_distances

In [None]:
RELATION_TYPES = {
    "high_correlation": {
        "description": "Identifies pairs of numerical features that have a strong linear relationship, indicating potential multicollinearity or redundancy in the dataset.",
        "use_cases": [
            "Feature selection",
            "Dimensionality reduction",
            "Understanding feature interactions"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'target_correlation': {
        "description": "Measures the linear relationship between individual features and the target variable, helping to identify the most influential predictors.",
        "use_cases": [
            "Feature importance ranking",
            "Predictive modeling",
            "Feature selection"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'categorical_effect': {
        "description": "Evaluates the statistical significance of categorical variables' impact on a numerical target variable using one-way ANOVA test.",
        "use_cases": [
            "Feature significance testing",
            "Group comparison",
            "Categorical feature importance"
        ],
        "data_types": ["categorical", "numerical"],
        "dimensions": [2],
    },
    'chi_squared': {
        "description": "Identifies statistically significant relationships between categorical variables using the chi-squared independence test.",
        "use_cases": [
            "Feature dependency analysis",
            "Categorical variable interaction detection",
            "Feature selection"
        ],
        "data_types": ["categorical"],
        "dimensions": [2],
    },
    'date_numerical_trend': {
        "description": "Detects temporal trends in numerical features by measuring their correlation with time-based attributes.",
        "use_cases": [
            "Time series analysis",
            "Trend identification",
            "Temporal pattern recognition"
        ],
        "data_types": ["numerical", "time series"],
        "dimensions": [2],
    },
    'date_categorical_distribution': {
        "description": "Analyzes how categorical variable distributions change or are distributed across different time periods.",
        "use_cases": [
            "Temporal categorical pattern detection",
            "Seasonal variation analysis",
            "Time-based segmentation"
        ],
        "data_types": ["categorical", "time series"],
        "dimensions": [2],
    },
    'non_linear': {
        "description": "Identifies complex, non-linear relationships between numerical features using mutual information score.",
        "use_cases": [
            "Advanced feature interaction detection",
            "Non-linear dependency analysis",
            "Complex relationship mapping"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'feature_importance': {
        "description": "Ranks features based on their predictive power using a Random Forest Regressor's feature importance metric.",
        "use_cases": [
            "Predictive modeling",
            "Feature selection",
            "Model interpretability"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'outlier_pattern': {
        "description": "Detects unique correlation patterns among outliers that differ from the overall dataset's correlations.",
        "use_cases": [
            "Anomaly detection",
            "Robust correlation analysis",
            "Outlier impact assessment"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'cluster_group': {
        "description": "Identifies groups of features that exhibit similar clustering characteristics based on their importance within specific clusters.",
        "use_cases": [
            "Feature grouping",
            "Dimensionality reduction",
            "Structural data understanding"
        ],
        "data_types": ["numerical"],
        "dimensions": [1],
    },
    'target_analysis': {
        "description": "Provides a comprehensive analysis of the target variable, including outlier characteristics and distribution properties.",
        "use_cases": [
            "Target variable understanding",
            "Distribution fitting",
            "Outlier detection"
        ],
        "data_types": ["numerical"],
        "dimensions": [1],
    }
}



# Save the user ratings to a pickle file for keeping the progress and assessing our model
def save_ratings(ratings, file_name):
    with open(file_name+'.pkl', 'wb') as f:
        pickle.dump(ratings, f)

# Load user ratings for collaborative filtering 
def load_ratings(file_name, rec_types):
    file = file_name+'.pkl'
    if os.path.isfile(file):
        with open( file, 'rb') as f:
            ratings = pickle.load(f)
    else:
        ratings = pd.DataFrame({})
        for type in rec_types:
            if type not in ratings.columns:
                ratings[type] = np.nan

    return ratings







This function is for user based content filtering.

In [None]:
def CFUB(ratings_pd):
    # Get the mean rating for each user
    ratings = ratings_pd.to_numpy()
    mean_user_rating = ratings_pd.mean(axis=1).to_numpy().reshape(-1, 1)
    # calculate the similarity between users
    ratings_diff = (ratings - mean_user_rating)
    ratings_diff[np.isnan(ratings_diff)]=4
    user_similarity = 1-pairwise_distances(ratings_diff, metric='cosine')
    pred = mean_user_rating + user_similarity.dot(ratings_diff) / np.array([np.abs(user_similarity).sum(axis=1)]).T
    return pred

This function is for item based content filtering.

In [None]:
def CFIB(ratings_pd):
    # Get the mean rating for each user
    ratings = ratings_pd.to_numpy()
    mean_user_rating = ratings_pd.mean(axis=1).to_numpy().reshape(-1, 1)
    # calculate the similarity between visualizations
    ratings_diff = (ratings - mean_user_rating)
    ratings_diff[np.isnan(ratings_diff)]=4
    vis_similarity = 1-pairwise_distances(ratings_diff, metric='cosine')
    pred = mean_user_rating + vis_similarity.dot(ratings_diff) / np.array([np.abs(vis_similarity).sum(axis=1)]).T
    return pred

Weighted sum of two predictions.

In [None]:
def combine_pred(pred1, pred2, w1 = 0.5, w2 = 0.5):
    # Replace NaN values with 0
    pred1 = np.nan_to_num(pred1, nan=4.0)
    pred2 = np.nan_to_num(pred2, nan=4.0)

    return w1 * pred1 + w2 * pred2

This function calculates the score for a relation of any type.

In [None]:
def normalize_score(value, metric_type):
    """
    Normalize different types of statistical measures to a 1-5 scale.
    """
    # Normalization strategies for different metric types
    normalization_strategies = {
        'high_correlation': {
            'abs_range': (0.5, 1.0),  # Correlation values are between -1 and 1
            'percentile_thresholds': [0.5, 0.7, 0.8, 0.9]
        },
        'target_correlation': {
            'abs_range': (0.5, 1.0),  # Correlation with target variable
            'percentile_thresholds': [0.5, 0.6, 0.7, 0.9]
        },
        'categorical_effect': {
            'abs_range': (0, 0.05),  # P-values, lower is stronger
            'percentile_thresholds': [0.05, 0.02, 0.01, 0.009]
        },
        'chi_squared': {
            'abs_range': (0, 0.05),  # P-values, lower is stronger
            'percentile_thresholds': [0.05, 0.02, 0.01, 0.009]
        },
        'date_numerical_trend': {
            'abs_range': (0.5, 1.0),  # Correlation values
            'percentile_thresholds': [0.5, 0.7, 0.8, 0.9]
        },
        'date_categorical_distribution':{
            'abs_range': (0, 0.05),  # P-values, lower is stronger
            'percentile_thresholds': [0.05, 0.02, 0.01, 0.009]
        },
        'non_linear': {
            'abs_range': (0.5, 1.0),  # Mutual information score
            'percentile_thresholds': [0.5, 0.7, 0.8, 0.9]
        },
        'feature_importance': {
            'abs_range': (0, 1.0),  # Feature importance values
            'percentile_thresholds': [0.2, 0.4, 0.6, 0.8]
        },
        'outlier_pattern': {
            'abs_range': (0.5, 1.0),  # Correlation differences
            'percentile_thresholds': [0.5, 0.7, 0.8, 0.9]
        }
    }
    

    if metric_type == 'cluster_group':
        # Normalize based on number of features in cluster or importance
        return min(max(1, int(value * 5)), 5)
    elif metric_type == 'target_analysis':
        # Normalize outlier ratio or distribution significance
        return min(max(1, int(value * 5)), 5)
    
    # We'll set the middle score as the default
    if metric_type not in normalization_strategies:
        return 3  
    
    strategy = normalization_strategies[metric_type]
    
    # Absolute value for signed metrics
    abs_value = abs(value)
    
    
    # Value-based normalization
    min_val, max_val = strategy['abs_range']
    
    # Normalize to 1-5 range
    if abs_value <= min_val:
        return 1
    elif abs_value >= max_val:
        return 5
    else:
        normalized = 1 + 4 * (abs_value - min_val) / (max_val - min_val)
        return int(min(max(normalized, 1), 5))


This function sets the score for each relation.

In [None]:
def get_relation_scores(relations):
    """
    Apply strength normalization to all relations.
    """
    for relation in relations:
        if relation['relation_type'] in {'high_correlation', 'target_correlation','date_numerical_trend',}:
            value = relation['details']['correlation_value']
        elif relation['relation_type'] in {'categorical_effect', 'date_categorical_distribution', 'chi_squared',  }:
            value = relation['details']['p_value']
        elif relation['relation_type'] == 'non_linear':
            value = relation['details']['mutual_information']
        elif relation['relation_type'] == 'outlier_pattern':
            value = relation['details']['outlier_correlation']
        else:
            value = 1       
        
        relation['score'] = normalize_score(
            value, 
            relation['relation_type']
        )
    
    return relations

This function generates a pandas DF with the top relations of each type

In [None]:
def get_top_relations(relations):
    algo_rec_df = pd.DataFrame({})
    top_relations = {}

    for i, rel in enumerate(relations):
        type = rel['relation_type']
        if not type in top_relations:
            top_relations[type] = {'score': rel['score'], 'index': i}

    for type in RELATION_TYPES:
        score = 0
        indx = -1
        if type in top_relations:
            score = top_relations[type]['score']
            indx = top_relations[type]['index']
        algo_rec_df.loc[0, type] = score
        algo_rec_df.loc[1, type] = indx

    return algo_rec_df