<a href="https://colab.research.google.com/github/Tal144155/DTS_Project/blob/main/TDS_Project_p1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular Data Science - Research Project
### Group Members: 
* Tal Ariel Ziv
* Arnon Lutsky

#### Introduction
Our final project aims to enhance and automate the data visualization process within the data science pipeline. Visualization is a critical step in understanding the data, allowing users to explore distributions, analyze relationships between features and target variables, and gain meaningful insights from different perspectives. By improving and automating this process, we seek to make data exploration more efficient, more intuitive, and accessible. Our solution is an algorithm that automatically analyzes the data for different statistical relations and interesting observations and recommends visualizations based on analysis and a recommendation system.<br>

Before we begin, let's install all packages that are needed to run the notebook.
#### Installation Guide:
1. Download python **version 3.12** (and up). You can use the following [link](https://www.python.org/downloads/).
2. Please download all required packages, using the following command (write it in your CMD): `pip install -r requirements.txt`<br><br>
<font size=4px>**Now, we are able to begin.**</font>

### Relation Detection Algorithm:

For deeper understanging of the relation detection algorithm, please refer to the pdf with the full explanaion of the project, under Relation Detection Algorithm, section 2.1 .



Now, lets start analyzing the data.

### 0. Imports

In [None]:
import pandas as pd
import numpy as np
import pickle
import os.path
from sklearn.metrics.pairwise import pairwise_distances
from plot_generator import *
from recommendation_tool import *
from relation_detection_algorithm import *
import warnings
warnings.filterwarnings('ignore')


### 1. Understanding The Data
Now, let's define the functions that will find the relations.

We'll set the top 10 relations as the default number of relations returned.

In [6]:
TOP_N_RELATIONS = 10

Now we'll run the relation detection algorithm.

In [8]:
RELATION_TYPES = {
    "high_correlation": {
        "description": "Identifies pairs of numerical features that have a strong linear relationship, indicating potential multicollinearity or redundancy in the dataset.",
        "use_cases": [
            "Feature selection",
            "Dimensionality reduction",
            "Understanding feature interactions"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'target_correlation': {
        "description": "Measures the linear relationship between individual features and the target variable, helping to identify the most influential predictors.",
        "use_cases": [
            "Feature importance ranking",
            "Predictive modeling",
            "Feature selection"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'categorical_effect': {
        "description": "Evaluates the statistical significance of categorical variables' impact on a numerical target variable using one-way ANOVA test.",
        "use_cases": [
            "Feature significance testing",
            "Group comparison",
            "Categorical feature importance"
        ],
        "data_types": ["categorical", "numerical"],
        "dimensions": [2],
    },
    'chi_squared': {
        "description": "Identifies statistically significant relationships between categorical variables using the chi-squared independence test.",
        "use_cases": [
            "Feature dependency analysis",
            "Categorical variable interaction detection",
            "Feature selection"
        ],
        "data_types": ["categorical"],
        "dimensions": [2],
    },
    'date_numerical_trend': {
        "description": "Detects temporal trends in numerical features by measuring their correlation with time-based attributes.",
        "use_cases": [
            "Time series analysis",
            "Trend identification",
            "Temporal pattern recognition"
        ],
        "data_types": ["numerical", "time series"],
        "dimensions": [2],
    },
    'date_categorical_distribution': {
        "description": "Analyzes how categorical variable distributions change or are distributed across different time periods.",
        "use_cases": [
            "Temporal categorical pattern detection",
            "Seasonal variation analysis",
            "Time-based segmentation"
        ],
        "data_types": ["categorical", "time series"],
        "dimensions": [2],
    },
    'non_linear': {
        "description": "Identifies complex, non-linear relationships between numerical features using mutual information score.",
        "use_cases": [
            "Advanced feature interaction detection",
            "Non-linear dependency analysis",
            "Complex relationship mapping"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'feature_importance': {
        "description": "Ranks features based on their predictive power using a Random Forest Regressor's feature importance metric.",
        "use_cases": [
            "Predictive modeling",
            "Feature selection",
            "Model interpretability"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'outlier_pattern': {
        "description": "Detects unique correlation patterns among outliers that differ from the overall dataset's correlations.",
        "use_cases": [
            "Anomaly detection",
            "Robust correlation analysis",
            "Outlier impact assessment"
        ],
        "data_types": ["numerical"],
        "dimensions": [2],
    },
    'cluster_group': {
        "description": "Identifies groups of features that exhibit similar clustering characteristics based on their importance within specific clusters.",
        "use_cases": [
            "Feature grouping",
            "Dimensionality reduction",
            "Structural data understanding"
        ],
        "data_types": ["numerical"],
        "dimensions": [1],
    },
    'target_analysis': {
        "description": "Provides a comprehensive analysis of the target variable, including outlier characteristics and distribution properties.",
        "use_cases": [
            "Target variable understanding",
            "Distribution fitting",
            "Outlier detection"
        ],
        "data_types": ["numerical"],
        "dimensions": [1],
    }
}

We'll load a dataset, this will be the movie dataset on which we worked on the last parts.

In [None]:
data = pd.DataFrame({
    'date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
    'category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'value': [8, 15, 7, 12, 18, 9, 14, 20, 11, 16, 22, 13],
    'count': [100, 150, 70, 120, 180, 90, 140, 200, 110, 160, 220, 130]
})

We'll load the existing user ratings 

In [None]:
ratings = load_ratings('user_ratings_rel2', RELATION_TYPES)

In [None]:
user_id = input("Please enter a user id:\n")

# If this is a new user, add the user to the dataframe.
if not user_id in ratings.index:
    ratings.loc[user_id] = np.nan
    save_ratings(ratings, 'user_ratings_rel')

In [None]:
target_value = 'value'
# Load user ratings from the pickle file

In [None]:
# Get automatic recommendations 
dataset_types = get_column_types(data)
algo_rec = find_relations(data, target_value, dataset_types)
algo_rec = get_relation_scores(algo_rec)

In [None]:




while True:
    if not algo_rec:
        print("Those are all the meaningful relations we've found.\n We hope you found this helpful! (:)")
        break
    # Get the current user ratings
    # ratings = load_ratings('user_ratings_rel', RELATION_TYPES)
    combined_user_vis_pred = combine_pred(CFIB(ratings), CFUB(ratings), 0.5, 0.5)

    # Make a df for the recommendation system
    algo_rec_df = get_top_relations(algo_rec)


    user_index = ratings.index.get_loc(user_id)
    recommendations = combine_pred(combined_user_vis_pred[user_index], algo_rec_df.to_numpy()[0], 0.7, 0.3)
    
    # Print the top recommendations sorted by score
    print("Recommended visualizations:")
    index = int(algo_rec_df.iloc[1,recommendations.argmax()])
    rec = algo_rec.pop(index)
    user_rating = 0
    if pd.notna(ratings.loc[user_id, rec['relation_type']]):
        user_rating = ratings.loc[user_id, rec['relation_type']]

    print(f"\n    {rec['relation_type'].replace('_', ' ').title()}")
    print(f"   Description: {RELATION_TYPES[rec['relation_type']]['description']}")
    print(f"   Score: {recommendations[index]}")
    print(f"   Rationale:")
    for exp in rec['details']:
        print(f"   - {exp}")
    
    new_rating = int(input(RATING_STRING))
    if user_rating:
        ratings.loc[user_id, rec['relation_type']] = user_rating * 0.8 + new_rating* 0.2
    else:
        ratings.loc[user_id, rec['relation_type']] = new_rating

    save_ratings(ratings, 'user_ratings_rel')    



    print(f'\n {ratings} \n')