
# Exploratory data analysis: functions


## Table of contents

0. [Introduction](#Introduction)
1. [Import of packages](#Packages)
2. [Functions definitions](#Functions)
    1. [Initial informations ](#InitialFunctions)
    2. [Overview](#OverviewFunctions)
    3. [Variables distributions](#DistributionsFunctions)
    4. [Target proportion](#ProportionFunctions)



<section id="Introduction">
    <h2> 0. Introduction </h2>
</section>


This is a file with functions that aim to facilitate exploratory data analysis. As output, the functions provide graphs, metrics, and basic tables commonly used in the analysis process. Pay attention to the function parameters:

- **df_origin**: single pandas dataframe;

<br>

- **X_train**: pandas train dataframe;
- **X_test**: pandas test dataframe;
- **X_validation**: pandas validation dataframe;

<br>

- **y_train**: train target;
- **y_test**: test target;
- **y_validation**: validation target;


---



<section id="Packages">
    <h2> 1. Import of packages </h2>
</section>


In [9]:
#!pip install

In [10]:
from sklearn.datasets import load_breast_cancer

from fast_ml import eda
from fast_ml.model_development import train_valid_test_split
from fast_ml.utilities import reduce_memory_usage, display_all

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle # pickle
import pyarrow

# interative interface
import ipywidgets as widgets
from IPython.display import display

---


<section id="Functions">
    <h2> 2. Functions definitions </h2>
</section>



<section id="InitialFunctions">
    <h3> 2.1. Initial informations </h3>
</section>


In [15]:
def eda_initial_informations(df):
    
    num_rows, num_columns = df.shape
    index = df.index
    
    print('Number of rows:', num_rows)
    print('Number of columns:', num_columns)
    print('Index:', index)


<section id="OverviewFunctions">
    <h3> 2.2. Overview </h3>
</section>


In [17]:
 def eda_overview(df):
    summary_df = eda.df_info(df)
    display_all(summary_df)

In [81]:
def eda_summary_table(df):
    num_no_missing_occurrences = df.apply(lambda x: x.notnull().sum())
    num_missing = df.apply(lambda x: x.isnull().sum())
    num_total_occurrences = num_no_missing_occurrences + num_missing
    percent_missing = ((num_missing / num_total_occurrences)*100).round().astype(int)
    num_unique_occurrences = df.apply(lambda x: x.nunique())
    formats = df.dtypes
    types = formats.apply(lambda x: 
        'Numeric' if x in ['float64', 'int64'] else 
        'Categorical' if x == 'object' else 
        'Datetime' if x == 'datetime64[ns]' else 'Outro'
    )    
    
    summary_table = pd.DataFrame({
        'Number of total occurrences': num_total_occurrences,
        'Number of no missing occurreces': num_no_missing_occurrences,
        'Number of missing': num_missing,
        'Percent of missing (%)': percent_missing,      
        'Number of unique no missing occurrences': num_unique_occurrences,
        'Format': formats,
        'Type': types
    })
    
    return summary_table




<section id="DistributionsFunctions">
    <h3> 2.3. Variables distributions </h3>
</section>


In [20]:
def eda_variables_distributions(dataframe1, dataframe2, n_start, n_end):
    if n_end <= n_start:
        print('Error: n_end must be greater than n_start.')
        return
    
    variables = dataframe1.columns[n_start:n_end]
    num_variables = len(variables)
    
    if num_variables <= 0:
        print('Error: Invalid range of columns.')
        return
    
    num_cols = 2
    num_rows = (num_variables + 1) // num_cols
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 4*num_rows))
    
    if num_rows == 1 and num_cols == 1:
        axes = [axes]
    else:
        axes = axes.flatten()
    
    for i, var in enumerate(variables):
        axes[i].hist(dataframe1[var], bins=20, color='skyblue', alpha=0.5, label='Dataframe 1', log=True)
        axes[i].hist(dataframe2[var], bins=20, color='orange', alpha=0.5, label='Dataframe 2', log=True)
        #axes[i].set_title(var)
        axes[i].set_xlabel(var)
        axes[i].set_ylabel('Frequency')
        axes[i].legend()
    
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()

# Exemplo de uso:
# eda_variables_distributions(dataframe1, dataframe2, n_start, n_end)



<section id="ProportionFunctions">
    <h3> 2.4. Target proportion </h3>
</section>


In [22]:
def eda_target_distribution(original_series, train_series, test_series, validation_series):
    labels = ['Class 0', 'Class 1']  # Defina os rótulos apropriados para suas classes
    
    # Crie subplots para os quatro gráficos de pizza
    fig, axes = plt.subplots(2, 2, figsize=(10, 10))
    fig.suptitle('Target Distribution Comparison', fontsize=16)
    axes = axes.flatten()

    series_data = [original_series, train_series, test_series, validation_series]
    titles = ['Original Series', 'Train Series', 'Test Series', 'Validation Series']
    
    for ax, data, title in zip(axes, series_data, titles):
        class_counts = [len(data[data == 0]), len(data[data == 1])]  # Calcule a contagem de classes 0 e 1
        
        wedges, texts, autotexts = ax.pie(class_counts, labels=labels, startangle=90, colors=['palegreen', 'lightcoral'],
                                          wedgeprops={'edgecolor': 'black'}, autopct='%1.1f%%')
        ax.axis('equal')  # Proporção igual para garantir que o gráfico de pizza seja um círculo
        ax.set_title(title)

        # Adicione o número absoluto das classes aos rótulos
        for autotext, count in zip(autotexts, class_counts):
            autotext.set(size=12, fontweight='bold')
            autotext.set_text(f'{count}\n({autotext.get_text()})')

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()