## CEIA - Análisis de datos

### Clase 8: Automatización de análisis de datos. EDA automático.

### 1. Biblioteca ydata-profiling

In [3]:
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport
import webbrowser
import os
from sklearn.model_selection import train_test_split

In [4]:
# Cargar el dataset de Titanic
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### 1. EDA automático con biblioteca ydata-profiling

In [5]:
# Creamos el reporte
profile_1 = ProfileReport(df, title="Pandas Profiling Report", explorative=False, correlations={
    # default "auto": {"calculate": True}
    "pearson": {"calculate": True},
    "spearman": {"calculate": True},
    "kendall": {"calculate": True},
    "cramers": {"calculate": True}, 
    }
)


In [6]:
# Exportar el repote a un archivo HTML o Json
profile_1.to_file("../recursos/titanic_report_1.html")

profile_1.to_file("../recursos/titanic_report_1.json") # Esta opción se usa para customizar el reporte 


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 15/15 [00:00<00:00, 2864.83it/s]
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'putmask: first argument must be an array')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
# Pequena función para abrir el reporte en el navegador
def open_html_report(report_path):
    file_path = os.path.abspath(report_path)

    # Abrir el reporte en el browser
    webbrowser.open(f"file:///{file_path}")

    return

In [8]:
# Si lo queremos mostrar directamente en la notebook, correr la siguiente línea:
# profile.to_notebook_iframe() 

In [9]:
# Abrir el reporte 1 en el browser
open_html_report("../recursos/titanic_report_1.html")

In [10]:
# Exploración más exhaustiva con el flag "explorative"
profile_2 = ProfileReport(df, title="Pandas Profiling Report EDA", explorative=True)
profile_2.to_file("../recursos/titanic_report_2.html")


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 15/15 [00:00<00:00, 2833.86it/s]
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'putmask: first argument must be an array')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [11]:
# Abrir el reporte 2 en el browser
open_html_report("../recursos/titanic_report_2.html")

### 2. EDA automático con biblioteca sweetviz

In [14]:
import sweetviz as sv

In [15]:
sv.analyze(df).show_html("../recursos/sweetviz_report.html")

                                             |          | [  0%]   00:00 -> (? left)

Report ../recursos/sweetviz_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [16]:
# Comparar 2 datasets (por ejemplo Train y Test)

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Generar el reporte de comparación
comparacion = sv.compare([train_df, "Train"], [test_df, "Test"])

# Guardar el reporte HTML
comparacion.show_html("../recursos/sweetviz_comparacion.html")


                                             |          | [  0%]   00:00 -> (? left)

Report ../recursos/sweetviz_comparacion.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
