# Pandas Profiling Auto EDA
This tool is especially useful in the early stages of a machine learning project, as it provides a quick and thorough overview of the data, highlighting potential issues such as missing values, duplicate rows, or irrelevant variables that you might want to preprocess or remove before building models.
Running this notebook will create an interactive report that you can use to explore your data in more detail in the `results/figures/` folder.

## Case notebook does not run
If the notebook can not run due to some import version issues, create an virtual environment use that as the kernel for this notebook. 
```bash
python -m venv venv_pandas_profiling
```
Then use that for this notebook. 


In [1]:
%pip install ydata-profiling

Defaulting to user installation because normal site-packages is not writeable
Collecting ydata-profiling
  Downloading ydata_profiling-4.8.3-py2.py3-none-any.whl.metadata (20 kB)
Collecting matplotlib<3.9,>=3.2 (from ydata-profiling)
  Downloading matplotlib-3.8.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting pydantic>=2 (from ydata-profiling)
  Downloading pydantic-2.7.1-py3-none-any.whl.metadata (107 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.3/107.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata

In [2]:
import sys
import os
import pandas as pd
from pandas_profiling import ProfileReport


# Get the current working directory
current_working_directory = os.getcwd()

# Go up one level from the current working directory
parent_directory = os.path.join(current_working_directory, '..')

# Add the parent directory to sys.path
sys.path.append(parent_directory)

os.getcwd()

from src.config import REPORT_PATH 
from src.data.data_loader import create_data_loader

  from .autonotebook import tqdm as notebook_tqdm
  from pandas_profiling import ProfileReport


## Load data

In [3]:
x, y = create_data_loader().load_raw_data()
training_data = pd.concat([x, y], axis=1)

## Generate report

In [4]:
profile = ProfileReport(training_data, title='Pandas Profiling Report')
profile.to_file(f"{REPORT_PATH}pandas_profiling.html")

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'S'')
Summarize dataset: 100%|██████████| 47/47 [00:03<00:00, 14.82it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.30s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 359.56it/s]
