# Pandas Profiling
There is an excellent tool for `Exploratory Data Analysis` (EDA) in python, it is not compatible with polars so we will be using pandas for this part.

EDA is helpful as it displays an overview of our data, like histograms, duplicates, null counts, basic statistics like mean, median, etc.

It is very basic, but for a starter tool is a solid choice.

In [None]:
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
from datetime import date
import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
import config
from paths import Paths


# Select columns
We would not want to add all columns to the profiler, since some of them can be hard to process, like the comment itself; pandas profiling doesn't handle text very well, but we will perform a special analysis on the text later.

In [9]:
columns = ["reply_count", "author", "likes", "published_at",
           "parent_id", "is_reply", "comment_length", "word_count", "emoji_count", "script"]

# Load All files
The `Paths` class (from `src`) provides a way to identify all the clean comment files in a single list. It reads the channel handle and iterates over all dates, producing an array of files, from which we can read and concatenate in a single pandas dataframe.

You may wonder if this is a memory intensive process, since we are joining the data of all processed days: **not really**, remember that we are not loading the `comment` or other columns that may be related (like `author_id` and `author`, so we keep only `author` here).

In [None]:
channel_paths = Paths(channel_handle=config.channel_handle)
dfs = [pd.read_parquet(file, columns=columns) for file in channel_paths.list_processed_files()]

In [None]:
# Join into a single dataframe, and delete the previous pointer
df = pd.concat(dfs, ignore_index=True)
del dfs

### Profile report generation
Pandas profiler is so simple to use, we only need to pass the dataframe directly into the profiler.

In [13]:
profile = ProfileReport(df, title="YouTube Comments EDA", explorative=True)

The profiler offers two options for display, save to file or embed the outputs directly into the notebook.
1) Saving to a file will generate an html file 
2) Displaying in the notebook generates it right bellow the execution cell, but there may issues at reloading the profiler information at a later time.

In [None]:

# Save to file
profile.to_file(os.path.join(channel_paths.results_dir, f"{config.channel_handle}_profiling.html"))

# Display right bellow
# profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:32<00:00,  3.20s/it]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]