# Automatic EDA using `YData Profiling`

**Author Name:** Danish Azeem\
**Email:** danishazeem365@gmail.com


# Assighnment: Names of 5 0r 8 libraries that can use y-data profiling to data analysis

 There are several libraries available in Python that can help automate the process of Exploratory Data Analysis (EDA) and provide comprehensive reports with visualizations and insights. Some popular libraries for automatic EDA include:

1. **Pandas Profiling**: Pandas Profiling is a Python library that generates interactive HTML reports containing descriptive statistics, visualizations, and correlation matrices for a given dataset. It provides a quick overview of the data distribution, missing values, correlations, and more.

2. **AutoViz**: AutoViz is an automatic visualization library that generates a wide variety of plots for each feature in the dataset. It automatically selects the appropriate chart type based on the data type and distribution of each feature.

3. **Sweetviz**: Sweetviz is a Python library for visualizing and comparing datasets. It generates detailed comparative reports with statistics, visualizations, and insights for both individual features and feature interactions.

4. **DataPrep**: DataPrep is a Python library that simplifies the data preparation and EDA process. It provides functions for data cleaning, visualization, profiling, and feature engineering tasks.

5. **D-Tale**: D-Tale is a Flask-based tool that generates interactive web-based reports for EDA. It allows users to explore datasets, visualize data distributions, and analyze data relationships using an intuitive interface.

These libraries can help streamline the process of Exploratory Data Analysis by automating routine tasks and providing actionable insights to analysts and data scientists. You can choose the library that best fits your requirements and preferences for conducting EDA on your datasets.


**1. Pandas Profiling (replaced by YData Profiling):**

While Pandas Profiling is no longer actively maintained, its successor, YData Profiling (formerly Pandas Profiling), offers a user-friendly and comprehensive approach to automated EDA. It generates an interactive HTML report summarizing data types, missing values, distributions, correlations, and more.

**2. Dtale:**

Dtale is another excellent library for interactive EDA. It focuses on providing visualizations within a web application interface, allowing you to explore data distributions, correlations, and filter data points intuitively.

**3. Sweetviz:**

Sweetviz is another user-friendly library that generates a visual and interactive HTML report for your data. It offers various charts and summaries to explore data distributions, missing values, relationships between variables, and more.

**4. AutoViz:**

AutoViz is a lightweight library that automatically generates various visualizations for your data, including histograms, scatter plots, box plots, and heatmaps. It's a good option for quickly getting a visual overview of your data.

**5. ExploriPy:**

ExploriPy focuses on providing statistical summaries and tests alongside visualizations. It can be helpful for identifying potential outliers, normality checks, and exploring relationships between variables through tests like ANOVA and Chi-Square.

**Choosing the Right Library:**

The best library for you depends on your specific needs and preferences. Consider these factors:

- **Desired level of automation:** Do you want a comprehensive report like YData Profiling or a focus on specific visualizations like AutoViz?
- **Interactivity:** Do you prefer interactive exploration within a web app (Dtale) or a static HTML report (YData Profiling, Sweetviz)?
- **Statistical tests:** If you need statistical summaries and hypothesis tests alongside visualizations, ExploriPy might be a good choice.

**Here are some additional tips for using automated EDA libraries:**

- **Don't rely solely on automation:** While these libraries are helpful, don't overlook the importance of critically evaluating the generated reports and visualizations. Use them as a starting point for further exploration.
- **Understand the data:** Regardless of the library, it's crucial to have a basic understanding of the data and its context to interpret the automated insights effectively.
- **Combine with manual exploration:** Use these libraries alongside manual techniques like data cleaning, correlation analysis, and domain knowledge for a more comprehensive EDA approach.

By effectively using these libraries, you can streamline the initial stages of EDA, saving time and allowing you to focus on more in-depth analysis and model building.

In [1]:
# import libreres
import pandas as pd
import numpy as np
import seaborn as sns
import ydata_profiling as ydp

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# import dataset
df = sns.load_dataset('titanic')

In [3]:
# ydata profiling report 
profile  = ydp.ProfileReport(df)
profile.to_file(output_file="./outputs/ydata_titanic.html")

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'no'')
  annotation = ("{:" + self.fmt + "}").format(val)
(using `df.profile_report(missing_diagrams={"Heatmap": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: '--'')
Summarize dataset: 100%|██████████| 41/41 [00:07<00:00,  5.41it/s, Completed]               
Generate report structure: 100%|██████████| 1/1 [00:06<00:00,  6.45s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.18s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 125.07it/s]


# Do it on our Pak Population Dataset


In [5]:
df_pop = pd.read_csv('../day_8\sub-division_population_of_pakistan.csv')


In [6]:
profile  = ydp.ProfileReport(df_pop)
profile.to_file(output_file="./outputs/ydata_pak_population.html")

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'BAHAWALPUR DIVISION'')
Summarize dataset: 100%|██████████| 319/319 [01:03<00:00,  5.03it/s, Completed]                                                     
Generate report structure: 100%|██████████| 1/1 [00:14<00:00, 14.43s/it]
Render HTML: 100%|██████████| 1/1 [00:15<00:00, 15.36s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  7.63it/s]


# Assighnment download two datasets from google dataset and do ydata_profiling on them and what you know about this dataset why that happend in this dataset?