# 1. Load a Dataset

See [pandas IO tools][1] for ways to read data from various sources.

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [1]:
# Using the iris sample dataset
import seaborn as sns
iris_data = sns.load_dataset('iris')
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# 2. Get Summary Statistics

## 2.1 Univariate Features

In [2]:
from eda_report.univariate import Variable

sepal_width = Variable(iris_data['sepal_width'])
sepal_width

            Overview
Name: sepal_width,
Type: numeric,
Unique Values: 23 -> {2.9, 3.0, 3.5, 3.2, 3.6, 3.1, 3.9, 3.4, 3.7, [...],
Missing Values: None

        Summary Statistics
                        sepal_width
Number of observations   150.000000
Average                    3.057333
Standard Deviation         0.435866
Minimum                    2.000000
Lower Quartile             2.800000
Median                     3.000000
Upper Quartile             3.300000
Maximum                    4.400000
Skewness                   0.318966
Kurtosis                   0.228249

## 2.2 Multivariate Datasets

In [3]:
from eda_report.multivariate import MultiVariable

mpg = MultiVariable(data=iris_data)
mpg

Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00,  3.79it/s]


        Overview
Numeric features: sepal_length, sepal_width, petal_length, petal_width
Categorical features: species

        Summary Statistics (Numeric features)
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

        Summary Statistics (Categorical features)
          species
count         150
unique          3
top     virginica
freq           50

        Bivariate Analysis (Correlation)
sepal_length & sepal_width --> very weak negative correlation (-0.12)
sepal_length & petal_width --> strong posi

# 3. Generate an EDA report document

You can customize the following:

- `title`: default = 'Exploratory Data Analysis Report',
- `graph_color`: default = 'orangered',
- `output_filename`: default = 'eda-report.docx'
- `target_variable`

In [4]:
from eda_report import get_word_report

# Automatically analyse the data, plot graphs and save results as a .docx file
result = 'eda-report.docx'
get_word_report(iris_data, output_filename=result, target_variable='species')

[INFO 22:03:39.289] Assessing correlation in numeric variables...
Bivariate analysis: 100%|████████████████████████████████████████████| 6/6 [00:01<00:00,  4.02it/s]
[INFO 22:03:44.488] Done. Summarising each variable...
Univariate analysis: 100%|███████████████████████████████████████████| 5/5 [00:02<00:00,  2.09it/s]
[INFO 22:03:47.081] Done. Results saved as 'eda-report.docx'


## Download the report

Click on the link in the output of the cell below to download the generated report document.

In [5]:
from IPython.display import FileLink

FileLink(result)