# 1. Load a Dataset

See [pandas IO tools][1] for ways to read data from various sources.

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [1]:
# Using the auto-mpg sample dataset
import seaborn as sns
data = sns.load_dataset('mpg')
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


# 2. Get Summary Statistics

## 2.1 Univariate Features

In [2]:
from eda_report.univariate import Variable

horse_power = Variable(data['horsepower'])
horse_power

            Overview
Name: horsepower,
Type: numeric,
Unique Values: 93 -> {nan, 46.0, 48.0, 49.0, 52.0, 53.0, 54.0, 58.0, [...],
Missing Values: 6 (1.51%)

        Summary Statistics
                        horsepower
Number of observations  392.000000
Average                 104.469388
Standard Deviation       38.491160
Minimum                  46.000000
Lower Quartile           75.000000
Median                   93.500000
Upper Quartile          126.000000
Maximum                 230.000000
Skewness                  1.087326
Kurtosis                  0.696947

## 2.2 Multivariate Datasets

In [3]:
from eda_report.multivariate import MultiVariable

mpg = MultiVariable(data)
mpg

Bivariate analysis: 100%|██████████████████████████████████████████| 21/21 [00:05<00:00,  3.78it/s]


        Overview
Numeric features: mpg, cylinders, displacement, horsepower, weight, acceleration, model_year
Categorical features: origin, name

        Summary Statistics (Numeric features)
              mpg   cylinders  displacement  horsepower       weight  \
count  398.000000  398.000000    398.000000  392.000000   398.000000   
mean    23.514573    5.454774    193.425879  104.469388  2970.424623   
std      7.815984    1.701004    104.269838   38.491160   846.841774   
min      9.000000    3.000000     68.000000   46.000000  1613.000000   
25%     17.500000    4.000000    104.250000   75.000000  2223.750000   
50%     23.000000    4.000000    148.500000   93.500000  2803.500000   
75%     29.000000    8.000000    262.000000  126.000000  3608.000000   
max     46.600000    8.000000    455.000000  230.000000  5140.000000   

       acceleration  model_year  
count    398.000000  398.000000  
mean      15.568090   76.010050  
std        2.757689    3.697627  
min        8.000000   7

# 3. Generate an EDA report document

You can customize the following:

- `title`: default = 'Exploratory Data Analysis Report',
- `graph_color`: default = 'orangered',
- `output_filename`: default = 'eda-report.docx'

In [4]:
from eda_report import get_word_report

# Automatically analyse the data, plot graphs and save results as a .docx file
get_word_report(data)

[INFO 17:19:42.572] Assessing correlation in numeric variables...
Bivariate analysis: 100%|██████████████████████████████████████████| 21/21 [00:05<00:00,  3.79it/s]
[INFO 17:19:54.631] Done. Summarising each variable...
Univariate analysis: 100%|███████████████████████████████████████████| 9/9 [00:04<00:00,  2.25it/s]
[INFO 17:19:59.285] Done. Results saved as 'eda-report.docx'
