Note: This exercise is adapted from the original here. As of September 2020 if you install pandas_profiling on conda you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on pypi (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see pipenv for more details) of this example here.

Pandas Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [2]:

%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [4]:
# uncomment and run below if you need to pip install the pandas-profiling library
import sys
!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling[notebook]
  Using cached pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
Collecting matplotlib>=3.2.0
  Downloading matplotlib-3.3.3-cp37-cp37m-win_amd64.whl (8.5 MB)
Collecting seaborn>=0.10.1
  Using cached seaborn-0.11.0-py3-none-any.whl (283 kB)
Collecting confuse>=1.0.0
  Using cached confuse-1.4.0-py2.py3-none-any.whl (21 kB)
Collecting tqdm>=4.43.0
  Using cached tqdm-4.54.0-py2.py3-none-any.whl (69 kB)
Processing c:\users\binhkn\appdata\local\pip\cache\wheels\70\e1\52\5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655\htmlmin-0.1.12-py3-none-any.whl
Collecting phik>=0.9.10
  Using cached phik-0.10.0-py3-none-any.whl (599 kB)
Collecting pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3
  Downloading pandas-1.1.4-cp37-cp37m-win_amd64.whl (8.7 MB)
Collecting tangled-up-in-unicode>=0.0.6
  Using cached tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB)
Collecting visions[type_image_path]==0.5.0
  Using cached visions-0.5.0-py3-none-any.whl (64 kB)
C

You might want to restart the kernel now.

Import libraries

In [3]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

ModuleNotFoundError: No module named 'pandas_profiling'

Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [None]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)
    
df = pd.read_csv(file_name)
    
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)

Inline report without saving object

In [None]:
report = df.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)
report

Save report to file

In [None]:
profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("tmp/example.html")

More analysis (Unicode) and Print existing ProfileReport object inline

In [None]:
profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report

Notebook Widgets

In [None]:
profile_report.to_widgets()