**Note**: This exercise is adapted from the original [here](https://github.com/pandas-profiling/pandas-profiling/blob/master/examples/meteorites/meteorites.ipynb). As of September 2020 if you install [pandas_profiling on conda](https://anaconda.org/conda-forge/pandas-profiling) you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on [pypi](https://pypi.org/project/pandas-profiling/) (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see [pipenv](https://pipenv-fork.readthedocs.io/en/latest/) for more details) of this example [here](https://github.com/andrewm4894/pandas-profiling/blob/master/Pipfile).

## Pandas Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [2]:
# uncomment and run below if you need to pip install the pandas-profiling library
import sys
!{sys.executable} -m pip install -U pandas-profiling==2.9.0
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling==2.9.0
  Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.0/259.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting phik>=0.9.10
  Downloading phik-0.12.3-cp39-cp39-macosx_10_13_x86_64.whl (652 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m653.0/653.0 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting tangled-up-in-unicode>=0.0.6
  Downloading tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting confuse>=1.0.0
  Downloading confuse-2.0.1-py3-none-any.whl (24 kB)
Collecting visions[type_image_path]==0.5.0
  Downloading visions-0.5.0-py3-none-any.whl (64 kB)


Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py) ... [?25ldone
[?25h  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27082 sha256=9ce4f98ce3290514bea5cd73e83e46e9467c92ab1abf0386e934daaa6f959b97
  Stored in directory: /Users/andreaternera/Library/Caches/pip/wheels/1d/05/04/c6d7d3b66539d9e659ac6dfe81e2d0fd4c1a8316cc5a403300
Successfully built htmlmin


Installing collected packages: htmlmin, tangled-up-in-unicode, confuse, imagehash, visions, phik, missingno, pandas-profiling
Successfully installed confuse-2.0.1 htmlmin-0.1.12 imagehash-4.3.1 missingno-0.5.2 pandas-profiling-2.9.0 phik-0.12.3 tangled-up-in-unicode-0.2.0 visions-0.5.0
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


You might want to restart the kernel now.

### Import libraries

In [4]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

ImportError: cannot import name 'ABCIndexClass' from 'pandas.core.dtypes.generic' (/Users/andreaternera/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/generic.py)

### Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [5]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)
    
df = pd.read_csv(file_name)
    
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)

NameError: name 'cache_file' is not defined

### Inline report without saving object

In [None]:
report = df.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)
report

### Save report to file

In [None]:
profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("tmp/example.html")

### More analysis (Unicode) and Print existing ProfileReport object inline

In [None]:
profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report

### Notebook Widgets

In [None]:
profile_report.to_widgets()