Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository to present the examples in this section. If you haven't downloaded the data, you can do so by running:
$ python -m yellowbrick.download
This should create a folder called data
in your current working directory with all of the datasets. You can load a specified dataset with pandas.read_csv
as follows:
import pandas as pd
data = pd.read_csv('data/concrete/concrete.csv')
The following code snippet can be found at the top of the examples/examples.ipynb
notebok in Yellowbrick. Please reference this code when trying to load a specific data set:
from yellowbrick.download import download_all
## The path to the test data sets
FIXTURES = os.path.join(os.getcwd(), "data")
## Dataset loading mechanisms
datasets = {
"bikeshare": os.path.join(FIXTURES, "bikeshare", "bikeshare.csv"),
"concrete": os.path.join(FIXTURES, "concrete", "concrete.csv"),
"credit": os.path.join(FIXTURES, "credit", "credit.csv"),
"energy": os.path.join(FIXTURES, "energy", "energy.csv"),
"game": os.path.join(FIXTURES, "game", "game.csv"),
"mushroom": os.path.join(FIXTURES, "mushroom", "mushroom.csv"),
"occupancy": os.path.join(FIXTURES, "occupancy", "occupancy.csv"),
}
def load_data(name, download=True):
"""
Loads and wrangles the passed in dataset by name.
If download is specified, this method will download any missing files.
"""
# Get the path from the datasets
path = datasets[name]
# Check if the data exists, otherwise download or raise
if not os.path.exists(path):
if download:
download_all()
else:
raise ValueError((
"'{}' dataset has not been downloaded, "
"use the download.py module to fetch datasets"
).format(name))
# Return the data frame
return pd.read_csv(path)
Note that most of the examples currently use one or more of the listed datasets for their examples (unless specifically shown otherwise). Each dataset has a README.md
with detailed information about the data source, attributes, and target. Here is a complete listing of all datasets in Yellowbrick and their associated analytical tasks:
- bikeshare: suitable for regression
- concrete: suitable for regression
- credit: suitable for classification/clustering
- energy: suitable for regression
- game: suitable for classification
- hobbies: suitable for text analysis
- mushroom: suitable for classification/clustering
- occupancy: suitable for classification