Understanding scikit learn datasets

Machine Learning Tutorial

Using scikit-learn library

Import the dataset. For example, let's use the Iris Species Dataset.

from sklearn.datasets import load_iris

iris_dataset = load_iris()

Now, to understand how the dataset is organized, check the keys.

print("\nKeys of iris_dataset: \n{}\n".format(iris_dataset.keys()))

# Keys of iris_dataset: 
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

DESCR
Stores a description of the dataset.

print(iris_dataset['DESCR'][:193] + "\n...\n")

# Iris plants dataset
# --------------------
# 
# **Data Set Characteristics:**
# 
#     :Number of Instances: 150 (50 in each of three classes)
#     :Number of Attributes: 4 numeric, pre
# ...

target_names
Array of strings containing the species of flower we want to predict.

print("Target names: \n{}\n".format(iris_dataset['target_names']))

# Target names: 
# ['setosa' 'versicolor' 'virginica']

feature_names
List of strings giving the description of each feature.

print("Feature names: \n{}\n".format(iris_dataset['feature_names']))

# Feature names: 
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

data
The data itself is contained in the target and data fields.
data -> numeric measurements of sepal length, sepal width, petal length and petal width.
target -> species of each flower that were measured in the data array.

print("Type of data: \n{}\n".format(type(iris_dataset['data'])))

# Type of data: <class 'numpy.ndarray'>

print("Shape of data: {}\n".format(iris_dataset['data'].shape))

# Shape of data: (150, 4)

print("First five columns of data: \n{}\n".format(iris_dataset['data'][:5]))

# First five columns of data: 
# [[5.1 3.5 1.4 0.2]
#  [4.9 3.  1.4 0.2]
#  [4.7 3.2 1.3 0.2]
#  [4.6 3.1 1.5 0.2]
#  [5.  3.6 1.4 0.2]]

target
The species are encoded as integers from 0 to 2. Numbers are given by the iris['target_names'] array.

print("Type of target: {}\n".format(type(iris_dataset['target'])))

# Type of target: <class 'numpy.ndarray'>

print("Shape of target: {}\n".format(iris_dataset['target'].shape))

# Shape of target: (150,)

print("Target:\n{}\n".format(iris_dataset['target']))

# Target:
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
#  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#  2 2]

frame
Add.

filename
Add.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding scikit learn datasets

Machine Learning Tutorial

Using scikit-learn library

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally