Skip to content

Understanding scikit learn datasets

Nori12 edited this page Jul 20, 2020 · 1 revision

Using scikit-learn library

Import the dataset. For example, let's use the Iris Species Dataset.

from sklearn.datasets import load_iris

iris_dataset = load_iris()

Now, to understand how the dataset is organized, check the keys.

print("\nKeys of iris_dataset: \n{}\n".format(iris_dataset.keys()))

# Keys of iris_dataset: 
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
  • DESCR
    Stores a description of the dataset.
print(iris_dataset['DESCR'][:193] + "\n...\n")

# Iris plants dataset
# --------------------
# 
# **Data Set Characteristics:**
# 
#     :Number of Instances: 150 (50 in each of three classes)
#     :Number of Attributes: 4 numeric, pre
# ...
  • target_names
    Array of strings containing the species of flower we want to predict.
print("Target names: \n{}\n".format(iris_dataset['target_names']))

# Target names: 
# ['setosa' 'versicolor' 'virginica']
  • feature_names
    List of strings giving the description of each feature.
print("Feature names: \n{}\n".format(iris_dataset['feature_names']))

# Feature names: 
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
  • data
    The data itself is contained in the target and data fields.
    data -> numeric measurements of sepal length, sepal width, petal length and petal width.
    target -> species of each flower that were measured in the data array.
print("Type of data: \n{}\n".format(type(iris_dataset['data'])))

# Type of data: <class 'numpy.ndarray'>

print("Shape of data: {}\n".format(iris_dataset['data'].shape))

# Shape of data: (150, 4)

print("First five columns of data: \n{}\n".format(iris_dataset['data'][:5]))

# First five columns of data: 
# [[5.1 3.5 1.4 0.2]
#  [4.9 3.  1.4 0.2]
#  [4.7 3.2 1.3 0.2]
#  [4.6 3.1 1.5 0.2]
#  [5.  3.6 1.4 0.2]]
  • target
    The species are encoded as integers from 0 to 2. Numbers are given by the iris['target_names'] array.
print("Type of target: {}\n".format(type(iris_dataset['target'])))

# Type of target: <class 'numpy.ndarray'>

print("Shape of target: {}\n".format(iris_dataset['target'].shape))

# Shape of target: (150,)

print("Target:\n{}\n".format(iris_dataset['target']))

# Target:
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
#  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#  2 2]
  • frame
    Add.
  • filename
    Add.
Clone this wiki locally