-
Notifications
You must be signed in to change notification settings - Fork 0
Understanding scikit learn datasets
Nori12 edited this page Jul 20, 2020
·
1 revision
Import the dataset. For example, let's use the Iris Species Dataset.
from sklearn.datasets import load_iris
iris_dataset = load_iris()
Now, to understand how the dataset is organized, check the keys.
print("\nKeys of iris_dataset: \n{}\n".format(iris_dataset.keys()))
# Keys of iris_dataset:
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
-
DESCR
Stores a description of the dataset.
print(iris_dataset['DESCR'][:193] + "\n...\n")
# Iris plants dataset
# --------------------
#
# **Data Set Characteristics:**
#
# :Number of Instances: 150 (50 in each of three classes)
# :Number of Attributes: 4 numeric, pre
# ...
-
target_names
Array of strings containing the species of flower we want to predict.
print("Target names: \n{}\n".format(iris_dataset['target_names']))
# Target names:
# ['setosa' 'versicolor' 'virginica']
-
feature_names
List of strings giving the description of each feature.
print("Feature names: \n{}\n".format(iris_dataset['feature_names']))
# Feature names:
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
-
data
The data itself is contained in the target and data fields.
data -> numeric measurements of sepal length, sepal width, petal length and petal width.
target -> species of each flower that were measured in the data array.
print("Type of data: \n{}\n".format(type(iris_dataset['data'])))
# Type of data: <class 'numpy.ndarray'>
print("Shape of data: {}\n".format(iris_dataset['data'].shape))
# Shape of data: (150, 4)
print("First five columns of data: \n{}\n".format(iris_dataset['data'][:5]))
# First five columns of data:
# [[5.1 3.5 1.4 0.2]
# [4.9 3. 1.4 0.2]
# [4.7 3.2 1.3 0.2]
# [4.6 3.1 1.5 0.2]
# [5. 3.6 1.4 0.2]]
-
target
The species are encoded as integers from 0 to 2. Numbers are given by the iris['target_names'] array.
print("Type of target: {}\n".format(type(iris_dataset['target'])))
# Type of target: <class 'numpy.ndarray'>
print("Shape of target: {}\n".format(iris_dataset['target'].shape))
# Shape of target: (150,)
print("Target:\n{}\n".format(iris_dataset['target']))
# Target:
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
# 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# 2 2]
-
frame
Add.
-
filename
Add.