# Toy datasets

From this guide: https://scikit-learn.org/stable/datasets/toy_dataset.html

In [1]:
%%html
<style>
table.mytable td > p {
    margin-bottom: 0em !important;
    line-height: 0.5 !important;
}
</style>

## Overview

`scikit-learn` comes with a few small standard datasets that do not require to download any file from some external website.

They can be loaded using the following functions:

<table class="longtable docutils align-default mytable">
<colgroup>
<col style="width: 10%">
<col style="width: 90%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston" title="sklearn.datasets.load_boston"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_boston</span></code></a>(*[,&nbsp;return_X_y])</p></td>
<td><p>Load and return the boston house-prices dataset (regression).</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris" title="sklearn.datasets.load_iris"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_iris</span></code></a>(*[,&nbsp;return_X_y,&nbsp;as_frame])</p></td>
<td><p>Load and return the iris dataset (classification).</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes" title="sklearn.datasets.load_diabetes"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_diabetes</span></code></a>(*[,&nbsp;return_X_y,&nbsp;as_frame])</p></td>
<td><p>Load and return the diabetes dataset (regression).</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits" title="sklearn.datasets.load_digits"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_digits</span></code></a>(*[,&nbsp;n_class,&nbsp;return_X_y,&nbsp;as_frame])</p></td>
<td><p>Load and return the digits dataset (classification).</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud" title="sklearn.datasets.load_linnerud"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_linnerud</span></code></a>(*[,&nbsp;return_X_y,&nbsp;as_frame])</p></td>
<td><p>Load and return the physical excercise linnerud dataset.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine" title="sklearn.datasets.load_wine"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_wine</span></code></a>(*[,&nbsp;return_X_y,&nbsp;as_frame])</p></td>
<td><p>Load and return the wine dataset (classification).</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="../modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer" title="sklearn.datasets.load_breast_cancer"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_breast_cancer</span></code></a>(*[,&nbsp;return_X_y,&nbsp;as_frame])</p></td>
<td><p>Load and return the breast cancer wisconsin dataset (classification).</p></td>
</tr>
</tbody>
</table>

ℹ️ These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. They are however often too small to be representative of real world machine learning tasks.

**Note on parameters of `load_...`:** Notice the common parameters: `return_X_y` and `as_frame`, which are self-explanatory.

## [🏡 Boston house prices dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-house-prices-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list simple">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>506</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.</p>
</dd>
<dt class="field-odd">Attribute Information (in order)</dt>
<dd class="field-odd"><ul class="simple">
<li><p>CRIM     per capita crime rate by town</p></li>
<li><p>ZN       proportion of residential land zoned for lots over 25,000 sq.ft.</p></li>
<li><p>INDUS    proportion of non-retail business acres per town</p></li>
<li><p>CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)</p></li>
<li><p>NOX      nitric oxides concentration (parts per 10 million)</p></li>
<li><p>RM       average number of rooms per dwelling</p></li>
<li><p>AGE      proportion of owner-occupied units built prior to 1940</p></li>
<li><p>DIS      weighted distances to five Boston employment centres</p></li>
<li><p>RAD      index of accessibility to radial highways</p></li>
<li><p>TAX      full-value property-tax rate per \$10,000</p></li>
<li><p>PTRATIO  pupil-teacher ratio by town</p></li>
<li><p>B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town</p></li>
<li><p>LSTAT    % lower status of the population</p></li>
<li><p>MEDV     Median value of owner-occupied homes in $1000’s</p></li>
</ul>
</dd>
<dt class="field-even">Missing Attribute Values</dt>
<dd class="field-even"><p>None</p>
</dd>
<dt class="field-odd">Creator</dt>
<dd class="field-odd"><p>Harrison, D. and Rubinfeld, D.L.</p>
</dd>
</dl>
</div></blockquote>

#### URL 
* This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

#### Example:

In [2]:
from IPython.display import display
import numpy as np
from sklearn.datasets import load_boston

In [3]:
# No parameters:
dataset = load_boston()

In [4]:
print("dataset = load_boston():\n")

print("dir(dataset):\n", dir(dataset))
print("dataset.feature_names:\n", dataset.feature_names)

X = dataset.data
print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
with np.printoptions(linewidth=200):
    print("X[:5]:\n", X[:5])

y = dataset.target
print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
with np.printoptions(linewidth=200):
    print("y[:5]:\n", y[:5])

dataset = load_boston():

dir(dataset):
 ['DESCR', 'data', 'feature_names', 'filename', 'target']
dataset.feature_names:
 ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
type(X):
 <class 'numpy.ndarray'>
X.shape:
 (506, 13)
X[:5]:
 [[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00 6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00 7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00 6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02 4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00 4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02 2.9400e+00]
 [6.9050e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 7.1470e+00 5.4200e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9690e+02 5.3300e+00]

In [5]:
# Use: return_X_y
X, y = load_boston(return_X_y=True)

print(type(X))
print(type(y))
print(X.shape)
print(y.shape)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(506, 13)
(506,)


## [🥀 Iris plants dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list simple">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>150 (50 in each of three classes)</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>4 numeric, predictive attributes and the class</p>
</dd>
<dt class="field-odd">Attribute Information</dt>
<dd class="field-odd"><ul class="simple">
<li><p>sepal length in cm</p></li>
<li><p>sepal width in cm</p></li>
<li><p>petal length in cm</p></li>
<li><p>petal width in cm</p></li>
<li><dl class="simple">
<dt>class:</dt><dd><ul>
<li><p>Iris-Setosa</p></li>
<li><p>Iris-Versicolour</p></li>
<li><p>Iris-Virginica</p></li>
</ul>
</dd>
</dl>
</li>
</ul>
</dd>
<dt class="field-even">Summary Statistics</dt>
<dd class="field-even"><p></p></dd>
</dl>
<table class="docutils align-default mytable">
<colgroup>
<col style="width: 26%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 13%">
<col style="width: 9%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"></th>
<th class="head"></th>
<th class="head"></th>
<th class="head"></th>
<th class="head"></th>
<th class="head"></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>sepal length:</p></td>
<td><p>4.3</p></td>
<td><p>7.9</p></td>
<td><p>5.84</p></td>
<td><p>0.83</p></td>
<td><p>0.7826</p></td>
</tr>
<tr class="row-odd"><td><p>sepal width:</p></td>
<td><p>2.0</p></td>
<td><p>4.4</p></td>
<td><p>3.05</p></td>
<td><p>0.43</p></td>
<td><p>-0.4194</p></td>
</tr>
<tr class="row-even"><td><p>petal length:</p></td>
<td><p>1.0</p></td>
<td><p>6.9</p></td>
<td><p>3.76</p></td>
<td><p>1.76</p></td>
<td><p>0.9490  (high!)</p></td>
</tr>
<tr class="row-odd"><td><p>petal width:</p></td>
<td><p>0.1</p></td>
<td><p>2.5</p></td>
<td><p>1.20</p></td>
<td><p>0.76</p></td>
<td><p>0.9565  (high!)</p></td>
</tr>
</tbody>
</table>
<dl class="field-list simple">
<dt class="field-odd">Missing Attribute Values</dt>
<dd class="field-odd"><p>None</p>
</dd>
<dt class="field-even">Class Distribution</dt>
<dd class="field-even"><p>33.3% for each of 3 classes.</p>
</dd>
<dt class="field-odd">Creator</dt>
<dd class="field-odd"><p>R.A. Fisher</p>
</dd>
<dt class="field-even">Donor</dt>
<dd class="field-even"><p>Michael Marshall (<a class="reference external" href="mailto:MARSHALL%PLU%40io.arc.nasa.gov">MARSHALL%PLU<span>@</span>io<span>.</span>arc<span>.</span>nasa<span>.</span>gov</a>)</p>
</dd>
<dt class="field-odd">Date</dt>
<dd class="field-odd"><p>July, 1988</p>
</dd>
</dl>
</div></blockquote>

#### URL
* N/A

#### Example:

In [6]:
from sklearn.datasets import load_iris

In [9]:
X, y = load_iris(return_X_y=True)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
with np.printoptions(linewidth=200):
    print("X[:5]:\n", X[:5])

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
with np.printoptions(linewidth=200):
    print("y[:5]:\n", y[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'numpy.ndarray'>
X.shape:
 (150, 4)
X[:5]:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
type(y):
 <class 'numpy.ndarray'>
y.shape:
 (150,)
y[:5]:
 [0 0 0 0 0]
np.unique(y):
 [0 1 2]


In [14]:
# Try `as_frame`:

X, y = load_iris(return_X_y=True, as_frame=True)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("list(y)[:5]:\n", list(y)[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (150, 4)
X.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


type(y):
 <class 'pandas.core.series.Series'>
y.shape:
 (150,)
list(y)[:5]:
 [0, 0, 0, 0, 0]
np.unique(y):
 [0 1 2]


## [💊 Diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list simple">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>442</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>First 10 columns are numeric predictive values</p>
</dd>
<dt class="field-odd">Target</dt>
<dd class="field-odd"><p>Column 11 is a quantitative measure of disease progression one year after baseline</p>
</dd>
<dt class="field-even">Attribute Information</dt>
<dd class="field-even"><ul class="simple">
<li><p>age     age in years</p></li>
<li><p>sex</p></li>
<li><p>bmi     body mass index</p></li>
<li><p>bp      average blood pressure</p></li>
<li><p>s1      tc, total serum cholesterol</p></li>
<li><p>s2      ldl, low-density lipoproteins</p></li>
<li><p>s3      hdl, high-density lipoproteins</p></li>
<li><p>s4      tch, total cholesterol / HDL</p></li>
<li><p>s5      ltg, possibly log of serum triglycerides level</p></li>
<li><p>s6      glu, blood sugar level</p></li>
</ul>
</dd>
</dl>
</div></blockquote>

#### URL
* https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

#### Example:

In [15]:
from sklearn.datasets import load_diabetes

In [17]:
X, y = load_diabetes(return_X_y=True, as_frame=True)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("list(y)[:5]:\n", list(y)[:5])
print("np.unique(y):\n", np.unique(y)[:10], "...")

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (442, 10)
X.head()


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


type(y):
 <class 'pandas.core.series.Series'>
y.shape:
 (442,)
list(y)[:5]:
 [151.0, 75.0, 141.0, 206.0, 135.0]
np.unique(y):
 [25. 31. 37. 39. 40. 42. 43. 44. 45. 47.] ...


## [🔢 Optical recognition of handwritten digits dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list simple">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>1797</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>64</p>
</dd>
<dt class="field-odd">Attribute Information</dt>
<dd class="field-odd"><p>8x8 image of integer pixels in the range 0..16.</p>
</dd>
<dt class="field-even">Missing Attribute Values</dt>
<dd class="field-even"><p>None</p>
</dd>
<dt class="field-odd">Creator</dt>
<dd class="field-odd"><ol class="upperalpha simple" start="5">
<li><p>Alpaydin (alpaydin ‘@’ boun.edu.tr)</p></li>
</ol>
</dd>
<dt class="field-even">Date</dt>
<dd class="field-even"><p>July; 1998</p>
</dd>
</dl>
</div></blockquote>

#### URL
* This is a copy of the test set of the UCI ML hand-written digits datasets https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

#### Example:

In [19]:
from sklearn.datasets import load_digits

In [22]:
# ❗ Has an additional parameter: n_class
X, y = load_digits(return_X_y=True, as_frame=True)  # n_class defaults to 10.

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("list(y)[:5]:\n", list(y)[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (1797, 64)
X.head()


Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_6,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


type(y):
 <class 'pandas.core.series.Series'>
y.shape:
 (1797,)
list(y)[:5]:
 [0, 1, 2, 3, 4]
np.unique(y):
 [0 1 2 3 4 5 6 7 8 9]


In [23]:
# Try different n_class.
X, y = load_digits(
    n_class=3,  # This.
    return_X_y=True, 
    as_frame=True
)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("list(y)[:5]:\n", list(y)[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (537, 64)
X.head()


Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_6,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,1.0,9.0,15.0,11.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,10.0,13.0,3.0,0.0,0.0
4,0.0,0.0,0.0,0.0,14.0,13.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,13.0,16.0,1.0,0.0


type(y):
 <class 'pandas.core.series.Series'>
y.shape:
 (537,)
list(y)[:5]:
 [0, 1, 2, 0, 1]
np.unique(y):
 [0 1 2]


## [🏃‍♂️ Linnerrud dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#linnerrud-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list simple">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>20</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>3</p>
</dd>
<dt class="field-odd">Missing Attribute Values</dt>
<dd class="field-odd"><p>None</p>
</dd>
</dl>
</div></blockquote>

‼️ The Linnerud dataset is a **multi-output** regression dataset. 

#### URL
* N/A

#### Example:

In [24]:
from sklearn.datasets import load_linnerud

In [32]:
X, y = load_linnerud(return_X_y=True, as_frame=True)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("y[:5]:\n", y[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (20, 3)
X.head()


Unnamed: 0,Chins,Situps,Jumps
0,5.0,162.0,60.0
1,2.0,110.0,60.0
2,12.0,101.0,101.0
3,12.0,105.0,37.0
4,13.0,155.0,58.0


type(y):
 <class 'pandas.core.frame.DataFrame'>
y.shape:
 (20, 3)
y[:5]:
    Weight  Waist  Pulse
0   191.0   36.0   50.0
1   189.0   37.0   52.0
2   193.0   38.0   58.0
3   162.0   35.0   62.0
4   189.0   35.0   46.0
np.unique(y):
 [ 31.  32.  33.  34.  35.  36.  37.  38.  46.  50.  52.  54.  56.  58.
  60.  62.  64.  68.  74. 138. 154. 156. 157. 162. 166. 167. 169. 176.
 182. 189. 191. 193. 202. 211. 247.]


## [🍷 Wine recognition dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list simple">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>178 (50 in each of three classes)</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>13 numeric, predictive attributes and the class</p>
</dd>
<dt class="field-odd">Attribute Information</dt>
<dd class="field-odd"><ul class="simple">
<li><p>Alcohol</p></li>
<li><p>Malic acid</p></li>
<li><p>Ash</p></li>
<li><p>Alcalinity of ash</p></li>
<li><p>Magnesium</p></li>
<li><p>Total phenols</p></li>
<li><p>Flavanoids</p></li>
<li><p>Nonflavanoid phenols</p></li>
<li><p>Proanthocyanins</p></li>
<li><p>Color intensity</p></li>
<li><p>Hue</p></li>
<li><p>OD280/OD315 of diluted wines</p></li>
<li><p>Proline</p></li>
</ul>
</dd>
</dl>
<ul class="simple">
<li><dl class="simple">
<dt>class:</dt><dd><ul>
<li><p>class_0</p></li>
<li><p>class_1</p></li>
<li><p>class_2</p></li>
</ul>
</dd>
</dl>
</li>
</ul>
<dl class="field-list simple">
<dt class="field-odd">Summary Statistics</dt>
<dd class="field-odd"><p></p></dd>
</dl>
<table class="docutils align-default mytable">
<colgroup>
<col style="width: 58%">
<col style="width: 8%">
<col style="width: 10%">
<col style="width: 14%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"></th>
<th class="head"></th>
<th class="head"></th>
<th class="head"></th>
<th class="head"></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>Alcohol:</p></td>
<td><p>11.0</p></td>
<td><p>14.8</p></td>
<td><p>13.0</p></td>
<td><p>0.8</p></td>
</tr>
<tr class="row-odd"><td><p>Malic Acid:</p></td>
<td><p>0.74</p></td>
<td><p>5.80</p></td>
<td><p>2.34</p></td>
<td><p>1.12</p></td>
</tr>
<tr class="row-even"><td><p>Ash:</p></td>
<td><p>1.36</p></td>
<td><p>3.23</p></td>
<td><p>2.36</p></td>
<td><p>0.27</p></td>
</tr>
<tr class="row-odd"><td><p>Alcalinity of Ash:</p></td>
<td><p>10.6</p></td>
<td><p>30.0</p></td>
<td><p>19.5</p></td>
<td><p>3.3</p></td>
</tr>
<tr class="row-even"><td><p>Magnesium:</p></td>
<td><p>70.0</p></td>
<td><p>162.0</p></td>
<td><p>99.7</p></td>
<td><p>14.3</p></td>
</tr>
<tr class="row-odd"><td><p>Total Phenols:</p></td>
<td><p>0.98</p></td>
<td><p>3.88</p></td>
<td><p>2.29</p></td>
<td><p>0.63</p></td>
</tr>
<tr class="row-even"><td><p>Flavanoids:</p></td>
<td><p>0.34</p></td>
<td><p>5.08</p></td>
<td><p>2.03</p></td>
<td><p>1.00</p></td>
</tr>
<tr class="row-odd"><td><p>Nonflavanoid Phenols:</p></td>
<td><p>0.13</p></td>
<td><p>0.66</p></td>
<td><p>0.36</p></td>
<td><p>0.12</p></td>
</tr>
<tr class="row-even"><td><p>Proanthocyanins:</p></td>
<td><p>0.41</p></td>
<td><p>3.58</p></td>
<td><p>1.59</p></td>
<td><p>0.57</p></td>
</tr>
<tr class="row-odd"><td><p>Colour Intensity:</p></td>
<td><p>1.3</p></td>
<td><p>13.0</p></td>
<td><p>5.1</p></td>
<td><p>2.3</p></td>
</tr>
<tr class="row-even"><td><p>Hue:</p></td>
<td><p>0.48</p></td>
<td><p>1.71</p></td>
<td><p>0.96</p></td>
<td><p>0.23</p></td>
</tr>
<tr class="row-odd"><td><p>OD280/OD315 of diluted wines:</p></td>
<td><p>1.27</p></td>
<td><p>4.00</p></td>
<td><p>2.61</p></td>
<td><p>0.71</p></td>
</tr>
<tr class="row-even"><td><p>Proline:</p></td>
<td><p>278</p></td>
<td><p>1680</p></td>
<td><p>746</p></td>
<td><p>315</p></td>
</tr>
</tbody>
</table>
<dl class="field-list simple">
<dt class="field-odd">Missing Attribute Values</dt>
<dd class="field-odd"><p>None</p>
</dd>
<dt class="field-even">Class Distribution</dt>
<dd class="field-even"><p>class_0 (59), class_1 (71), class_2 (48)</p>
</dd>
<dt class="field-odd">Creator</dt>
<dd class="field-odd"><p>R.A. Fisher</p>
</dd>
<dt class="field-even">Donor</dt>
<dd class="field-even"><p>Michael Marshall (<a class="reference external" href="mailto:MARSHALL%PLU%40io.arc.nasa.gov">MARSHALL%PLU<span>@</span>io<span>.</span>arc<span>.</span>nasa<span>.</span>gov</a>)</p>
</dd>
<dt class="field-odd">Date</dt>
<dd class="field-odd"><p>July, 1988</p>
</dd>
</dl>
</div></blockquote>

#### URL
* This is a copy of UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

#### Example:

In [29]:
from sklearn.datasets import load_wine

In [31]:
X, y = load_wine(return_X_y=True, as_frame=True)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("y[:5]:\n", y[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (178, 13)
X.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


type(y):
 <class 'pandas.core.series.Series'>
y.shape:
 (178,)
y[:5]:
 0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64
np.unique(y):
 [0 1 2]


## [🧫 Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset)

#### Data Set Characteristics:

<blockquote>
<div><dl class="field-list">
<dt class="field-odd">Number of Instances</dt>
<dd class="field-odd"><p>569</p>
</dd>
<dt class="field-even">Number of Attributes</dt>
<dd class="field-even"><p>30 numeric, predictive attributes and the class</p>
</dd>
<dt class="field-odd">Attribute Information</dt>
<dd class="field-odd"><ul class="simple">
<li><p>radius (mean of distances from center to points on the perimeter)</p></li>
<li><p>texture (standard deviation of gray-scale values)</p></li>
<li><p>perimeter</p></li>
<li><p>area</p></li>
<li><p>smoothness (local variation in radius lengths)</p></li>
<li><p>compactness (perimeter^2 / area - 1.0)</p></li>
<li><p>concavity (severity of concave portions of the contour)</p></li>
<li><p>concave points (number of concave portions of the contour)</p></li>
<li><p>symmetry</p></li>
<li><p>fractal dimension (“coastline approximation” - 1)</p></li>
</ul>
<p>The mean, standard error, and “worst” or largest (mean of the three
worst/largest values) of these features were computed for each image,
resulting in 30 features.  For instance, field 0 is Mean Radius, field
10 is Radius SE, field 20 is Worst Radius.</p>
<ul class="simple">
<li><dl class="simple">
<dt>class:</dt><dd><ul>
<li><p>WDBC-Malignant</p></li>
<li><p>WDBC-Benign</p></li>
</ul>
</dd>
</dl>
</li>
</ul>
</dd>
<dt class="field-even">Summary Statistics</dt>
<dd class="field-even"><p></p></dd>
</dl>
<table class="docutils align-default mytable">
<colgroup>
<col style="width: 76%">
<col style="width: 12%">
<col style="width: 12%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"></th>
<th class="head"></th>
<th class="head"></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>radius (mean):</p></td>
<td><p>6.981</p></td>
<td><p>28.11</p></td>
</tr>
<tr class="row-odd"><td><p>texture (mean):</p></td>
<td><p>9.71</p></td>
<td><p>39.28</p></td>
</tr>
<tr class="row-even"><td><p>perimeter (mean):</p></td>
<td><p>43.79</p></td>
<td><p>188.5</p></td>
</tr>
<tr class="row-odd"><td><p>area (mean):</p></td>
<td><p>143.5</p></td>
<td><p>2501.0</p></td>
</tr>
<tr class="row-even"><td><p>smoothness (mean):</p></td>
<td><p>0.053</p></td>
<td><p>0.163</p></td>
</tr>
<tr class="row-odd"><td><p>compactness (mean):</p></td>
<td><p>0.019</p></td>
<td><p>0.345</p></td>
</tr>
<tr class="row-even"><td><p>concavity (mean):</p></td>
<td><p>0.0</p></td>
<td><p>0.427</p></td>
</tr>
<tr class="row-odd"><td><p>concave points (mean):</p></td>
<td><p>0.0</p></td>
<td><p>0.201</p></td>
</tr>
<tr class="row-even"><td><p>symmetry (mean):</p></td>
<td><p>0.106</p></td>
<td><p>0.304</p></td>
</tr>
<tr class="row-odd"><td><p>fractal dimension (mean):</p></td>
<td><p>0.05</p></td>
<td><p>0.097</p></td>
</tr>
<tr class="row-even"><td><p>radius (standard error):</p></td>
<td><p>0.112</p></td>
<td><p>2.873</p></td>
</tr>
<tr class="row-odd"><td><p>texture (standard error):</p></td>
<td><p>0.36</p></td>
<td><p>4.885</p></td>
</tr>
<tr class="row-even"><td><p>perimeter (standard error):</p></td>
<td><p>0.757</p></td>
<td><p>21.98</p></td>
</tr>
<tr class="row-odd"><td><p>area (standard error):</p></td>
<td><p>6.802</p></td>
<td><p>542.2</p></td>
</tr>
<tr class="row-even"><td><p>smoothness (standard error):</p></td>
<td><p>0.002</p></td>
<td><p>0.031</p></td>
</tr>
<tr class="row-odd"><td><p>compactness (standard error):</p></td>
<td><p>0.002</p></td>
<td><p>0.135</p></td>
</tr>
<tr class="row-even"><td><p>concavity (standard error):</p></td>
<td><p>0.0</p></td>
<td><p>0.396</p></td>
</tr>
<tr class="row-odd"><td><p>concave points (standard error):</p></td>
<td><p>0.0</p></td>
<td><p>0.053</p></td>
</tr>
<tr class="row-even"><td><p>symmetry (standard error):</p></td>
<td><p>0.008</p></td>
<td><p>0.079</p></td>
</tr>
<tr class="row-odd"><td><p>fractal dimension (standard error):</p></td>
<td><p>0.001</p></td>
<td><p>0.03</p></td>
</tr>
<tr class="row-even"><td><p>radius (worst):</p></td>
<td><p>7.93</p></td>
<td><p>36.04</p></td>
</tr>
<tr class="row-odd"><td><p>texture (worst):</p></td>
<td><p>12.02</p></td>
<td><p>49.54</p></td>
</tr>
<tr class="row-even"><td><p>perimeter (worst):</p></td>
<td><p>50.41</p></td>
<td><p>251.2</p></td>
</tr>
<tr class="row-odd"><td><p>area (worst):</p></td>
<td><p>185.2</p></td>
<td><p>4254.0</p></td>
</tr>
<tr class="row-even"><td><p>smoothness (worst):</p></td>
<td><p>0.071</p></td>
<td><p>0.223</p></td>
</tr>
<tr class="row-odd"><td><p>compactness (worst):</p></td>
<td><p>0.027</p></td>
<td><p>1.058</p></td>
</tr>
<tr class="row-even"><td><p>concavity (worst):</p></td>
<td><p>0.0</p></td>
<td><p>1.252</p></td>
</tr>
<tr class="row-odd"><td><p>concave points (worst):</p></td>
<td><p>0.0</p></td>
<td><p>0.291</p></td>
</tr>
<tr class="row-even"><td><p>symmetry (worst):</p></td>
<td><p>0.156</p></td>
<td><p>0.664</p></td>
</tr>
<tr class="row-odd"><td><p>fractal dimension (worst):</p></td>
<td><p>0.055</p></td>
<td><p>0.208</p></td>
</tr>
</tbody>
</table>
<dl class="field-list simple">
<dt class="field-odd">Missing Attribute Values</dt>
<dd class="field-odd"><p>None</p>
</dd>
<dt class="field-even">Class Distribution</dt>
<dd class="field-even"><p>212 - Malignant, 357 - Benign</p>
</dd>
<dt class="field-odd">Creator</dt>
<dd class="field-odd"><p>Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian</p>
</dd>
<dt class="field-even">Donor</dt>
<dd class="field-even"><p>Nick Street</p>
</dd>
<dt class="field-odd">Date</dt>
<dd class="field-odd"><p>November, 1995</p>
</dd>
</dl>
</div></blockquote>

#### URL
* This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2

#### Example:

In [33]:
from sklearn.datasets import load_breast_cancer

In [34]:
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

print("type(X):\n", type(X))
print("X.shape:\n", X.shape)
print("X.head()")
display(X.head())

print("type(y):\n", type(y))
print("y.shape:\n", y.shape)
print("y[:5]:\n", y[:5])
print("np.unique(y):\n", np.unique(y))

type(X):
 <class 'pandas.core.frame.DataFrame'>
X.shape:
 (569, 30)
X.head()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


type(y):
 <class 'pandas.core.series.Series'>
y.shape:
 (569,)
y[:5]:
 0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64
np.unique(y):
 [0 1]
