# Introduction

The objective of this colab is to demonstrate `sklearn` dataset API.

Recall that it has three APIs:
1. Loaders (`load_*`) load small standard datasets bundled with `sklearn`.
2. Fetchers (`fetch_*`) fetch large datasets from the internet and loads them in memory.
3. Generators (`generate_*`) generate controlled synthetic datasets.

Loaders and fetchers return a `bunch` object and generators return a tuple of feature matrix and label vector (or matrix).

# Loaders

## Loading iris dataset

In [1]:
from sklearn.datasets import load_iris
data = load_iris()

This returns a `Bunch` object `data` which is a dictionary like object with the following attributes:
* `data`, which has the feature matrix.
* `target`, which is the label vector
* `feature_names` contain the names of the features.
* `target_names` contain the names of the classes.
* `DESCR` has the full description of dataset.
* `filename` has the path to the location of data.

In [2]:
type(data)

sklearn.utils._bunch.Bunch

We can access them one by one and examine their contents.  For example, we can access `feature_names` as follows:

In [3]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

We can see the names of the features in this dataset.

Let's examine the names of the labels.

In [4]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

There are three classes: `setosa`, `versicolor`, `virginica`.

The feature matrix can be accessed as follows: `data.data`.  Let's look at the first five examples in feature matrix.

In [5]:
data.data[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

We can observe 4 features per example.

Let's examine the shape of the feature matrix.

In [6]:
data.data.shape

(150, 4)

There are 150 examples and each example has 4 features.

Finally, we will examine the label vector and its shape.

In [7]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

There are 50 examples each from three classes: 0, 1 and 2.

We can read additional documentation about `load_iris` in the following manner:

In [8]:
?load_iris

In this way, we can load and examine different datasets.

We can obtain feature matrix and label or target from `load_iris` and other loaders in general by setting `return_X_y` argument to `True`.

In [9]:
feature_matrix, label_vector = load_iris(return_X_y=True)
print ('Shape of feature matrix:', feature_matrix.shape)
print ('Shape of label vector:', label_vector.shape)

Shape of feature matrix: (150, 4)
Shape of label vector: (150,)


## Loading diabetes dataset

In [10]:
from sklearn.datasets import load_diabetes
diabetes_data = load_diabetes()

Additional details about this loader can be accessed from the documentation.

In [11]:
?load_diabetes

### `load_diabetes`

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [12]:
type(diabetes_data)

sklearn.utils._bunch.Bunch

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [13]:
diabetes_data.DESCR

'.. _diabetes_dataset:\n\nDiabetes dataset\n----------------\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\n**Data Set Characteristics:**\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attribute Information:\n      - age     age in years\n      - sex\n      - bmi     body mass index\n      - bp      average blood pressure\n      - s1      tc, total serum cholesterol\n      - s2      ldl, low-density lipoproteins\n      - s3      hdl, high-density lipoproteins\n      - s4      tch, total cholesterol / HDL\n      - s5      ltg, possibly log of serum triglycerides level\n      - s6      glu, blood sugar

Find out the shape of the feature matrix.

In [14]:
feature_matrix = diabetes_data.data
feature_matrix.shape

(442, 10)

Look at the first five examples from the feature matrix.

In [15]:
feature_matrix[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187239, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632753, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567042, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286131, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665608,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02268774, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187239,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03198764, -0.04664087]])

Find out the shape of the label matrix.

In [16]:
label_matrix = diabetes_data.target
label_matrix.shape

(442,)

Look at the labels of the first five examples.

In [17]:
label_matrix[:5]

array([151.,  75., 141., 206., 135.])

Find out the names of the features.

In [18]:
diabetes_data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

## Loading digits dataset

In [19]:
from sklearn.datasets import load_digits
?load_digits

In [20]:
digits_data = load_digits()

### `load_digits`

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [21]:
type(digits_data)

sklearn.utils._bunch.Bunch

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [22]:
digits_data.DESCR

".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 1797\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixel

Find out the shape of the feature matrix.

In [23]:
feature_matrix = digits_data.data
feature_matrix.shape

(1797, 64)

Look at the first five examples from the feature matrix.

In [24]:
feature_matrix[:5]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.],
       [ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
         9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
        15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
         0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
        16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.],
       [ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
        14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
         1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
         0.,  0.,  9., 16., 16.,  5.,  0.,  0.,  0.,  0.,  

Find out the shape of the label matrix.

In [25]:
label_matrix = digits_data.target
label_matrix.shape

(1797,)

Look at the labels of the first five examples.

In [26]:
label_matrix[:5]

array([0, 1, 2, 3, 4])

Find out the names of the features.

In [27]:
digits_data.feature_names

['pixel_0_0',
 'pixel_0_1',
 'pixel_0_2',
 'pixel_0_3',
 'pixel_0_4',
 'pixel_0_5',
 'pixel_0_6',
 'pixel_0_7',
 'pixel_1_0',
 'pixel_1_1',
 'pixel_1_2',
 'pixel_1_3',
 'pixel_1_4',
 'pixel_1_5',
 'pixel_1_6',
 'pixel_1_7',
 'pixel_2_0',
 'pixel_2_1',
 'pixel_2_2',
 'pixel_2_3',
 'pixel_2_4',
 'pixel_2_5',
 'pixel_2_6',
 'pixel_2_7',
 'pixel_3_0',
 'pixel_3_1',
 'pixel_3_2',
 'pixel_3_3',
 'pixel_3_4',
 'pixel_3_5',
 'pixel_3_6',
 'pixel_3_7',
 'pixel_4_0',
 'pixel_4_1',
 'pixel_4_2',
 'pixel_4_3',
 'pixel_4_4',
 'pixel_4_5',
 'pixel_4_6',
 'pixel_4_7',
 'pixel_5_0',
 'pixel_5_1',
 'pixel_5_2',
 'pixel_5_3',
 'pixel_5_4',
 'pixel_5_5',
 'pixel_5_6',
 'pixel_5_7',
 'pixel_6_0',
 'pixel_6_1',
 'pixel_6_2',
 'pixel_6_3',
 'pixel_6_4',
 'pixel_6_5',
 'pixel_6_6',
 'pixel_6_7',
 'pixel_7_0',
 'pixel_7_1',
 'pixel_7_2',
 'pixel_7_3',
 'pixel_7_4',
 'pixel_7_5',
 'pixel_7_6',
 'pixel_7_7']

Find names of class labels.

In [28]:
digits_data.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## Exercise

Experiment with other dataset loaders e.g. `load_wine`, `load_breast_cancer` and `load_linnerud`.

### `load_wine`

**Step 1.** Import the loader.

In [29]:
from sklearn.datasets import load_wine

**Step 1a.** In case, you want to know more about the loader, access its documentation by using `?<loader_name>' command.

In [30]:
?load_wine

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [31]:
data = load_wine()

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [32]:
data.DESCR



Find out the shape of the feature matrix.

In [33]:
data.data.shape

(178, 13)

Look at the first five examples from the feature matrix.

In [34]:
data.data[:5]

array([[1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
        3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, 1.120e+01, 1.000e+02, 2.650e+00,
        2.760e+00, 2.600e-01, 1.280e+00, 4.380e+00, 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, 1.860e+01, 1.010e+02, 2.800e+00,
        3.240e+00, 3.000e-01, 2.810e+00, 5.680e+00, 1.030e+00, 3.170e+00,
        1.185e+03],
       [1.437e+01, 1.950e+00, 2.500e+00, 1.680e+01, 1.130e+02, 3.850e+00,
        3.490e+00, 2.400e-01, 2.180e+00, 7.800e+00, 8.600e-01, 3.450e+00,
        1.480e+03],
       [1.324e+01, 2.590e+00, 2.870e+00, 2.100e+01, 1.180e+02, 2.800e+00,
        2.690e+00, 3.900e-01, 1.820e+00, 4.320e+00, 1.040e+00, 2.930e+00,
        7.350e+02]])

Find out the shape of the label matrix.

In [35]:
data.target.shape

(178,)

Look at the labels of the first five examples.

In [36]:
data.target[:5]

array([0, 0, 0, 0, 0])

Find out the names of the features.

In [37]:
data.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

Find names of class labels.

In [38]:
data.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

### `load_breast_cancer`

**Step 1.** Import the loader.

In [39]:
from sklearn.datasets import load_breast_cancer
breast_cancer_dataset = load_breast_cancer()

**Step 1a.** In case, you want to know more about the loader, access its documentation by using `?<loader_name>' command.

In [40]:
?load_breast_cancer

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [41]:
type(breast_cancer_dataset)

sklearn.utils._bunch.Bunch

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [42]:
breast_cancer_dataset.DESCR



Find out the shape of the feature matrix.

In [43]:
feature_matrix = breast_cancer_dataset.data
feature_matrix.shape

(569, 30)

Look at the first five examples from the feature matrix.

In [44]:
feature_matrix[:5]

array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,
        8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,
        3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,
        1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,
        1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, 1.203e+03, 1.096e-01, 1.599e-01,
        1.974e-01, 1.279e-01, 2.069e-01, 5.999e-02, 7.456e-01, 7.869e-01,
        4.585e+00, 9.403e+01, 6.150e-03, 4.006e-02, 3.832e-02, 2.058e-02,
        2.250e-02, 4.571e-03, 2.357e

Find out the shape of the label matrix.

In [45]:
label_matrix = breast_cancer_dataset.target
label_matrix.shape

(569,)

Look at the labels of the first five examples.

In [46]:
label_matrix[:5]

array([0, 0, 0, 0, 0])

Find out the names of the features.

In [47]:
breast_cancer_dataset.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Find names of class labels.

In [48]:
breast_cancer_dataset.target_names

array(['malignant', 'benign'], dtype='<U9')

### `load_linnerud`

**Step 1.** Import the loader.

In [49]:
from sklearn.datasets import load_linnerud
linnerud_dataset = load_linnerud()

**Step 1a.** In case, you want to know more about the loader, access its documentation by using `?<loader_name>' command.

In [50]:
?load_linnerud

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [51]:
type(linnerud_dataset)

sklearn.utils._bunch.Bunch

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [52]:
linnerud_dataset.DESCR

'.. _linnerrud_dataset:\n\nLinnerrud dataset\n-----------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20\n    :Number of Attributes: 3\n    :Missing Attribute Values: None\n\nThe Linnerud dataset is a multi-output regression dataset. It consists of three\nexercise (data) and three physiological (target) variables collected from\ntwenty middle-aged men in a fitness club:\n\n- *physiological* - CSV containing 20 observations on 3 physiological variables:\n   Weight, Waist and Pulse.\n- *exercise* - CSV containing 20 observations on 3 exercise variables:\n   Chins, Situps and Jumps.\n\n.. topic:: References\n\n  * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris:\n    Editions Technic.\n'

Find out the shape of the feature matrix.

In [53]:
feature_matrix = linnerud_dataset.data

Look at the first five examples from the feature matrix.

In [54]:
feature_matrix[:5]

array([[  5., 162.,  60.],
       [  2., 110.,  60.],
       [ 12., 101., 101.],
       [ 12., 105.,  37.],
       [ 13., 155.,  58.]])

Find out the shape of the label matrix.

In [55]:
feature_matrix.shape

(20, 3)

Look at the labels of the first five examples.

In [56]:
label_matrix = linnerud_dataset.target

Find out the names of the features.

In [57]:
linnerud_dataset.feature_names

['Chins', 'Situps', 'Jumps']

Find names of class labels.

In [58]:
linnerud_dataset.target_names

['Weight', 'Waist', 'Pulse']

# Fetchers

## `fetch_california_housing`

**Step 1**: Import the library and access the documentation.

In [59]:
from sklearn.datasets import fetch_california_housing
?fetch_california_housing

Note that the `fetch_`* also returns a `Bunch` object just like loaders.

We can examine various attributes of this dataset on the lines of datasets in loaders.

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [60]:
housing_data = fetch_california_housing()

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [61]:
housing_data.DESCR

'.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block group\n        - HouseAge      median house age in block group\n        - AveRooms      average number of rooms per household\n        - AveBedrms     average number of bedrooms per household\n        - Population    block group population\n        - AveOccup      average number of household members\n        - Latitude      block group latitude\n        - Longitude     block group longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands of dollars ($100,000

Find out the shape of the feature matrix.

In [62]:
housing_data.data.shape

(20640, 8)

Look at the first five examples from the feature matrix.

In [63]:
housing_data.data[:5]

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02],
       [ 5.64310000e+00,  5.20000000e+01,  5.81735160e+00,
         1.07305936e+00,  5.58000000e+02,  2.54794521e+00,
         3.78500000e+01, -1.22250000e+02],
       [ 3.84620000e+00,  5.20000000e+01,  6.28185328e+00,
         1.08108108e+00,  5.65000000e+02,  2.18146718e+00,
         3.78500000e+01, -1.22250000e+02]])

Find out the shape of the label matrix.

In [64]:
housing_data.target.shape

(20640,)

Look at the labels of the first five examples.

In [65]:
housing_data.target[:5]

array([4.526, 3.585, 3.521, 3.413, 3.422])

Note that the labels seem to be real numbers.

Find out the names of the features.

In [66]:
housing_data.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

Find names of class labels.

In [67]:
housing_data.target_names

['MedHouseVal']

## `fetch_openml`

[openml.org](openml.org) is a public repository for machine learning data and experiments, that allows everybody to upload open datasets.

Import the library and access the documentation.

In [68]:
from sklearn.datasets import fetch_openml
?fetch_openml

Note that this is an experimental API and is likely to change in the future releases.

> We use this API for loading MNIST dataset.

In [69]:
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
print ("Feature matrix shape:", X.shape)
print ("Label shape:", y.shape)

  warn(


Feature matrix shape: (70000, 784)
Label shape: (70000,)


## Exercise

### `fetch_20newsgroups`

**Step 1.** Import the loader.

In [70]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_dataset = fetch_20newsgroups()

**Step 1a.** In case, you want to know more about the loader, access its documentation by using `?<loader_name>' command.

In [71]:
?fetch_20newsgroups

**Step 2.** Load the dataset and obtain a `Bunch` object.

In [72]:
type(newsgroups_dataset)

sklearn.utils._bunch.Bunch

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [73]:
newsgroups_dataset.DESCR



Find out the shape of the feature matrix.

In [74]:
feature_matrix = newsgroups_dataset.data
len(feature_matrix)

11314

Look at the first five examples from the feature matrix.

In [75]:
feature_matrix[:5]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

Find out the shape of the label matrix.

In [76]:
label_matrix = newsgroups_dataset.target
label_matrix.shape

(11314,)

Look at the labels of the first five examples.

In [77]:
label_matrix[:5]

array([ 7,  4,  4,  1, 14])

Find names of class labels.

In [78]:
newsgroups_dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### `fetch_kddcup99`

**Step 1.** Import the loader.

In [79]:
from sklearn.datasets import fetch_kddcup99
kddcup99_dataset = fetch_kddcup99()

**Step 1a.** In case, you want to know more about the loader, access its documentation by using `?<loader_name>' command.

In [80]:
?fetch_kddcup99()

Object `fetch_kddcup99()` not found.


**Step 2.** Load the dataset and obtain a `Bunch` object.

In [81]:
type(kddcup99_dataset)

sklearn.utils._bunch.Bunch

**Step 3.** Examine the bunch object.

Look at the description of the dataset.

In [82]:
kddcup99_dataset.DESCR



Find out the shape of the feature matrix.

In [83]:
feature_matrix = kddcup99_dataset.data
feature_matrix.shape

(494021, 41)

Look at the first five examples from the feature matrix.

In [84]:
feature_matrix[:5]

array([[0, b'tcp', b'http', b'SF', 181, 5450, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9,
        9, 1.0, 0.0, 0.11, 0.0, 0.0, 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', b'SF', 239, 486, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 19,
        19, 1.0, 0.0, 0.05, 0.0, 0.0, 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', b'SF', 235, 1337, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 8, 8, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 29,
        29, 1.0, 0.0, 0.03, 0.0, 0.0, 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', b'SF', 219, 1337, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 6, 6, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 39,
        39, 1.0, 0.0, 0.03, 0.0, 0.0, 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', b'SF', 217, 2032, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 6, 6, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 49,
        49, 1.0, 0.0, 0.02, 0.0, 0.0, 0.0, 0.0, 0.0]

Find out the shape of the label matrix.

In [85]:
label_matrix = kddcup99_dataset.target
label_matrix.shape

(494021,)

Look at the labels of the first five examples.

In [86]:
label_matrix[:5]

array([b'normal.', b'normal.', b'normal.', b'normal.', b'normal.'],
      dtype=object)

Find out the names of the features.

In [87]:
kddcup99_dataset.feature_names

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

Find names of class labels.

In [88]:
kddcup99_dataset.target_names

['labels']

# Generators

### `make_regression`

In [89]:
from sklearn.datasets import make_regression
?make_regression

#### Example 1

Let's generate 100 samples with 5 features for a single label regression problem.

In [90]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True, random_state=42)

It's a good practice to set seed so that we get to see repeatability in the experimentation.

Let's look at the shapes of feature matrix and label vector.

In [91]:
X.shape

(100, 5)

In [92]:
y.shape

(100,)

#### Example 2

Let's generate 100 samples with 5 features for multiple regression problem with 5 outputs.

In [93]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True, random_state=42)

Let's look at the shapes of feature matrix and label vector.

In [94]:
X.shape

(100, 5)

In [95]:
y.shape

(100, 5)

Since we generated multi-output target with 5 outputs, the output has shape `(100, 5)`.

## `make_classification`

Generate a random $n$-class classification problem set up.

In [96]:
from sklearn.datasets import make_classification
?make_classification

Let's generate a binary classification problem with 10 features and 100 samples.

In [97]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=42)

Let's examine the shapes of feature matrix and label vector.

In [98]:
X.shape

(100, 10)

In [99]:
y.shape

(100,)

Look at a few examples and their labels.

In [100]:
X[:5]

array([[ 0.11422765, -1.71016839, -0.06822216, -0.14928517,  0.30780177,
         0.15030176, -0.05694562, -0.22595246, -0.36361221, -0.13818757],
       [ 0.70775194, -1.57022472, -0.23503183, -0.63604713,  0.62180996,
        -0.56246678,  0.97255445, -0.77719676,  0.63240774, -0.47809669],
       [ 0.63859246,  0.04739867,  0.33273433,  1.1046981 , -0.65183611,
        -1.66152006, -1.2110162 ,  1.09821151, -0.0660798 ,  0.68024225],
       [-0.23894805, -0.97755524,  0.0379061 ,  0.19896733,  0.50091719,
        -0.90756366,  0.75539123,  0.12437227, -0.57677133,  0.07871283],
       [-0.59239392, -0.05023811,  0.17573204, -1.43949185,  0.27045683,
        -0.86399077, -0.83095012,  0.60046915,  0.04852163,  0.32557953]])

In [101]:
y[:5]

array([1, 1, 1, 1, 0])

Let's generate a three class classification problem with 100 samples and 10 features.

In [102]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=3, n_clusters_per_class=1, random_state=42)

Let's examine shapes of feature matrix and labels.

In [103]:
X.shape

(100, 10)

In [104]:
y.shape

(100,)

Let's look at a few examples - features and labels.

In [105]:
X[:5]

array([[-0.58351628, -1.73833907, -1.37298251, -1.77311485,  0.45918008,
         0.83392215, -1.66096093,  0.20768769, -0.07016571,  0.42961822],
       [-1.0044394 , -1.43862044,  0.47335819, -0.21188291,  0.0125924 ,
         0.22409248, -0.77300978,  0.49799829,  0.0976761 ,  0.02451017],
       [ 0.07740833,  0.19896733,  0.12437227,  0.17738132, -0.97755524,
         0.50091719,  0.75138712,  0.54336019,  0.09933231, -1.66940528],
       [-0.91759569, -0.9609536 ,  1.07746664,  0.4522739 , -0.32138584,
        -0.8254972 , -0.56372455,  0.24368721,  0.41293145, -0.8222204 ],
       [-0.96222828, -0.96090774,  1.21530116,  0.55980482, -1.24778318,
        -0.25256815, -1.43014138,  0.13074058,  1.6324113 , -0.44004449]])

In [106]:
y[:5]

array([2, 0, 1, 0, 0])

## `make_multilabel_classification`

This function helps us generating a random multi-label classification problem.

In [107]:
from sklearn.datasets import make_multilabel_classification
?make_multilabel_classification

Let's generate a multilabel classification problem with 100 samples, 10 features, 5 labels and on an average 2 labels per example.

In [108]:
X, y = make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, n_labels=2)

First of all, let's examine shapes of feature matrix and label vector.

In [109]:
X.shape

(100, 20)

In [110]:
y.shape

(100, 5)

Let's examine a few rows of feature matrix and label matrix.

In [111]:
X[:5]

array([[2., 1., 4., 1., 0., 1., 1., 0., 1., 1., 2., 3., 1., 2., 2., 2.,
        1., 7., 3., 3.],
       [4., 1., 0., 1., 0., 1., 4., 5., 2., 2., 2., 3., 1., 0., 4., 4.,
        0., 3., 2., 3.],
       [4., 3., 1., 2., 4., 2., 1., 2., 3., 0., 5., 2., 2., 1., 0., 1.,
        1., 1., 1., 3.],
       [1., 0., 3., 4., 4., 1., 0., 1., 2., 4., 0., 1., 3., 3., 1., 3.,
        0., 3., 1., 3.],
       [3., 4., 2., 2., 2., 0., 1., 0., 3., 1., 2., 3., 6., 2., 5., 5.,
        0., 7., 3., 5.]])

In [112]:
y[:5]

array([[0, 1, 1, 0, 0],
       [0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1],
       [0, 0, 1, 1, 1]])

## `make_blobs`

`make_blobs` enables us to generate random data for clustering.

In [113]:
from sklearn.datasets import make_blobs
?make_blobs

Let's generate a random dataset of 10 samples with 2 features each for clustering.

In [114]:
X, y = make_blobs(n_samples=10, centers=3, n_features=2, random_state=42)
print ("Feature matrix shape:", X.shape)
print ("Label shape:", y.shape)

Feature matrix shape: (10, 2)
Label shape: (10,)


We can find the cluster membership of each point in `y`.

In [115]:
y

array([2, 2, 1, 2, 0, 0, 0, 1, 1, 0])