# Machine Learning with Python

## 1.2 Data

To get started, let's get hold of some data.

### [Toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html)

We have some small demonstration datasets immediately available in `sklearn.datasets`:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

type(iris)

The data is returned as a `Bunch` object. This is similar to a dictionary, but also allows values to be referenced as attributes.

In [None]:
iris.keys()

In [None]:
iris["filename"]

In [None]:
iris.filename

Notice that the `data` and the `target` are provided separately as numpy arrays. Each row is an observation (i.e. a data point) and each column is a variable.

In [None]:
iris.data

In [None]:
iris.target

In [None]:
iris.target_names

Use the `DESCR` attribute to find out more about the dataset.

In [None]:
print(iris.DESCR)

### Splitting testing from training data

Before we even look at the data, it is good practice to split off a test dataset that will remain unseen until we are ready for final evaluation of our models.

sklearn has a convenient function `train_test_split` that will create a randomised 75%:25% split of training:testing data. The `X` values are the features and the `y` values are the target.

Notice that we can "seed" the random number generator to create deterministic output - this can be helpful during code development as it means our results will not change between runs.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=0)

### Visualising data

Before we start, it is good practice to take a look at the general form of the data to identify any inconsistencies or errors.

We can use the `scatter_matrix` from pandas to create a pairwise matrix of scatter plots together with histograms for the individual features.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris.feature_names)

# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8)

plt.show()

### [Real-world datasets](https://scikit-learn.org/stable/datasets/real_world.html)

In addition to the toy data, scikit-learn has loader functions for some commonly used larger sets. For example, the [Olivetti faces](https://scikit-learn.org/stable/datasets/real_world.html#the-olivetti-faces-dataset) dataset.


In [1]:
from sklearn.datasets import fetch_olivetti_faces

olivetti = fetch_olivetti_faces()


In [None]:
olivetti.keys()

In [2]:
olivetti.target

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  5,
        5,  5,  5,  5,  5,  5,  5,  5,  5,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11,
       11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
       13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,
       15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18,
       18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20,
       20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22,
       22, 22, 22, 22, 22

In [None]:
olivetti.data[0]

In [None]:
olivetti.images[0]

In [None]:
plt.imshow(olivetti.images[0])
plt.show()

### [Data generators](https://scikit-learn.org/stable/datasets/sample_generators.html)

Sometimes we want to generate synthetic data to test clustering and regression methods. scikit-learn provides a number of helpful functions to do this for us.

For example, we can create a classification dataset consisting of a number of blobs:


In [None]:
from sklearn.datasets import make_blobs

# 100 data points, 2 features, 3 blobs
blobs_X, blobs_y = make_blobs(100, 2, centers=3)

# split off a test dataset
X_train, X_test, y_train, y_test = train_test_split(blobs_X, blobs_y, random_state=0)

In [None]:
plt.scatter(X_train[:,0],X_train[:,1], c=y_train)
plt.show()

### [external datasets](https://scikit-learn.org/stable/datasets/loading_other_datasets.html#loading-from-external-datasets)

There are of course many other sources of data that we might want to use.

scikit-learn provides a direct interface to the [OpenML](https://www.openml.org/home) repository, so it is very easy to make use of these datasets in your work. See [here](https://scikit-learn.org/stable/datasets/loading_other_datasets.html#downloading-datasets-from-the-openml-org-repository) for more details.



In [None]:
from sklearn.datasets import fetch_openml

mice = fetch_openml(name='miceprotein', version=4)

mice.keys()

To load data from a CSV file, we would use the pandas functions:

In [None]:
codon = pd.read_csv("codon_usage.csv")

In [None]:
codon.head()

In [None]:
# Extract a DataFrame for the features
codon_X = codon.iloc[:,5:]

codon_X

In [None]:
# Extract a Series for the target
codon_y = codon.iloc[:,0]

codon_y

In [None]:
# We can now work in scikit-learn with these pandas objects 
# in exactly the same way as for the numpy arrays
X_train, X_test, y_train, y_test = train_test_split(codon_X, codon_y, random_state=0)

### Other data resources

Some other repositories you may find useful:

https://paperswithcode.com/datasets

https://archive.ics.uci.edu/ml/index.php


### Preprocessing

There may be several preprocessing steps that we need to complete before data is ready to use in our chosen method. Below are a few processes that are commonly applied.

Importantly, note that the any transformations of the features (e.g. imputation or standardisation discussed below) should be calculated on the *training* data only, after which the same transformation is applied to both the training and testing data. 

If we were to fit to the *entire* data set, after transformation our testing data would be contaminated with information from the training data, which is what we want to avoid.

Consider the following table of features describing some students:

* Age
* Subject of current degree course
* Level of current degree course
* Height


In [None]:
import numpy as np

X_train = pd.DataFrame(
            [
              [25,"Chemistry","PhD",1.50],
              [32,"Physics","MSc",1.67],
              [20,"Mathematics","BSc",1.69],
              [22,"Mathematics","PhD",1.58],
              [np.nan,"Physics","PhD",1.70],
              [25,"Physics","PhD",1.82],
              [np.nan,"Mathematics","BSc",1.49],
              [19,"Chemistry","BSc",1.80]
             ], 
             columns=["age","subject","level","height"]
          )

X_train

#### [Encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features)

Categorical datatypes need to be transformed to numbers before scikit-learn can use them as features.

For binary features and features whose categories have a natural ordering, this can be done by assigning an integer value to each category. This is called *ordinal encoding*.

The feature `level` has a natural ordering, so we can use the `OrdinalEncoder` to encode it:

In [None]:
# extract the level column as a pandas DataFrame
level = X_train[["level"]]

level

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# we provide the ordered categories to the encoder
enc = OrdinalEncoder(categories=[["BSc","MSc","PhD"]])

# fit the encoder to the data
enc.fit(level)

# encode the data
level_enc = enc.transform(level)

print(level_enc)

In [None]:
# make a copy of the original DataFrame
X_train_enc = X_train.copy()

# replace the "level" column with the encoded version
X_train_enc["level"] = level_enc

X_train_enc

Although ordinal encoding is suitable for many categorical features, it may produce poor performance when the data type is nominal, i.e. where no meaningful ordering exists for the allowed values.

A better solution in these cases is to use *one-hot encoding* to replace a single integer feature with multiple binary features.

We will use a `OneHotEncoder` to encode the `subject` feature:

In [None]:
# extract the subject column as a pandas DataFrame
subject = X_train[["subject"]]  

subject

In [None]:
from sklearn.preprocessing import OneHotEncoder

#setting sparse=False means that enc.transform() will return an array
enc = OneHotEncoder( sparse=False )

# fit the encoder to the data
enc.fit(subject)

# encode the data
subject_enc = enc.transform(subject)

print(subject_enc)

In [None]:
new_columns = pd.DataFrame( subject_enc, columns= "subject_" + enc.categories_[0] )

new_columns

In [None]:
# remove the original "subject" feature and add the new features
X_train_enc = X_train_enc.drop("subject",axis=1).join(new_columns)

X_train_enc

#### [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html)

Real-world datasets often contain missing values, represented in data as `"?"` or `np.nan`.

We might choose to ignore (drop) rows that contain missing values. However, this wastes the rest of the information in that row.

It may be more desirable to insert a guess in place of the missing values. scikit-learn provides a couple of methods to do this.

The simplest approach is to use the mean of the column values in place of any unknown value:

In [None]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer()
imp.fit(X_train_enc)
X_imp = imp.transform(X_train_enc)

X_imp

A more sophisticated approach uses iterative regression modelling to try to guess the unknown values, based on the other features seen in that row:

In [None]:
 # explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer()
imp.fit(X_train_enc)
X_imp = imp.transform(X_train_enc)

X_imp

#### [Standardisation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)

Once we have encoded categorical features and imputed any missing values, it may be necessary to center the data and transform it so that all features have equal variance. This is a requirement for some machine learning methods to operate correctly.
 

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_imp)

X_scaled = scaler.transform(X_imp)

X_scaled

Each feature in the scaled data has zero mean and unit variance:

In [None]:
X_scaled.mean( axis=0 )

In [None]:
X_scaled.var( axis=0 )

Note that this basic type of standardisation will cause problems for sparse data sets and data containing outlier values - there are other functions that implement alternative scaling procedures recommended for these cases.

### Exercise 

Download the `autoMpg` dataset from OpenML:

In [None]:
from sklearn.datasets import fetch_openml
mpg = fetch_openml(name='autoMpg',version=1)

See https://www.openml.org/d/196 for a description of the data.

Separate the data into training and testing sets.

The `origin` feature is a nominal value coded as an integer.
Use one-hot encoding to turn this column into multiple binary features.

The `horsepower` feature has 6 missing values. Can you impute them?

Finally, standardise the dataset using `StandardScaler`.