# Machine Learning with Python

## 1.2 Data

To get started, let's get hold of some data.

### [Toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html)

We have some small demonstration datasets immediately available in `sklearn.datasets`:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

type(iris)

The data is returned as a `Bunch` object. This is similar to a dictionary, but also allows values to be referenced as attributes

In [None]:
iris.keys()

In [None]:
iris["filename"]

In [None]:
iris.filename

Notice that the `data` and the `target` are provided separately as numpy arrays. Each row is an observation (i.e. a data point) and each column is a variable.

In [None]:
iris.data

In [None]:
iris.target

In [None]:
iris.target_names

Use the `DESCR` attribute to find out more about the dataset.

In [None]:
print(iris.DESCR)

### Splitting testing from training data

Before we even look at the data, it is good practice to split off a test dataset that will remain unseen until we are ready for final evaluation of our models.

sklearn has a convenient function `train_test_split` that will create a randomised 75%:25% split of training:testing data. The `X` values are the features and the `y` values are the target.

Notice that we can "seed" the random number generator to create deterministic output - this can be helpful during code development as it means our results will not change between runs.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=0)

### Visualising data

Before we start, it is good practice to take a look at the general form of the data to identify any inconsistencies or errors.

We can use the `scatter_matrix` from pandas to create a pairwise matrix of scatter plots together with histograms for the individual features.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris.feature_names)

# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8)

plt.show()

### [Real-world datasets](https://scikit-learn.org/stable/datasets/real_world.html)

In addition to the toy data, scikit-learn has loader functions for some commonly used larger sets. For example, the [Olivetti faces](https://scikit-learn.org/stable/datasets/real_world.html#the-olivetti-faces-dataset) dataset.


In [None]:
from sklearn.datasets import fetch_olivetti_faces

olivetti = fetch_olivetti_faces()


In [None]:
olivetti.keys()

In [None]:
olivetti.target

In [None]:
olivetti.data[0]

In [None]:
olivetti.images[0]

In [None]:
plt.imshow(olivetti.images[0])
plt.show()

### [Data generators](https://scikit-learn.org/stable/datasets/sample_generators.html)

Sometimes we want to generate synthetic data to test clustering and regression methods. scikit-learn provides a number of helpful functions to do this for us.

For example, we can create a classification dataset consisting of a number of blobs:


In [None]:
from sklearn.datasets import make_blobs

# 100 data points, 2 features, 3 blobs
blobs_X, blobs_y = make_blobs(100, 2, centers=3)

# split off a test dataset
X_train, X_test, y_train, y_test = train_test_split(blobs_X, blobs_y, random_state=0)

In [None]:
plt.scatter(X_train[:,0],X_train[:,1], c=y_train)
plt.show()

### [external datasets](https://scikit-learn.org/stable/datasets/loading_other_datasets.html#loading-from-external-datasets)

There are of course many other sources of data that we might want to use.

scikit-learn provides a direct interface to the [OpenML](https://www.openml.org/home) repository, so it is very easy to make use of these datasets in your work. See [here](https://scikit-learn.org/stable/datasets/loading_other_datasets.html#downloading-datasets-from-the-openml-org-repository) for more details.



In [None]:
from sklearn.datasets import fetch_openml

mice = fetch_openml(name='miceprotein', version=4)

mice.keys()

To load data from a CSV file, we would use the pandas functions:

In [None]:
codon = pd.read_csv("codon_usage.csv")


In [None]:
codon.head()

In [None]:
# Extract a DataFrame for the features
codon_X = codon.iloc[:,5:]

# Extract a Series for the target
codon_y = codon.iloc[:,0]

# We can now work in scikit-learn with these pandas objects 
# in exactly the same way as for the numpy arrays
X_train, X_test, y_train, y_test = train_test_split(codon_X, codon_y, random_state=0)

### Other data resources

Some other repositories you may find useful:

https://paperswithcode.com/datasets

https://archive.ics.uci.edu/ml/index.php


### Preprocessing

There may be several preprocessing steps that we need to complete before data is ready to use in our chosen method. Below are a few processes that are commonly applied.

Importantly, note that the any manipulations to the data (imputation, standardisation etc.) should be calculated on the *training* data only, although the same transformation needs to be applied to both the training and testing data. If we preprocess before the train/test split then we are contaminating the testing data with information from the training data, which is what we want to avoid.

#### [Encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features)

Categorical datatypes need to be transformed to numbers before scikit-learn can use them as features.

In some datasets, this has already been done by assigning an integer value to each label. This is called *ordinal encoding*.

Although this is fine for binary attributes and (possibly) features with a natural ordering, it may produce poor performance when the datatype is nominal, i.e. with no meaningful ordering for the labels.

A better solution is to use *one-hot encoding* to replace a single integer feature with multiple binary features.


#### [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html)

Real-world datasets often contain missing values, represented in data as `"?"` or `nan`.

We might choose to ignore (drop) rows that contain missing values. However, this wastes the rest of the information in that row.

It may be more desirable to insert a guess in place of the missing values. scikit-learn provides some simple and more complex methods to do this:


#### [Standardisation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)

Once we have encoded categorical features and imputed missing values, it may be necessary to center the data and transform it so that all features are equivalent in some way (for example, have equal variance). This is a requirement for some machine learning methods to operate correctly.
 

### Exercise 

Download the `autoMpg` dataset from OpenML:

In [None]:
from sklearn.datasets import fetch_openml
mpg = fetch_openml(name='autoMpg',version=1)

See https://www.openml.org/d/196 for a description of the data.

The `origin` feature is a nominal value coded as an integer.
Use one-hot encoding to turn this column into multiple binary features.

The `horsepower` feature has 6 missing values. Can you impute them?

Finally, standardise the dataset using a method of your choosing.