# Creating metafeatures

## Goals

* Explore the visualization of UCI wine dataset through combinations of features
* Identify a combination of features which separates cultivars

In [None]:
!pip install --user scprep

## 1. Loading [the UCI wine dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html)

You've done this before.

In [None]:
import scprep
import numpy as np
import pandas as pd
%matplotlib inline
from sklearn import datasets, preprocessing

#### Load data

We'll load the data into a `pandas` DataFrame as we did last time.

In [None]:
wine = datasets.load_wine()

# Access the numerical data from the wine Bunch
data = wine['data']

# Load data about the rows and columns
feature_names = wine['feature_names']

# Load cultivar information about each wine
cultivars = np.array(['Cultivar{}'.format(cl) for cl in wine['target']])

# Create nice names for each row
wine_names = np.array(['Wine{}'.format(i) for i in range(data.shape[0])])

# Gather all of this information into a DataFrame
data = pd.DataFrame(data, columns=feature_names, index=wine_names)

# Print the first 5 rows of the data, eq. to data[:5]
data.head()

## 2. Visualizing data in 2D

Next, you'll pick a cultivar and try to find features that separate that cultivar from the others. First, just try plotting two variables.

#### Select a cultivar

In [None]:
my_cultivar = "Cultivar0" # alternative: "Cultivar1", "Cultivar2"

#### Plot two variables

In [None]:
scprep.plot.scatter(x=data['color_intensity'], y=data['hue'],
                    c=cultivars == my_cultivar, legend_title=my_cultivar)

Play around with the choices of variables and see what works best to separate your cultivar from the others. You can choose from any of `data.columns`.

In [None]:
print(data.columns)

### Exercise 1 - find two variables that best separate your chosen cultivar from the others

In [None]:
# ================
# Fill in the x and y arguments to scprep.plot.scatter
scprep.plot.scatter(x=
                    y=
                    c=cultivars == my_cultivar, legend_title=my_cultivar)
# ================

## 3. Creating metafeatures

You'll notice that it was very difficult to get good separation of cultivars using just two variables. We can do better by combining multiple variables, as shown below.

In [None]:
scprep.plot.scatter(x=data['color_intensity'] + 2 * data['malic_acid'], 
                    y=3 * data['hue'] + data['flavanoids'],
                    c=cultivars == my_cultivar, legend_title=my_cultivar)

#### Comparing feature magnitudes

When summing variables together, we need to think about the relative ranges. If we look at the scales of the features in the wine dataset, we will see they vary drastically in size.

In [None]:
# compute the mean and standard deviation of each feature
data.aggregate([np.mean, np.std]).round(2)

#### Scaling data

For simplicity, we will first scale (or z-score) each feature to have the same mean and variance; this will allow you to sum them without having to worry about which has higher absolute values (for example, if we summed `'ash'` (on the order of 1-2) and `'proline'` (on the order of 500-1000) together, the ash values would have little to no effect on the sum.

In [None]:
# use the sklearn StandardScaler to scale to mean 0, variance 1
data_scaled = preprocessing.StandardScaler().fit_transform(data)

# turn the result back into a pandas DataFrame
data_scaled = pd.DataFrame(data_scaled, index=data.index, columns=data.columns)

# compute the mean and standard deviation of each feature
data_scaled.aggregate([np.mean, np.std]).round(2)

Much better. Now we can sum features together without worrying about magnitude.

In [None]:
scprep.plot.scatter(x=data_scaled['color_intensity'] + data_scaled['malic_acid'], 
                    y=data_scaled['hue'] + data_scaled['flavanoids'],
                    c=cultivars == my_cultivar, legend_title=my_cultivar)

Now it's your turn. Can you find a combination of features that cleanly separates your chosen cultivar from the others?

### Exercise 2  - find two combinations of variables that best separate your chosen cultivar from the others

In [None]:
# ================
# Fill in the x and y arguments to scprep.plot.scatter
x = data_scaled[???] + data_scaled[???] + ...
y = data_scaled[???] + data_scaled[???] + ...
# ================
scprep.plot.scatter(x=x, y=y,
                    c=cultivars == my_cultivar, legend_title=my_cultivar)