# Feature Selection with the D-Wave System
In the [feature selection demo](<<< leap_url >>>/demos/socialnetwork/) you saw ...   

This notebook examines how you can solve optimization problems on a  D-Wave quantum processing unit (QPU) with the example of a feature-selection problem.
    
1. [What is Feature Selection?](#What-is-Feature-Selection?) defines and explains the feature-selection problem.
2. [Formulating the Problem](#Formulating-the-Problem-for-a-D-Wave-System) shows how such optimization problems can be formulated for solution on a quantum computer. 
3. 
4. [Feature Selection and Mutual Information](#Feature-Selection-and-Mutual-Information) provides more details on the mathematics used by this example.

This notebook should help you understand both the techniques and [Ocean software](https://github.com/dwavesystems) tools used for solving optimization problems on D-Wave quantum computers.

**New to Jupyter Notebooks?** JNs are divided into text or code cells. Pressing the **Run** button in the menu bar moves to the next cell. Code cells are marked by an "In: \[\]" to the left; when run, an asterisk displays until code completion: "In: \[\*\]".

# What is Feature Selection?
Statistical and machine-learning models use a set of input variables (features)
to predict output variables of interest. Feature selection, which can be
part of the model design process, simplifies the model and reduces dimensionality by selecting,
from a given set of potential features, a subset of highly informative ones. 

For example, if Farmer Jones were creating a model for predicting the ripening of her hothouse tomatoes, she might start recording daily the following list of potential features: date, air temperature, degree of cloudiness, hours of daylight, daily water, fertilizer, air humidity, hours of electric light, ambient music style. After a growth season or two, she analyzes correlations between these features and her tomato crops. Her analysis reveals:

* date, cloudiness, daylight have little predictive power
* water and humidity are highly predictive of crop rot; they also follow a very similar trend (the hothouse has a roof sprinkler) 
* fertilizer is highly predictive of fruit size

Farmer Jones understands that her hothouse's electric light makes her crop less dependant on seasons (date) and sunshine (cloudiness). She can simplify her model by disregarding those features. She can also reduce the number of inputs by recording either her water or humidity measurement but not both.

For systems with large amounts of potential input information, such as weather forecasting or facial recognition, the model complexity and required compute resources can be daunting. Feature selection can help make such models tractable. 

However, optimal feature selection itself can be a hard problem. This example introduces a powerful method of optimizing feature selection based on a hard probability calculation. To overcome the difficulties of this calculation, it formulates a solution by quantum computer.  

## Feature Selection by Mutual Information
There are many methods to do feature selection. For example, if you are building a deep learning network and have six potential features, you might try training first on each of the features by itself, then on all 15 combinations of subsets of two features, then 20 combinations of subsets of three features, and so on. This naive method rarely works for real-world cases. For those, statistical methods are much more efficient.  

One statistical criterion that can guide this selection is mutual information (MI). Section [Feature Selection and Mutual Information](#Feature-Selection-and-Mutual-Information) describes the concepts and mathematics of this statistical tool in more detail. At a high level, MI quantifies how much one knows about one random variable from observations of another variable.  

Ideally, to select the $k$ most relevant features, you might maximize $I({X_s}; Y)$,
the MI between a set of $k$ features, $X_s$, and the variable of interest, $Y$.
This is a hard calculation because the number of states is exponential with $k$.


# Formulating the Problem

THIS SECTION FOCUSSES ON FORMULATING A QUBO



The Mutual Information QUBO MIQUBO) method of feature selection formulates a quadratic
unconstrained binary optimization (QUBO) based on an approximation for $I({X_s}; Y)$,
which is submitted to the D-Wave quantum computer for solution.

The demo illustrates the MIQUBO method by finding an optimal feature set for predicting
survival of Titanic passengers. It uses records provided in file
formatted_titanic.csv, which is a feature-engineered version of a public database of
passenger information recorded by the ship's crew (in addition to a column showing
survival for each passenger, it contains information on gender, title, class, port
of embarkation, etc). Its output is a ranking of subsets of features that have
high MI with the variable of interest (survival) and low redundancy.

# A Real-World Problem: Predicting Survival of Titanic Passangers

# Summary

# Feature Selection and Mutual Information
This section explains the math of mutual information.

As described above, to select the :math:`k` most relevant features, you might maximize
:math:`I({X_s}; Y)`, the MI between a set of :math:`k` features, :math:`X_s`, and the
variable of interest, :math:`Y`. Given :math:`N` features out of which you select
:math:`k`, maximize mutual information, I, as

.. math::
    {X_1, X_2, ...X_k} = \arg \max I(X_k; Y)
by expanding,

.. math::
    I(X_k;Y) = N^{-1} \sum_i \left\{ I(X_i;Y) + I(X_{k(i)};Y|X_i) \right\}
Approximate the second term by assuming conditional independence:

.. math::
    I(X_k;Y|X_i) \approx \sum_{j \in X_k(i)} I(X_j;Y|X_i)
Using the following equations for Shannon entropy,

.. math::
    H(X) = -\sum_x P(x)\mathrm{log}P(x)
    H(X|Y) = H(X,Y)-H(Y)
You can then calculate all these terms as follows:

.. math::
     I(X;Y) = H(X)-H(X|Y)
     I(X;Y|Z) = H(X|Z)-H(X|Y,Z)
The approximated equation for MI can now be formed as a QUBO:

.. math:
    {X_1, X_2, ...X_k} = \arg \max \left\{MI - Penalty}
where the penalty is some multiple of :math:`\sum_{i} (x_i - k)^2` that enforces
the constraint of :math:`k` features.