#Boolean operations, Boolean masks, and Boolean combinations

In this exercise set, we load forest covertype data (Note: This will take a little while because it downloads a large dataset from the web). This is a multi class dataset that has data for 7 different forest covertypes (stored in the target attribute). There are 581,012 forest plots with 54 attributes each (stored in a 581012 x 54 array) The first ten attributes are numerical, the last 44 are Boolean (true/false) attributes. Each of the Boolean attributes represents a qualitative soil type attribute which is either present or absent. We will refer to all the attributes by their column index. Thus the first attribute is attribute 0 and the last (a Boolean attribute) is attribute 53.

In [2]:
#from sklearn.datasets import load_wine
#wdata = load_wine()
from sklearn.datasets import fetch_covtype
data = fetch_covtype()
print(data.data.shape)  # data.data is the 581012 x 54 array
print(data.target.shape)     # data.target contains the class for each instance

Downloading https://ndownloader.figshare.com/files/5976039


(581012, 54)
(581012,)


[The Boolean arrays and masks notebook](https://github.com/gawron/python-for-social-science/blob/master/numpy/02_06_Boolean_Arrays_and_Masks.ipynb) discusses combining Boolean arrays with Boolean operators `&` (conceptually 'and') and `|` (conceptually 'or') and  `~` (conceptually 'not').  Study the examples there, especially the examples used on the Seattle rainfall data.  Pay special attention to the use of parentheses, because using them correctly matters in solving the following problems.

Each of the following problems concerns finding a particular set of rows in the covertypes dataset. For each set do two things: 
 
a. Construct a single Python expression which counts the number of rows in the set.
b. Construct a single Python expression which returns all the rows of the covertype dataset that are in the set.  (Note: not just the Boolean array, but the rows you get when use the Boolean array as a mask).
 
 1.  The rows in the covertype dataset which have a value greater than 300 for attribute 1.
 2.  The rows which have attribute 12 (Note:  this is a qualitative soil attribute).
 3.  The entire covertypes dataset excluding rows which belong either to covertype 3 or covertype 5.  (Note: you will have to use two arrays `data.data` and `data.target`).
 4. The rows which have attribute 12 but do not have either covertype 3 or covertype 5.
 5. The rows which are not in covertype 3 or 5, have a value greater than 300 for attribute 1, and have attribute 12.

The `sklearn` description of the covertype dataset is out below. For a fuller understanding of the attributes in the dataset you can look at [the original UCI data set description.](https://archive.ics.uci.edu/ml/datasets/Covertype) but that is not necessary to solve the problem.

The `data` object loaded above is a wrapper which contains three attributes.

In [3]:
dir(data)

['DESCR', 'data', 'target']

In [4]:
print(data.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like object
with the feature matrix in the ``data`` member
and the target values in ``target``.
The dataset will be downloaded from the web if necessary.



Example:  `data.data`  is a `numpy` array containing the attributes for the 581,012 samples.  


In [14]:
print(type(data.data))
print(data.data.shape)

<class 'numpy.ndarray'>
(581012, 54)



To compute the Boolean array  that identifies the rows in which attribute 0 is greater than 3000 you do

In [13]:
WW = data.data[:,0] > 3000
print(WW)
print(WW.shape, data.data.shape)

[False False False ... False False False]
(581012,) (581012, 54)


Note the Boolean array has the same number of rows as the entire dataset.  To find the rows that satisfy this condition, you use this Boolean array as a 
**mask**. That is, you do:

In [12]:
print(data.data[WW,:])
data.data[WW,:].shape

[[3008.   45.   14. ...    0.    0.    0.]
 [3073.  173.   12. ...    0.    0.    0.]
 [3067.  164.   11. ...    0.    0.    0.]
 ...
 [3125.  127.    5. ...    0.    0.    0.]
 [3126.  120.    4. ...    0.    0.    0.]
 [3124.  115.    5. ...    0.    0.    0.]]


(286266, 54)

Note the new array contains just the rows that satisfies condition `WW` , so it is smaller.

The covertypes (or dominant tree) for each forest plot are in `data.target`.  The `target` attribute is often the attribute used to store the classes in an `sklearn` clasification dataset. There are seven classes.

In [17]:
print(data.target.shape)
print(set(data.target))

(581012,)
{1, 2, 3, 4, 5, 6, 7}


Place your answer to questions 1a and 1b in the next cell.

Place your answer to questions 2a and 2b in the next cell.

Place your answer to questions 3a and 3b in the next cell.

Place your answer to questions 4a and 4b in the next cell.

Place your answer to questions 5a and 5b in the next cell.

In [16]:
# Part a Count members of set
print(sum((data.data[:,1] > 300) & ~((data.target == 3) | (data.target == 5)) & (data.data[:,12] == 1)))
# Part b. Return the rows in the set
#Also works
#subdata = data.data[(data.data[:,1] > 300) & ~((data.target == 3) | (data.target == 5)) & (data.data[:,12] == 1),:]
subdata = data.data[(data.data[:,1] > 300) & ~((data.target == 3) | (data.target == 5)) & (data.data[:,12] == 1)]
subdata.shape

43815


(43815, 54)