In [1]:
import numpy as np
import sklearn.datasets
import sklearn.linear_model
import sklearn.metrics

In [2]:
np.random.seed(42)  # do not change for reproducibility

In this test we use the [Covertype Data Set](http://archive.ics.uci.edu/ml/datasets/Covertype), a dataset describing cartographic 
features of areas of land in the USA and also its forest type according to the US Forest Service. 
There are seven classes (1-7), 581012 samples and 54 features.
For this test, we're only interested in cover type 3.




In [3]:
dataset = sklearn.datasets.fetch_covtype()

Downloading https://ndownloader.figshare.com/files/5976039


In [4]:
# only use a random subset for speed - pretend the rest of the data doesn't exist
random_sample = np.random.choice(len(dataset.data), len(dataset.data) // 10)

COVER_TYPE = 3
features = dataset.data[random_sample, :]
target = dataset.target[random_sample] == COVER_TYPE

A junior colleague tells you that they're getting 96% accuracy using logistic regression. You review their work and see the following:

In [15]:
classifier = sklearn.linear_model.LogisticRegression(solver='liblinear')
classifier.fit(features,  target)
training_predictions = classifier.predict(features)
accuracy = sklearn.metrics.accuracy_score(training_predictions, target)

In [17]:
print(f'Accuracy: {accuracy:.3f}')

Accuracy: 0.959


**Question 1**

Evaluate the accuracy more thoroughly. Do not modify the parameters of the model. Use the classifier object.

In [29]:
sklearn.metrics.confusion_matrix(training_predictions, target)

array([[53713,  1540],
       [  847,  2001]])

In [30]:
sklearn.metrics.precision_score(training_predictions, target)

0.5650946060434905

In [34]:
sklearn.metrics.recall_score(training_predictions, target)

0.7025983146067416

In [36]:
sklearn.metrics.f1_score(training_predictions, target)

0.6263891062764125

**Question 2**

Is accuracy the most suitable metric for this problem?

No, since the classes are imbalanced and there's much more negative cases, the Precision-Recall curve metric would be a more appropriate metric, which accounts both returning accurate results and returning the majority of all positive results, as well as being more sensitive to the positive class.

**Question 3**

Should you get more training data?

The classes are imbalanced so we should get more samples of the imbalanced classes (over-sampling) or, in alternative, we can remove some observations from the over represented class (under-sampling), which probably is not the best option since our dataset is relatively small overall.

**Question 4**

How would you decide which features to include in the deployed model?

A possibility of feature selection would be to perform univariate selection, selecting the features with the strongest relationships with the target variable. The SKLearn library provides `SelectKBest()` which could be used in this case to select the $k$ best features.