# Classification

Up to this point you've been using data to make real-valued predictions such as home prices. This type of modeling is called **regression**, hence the "Regressor" part of `RandomForestRegressor`. In general, regression means you are trying to predict real numbers like prices, failure rates, energy consumption, etc.

Another common problem you'll see is making a choice between mutually exclusive outcomes. For example, spam detection is predicting if an email is "spam" or "not spam" based on the body text. Another common problem is identifying handwritten digits from images, choosing one of ten digits for any given image. This type of modeling is called **classification** because we are trying to choose one option from some set of classes.

Classification is typically broken down further into binary and multiclass classification. Binary classification means there are two classes or conditions ("spam" or "not spam") while multiclass has more than two classes (ten digits). In general there are different approaches to the two types of classification, but most multiclass models will also work for binary problems.

It's straightforward to build classification models using what you already know about scikit-learn. Instead of `RandomForestRegressor` for regression, we can use `RandomForestClassifier` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn-ensemble-randomforestclassifier)). 

As an example of classification with `RandomForestClassifier`, I'll use [this dataset](https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv) for predicting price ranges for phones. The targets in the data have values:

 * 0 (low cost)
 * 1 (medium cost)
 * 2 (high cost)
 * 3 (very high cost)
 
For features, we have things like

* battery_power: Total energy a battery can store in one time measured in mAh
* blue: Has bluetooth or not
* clock_speed: speed at which microprocessor executes instructions
* dual_sim: Has dual sim support or not
* fc: Front Camera mega pixels
* four_g: Has 4G or not
* ....

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics

In [3]:
#TODO: Change this to the appropriate path for the kernel
data = pd.read_csv('/Users/mat/Projects/Kaggle/classification/mobile-price-classification/train.csv')

In [4]:
data.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [5]:
data.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

We create our feature and targets the same as before using `train_test_split`. Creating and fitting the model is the same as well, except we're using `RandomForestClassifier` instead of `RandomForestRegressor`.

In [6]:
# Set variables for the targets and features
y = data['price_range']
X = data.drop('price_range', axis=1)

# Split the data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=7)

In [7]:
# Create the classifier and fit it to our training data
model = RandomForestClassifier(random_state=7, n_estimators=100)
model.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=7, verbose=0, warm_start=False)

The simplest metric for classification models is the **accuracy**, the number of correct predictions out of all predictions made. Scikit-learn provides `metrics.accuracy_score` to calculate this.

In [8]:
# Predict classes given the validation features
pred_y = model.predict(val_X)

# Calculate the accuracy as our performance metric
accuracy = metrics.accuracy_score(val_y, pred_y)
print(f"Accuracy: {accuracy:.3f}")

Accuracy: 0.864


## Confusion Matrix

Our model did pretty well, correctly predicting around 86% of the price ranges in the validation data. It's often useful to look at where the model is failing with a **confusion matrix** which shows us how our model classified the inputs.

In [9]:
# Calculate the confusion matrix itself
confusion = metrics.confusion_matrix(val_y, pred_y)
print(f"Confusion matrix:\n{confusion}")

# Normalizing by the true label counts to get rates
norm_confusion = confusion.astype('float') / confusion.sum(axis=1)[:, None]
print(f"\nNormalized confusion matrix:\n{norm_confusion}")

Confusion matrix:
[[130   6   0   0]
 [  4  91  15   0]
 [  0  21  98  12]
 [  0   0  10 113]]

Normalized confusion matrix:
[[0.95588235 0.04411765 0.         0.        ]
 [0.03636364 0.82727273 0.13636364 0.        ]
 [0.         0.16030534 0.7480916  0.09160305]
 [0.         0.         0.08130081 0.91869919]]


TODO: Include image of the confusion matrix here

The rows of the confusion matrix are the true class and the columns are the predicted class. The diagonal tells us how many of each class the model predicted correctly. The off-diagonals show where the model is making wrong predictions, where it is "confused." For example, looking at the first column and second row, we classified four phones that were actually low cost as medium cost. We see for classes 0 and 3, the low cost and highest cost phones, our model works really well, above 90% accurate. However, our model is weaker for medium and high cost phones. Note that incorrect predictions are only between adjacent classes. The model doesn't confuse low cost and very high cost phones.

The diagonal of the normalized confusion matrix gives us the **true positive rate** for each of the classes. This is the fraction of positive cases in the validation data the model predicted to be positive. This value is important because the accuracy metric can sometimes hide information you care about. Occasionally you'll have unbalanced classes where you have many more examples of one class compared to another. For example, it's typical that when you're trying to diagnose a disease with a machine learning model, most examples in your data won't have that disease. Your model could be very good at identifying when people don't have the disease (negative cases), but poor at identifying positive cases. 

To illustrate this, assume negative cases make up 90% of your data and the classifier identifies negative cases at 95% accuracy.  Positive cases make up the other 10% of the data but your classifier only has a 25% accuracy for these cases. The overall accuracy would then be $90\% * 95\% + 25\% * 5\% = 88\%$. The accuracy looks high and if that was the only metric you looked at you might think your classifier is doing pretty well. However, the goal is to identify when people have this disease, so 25% accuracy there is clearly not sufficient.

You'll often see the true positive rate called the **recall** which you can calculate using `metrics.recall_score`. For binary classification you'll get one value while for multiclass classification, you'll get values for each of the classes. For more information about recall and associated metrics check out [the Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall) and [the scikit-learn tutorial](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html). 

## Class probabilities 

The output from `model.predict` is a single class per input. However, classification models actually calculate a *probability distribution* over the classes. Using `model.predict` simply returns the class with the highest probability. This might not be ideal based on how the decision affects your metrics or downstream measures. To get the probabilities themselves, use the `.predict_proba` method.

In [45]:
probs = model.predict_proba(val_X)
print(probs)

[[0.02 0.09 0.44 0.45]
 [0.02 0.06 0.22 0.7 ]
 [0.   0.17 0.61 0.22]
 ...
 [0.05 0.17 0.42 0.36]
 [0.45 0.34 0.13 0.08]
 [0.25 0.53 0.18 0.04]]


This shows the probability the model assigns to each class. Often in business problems, decisions you make lead to different monetary returns. The expected return for a decision based on your classifier is the probability times the monetary return of that decision.

Consider probabilities `[0.05 0.17 0.42 0.36]`. Assume the third option would result in \\$100 of profit while the fourth option would return \\$150 in profit. Then the expected monetary values are $0.42* \$100 = \$42$ and $0.36*\$150 = \$54$. Even though the third option has the highest probability, on average it would be better from a business perspective to choose the fourth option.

Next up, you'll get a chance to build your own classification model.