# Basline Classifiers (Dummy Classifiers)

The baseline classifiers make predictions with simple rules, possibly without using any features. If our cool ML models can't beat the baseline, then we need to rethink about our models. 

In [1]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

## Load Dataset

We will load the dataset from file into a Panda data frame and investigate its structure. 


In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

# View some metadata of the dataset and see if that makes sense
print('dataset.shape', dataset.shape)

X = np.array(dataset.iloc[:,:-1])[:, [1,2,6,9,10]]
y = np.array(dataset.quality)

print('X', X.shape, 'y', y.shape)
print('Label distribution:', {i: np.sum(y==i) for i in np.unique(dataset.quality)})

dataset.shape (1599, 12)
X (1599, 5) y (1599,)
Label distribution: {3: 10, 4: 53, 5: 681, 6: 638, 7: 199, 8: 18}


### Describe dataset.

In [3]:
dataset.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


### Building a dummy classifier using sklearn DummyClassifier

Let's create four baslines with sklearn `DummyClassifier`: 

* Most Frequent: The classifier always predicts the most frequent class label in the training data.
* Stratified: It generates predictions by respecting the class distribution of the training data. It is different from the “most frequent” strategy as it instead associates a probability with each data point of being the most frequent class label.
* Uniform: It generates predictions uniformly at random.
* Constant: The classifier always predicts a constant label and is primarily used when classifying non-majority class labels.

#### Most Frequent Strategy

In [4]:
dummy_model = DummyClassifier(strategy='most_frequent')
dummy_model.fit(X,y)
print(f"Accuracy: {dummy_model.score(X,y)}")

Accuracy: 0.425891181988743


#### Prior Strategy

In [5]:
dummy_model = DummyClassifier(strategy='prior')
dummy_model.fit(X,y)
print(f"Accuracy: {dummy_model.score(X,y)}")

Accuracy: 0.425891181988743


#### Stratified Strategy

In [6]:
dummy_model = DummyClassifier(strategy='stratified')
dummy_model.fit(X,y)
print(f"Accuracy: {dummy_model.score(X,y)}")

Accuracy: 0.3445903689806129


Note: The above score will keep changing as prediction is sampled from a distribution. 

#### Uniform Strategy

In [7]:
dummy_model = DummyClassifier(strategy='uniform')
dummy_model.fit(X,y)
print(f"Accuracy: {dummy_model.score(X,y)}")

Accuracy: 0.16635397123202


Note: The above score will keep changing as prediction is sampled from a distribution.