# Dummy Classifier - Useful
> How to have a first idea on the quality of your model

- toc: true
- badges: false
- comments: true
- author: Cécile Gallioz
- categories: [sklearn]

The [Dummy classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier.predict) will give us an idea of the "minimum" quality we can achieve.

It returns either a fixed value or the most frequent value of the training sample.

The quality of its score will be used as a floor for the future estimation. The objective is to do better or much better than the idiot!

In [1]:
import pandas as pd
import numpy  as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

In [2]:
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

In [3]:
target_column = 'Species'
target = myDataFrame[target_column]
target.value_counts()

Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

In [4]:
target.value_counts(normalize=True)

Adelie       0.441520
Gentoo       0.359649
Chinstrap    0.198830
Name: Species, dtype: float64

In [5]:
data = myDataFrame.drop(columns=target_column)
data.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)'], dtype='object')

In [6]:
numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
data_numeric = data[numerical_columns]

In [7]:
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    #random_state=42, 
    test_size=0.25)

## Prior (by default) idem most frequent
The value returne is the most frequent in the training set

    model = DummyClassifier(strategy='prior')

In [8]:
model = DummyClassifier()
model.fit(data_train, target_train);

In [9]:
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Adelie': 1.0}

In [10]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.453


## Stratified
The value return is found randomly by respecting the class distribution of the training

    model = DummyClassifier(strategy='stratified', random_state= 33)

In [11]:
model = DummyClassifier(strategy='stratified')
model.fit(data_train, target_train);

In [12]:
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Adelie': 0.4186046511627907,
 'Chinstrap': 0.22093023255813954,
 'Gentoo': 0.36046511627906974}

In [13]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.372


## Uniform
The value return is generated uniformly at random

    model = DummyClassifier(strategy='uniform', random_state= 21)

In [14]:
model = DummyClassifier(strategy='uniform')
model.fit(data_train, target_train);

In [15]:
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Adelie': 0.23255813953488372,
 'Chinstrap': 0.3953488372093023,
 'Gentoo': 0.37209302325581395}

In [16]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.465


## Constant
Always predicts a constant label that is provided. This is useful for metrics that evaluate a non-majority class

    model = DummyClassifier(strategy='constant', constant="oneConstant")

In [17]:
model = DummyClassifier(strategy='constant', constant="Chinstrap")
model.fit(data_train, target_train);

In [18]:
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Chinstrap': 1.0}

In [19]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.151


# Conclusion

The best estimation is to prior, so any model better than that is good. 

We thus have a floor value, then obviously the higher the score the better the estimation.