# 📝 Exercise M1.03

The goal of this exercise is to compare the performance of our classifier in
the previous notebook (roughly 81% accuracy with `LogisticRegression`) to
some simple baseline classifiers. The simplest baseline classifier is one
that always predicts the same class, irrespective of the input data.

- What would be the score of a model that always predicts `' >50K'`?
- What would be the score of a model that always predicts `' <=50K'`?
- Is 81% or 82% accuracy a good score for this problem?

Use a `DummyClassifier` and do a train-test split to evaluate
its accuracy on the test set. This
[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)
shows a few examples of how to evaluate the generalization performance of these
baseline models.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [16]:
adult_census.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


We will first split our dataset to have the target separated from the data
used to train our predictive model.

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

We start by selecting only the numerical columns as seen in the previous
notebook.

In [3]:
numerical_columns = [
    "age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]

Split the data and target into a train and test set.

In [12]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, random_state=42)


Use a `DummyClassifier` such that the resulting classifier will always
predict the class `' >50K'`. What is the accuracy score on the test set?
Repeat the experiment by always predicting the class `' <=50K'`.

Hint: you can set the `strategy` parameter of the `DummyClassifier` to
achieve the desired behavior.

In [14]:
from sklearn.dummy import DummyClassifier

# Write your code here.
class_to_predict = " >50K"
class_clf = DummyClassifier(strategy="constant",
                                   constant=class_to_predict)
class_clf.fit(data_train, target_train)
score = class_clf.score(data_test, target_test)
print(f"Accuracy of a model predicting only high revenue: {score:.3f}")

Accuracy of a model predicting only high revenue: 0.234


In [15]:
class_low_to_predict = " <=50K"
class_clf = DummyClassifier(strategy="constant",
                                  constant=class_to_predict)
class_clf.fit(data_train, target_train)
score = class_clf.score(data_test, target_test)
print(f"Accuracy of a model predicting only low revenue: {score:.3f}")


Accuracy of a model predicting only low revenue: 0.234


In [17]:
adult_census["class"].value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

In [18]:
(target == " <=50K").mean()

0.7607182343065395

In [19]:
most_freq_revenue_clf = DummyClassifier(strategy="most_frequent")
most_freq_revenue_clf.fit(data_train, target_train)
score = most_freq_revenue_clf.score(data_test, target_test)
print(f"Accuracy of a model predicting the most frequent class: {score:.3f}")

Accuracy of a model predicting the most frequent class: 0.766
