# 📝 Exercise M1.03

The goal of this exercise is to compare the performance of our classifier in
the previous notebook (roughly 81% accuracy with `LogisticRegression`) to some
simple baseline classifiers. The simplest baseline classifier is one that
always predicts the same class, irrespective of the input data.

- What would be the score of a model that always predicts `' >50K'`?
- What would be the score of a model that always predicts `' <=50K'`?
- Is 81% or 82% accuracy a good score for this problem?

Use a `DummyClassifier` and do a train-test split to evaluate its accuracy on
the test set. This
[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)
shows a few examples of how to evaluate the generalization performance of
these baseline models.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/master/datasets/adult-census.csv"
adult_census = pd.read_csv(url)

We first split our dataset to have the target separated from the data used to
train our predictive model.

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

We start by selecting only the numerical columns as seen in the previous
notebook.

In [3]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]

Split the data and target into a train and test set.

In [39]:
from sklearn.model_selection import train_test_split

# Write your code here.
data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, random_state=42)

Use a `DummyClassifier` such that the resulting classifier always predict the
class `' >50K'`. What is the accuracy score on the test set? Repeat the
experiment by always predicting the class `' <=50K'`.

Hint: you can set the `strategy` parameter of the `DummyClassifier` to achieve
the desired behavior.

In [40]:
from sklearn.dummy import DummyClassifier
DummyClassifier?

[1;31mInit signature:[0m [0mDummyClassifier[0m[1;33m([0m[1;33m*[0m[1;33m,[0m [0mstrategy[0m[1;33m=[0m[1;34m'prior'[0m[1;33m,[0m [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mconstant[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more
complex classifiers.

The specific behavior of the baseline is selected with the `strategy`
parameter.

All strategies make predictions that ignore the input feature values passed
as the `X` argument to `fit` and `predict`. The predictions, however,
typically depend on values observed in the `y` parameter passed to `fit`.

Note that the "stratified" and "uniform" strategies lead to
non-deterministic predictions that can be rendered deterministic by setting
the `random_state` parameter if needed. The other strategies are naturally
determin

In [41]:
print(target_train.unique())
print(type(target_train))
print([f"'{x}'" for x in target_train.unique()])


[' >50K' ' <=50K']
<class 'pandas.core.series.Series'>
["' >50K'", "' <=50K'"]


In [45]:
# Write your code here.
class_to_predict = ' >50K'
high_revenue_clf = DummyClassifier(strategy = "constant", constant =class_to_predict)
high_revenue_clf.fit(data_train, target_train)

high_revenue_score = high_revenue_clf.score(data_test, target_test)

dummy_name = high_revenue_clf.__class__.__name__
dummy_strategy = high_revenue_clf.strategy
strategy_constant = high_revenue_clf.constant

print(f"The test accuracy using a {dummy_name} with {dummy_strategy} = {strategy_constant} is {high_revenue_score:.3f}")

The test accuracy using a DummyClassifier with constant =  >50K is 0.234


In [49]:
# Write your code here.
class_to_predict = ' <=50K'
low_revenue_clf = DummyClassifier(strategy = "constant", constant =class_to_predict)
low_revenue_clf.fit(data_train, target_train)

low_revenue_score = low_revenue_clf.score(data_test, target_test)

dummy_name = low_revenue_clf.__class__.__name__
dummy_strategy = low_revenue_clf.strategy
strategy_constant = low_revenue_clf.constant

print(f"The test accuracy using a {dummy_name} with {dummy_strategy} = {strategy_constant} is {low_revenue_score:.3f}")

The test accuracy using a DummyClassifier with constant =  <=50K is 0.766


We observe that this model has an accuracy higher than 0.5. This is due to the fact that we have 3/4 of the target belonging to low-revenue class.

In [50]:
adult_census[target_name].value_counts()

class
<=50K    37155
>50K     11687
Name: count, dtype: int64

In [51]:
(target == ' <=50K').mean()

np.float64(0.7607182343065395)

The `LogisticRegression` accuracy (roughly 81%) seems better than the `DummyClassifier` accuracy (roughly 76%). 

In a way, it is a bit reassuring, using a ML model gives a better performance than always predicting the majority class.

In [56]:
from sklearn.linear_model import LogisticRegression

# Write your code here.
model = LogisticRegression()
model.fit(data_train, target_train)
predictions = model.predict(data_train)
accuracy = model.score(data_test, target_test)
accuracy


0.8070592089099992