# 📝 Exercise M1.02

The goal of this exercise is to fit a similar model as in the previous
notebook to get familiar with manipulating scikit-learn objects and in
particular the `.fit/.predict/.score` API.

Let's load the adult census dataset with only numerical variables

In [14]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
data = adult_census.drop(columns="class")
target = adult_census["class"]

In the previous notebook we used `model = KNeighborsClassifier()`. All
scikit-learn models can be created without arguments. This is convenient
because it means that you don't need to understand the full details of a model
before starting to use it.

One of the `KNeighborsClassifier` parameters is `n_neighbors`. It controls the
number of neighbors we are going to use to make a prediction for a new data
point.

What is the default value of the `n_neighbors` parameter?

**Hint**: Look at the documentation on the [scikit-learn
website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
or directly access the description inside your notebook by running the
following cell. This opens a pager pointing to the documentation.

In [23]:
from sklearn.neighbors import KNeighborsClassifier

KNeighborsClassifier.score?

[31mSignature:[39m KNeighborsClassifier.score(self, X, y, sample_weight=[38;5;28;01mNone[39;00m)
[31mDocstring:[39m
Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.

Parameters
----------
X : array-like of shape (n_samples, n_features), or None
    Test samples. If `None`, predictions for all indexed points are
    used; in this case, points are not considered their own
    neighbors. This means that `knn.fit(X, y).score(None, y)`
    implicitly performs a leave-one-out cross-validation procedure
    and is equivalent to `cross_val_score(knn, X, y, cv=LeaveOneOut())`
    but typically much faster.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)
    True labels for `X`.

sample_weight : array-like of shape (n_samples,), default=None
    Sample weights.

Returns
-------
score : float
    Mean acc

Create a `KNeighborsClassifier` model with `n_neighbors=50`

In [16]:
# Write your code here.
model = KNeighborsClassifier(n_neighbors=50)

Fit this model on the data and target loaded above

In [17]:
_ = model.fit(data, target)

Use your model to make predictions on the first 10 data points inside the
data. Do they match the actual target values?

In [32]:
# Write your code here.
target_predicted = model.predict(data[:10])
target_predicted

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K',
       ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

Compute the accuracy on the training data.

In [25]:
# Write your code here.
accuracy = model.score(data, target)
print(accuracy)

0.8290379545978042


Now load the test data from `"../datasets/adult-census-numeric-test.csv"` and
compute the accuracy on the test data.

In [29]:
# Write your code here.
test = pd.read_csv("../datasets/adult-census-numeric-test.csv")
target_test = test["class"]
data_test = test.drop(columns="class")

test_data_accuracy = model.score(data_test, target_test)
print(test_data_accuracy)

0.8177909714402702
