/
02_numerical_pipeline_sol_00.py
107 lines (86 loc) · 3.01 KB
/
02_numerical_pipeline_sol_00.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.6.0
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# ---
# %% [markdown]
# # 📃 Solution for Exercise M1.02
#
# The goal of this exercise is to fit a similar model as in the previous
# notebook to get familiar with manipulating scikit-learn objects and in
# particular the `.fit/.predict/.score` API.
# %% [markdown]
# Let's load the adult census dataset with only numerical variables
# %%
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
data = adult_census.drop(columns="class")
target = adult_census["class"]
# %% [markdown]
# In the previous notebook we used `model = KNeighborsClassifier()`. All
# scikit-learn models can be created without arguments, which means that you
# don't need to understand the details of the model to use it in scikit-learn.
#
# One of the `KNeighborsClassifier` parameters is `n_neighbors`. It controls
# the number of neighbors we are going to use to make a prediction for a new
# data point.
#
# What is the default value of the `n_neighbors` parameter? Hint: Look at the
# help inside your notebook `KNeighborsClassifier?` or on the [scikit-learn
# website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
# %% [markdown] tags=["solution"]
# The default value for `n_neighbors` is 5
# %% [markdown]
# Create a `KNeighborsClassifier` model with `n_neighbors=50`
# %%
# solution
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=50)
# %% [markdown]
# Fit this model on the data and target loaded above
# %%
# solution
model.fit(data, target)
# %% [markdown]
# Use your model to make predictions on the first 10 data points inside the
# data. Do they match the actual target values?
# %%
# solution
first_data_values = data.iloc[:10]
first_predictions = model.predict(first_data_values)
first_predictions
# %% tags=["solution"]
first_target_values = target.iloc[:10]
first_target_values
# %% tags=["solution"]
number_of_correct_predictions = (
first_predictions == first_target_values).sum()
number_of_predictions = len(first_predictions)
print(
f"{number_of_correct_predictions}/{number_of_predictions} "
"of predictions are correct")
# %% [markdown]
# Compute the accuracy on the training data.
# %%
# solution
model.score(data, target)
# %% [markdown]
# Now load the test data from `"../datasets/adult-census-numeric-test.csv"` and
# compute the accuracy on the test data.
# %%
# solution
adult_census_test = pd.read_csv("../datasets/adult-census-numeric-test.csv")
data_test = adult_census_test.drop(columns="class")
target_test = adult_census_test["class"]
model.score(data_test, target_test)
# %% [markdown] tags=["solution"]
# Looking at the previous notebook, the accuracy seems slightly higher with
# `n_neighbors=50` than with `n_neighbors=5` (the default value).