# First model with scikit-learn

## Basic preprocessing and model fitting

In this notebook, we present how to build predictive models on tabular
datasets, with only numerical features.

In particular we will highlight:
* the scikit-learn API : `.fit`/`.predict`/`.score`
* how to evaluate the performance of a model with a train-test split

## Loading the dataset

We will use the same dataset "adult_census" described in the previous notebook.
For more details about the dataset see <http://www.openml.org/d/1590>.

Numerical data is the most natural type of data used in machine
learning and can (almost) directly be fed into predictive models. We
will load a the subset of the original data with only the numerical
columns.

In [None]:
import pandas as pd

df = pd.read_csv("../datasets/adult-census-numeric.csv")

Let's have a look at the first records of this data frame:

In [None]:
df.head()

In [None]:
target_name = "class"
target = df[target_name]
target

We now separate out the data that we will use to predict from the
prediction target

In [None]:
data = df.drop(columns=[target_name, ])
data.head()

We will use this data to fit a linear classification model to predict
the income class.

In [None]:
data.columns

In [None]:
print(
    f"The dataset contains {data.shape[0]} samples and "
    f"{data.shape[1]} features")

We will build a classification model using the "K Nearest Neighbor"
strategy. The `fit` method is called to train the model from the input
(features) and target data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(data, target)

Let'us to use our model to make some predictions on the first five
records of the held out test set:

In [None]:
target_predicted = model.predict(data)
target_predicted[:5]

We can compare these predictions to the actual data

In [None]:
target[:5]

To get a better assessment, we can compute the average success rate

In [None]:
(target == target_predicted).mean()

But, can this evaluation be trusted, or is it too good to be true?

When building a machine learning model, it is important evaluate the
trained model on data that was not used to fit the model, as
generalization is more than memorization. It is harder to conclude on
instances never seen than on those already seen.

Correct evaluation is easily done by leaving out a subset of the data
when training the model and using it after for model evaluation. The
data used to fit a model is called training data while the one used to
assess a model is called testing data.

We can load more data, which was actually left-out from the original
data set

In [None]:
df_test = pd.read_csv('../datasets/adult-census-numeric-test.csv')

From this new data, we separate out input features and the target to
predict

In [None]:
target_test = df_test[target_name]
data_test = df_test.drop(columns=[target_name, ])

In [None]:
print(
    f"The testing dataset contains {data_test.shape[0]} samples and "
    f"{data_test.shape[1]} features")

Note that scikit-learn provides a helper function `train_test_split`
which can be used to split the dataset into a training and a testing
set. It will also ensure that the data are shuffled randomly before
splitting the data.

To quantitatively evaluate our model, we can use the method `score`. It will
compute the classification accuracy when dealing with a classification
problem.

In [None]:
print(f"The test accuracy using a {model.__class__.__name__} is "
      f"{model.score(data_test, target_test):.3f}")

We can now compute the model predictions on the test set:

In [None]:
target_test_predicted = model.predict(data_test)

And compute the average accuracy on the test set:

In [None]:
(target_test == target_test_predicted).mean()

If we compare with the accuracy obtained by wrongly evaluating the model
on the training set, we find that this evaluation was indeed optimistic

In this notebook we have:
* fit a **nearest neighbor** model on training dataset
* evaluated its performance on the testing data
* presented the scikit-learn API `.fit` (to train a model), `.predict` (to
  make predictions) and `.score` (to evaluate a model)