# Module 1:  First model with scikit-learn

In [1]:
import pandas as pd

adult_census = pd.read_csv("./csv_result-numeric-train set.csv")

In [2]:
adult_census.head()

Unnamed: 0,id,age,capital-gain,capital-loss,hours-per-week,class
0,1,25,0,0,40,<=50K
1,2,38,0,0,50,<=50K
2,3,28,0,0,40,>50K
3,4,44,7688,0,40,>50K
4,5,18,0,0,30,<=50K


## Trim dataset

In [3]:
target_name = "class"
target = adult_census[target_name]
target

0        <=50K
1        <=50K
2         >50K
3         >50K
4        <=50K
         ...  
48837    <=50K
48838     >50K
48839    <=50K
48840    <=50K
48841     >50K
Name: class, Length: 48842, dtype: object

In [4]:
data = adult_census.drop(columns=[target_name])
data.head()

Unnamed: 0,id,age,capital-gain,capital-loss,hours-per-week
0,1,25,0,0,40
1,2,38,0,0,50
2,3,28,0,0,40
3,4,44,7688,0,40
4,5,18,0,0,30


# Model

## Learning: .fit()

In scikit-learn an object that has a fit method is called an **estimator**. The method fit is composed of two elements: (i) a learning algorithm and (ii) some model states. The learning algorithm takes the training data and training target as input and sets the model states. These model states are later used to either predict (for classifiers and regressors) or transform data (for transformers).

![Predictor predict diagram](./api_diagram-predictor.fit.svg)

Generally:
- data   = X
- target = Y

In [6]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
_ = model.fit(data, target)

## Predict: .predict()

An estimator (an object with a `fit` method) with a `predict` method is called
a **predictor**.

![Predictor predict diagram](./api_diagram-predictor.predict.svg)

To predict, a model uses a **prediction function** that uses the input data
together with the model states. As for the learning algorithm and the model
states, the prediction function is specific for each type of model.

`predict` function with first 5 predictions:

In [7]:
target_predicted = model.predict(data)
target_predicted[:5]

array(['<=50K', '<=50K', '<=50K', '>50K', '<=50K'], dtype=object)

Check with the actual data:

In [8]:
target[:5]

0    <=50K
1    <=50K
2     >50K
3     >50K
4    <=50K
Name: class, dtype: object

Compare with data:

In [10]:
target[:5] == target_predicted[:5]

0     True
1     True
2    False
3     True
4     True
Name: class, dtype: bool

Average success rate:

In [11]:
(target == target_predicted).mean()

0.8433725072683347

High success rate, `but fit` and `predict` used the same data.

## Train-test data split

Correct evaluation is easily done by leaving out a subset of the data when training the model and using it afterwards for model evaluation.

The data used to fit a model is called **training data** while the data used to assess a model is called **testing data**.

In [13]:
adult_census_test = pd.read_csv("./csv_result-numeric - test set.csv")

In [15]:
target_test = adult_census_test[target_name]
data_test = adult_census_test.drop(columns=[target_name])

In [17]:
accuracy = model.score(data_test, target_test)
model_name = model.__class__.__name__

print(f"The test accuracy using a {model_name} is {accuracy:.3f}")

The test accuracy using a KNeighborsClassifier is 0.839


We use the generic term **model** for objects whose goodness of fit can be
measured using the `score` method:

![Predictor score diagram](./api_diagram-predictor.score.svg)

To compute the score, the predictor first computes the predictions (using the
`predict` method) and then uses a scoring function to compare the true target
`y` and the predictions. Finally, the score is returned.