# First model with scikit-learn
In this notebook, we present how to build predictive models on tabular datasets, with only numerical features.

In particular we will highlight:

the scikit-learn API: .fit(X, y)/.predict(X)/.score(X, y);
how to evaluate the generalization performance of a model with a train-test split.

## Loading the adult census data set (same as the previous notebook) with Pandas

In [4]:
import pandas as pd
import numpy as np

adult_census = pd.read_csv("/Users/russconte/Adult_Census.csv")

adult_census_numbers = adult_census.select_dtypes(include=np.number)

Let's have a look at the first five rows of data:

In [5]:
adult_census_numbers.head(5)

Unnamed: 0,Age,fnlwgt,Education-num,Capital-gain,Capital-loss,Hours-per-week
0,25,226802,7,0,0,40
1,38,89814,9,0,0,50
2,28,336951,12,0,0,40
3,44,160323,10,7688,0,40
4,18,103497,10,0,0,30


## Separate the data and the target

In [8]:
target_name = "class"
target = adult_census["Class"]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: Class, Length: 48842, dtype: object

In [9]:
data = adult_census_numbers

Fit our first model (K-nearest neighbors) and make predictions

The fit method is called to train the model from the input (features) and target data.

In [18]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
_ = model.fit(data, target)

Let's use our model to make some predictions using our data set and target:

In [19]:
target_predicted = model.predict(data)

In [24]:
target_predicted[0:5]

array([' <=50K', ' <=50K', ' <=50K', ' >50K', ' <=50K'], dtype=object)

We can check if these agree with the actual values:

In [25]:
target[:5] == target_predicted[:5]

0    True
1    True
2    True
3    True
4    True
Name: Class, dtype: bool

In [26]:
print(f"The number of correct predictions is: "
      f"{(target == target_predicted).sum()} / 48842")

The number of correct predictions is: 41003 / 48842


Let's compute the average success rate:

In [17]:
(target == target_predicted).mean()

0.83950288685967

The most obvious problem here is that we used the same data to train and test our model. Let's look at a better way to do this!

## The train-test data split:

In [28]:
adult_census_numbers_train = adult_census_numbers.iloc[0:39072,]
adult_census_numbers_test = adult_census_numbers.iloc[39073:48841]

adult_census_numbers_test.head()
adult_census_numbers_train.head()

Unnamed: 0,Age,fnlwgt,Education-num,Capital-gain,Capital-loss,Hours-per-week
0,25,226802,7,0,0,40
1,38,89814,9,0,0,50
2,28,336951,12,0,0,40
3,44,160323,10,7688,0,40
4,18,103497,10,0,0,30


In [41]:
target_test = target[39073:48841]

In [42]:
target_test.info()

<class 'pandas.core.series.Series'>
RangeIndex: 9768 entries, 39073 to 48840
Series name: Class
Non-Null Count  Dtype 
--------------  ----- 
9768 non-null   object
dtypes: object(1)
memory usage: 76.4+ KB


In [43]:
data_test = adult_census_numbers_test

In [44]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9768 entries, 39073 to 48840
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Age             9768 non-null   int64
 1   fnlwgt          9768 non-null   int64
 2   Education-num   9768 non-null   int64
 3   Capital-gain    9768 non-null   int64
 4   Capital-loss    9768 non-null   int64
 5   Hours-per-week  9768 non-null   int64
dtypes: int64(6)
memory usage: 458.0 KB


In [45]:
accuracy = model.score(data_test, target_test)
model_name = model.__class__.__name__

print(f"The test accuracy using a {model_name} is "
      f"{accuracy:.3f}")

The test accuracy using a KNeighborsClassifier is 0.840


Notebook recap:

In this notebook, we:

• Fitted a k-nearest neighbors model on the training data set
• Evaluated its performance on the testing data
• Introduced the scikit-learn API .fit(X, y) (to train a model), .predict(X) (to make predictions) and .score(X, y) (to evaluate a model).