## Working with numerical data
In the previous notebook, we trained a k-nearest neighbors model on some data.

However, we oversimplified the procedure by loading a dataset that contained exclusively numerical data. Besides, we used datasets which were already split into train-test sets.

In this notebook, we aim at:

* identifying numerical data in a heterogeneous dataset;
* selecting the subset of columns corresponding to numerical data;
* using a scikit-learn helper to separate data into train-test sets;
* training and evaluating a more complex scikit-learn model.

We will start by loading the adult census dataset used during the data exploration.

Load the entire data set:

In [2]:
import pandas as pd
import numpy as np

adult_census = pd.read_csv("/Users/russconte/Adult_Census.csv")

The next step separates the data from the target in one nice step! :)

In [5]:
data, target = adult_census.drop(columns="Class"), adult_census["Class"]

In [6]:
data.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,Education-num,Marital-Status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-Country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [7]:
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: Class, dtype: object

Select the numerical columns

In [8]:
adult_census_numbers = adult_census.select_dtypes(include=np.number)
adult_census_numbers

Unnamed: 0,Age,fnlwgt,Education-num,Capital-gain,Capital-loss,Hours-per-week
0,25,226802,7,0,0,40
1,38,89814,9,0,0,50
2,28,336951,12,0,0,40
3,44,160323,10,7688,0,40
4,18,103497,10,0,0,30
...,...,...,...,...,...,...
48837,27,257302,12,0,0,38
48838,40,154374,9,0,0,40
48839,58,151910,9,0,0,40
48840,22,201490,9,0,0,20


In [9]:
adult_census_numbers.describe()

Unnamed: 0,Age,fnlwgt,Education-num,Capital-gain,Capital-loss,Hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [18]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    adult_census_numbers, target, random_state=42, test_size=0.25)

We will use logistic regression to address this data set:

In [19]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [20]:
model.fit(data_train, target_train)

In [21]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.804


Notebook recap

In scikit-learn, the score method of a classification model returns the accuracy, i.e. the fraction of correctly classified samples. In this case, around 8 / 10 of the times the logistic regression predicts the right income of a person. Now the real question is: is this generalization performance relevant of a good predictive model? Find out by solving the next exercise!

In this notebook, we learned to:

identify numerical data in a heterogeneous dataset;
select the subset of columns corresponding to numerical data;
use the scikit-learn train_test_split function to separate data into a train and a test set;
train and evaluate a logistic regression model.