# Working with numerical data

## Load whole data and trim

In [1]:
import pandas as pd

adult_census = pd.read_csv("./csv_result-phpMawTba.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")
adult_census.head()

Unnamed: 0,id,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,1,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,2,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,3,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,4,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,5,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
data, target = adult_census.drop(columns="class"), adult_census["class"]
data.head()
target

0        <=50K
1        <=50K
2         >50K
3         >50K
4        <=50K
         ...  
48837    <=50K
48838     >50K
48839    <=50K
48840    <=50K
48841     >50K
Name: class, Length: 48842, dtype: object

## Identify numerical data

See individual types:

In [4]:
data.dtypes

id                 int64
age                int64
workclass         object
fnlwgt             int64
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

See unique types:

In [5]:
data.dtypes.unique()

array([dtype('int64'), dtype('O')], dtype=object)

## Select only numerical columns

In [6]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data[numerical_columns].head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


## Column details

In [7]:
data["age"].describe()

count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

Store all numeric columns in new array

In [8]:
data_numeric = data[numerical_columns]

## Split Data into Train/Test

In [9]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42, test_size=0.25
)

In scikit-learn setting the random_state parameter allows to get **deterministic results** when we use a random number generator. In the train_test_split case the randomness comes from shuffling the data, which decides how the dataset is split into a train and a test set)

In [10]:
print(
    f"Number of samples in testing: {data_test.shape[0]} => "
    f"{data_test.shape[0] / data_numeric.shape[0] * 100:.1f}% of the"
    " original set"
)

Number of samples in testing: 12211 => 25.0% of the original set


In [11]:
print(
    f"Number of samples in training: {data_train.shape[0]} => "
    f"{data_train.shape[0] / data_numeric.shape[0] * 100:.1f}% of the"
    " original set"
)

Number of samples in training: 36631 => 75.0% of the original set


## Logistic Regression

In [14]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [15]:
model.fit(data_train, target_train)

In [16]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.807


## Dummy estimators

When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of thumb. DummyClassifier implements several such simple strategies for classification:

- `stratified` generates random predictions by respecting the training set class distribution.

- `most_frequent` always predicts the most frequent label in the training set.

- `prior` always predicts the class that maximizes the class prior (like most_frequent) and predict_proba returns the class prior.

- `uniform` generates predictions uniformly at random.

- `constant` always predicts a constant label that is provided by the user.
A major motivation of this method is F1-scoring, when the positive class is in the minority.

Note that with all these strategies, the predict method completely ignores the input data!

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC

clf = SVC(kernel='linear', C=1).fit(data_train, target_train)
clf.score(data_test, target_test)
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf.fit(data_test, target_test)
clf.score(data_test, target_test)

[sklearn dummy classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier)