# Numerical Data
### Aim:
        Identify numerical data in a heterogenous datasets;
        Select the subset of columns corresponding to numerical data;
        use scikit learn helper to separate data into train-test sets;
        train and evaluate a more complex scikit learn model.
        

In [4]:
import pandas as pd
adult_census = pd.read_csv("adult_census.csv")
#Drop the duplicated column education-num 
adult_census = adult_census.drop(columns="education-num")
adult_census.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
# separate the target from the data
data, target = adult_census.drop(columns="class"), adult_census["class"]

In [7]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [8]:
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

# Identify numerical data
Predictive models were initially designed to work with numerical data. Numerical data represent a form of quantitative measures represented by numbers. They usually require little work before feeding them into our models. 

Note: Numerical data are represented with numbers, but numbers are not always representing numerical data. Categories could already be encoded with numbers and you will need to identify these features. 

In [12]:
data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [13]:
data.dtypes.unique()

array([dtype('int64'), dtype('O')], dtype=object)

In [14]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


This shows that we have two data types in our data objects and integers. The objects are representing strings. 

In [15]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data[numerical_columns].head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [16]:
data["age"].describe()

count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

In [17]:
data["hours-per-week"].describe()

count    48842.000000
mean        40.422382
std         12.391444
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: hours-per-week, dtype: float64

In [18]:
data_numeric = data[numerical_columns]

## Train-test split the datasets
Scikit-learn provides the helper function sklearn.model_selection.train_test_split which is used to automatically split the dataset into two subsets

In [19]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42, test_size=0.25)

In scikit-learn setting the `random_state parameter` allows to get `deterministic results` when we use a random number generator. In the train_test_split case the randomness comes from shuffling the data, which decides how the dataset is split into a train and a test set. 

In [20]:
print(f"Number of samples in testing: {data_test.shape[0]} => "
      f"{data_test.shape[0] / data_numeric.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in testing: 12211 => 25.0% of the original set


In [21]:
print(f"Number of samples in training: {data_train.shape[0]} => "
      f"{data_train.shape[0] / data_numeric.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in training: 36631 => 75.0% of the original set


In short, linear models find a set of weights to combine features linearly and predict the target. For instance, the model can come up with a rule such as:

if 0.1 * age + 3.3 * hours-per-week - 15.1 > 0, predict high-income
otherwise predict low-income

In [22]:
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

In [23]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [24]:
model.fit(data_train, target_train)

In [27]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.2f}")

Accuracy of logistic regression: 0.81


After a successful implementation of these model and evaluating its performance, the question that abound is: "Is this statistical performance relevant for a good predictive model? "

Here, we will like to compare the statistical performance of our classifiers to some baseline classifiers that would ignore the input data and instead make constant predictions. We can also in practise develop a dummyclassifier that predicts the majority class.

We will use a `DummyClassifier` and do a train-test-split to evaluate the accuracy of the test set.

Take a look at the Baseline Model provided by sci-kit Learn here: 
https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators

Pay special attention to the `strategy` section of the documentation.

In [28]:
from sklearn.model_selection import train_test_split

data_numeric_train, data_numeric_test, target_train, target_test = \
    train_test_split(data_numeric, target, random_state=0)

In [32]:
adult_census["class"].value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

In [33]:
(target == " <=50K").mean()

0.7607182343065395

In [39]:
from sklearn.dummy import DummyClassifier
strategy = ['constant', 'most_frequent']
constant = [' <=50K', ' >50K']

for s in strategy:
    if s == 'constant':
        for c in constant: 
            revenue_model = DummyClassifier(strategy=s, random_state=0, constant=c)
            revenue_model.fit(data_numeric_train, target_train)
            score = revenue_model.score(data_numeric_test, target_test)
            if c==' >50K':
                print(f"Accuracy of a model predicting only high revenue: {score:.2f}")
            else:
                print(f"Accuracy of a model predicting only low revenue: {score:.2f}")
                
    else: 
        revenue_model = DummyClassifier(strategy= s, random_state=0)
        revenue_model.fit(data_numeric_train, target_train)
        score = revenue_model.score(data_numeric_test, target_test)
        print(f"Accuracy of a model predicting the most frequent class: {score:.2f}")

Accuracy of a model predicting only low revenue: 0.76
Accuracy of a model predicting only high revenue: 0.24
Accuracy of a model predicting the most frequent class: 0.76


The dummy models we have above show cases of a model that always predicting class <=50K and >50K, respectively. The prediction of models with contantly low revenue shows an accuracy that is above 0.5. This is because 75% of the target class belongs to the low revenue class. Hence, any model giving results below this dummy model will not be helpful. 