<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Creating-a-Binary-Classifier" data-toc-modified-id="Creating-a-Binary-Classifier-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Creating a Binary Classifier</a></span></li><li><span><a href="#Checking-the-Accuracy" data-toc-modified-id="Checking-the-Accuracy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Checking the Accuracy</a></span></li><li><span><a href="#Scenarios" data-toc-modified-id="Scenarios-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scenarios</a></span></li><li><span><a href="#Precision-and-Sensitivity" data-toc-modified-id="Precision-and-Sensitivity-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Precision and Sensitivity</a></span><ul class="toc-item"><li><span><a href="#Precision" data-toc-modified-id="Precision-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Precision</a></span></li><li><span><a href="#Sesitivity" data-toc-modified-id="Sesitivity-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Sesitivity</a></span></li></ul></li><li><span><a href="#Dataset" data-toc-modified-id="Dataset-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Working-on-a-Tensorflow-Estimator" data-toc-modified-id="Working-on-a-Tensorflow-Estimator-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Working on a Tensorflow Estimator</a></span><ul class="toc-item"><li><span><a href="#Step1:-Import-the-data" data-toc-modified-id="Step1:-Import-the-data-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Step1: Import the data</a></span></li><li><span><a href="#Step2:-Data-conversion" data-toc-modified-id="Step2:-Data-conversion-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Step2: Data conversion</a></span></li></ul></li></ul></div>

# Creating a Binary Classifier

Here we are using `Logistic Regression`. Formula is gien by <br>
$p(Y=1|X) = \frac{1} {1+ e^-(\theta^T + b)}$

where $\theta$ is the weight and $b$ is the bias

# Checking the Accuracy

$A_{cc} = \frac{1} {n} \sum_{i=1}^{i=n}(\hat{y} = y_i) $

# Scenarios

There can be a scenario where the data to train our dataset on has least examples of a certain class, eg: 5% examples of class 0 and 95% examples of class 1.<br>
In this case if our model always predicts class 1 then it is correct in 95% case but it actually didnt learn to differentiate between the 2 classes.

In these cases confusion matrix comes to help.

The confusion matrix visualizes the accuracy of a classifier by comparing the actual and predicted classes. The binary confusion matrix is composed of squares:

- TP: True Positive: Predicted values correctly predicted as actual positive
- FP: Predicted values incorrectly predicted an actual positive. i.e., Negative values predicted as positive
- FN: False Negative: Positive values predicted as negative
- TN: True Negative: Predicted values correctly predicted as actual negative

<img src='artifacts/confusion_matrix.png'/>

# Precision and Sensitivity

## Precision
The precision metric shows the accuracy of the positive class. It measures how likely the prediction of the positive class is correct.

$Precision = \frac {TP} {TP+FP}$

The maximum score is 1 when the classifier perfectly classifies all the positive values. Precision alone is not very helpful because it ignores the negative class. The metric is usually paired with Recall metric. Recall is also called sensitivity or true positive rate.


## Sesitivity
Sensitivity computes the ratio of positive classes correctly detected. This metric gives how good the model is to recognize a positive class.

$Sensitivity = \frac {TP} {TP+FN}$

# Dataset

We are going to use the sensus data. Income here is denoted in binary
- 1 -> income > 50k
- 0 -> income < 50k

This dataset includes eight categorical variables:
- workplace
- education
- marital
- occupation
- relationship
- race
- sex
- native_country

moreover, six continuous variables:
- age
- fnlwgt
- education_num
- capital_gain
- capital_loss
- hours_week

# Working on a Tensorflow Estimator

## Step1: Import the data

In [17]:
import tensorflow.compat.v1 as tf
import pandas as pd
import seaborn as sns
tf.disable_eager_execution()
print(tf)

<module 'tensorflow_core.compat.v1' from '/home/sbjr/my_bin/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/_api/v2/compat/v1/__init__.py'>


In [2]:
COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital','occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss','hours_week', 'native_country', 'label']
data_path = 'data/sensus_data/adult.data'
test_path = 'data/sensus_data/adult.test'
df_train = pd.read_csv(data_path, skipinitialspace=True, names = COLUMNS, index_col=False)
df_test = pd.read_csv(test_path,skiprows=1, skipinitialspace=True, names = COLUMNS, index_col=False)
display(df_train.head(5))
display(df_test.head(5))
print('Train->\n')
df_train.info()
print('\n\nTest->\n')
df_test.info()

<module 'tensorflow_core.compat.v1' from '/home/sbjr/my_bin/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/_api/v2/compat/v1/__init__.py'>


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


Train->

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital         32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_week      32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  label           32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Test->

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16281 entries, 0 to 1628

We need to convert the Label column data from <=50k and >50K to 0 and 1

In [3]:
df_train['label'] = df_train['label'].map({'<=50K':0, '>50K':1})
df_test['label'] = df_test['label'].map({'<=50K.':0, '>50K.':1})
display(df_train.head(5))
display(df_test.head(5))

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0


## Step2: Data conversion

A few steps are required before you train a linear classifier with Tensorflow. You need to prepare the features to include in the model. In the benchmark regression, you will use the original data without applying any transformation.

The estimator needs to have a list of features to train the model. Hence, the column's data requires to be converted into a tensor.

A good practice is to define two lists of features based on their type and then pass them in the feature_columns of the estimator.

You will begin by converting continuous features, then define a bucket with the categorical data.

The features of the dataset have two formats:

- Integer
- Object

In [4]:
CONTI_FEATURES  = ['age', 'fnlwgt','capital_gain', 'education_num', 'capital_loss', 'hours_week']
CATE_FEATURES = ['workclass', 'education', 'marital', 'occupation', 'relationship', 'race', 'sex', 'native_country']

In [6]:
# Create feature columns
continuous_features = [tf.feature_column.numeric_column(k) for k in CONTI_FEATURES]
categorical_features = [tf.feature_column.categorical_column_with_hash_bucket(k, hash_bucket_size=1000) for k in CATE_FEATURES]

## Step3: Train the classifier

### Create the model

In [9]:
model = tf.estimator.LinearClassifier(
        n_classes = 2,
        model_dir = 'logs/9_Linear_classifier',
        feature_columns = categorical_features + continuous_features)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'logs/9_Linear_classifier', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f484213bcd0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


### Create input_fn

In [14]:
FEATURES = CONTI_FEATURES + CATE_FEATURES
LABEL = 'label'
def get_input_fn(df, features, num_epoches = None, batch_size = 128, shuffle = True):
    return tf.estimator.inputs.pandas_input_fn(
    x = pd.DataFrame({k:df[k] for k in features}),
    y = pd.Series(df[LABEL].values),
    batch_size=batch_size,
    num_epochs=num_epoches,
    shuffle=shuffle)

### Train the classifier

In [15]:
model.train(input_fn=get_input_fn(
            df_train,
            FEATURES,
            num_epoches = None,
            batch_size = 128,
            shuffle = False),
           steps=1000)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Saving checkpoints for 0 into logs/9_Linear_classifier/model.ckpt.
INFO:tensorflow:loss = 88.72288, step = 0
INFO:tensorflow:global_step/sec: 1

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x7f4841aaf1d0>

### Evaluating the model

In [16]:
model.evaluate(input_fn=get_input_fn(
        df_test,
        FEATURES,
        num_epoches = 1,
        batch_size = 128,
        shuffle = False),
                steps=1000)

INFO:tensorflow:Calling model_fn.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-02-16T15:56:28Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from logs/9_Linear_classifier/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [100/1000]
INFO:tensorflow:Finished evaluation at 2020-02-16-15:56:30
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.7960199, accuracy_baseline = 0.76377374, auc = 0.6147839, auc_precision_recall = 0.55554575, average_loss = 119.473885, global_step = 1000, label/mean = 0.23622628, loss = 15196.519, precision = 0.6690277, prediction/mean = 0.095460445, recall = 0.2701508
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: logs/9_Linear_classifier/model.ckpt-1000


{'accuracy': 0.7960199,
 'accuracy_baseline': 0.76377374,
 'auc': 0.6147839,
 'auc_precision_recall': 0.55554575,
 'average_loss': 119.473885,
 'label/mean': 0.23622628,
 'loss': 15196.519,
 'precision': 0.6690277,
 'prediction/mean': 0.095460445,
 'recall': 0.2701508,
 'global_step': 1000}

<b>Even though the accuracy is high, we cant truly say the model learned well. This is denoted by the metric precision and recall.</b>

## Step4: Improving the Precision

We can improve the accuracy by introducing polynomial term. Polynomial regression is instrumental when there is non-linearity in the data. There are two ways to capture non-linearity in the data.
- Add polynomial term
- Bucketize the continuous variable into a categorical variable

### Using Polynomial Term

In [27]:
# creating polynomial feature
def square_var(df_train, df_test, feat):
    df_train['sq_'+feat] = df_train[feat].pow(2)
    df_test['sq_'+feat] = df_test[feat].pow(2)
    return df_train, df_test

In [28]:
df_train_new, df_test_new = square_var(df_train, df_test, 'age')
display(df_train_new.head(5))
display(df_test_new.head(5))

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label,sq_age
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,1521
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,2500
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,1444
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,2809
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,784


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label,sq_age
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0,625
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0,1444
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1,784
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1,1936
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0,324


In [40]:
CONT_FEAT_NEW = CONTI_FEATURES + ['sq_age']
continuous_features_new = [tf.feature_column.numeric_column(k) for k in CONT_FEAT_NEW]

In [41]:
#create a new model
model_new = tf.estimator.LinearClassifier(model_dir='logs/9_linear_classifier_new', feature_columns=continuous_features_new + categorical_features)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'logs/9_linear_classifier_new', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f48399b48d0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [42]:
model_new.train(input_fn= get_input_fn(df_train_new, 
                                       CONT_FEAT_NEW + CATE_FEATURES,
                                       num_epoches = None,
                                       batch_size = 128,
                                       shuffle = False),
               steps=1000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into logs/9_linear_classifier_new/model.ckpt.
INFO:tensorflow:loss = 88.72288, step = 0
INFO:tensorflow:global_step/sec: 144.675
INFO:tensorflow:loss = 78813.016, step = 100 (0.692 sec)
INFO:tensorflow:global_step/sec: 205.938
INFO:tensorflow:loss = 13321.129, step = 200 (0.486 sec)
INFO:tensorflow:global_step/sec: 208.668
INFO:tensorflow:loss = 106516.95, step = 300 (0.479 sec)
INFO:tensorflow:global_step/sec: 213.221
INFO:tensorflow:loss = 11190.488, step = 400 (0.470 sec)
INFO:tensorflow:global_step/sec: 220.13
INFO:tensorflow:loss = 13817.372, step = 500 (0.454 sec)
INFO:tensorflow:global_step/sec: 216.29
INFO:tensorflow:loss = 17206.01, step = 600 (0.463 sec)
INFO:tensorflow:global_step/sec: 212.424
INFO:ten

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x7f481c302a50>

In [43]:
# evaluate this new model
model_new.evaluate(input_fn=get_input_fn(df_test_new,
                                        CONT_FEAT_NEW + CATE_FEATURES,
                                        num_epoches = 1,
                                        batch_size = 128,
                                        shuffle = False),
                  steps = 1000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-02-16T20:11:19Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from logs/9_linear_classifier_new/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [100/1000]
INFO:tensorflow:Finished evaluation at 2020-02-16-20:11:20
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.79239607, accuracy_baseline = 0.76377374, auc = 0.6059219, auc_precision_recall = 0.5428478, average_loss = 126.17265, global_step = 1000, label/mean = 0.23622628, loss = 16048.57, precision = 0.66002744, prediction/mean = 0.089552544, recall = 0.24986999
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: logs/9_linear_classifier_new/model.ckpt-1000


{'accuracy': 0.79239607,
 'accuracy_baseline': 0.76377374,
 'auc': 0.6059219,
 'auc_precision_recall': 0.5428478,
 'average_loss': 126.17265,
 'label/mean': 0.23622628,
 'loss': 16048.57,
 'precision': 0.66002744,
 'prediction/mean': 0.089552544,
 'recall': 0.24986999,
 'global_step': 1000}

### Bucketizing continuous features into categorical features

In [44]:
age = tf.feature_column.numeric_column('age')
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

We will also use feature crossing here, crossing age, education and occupation<br>
<b>Note:Here at least 1 feature should be categorical, in our case both occupation and education in categorical in nature</b>

In [45]:
education_x_occupation = [tf.feature_column.crossed_column(['education', 'occupation'], hash_bucket_size=1000)]
age_buckets_x_education_x_occupation = [tf.feature_column.crossed_column([age_buckets, 'education', 'occupation'], hash_bucket_size=1000)]

In [52]:
# creaing new model
!rm -rf logs/9_linear_classifier_improved
model_improved = tf.estimator.LinearClassifier(model_dir= 'logs/9_linear_classifier_improved',
                                              feature_columns= [age_buckets] 
                                               + education_x_occupation 
                                               + age_buckets_x_education_x_occupation 
                                               # + continuous_features_new not using the old features
                                               + categorical_features)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'logs/9_linear_classifier_improved', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4783a75790>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [53]:
#training the improved model
FEAT_IMPROVED = ['age','workclass', 'education', 'education_num', 'marital','occupation', 'relationship', 'race', 'sex', 'native_country', 'sq_age']
model_improved.train(input_fn=get_input_fn(df_train_new,
                                 FEAT_IMPROVED,
                                 num_epoches = None,
                                 batch_size = 128,
                                 shuffle = False),
           steps = 1000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into logs/9_linear_classifier_improved/model.ckpt.
INFO:tensorflow:loss = 88.72288, step = 0
INFO:tensorflow:global_step/sec: 148.853
INFO:tensorflow:loss = 50.334488, step = 100 (0.673 sec)
INFO:tensorflow:global_step/sec: 235.938
INFO:tensorflow:loss = 56.153225, step = 200 (0.426 sec)
INFO:tensorflow:global_step/sec: 237.257
INFO:tensorflow:loss = 45.792007, step = 300 (0.421 sec)
INFO:tensorflow:global_step/sec: 197.804
INFO:tensorflow:loss = 37.485672, step = 400 (0.506 sec)
INFO:tensorflow:global_step/sec: 177.983
INFO:tensorflow:loss = 56.484497, step = 500 (0.562 sec)
INFO:tensorflow:global_step/sec: 181.79
INFO:tensorflow:loss = 32.528934, step = 600 (0.553 sec)
INFO:tensorflow:global_step/sec: 215.453
I

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x7f483946bfd0>

In [54]:
# evaluate this improved model
model_improved.evaluate(input_fn= get_input_fn(df_test_new,
                                     FEAT_IMPROVED,
                                     num_epoches = 1,
                                     batch_size = 128,
                                     shuffle = False),
              steps=1000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-02-16T20:26:16Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from logs/9_linear_classifier_improved/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [100/1000]
INFO:tensorflow:Finished evaluation at 2020-02-16-20:26:18
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.8358209, accuracy_baseline = 0.76377374, auc = 0.8840164, auc_precision_recall = 0.69599575, average_loss = 0.35122654, global_step = 1000, label/mean = 0.23622628, loss = 44.67437, precision = 0.68986726, prediction/mean = 0.2332066, recall = 0.55408216
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: logs/9_linear_classifier_improved/model.ckpt-1000


{'accuracy': 0.8358209,
 'accuracy_baseline': 0.76377374,
 'auc': 0.8840164,
 'auc_precision_recall': 0.69599575,
 'average_loss': 0.35122654,
 'label/mean': 0.23622628,
 'loss': 44.67437,
 'precision': 0.68986726,
 'prediction/mean': 0.2332066,
 'recall': 0.55408216,
 'global_step': 1000}

## Step5: Adding regularization term to prevent over/under-fitting

When a model has lots of parameters and a relatively low amount of data.<br>
To prevent overfitting, regularization gives you the possibilities to control for such complexity and make it more generalizable. There are two regularization techniques:
- L1: Lasso
- L2: Ridge<br>
In TensorFlow, you can add these two hyperparameters in the optimizer. For instance, the higher the hyperparameter L2, the weight tends to be very low and close to zero. The fitted line will be very flat, while an L2 close to zero implies the weights are close to the regular linear regression.

In [55]:
# creating a new model
model_reg = tf.estimator.LinearClassifier(model_dir='logs/9_linear_classifier_regularized',
                                         feature_columns= categorical_features + [age_buckets] + age_buckets_x_education_x_occupation,
                                         optimizer= tf.train.FtrlOptimizer(
                                         learning_rate = 0.1,
                                         l1_regularization_strength = 0.9,
                                         l2_regularization_strength = 5))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'logs/9_linear_classifier_regularized', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f47f041cf10>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [56]:
model_reg.train(input_fn= get_input_fn(df_train_new,
                                      FEAT_IMPROVED,
                                 num_epoches = None,
                                 batch_size = 128,
                                 shuffle = False),
               steps=1000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into logs/9_linear_classifier_regularized/model.ckpt.
INFO:tensorflow:loss = 88.72288, step = 0
INFO:tensorflow:global_step/sec: 113.933
INFO:tensorflow:loss = 50.421577, step = 100 (0.882 sec)
INFO:tensorflow:global_step/sec: 219.517
INFO:tensorflow:loss = 55.906914, step = 200 (0.454 sec)
INFO:tensorflow:global_step/sec: 197.573
INFO:tensorflow:loss = 46.703293, step = 300 (0.507 sec)
INFO:tensorflow:global_step/sec: 185.543
INFO:tensorflow:loss = 39.030033, step = 400 (0.535 sec)
INFO:tensorflow:global_step/sec: 198.351
INFO:tensorflow:loss = 57.349205, step = 500 (0.511 sec)
INFO:tensorflow:global_step/sec: 179.831
INFO:tensorflow:loss = 33.150005, step = 600 (0.552 sec)
INFO:tensorflow:global_step/sec: 240.0

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x7f47f0556690>

In [57]:
#evaluating regularized model
model_reg.evaluate(input_fn= get_input_fn(df_test_new,
                                         FEAT_IMPROVED,
                                 num_epoches = None,
                                 batch_size = 128,
                                 shuffle = False),
                  steps=1000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-02-16T20:32:45Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from logs/9_linear_classifier_regularized/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [100/1000]
INFO:tensorflow:Evaluation [200/1000]
INFO:tensorflow:Evaluation [300/1000]
INFO:tensorflow:Evaluation [400/1000]
INFO:tensorflow:Evaluation [500/1000]
INFO:tensorflow:Evaluation [600/1000]
INFO:tensorflow:Evaluation [700/1000]
INFO:tensorflow:Evaluation [800/1000]
INFO:tensorflow:Evaluation [900/1000]
INFO:tensorflow:Evaluation [1000/1000]
INFO:tensorflow:Finished evaluation at 2020-02-16-20:32:50
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.8378047, accuracy_baseline = 0.76373434, auc = 0.88715637, auc_precision_recall = 0.7012542, average_loss = 0.3467789, global_step = 1000, label/mean

{'accuracy': 0.8378047,
 'accuracy_baseline': 0.76373434,
 'auc': 0.88715637,
 'auc_precision_recall': 0.7012542,
 'average_loss': 0.3467789,
 'label/mean': 0.23626563,
 'loss': 44.3877,
 'precision': 0.6975455,
 'prediction/mean': 0.23594265,
 'recall': 0.5535017,
 'global_step': 1000}