# Tensor Flow - Use case implementation

Here, census.csv file is used aas the data set. It contains varoius features of an individual like age, gender, marital status, relationship etc.,(Listed below)

Given all these features, we will use tensor flow to find what class of income an individual belong to (>50k or <=50k).

Advantage of using **TensorFlow** is that it offers high level APIs, in this case 'estimator' API will be used to create a classifier(LinearClassifier).

The data has to be prepared to be in a structured format to an API to be fed in a right way.

Data Set source : https://github.com/agupta98/Tensorflow/blob/master/census_data.csv

**Import necessary libararies**

In [1]:
# Just disables the warning, doesn't enable AVX/FMA
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np

**Read in the census_data.csv file**

In [3]:
census = pd.read_csv("./data/census_data.csv") 

In [4]:
list(census)

['age',
 'workclass',
 'education',
 'education_num',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'gender',
 'capital_gain',
 'capital_loss',
 'hours_per_week',
 'native_country',
 'income_bracket']

In [5]:
census.head() 

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


TensorFlow will not understand strings as labels in the column 'income_bracket'. apply() method from pandas has to be used to apply a custom function that converts them to 0s and 1s.

In [6]:
census['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [7]:

def label(label):
    if label == '<=50k':
        return 0
    else:
        return 1

In [8]:
census['income_bracket'] = census['income_bracket'].apply(label)

**Perforn Train-Test split on Census data**

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
x_data = census.drop('income_bracket', axis=1)
y_labels = census['income_bracket']

X_train, X_test, y_train, y_test = train_test_split(x_data,y_labels,test_size=0.3,random_state=101)

# 30% of data to be test data

### TensorFlow API - estimator

In order to call estimator, feature columns has to be prepared and has to be in a right format. Then, input function has to be passed.

**Prepare feature columns:**
    
    model = tf.estimator.LinearClassifier(feature_columns=feat_cols)

**Prepare input function:**
    
    model.train(input_fn=input_func, steps=5000)

In [11]:
census.columns

Index([u'age', u'workclass', u'education', u'education_num', u'marital_status',
       u'occupation', u'relationship', u'race', u'gender', u'capital_gain',
       u'capital_loss', u'hours_per_week', u'native_country',
       u'income_bracket'],
      dtype='object')

**Preparing the feature columns** to create a model.

tf.feature_column() method is used to create feature columns. The way of creating feature columns for 'categorical' and 'continuous' values are different.

**Feature columns for 'categorical' values**

In [12]:
# use 'with_vocabulary_list' when there are known, few and finite number of values to the column
gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Male","Female"])

# use 'with_hash_bucket' when the number of values to the column are not known and infinte (max 1000)
occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)

**Feature columns for 'continuous' values**

In [13]:
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

Create a vector(list) to hold in all these newly created variables for feature columns. 

In [14]:
feat_cols = [gender, occupation, marital_status, relationship, education, workclass, native_country, 
             age, education_num, capital_gain, capital_loss, hours_per_week]

**Preparing the input function** to train the created model.

batch-size parameter indicates how many records to be read each time during the training process.

In [15]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=100,num_epochs=None, shuffle=True)

**Create a LinearClassifier** using estimator API.

In [16]:
model = tf.estimator.LinearClassifier(feature_columns=feat_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a28bdf490>, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': '/var/folders/54/s4nndgq53ml0qj_q2pqp8hhm0000gp/T/tmpFS3fQP', '_save_summary_steps': 100}


In [17]:
# To train the model on the data with 1000 iterations

model.train(input_fn=input_func, steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/54/s4nndgq53ml0qj_q2pqp8hhm0000gp/T/tmpFS3fQP/model.ckpt.
INFO:tensorflow:loss = 69.31472, step = 1
INFO:tensorflow:global_step/sec: 135.333
INFO:tensorflow:loss = 0.00022910332, step = 101 (0.744 sec)
INFO:tensorflow:global_step/sec: 176.365
INFO:tensorflow:loss = 0.000235435, step = 201 (0.567 sec)
INFO:tensorflow:global_step/sec: 165.641
INFO:tensorflow:loss = 0.0003066226, step = 301 (0.609 sec)
INFO:tensorflow:global_step/sec: 164.528
INFO:tensorflow:loss = 0.00010653919, step = 401 (0.608 sec)
INFO:tensorflow:global_step/sec: 167.646
INFO:tensorflow:loss = 0.00024590586, step = 501 (0.588 sec)
INFO:tensorflow:global_step/sec: 151.963
INFO:tensorflow:loss = 0.00075582624, step = 601 (0.662 sec)
INFO:tensorflow:global_step/sec: 160.297
INFO:tensorflow:loss = 0.0006577833, step = 701 (0.623 sec)
INFO:tensorflow:global_step/sec: 173.721
INFO:tensorflow:loss = 0.0004566936, step = 80

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x1a28bdf410>

### Evaluation

Create a prediction input function, providing shuffle=False

In [18]:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=100,num_epochs=None, shuffle=False)

To produce a generator of predictions, use model.predict() method to pass input function and then transform it to a list using list().

**Note:** I was not able to get the Output for the below as my CPU did not support. Below is the warning which usually happens for most of the average level CPUs. GPUs will not pose any such problems.


[I 16:13:36.477 NotebookApp] Adapting to protocol v5.1 for kernel 0ddc340e-de56-4293-8b32-074eec23f15d
2018-12-16 16:14:02.325553: I tensorflow/core/platform/cpu_feature_guard.cc:137] **Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA**

Installed the below as suggested in Stackoverflow. [https://github.com/lakshayg/tensorflow-build]

pip install tensorflow-1.4.1-cp27-cp27m-macosx_10_12_intel.whl

Yet, the same error along with 'notebook.ipynb is not trusted'! Did the below to get system to trust the notebook.

jupyter trust notebook.ipynb.

After this, no warning nor error. But the below code does not give any output even after 30 minutes.

**References to fix:**

- https://github.com/lakshayg/tensorflow-build
- https://stackoverflow.com/questions/47068709/your-cpu-supports-instructions-that-this-tensorflow-binary-was-not-compiled-to-u
- https://stackoverflow.com/questions/27577463/installing-python-library-from-whl-file
- https://ipython.org/ipython-doc/3/notebook/security.html
- https://github.com/tensorflow/tensorflow/issues/10436


In [None]:
predictions = list(model.predict(input_fn=pred_fn))

INFO:tensorflow:Restoring parameters from /var/folders/54/s4nndgq53ml0qj_q2pqp8hhm0000gp/T/tmpFS3fQP/model.ckpt-1000


In [None]:
predictions[0]

To compare against the real y_test values, create a list of 'class_ids' key values from the prediction list of dictionaries which are the predictions.

In [None]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])    

In [None]:
final_preds[:10]

To evaluate the model's performance import classification report from sklearn.metrics

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, final_preds))

Model accuracy = %