# California Census Data

The California Census Data contains several features, I'll be trying to use various features of an individual to predict what class of income they belong in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

## Importing Required Packages

In [75]:
import pandas as pd

## Reading CSV File
** Read in the census_data.csv data with pandas**

In [76]:
cal_census_data = pd.read_csv('census_data.csv')

In [77]:
cal_census_data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Data Pre-processing
** Convert the Label column to 0s and 1s instead of strings.**

In [78]:
cal_census_data['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [91]:
transformed_l_cal_census_data = cal_census_data.copy()
transformed_l_cal_census_data['income_bracket'] = transformed_l_cal_census_data['income_bracket'].apply(lambda x: 0 if x == ' <=50K' else 1)

In [42]:
cal_census_data

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


In [92]:
y_label = transformed_l_cal_census_data['income_bracket']

In [81]:
X_data = transformed_l_cal_census_data.drop(columns=['income_bracket'],axis=1)

### Perform a Train Test Split on the Data

In [43]:
from sklearn.model_selection import train_test_split

In [93]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_label, test_size=0.30, random_state=101)

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [83]:
transformed_cal_census_data.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

** Import Tensorflow **

In [84]:
import tensorflow.compat.v1 as tf

In [85]:
feat_cols = {}

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [86]:
for col_name in [ col_name for col_name, col_dtype in dict(X_data.dtypes).items() if col_dtype == object]:
    feat_cols[col_name] = tf.feature_column.categorical_column_with_hash_bucket(col_name, len(X_data[col_name].unique().tolist()))

** Create the continuous feature_columns for the continuous values using numeric_column **

In [87]:
for col in [ col_name for col_name, col_dtype in dict(X_data.dtypes).items() if col_dtype != object]:
    feat_cols[col] = tf.feature_column.numeric_column(col)

** Put all these variables into a single list with the variable name feat_cols **

In [88]:
feat_colns = feat_cols.values()

In [104]:
feat_cols

{'workclass': HashedCategoricalColumn(key='workclass', hash_bucket_size=9, dtype=tf.string),
 'education': HashedCategoricalColumn(key='education', hash_bucket_size=16, dtype=tf.string),
 'marital_status': HashedCategoricalColumn(key='marital_status', hash_bucket_size=7, dtype=tf.string),
 'occupation': HashedCategoricalColumn(key='occupation', hash_bucket_size=15, dtype=tf.string),
 'relationship': HashedCategoricalColumn(key='relationship', hash_bucket_size=6, dtype=tf.string),
 'race': HashedCategoricalColumn(key='race', hash_bucket_size=5, dtype=tf.string),
 'gender': HashedCategoricalColumn(key='gender', hash_bucket_size=2, dtype=tf.string),
 'native_country': HashedCategoricalColumn(key='native_country', hash_bucket_size=42, dtype=tf.string),
 'age': NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 'education_num': NumericColumn(key='education_num', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 'capital_gai

### Create Input Function

** Batch_size is up to you. But do make sure to shuffle!**

In [105]:
train_input_fn = tf.estimator.inputs.pandas_input_fn(x=X_train, y= y_train, batch_size=10, num_epochs=1000, shuffle=True)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [106]:
y_label.unique()

array([0, 1], dtype=int64)

In [107]:
model = tf.estimator.LinearClassifier(feat_colns, n_classes=len(y_label.unique().tolist()))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\niks8\\AppData\\Local\\Temp\\tmpwt7b9que', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


** Train your model on the data, for at least 5000 steps. **

In [108]:
model.train(train_input_fn, steps = 5000)

INFO:tensorflow:Calling model_fn.


  self.bias = self.add_variable(


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\niks8\AppData\Local\Temp\tmpwt7b9que\model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 6.931472, step = 0
INFO:tensorflow:global_step/sec: 360.91
INFO:tensorflow:loss = 2.9193149, step = 100 (0.279 sec)
INFO:tensorflow:global_step/sec: 731.018
INFO:tensorflow:loss = 4.586574, step = 200 (0.136 sec)
INFO:tensorflow:global_step/sec: 727.062
INFO:tensorflow:loss = 0.40996698, step = 300 (0.137 sec)
INFO:tensorflow:global_step/sec: 732.328
INFO:tensorflow:loss = 7.5699177, step = 400 (0.137 sec)
INFO:tensorflow:global_step/sec: 760.844
INFO:tensorflow:loss = 1.6634842, step = 500 (0.132 sec)
INFO:tensorfl

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x179b9125520>

### Evaluation

** Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False. **

In [109]:
eval_input_fn = tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size=10, num_epochs=1, shuffle=False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [115]:
predictions = model.predict(eval_input_fn)

** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [116]:
predictions = [pred['class_ids'] for pred in list(predictions)]

INFO:tensorflow:Calling model_fn.


  self.bias = self.add_variable(


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\niks8\AppData\Local\Temp\tmpwt7b9que\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


** Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [101]:
from sklearn.metrics import classification_report

In [113]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.87      0.90      0.89      7436
           1       0.64      0.59      0.61      2333

    accuracy                           0.82      9769
   macro avg       0.76      0.74      0.75      9769
weighted avg       0.82      0.82      0.82      9769

