# Week 2 COMP188
## Dissecting Filipino family income and expenditure dataset

This week we are dissecting the [Filipino family income and expenditure dataset from Kaggle](https://www.kaggle.com/grosvenpaul/family-income-and-expenditure), with the methods used by [Google's tutorial on TensorFlow](https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/video-lecture). 

We'll be using TensorFlow's Linear Regressor to predict total family expediture from features of the family head as well as other family metrics.

In [73]:
import tensorflow as tf
from tensorflow.python.data import Dataset
from sklearn import metrics

import numpy as np
import pandas as pd

import math

from IPython import display
from matplotlib import cm
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt

import os
tf.__version__ 
#I ran into problems with tensorflow being on an ancient version on my desktop's Anaconda, seems to work fine on my laptop

'1.6.0'

We'll load and examine our data.

In [44]:
df = pd.read_csv('Family Income and Expenditure.csv')

In [None]:
df.head()

I also want to get used to using pandas so I'll do some basic dataframe tasks. We'll get rid of some unnecessary data in order to run it faster and make it cleaner to look at.

We'll then sum up all the expenses, then remove all of them.

In [45]:
import re

if "Total Food Expenditure" in list(df):    
    df = df.drop("Total Food Expenditure", 1)

reg = re.compile('.*(([Ee]xpenditure)|([Ee]xpenses))$')
expenditure_types = [var for var in list(df) if re.match(reg, var)]
df["Total Expenditures"] = np.sum(df[expenditure_types],1)

We'll also get rid of `Total Household Income` and `Number or *` because that's cheating as it would very likely highly correlate with total expenditure. 

In [46]:
reg_number = re.compile('Number.*')
reg_house = re.compile('House .*')
reg_type = re.compile('Type .*')
remove = [var for var in list(df) if re.match(reg_number, var) 
                                  or re.match(reg_house, var) 
                                  or re.match(reg_type, var)]  + expenditure_types

for var in remove:
    if var in list(df):
        df = df.drop(var, 1)

In [66]:
df.head()

Unnamed: 0,Total Household Income,Region,Main Source of Income,Agricultural Household indicator,Imputed House Rental Value,Total Income from Entrepreneurial Acitivites,Household Head Sex,Household Head Age,Household Head Marital Status,Household Head Highest Grade Completed,...,Household Head Class of Worker,Total Number of Family members,Members with age less than 5 year old,Members with age 5 - 17 years old,Total number of family members employed,Tenure Status,Toilet Facilities,Electricity,Main Source of Water Supply,Total Expenditures
34016,84779,Caraga,Wage/Salaries,1,12000,16380,Male,49,Married,Grade 1,...,Worked for private establishment,2,0,0,1,"Own house, rent-free lot with consent of owner","Water-sealed, sewer septic tank, used exclusiv...",1,"Shared, faucet, community water system",90928
12477,153858,XI - Davao Region,Wage/Salaries,0,10200,21280,Male,36,Married,Second Year College,...,Worked for private establishment,4,0,2,1,Rent-free house and lot with consent of owner,"Water-sealed, sewer septic tank, shared with o...",1,"Own use, tubed/piped deep well",128649
24153,38106,VII - Central Visayas,Other sources of Income,0,15000,1820,Female,58,Divorced/Separated,Elementary Graduate,...,Self-employed wihout any employee,1,0,0,0,Own or owner-like possession of house and lot,"Water-sealed, sewer septic tank, used exclusiv...",1,"Own use, faucet, community water system",38998
8710,195036,IVA - CALABARZON,Wage/Salaries,0,0,0,Male,46,Married,First Year High School,...,Worked for private establishment,6,0,1,3,Rent house/room including lot,"Water-sealed, sewer septic tank, used exclusiv...",1,"Own use, faucet, community water system",178844
2475,440560,III - Central Luzon,Wage/Salaries,0,24000,54050,Male,61,Married,First Year Post Secondary,...,Employer in own family-operated farm or business,4,0,0,2,Own or owner-like possession of house and lot,"Water-sealed, sewer septic tank, used exclusiv...",1,"Own use, faucet, community water system",280357


## Making the Model!
So now we wanna predict `Total Expenditures` through household parameters. Lets first shuffle the data.

In [47]:
df = df.reindex(np.random.permutation(df.index))

### Defining our feature
For just using a single predictor, we could use variables like `Imputed House Rental Value` or `Total Income from Entrepreneurial Activites` but it's effect would be a little too obvious and statistically significant. 

It would seem interesting to use `Household Head Age` as a predictor, so we'll use it as a numerical feature.

In [71]:
x = df["Household Head Age"]
x_attr = [tf.feature_column.numeric_column("Household Head Age")]

### Defining our target

In [69]:
Y = df["Total Expenditures"]

### Starting our Linear Regression Engine
We'll use the `GradientDescentOptimizer` to train our model.

The TensorFlow tutorial also uses gradient clipping to limit the magnitude of gradients, which would drastically over-shoot the gradient descent.

In [61]:
optim = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
optim = tf.contrib.estimator.clip_gradients_by_norm(optim, 5.0)

lin_regress = tf.estimator.LinearRegressor(
    feature_columns = x_attr,
    optimizer = optim
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\Ollie\\AppData\\Local\\Temp\\tmpvgp9l2fm', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000005CB4CFAA90>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


## Defining our input function
Our input function will process our pandas data into numpy arrays where TensorFlow can easily process them.

In [63]:
def inputer(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    features = {key:np.array(value) for key,value in dict(features).items()}
    
    ds = Dataset.from_tensor_slices((features, targets))
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    if shuffle:
        ds = ds.shuffle(buffer_size=10000)
    
    features, labels = ds.make_one_shot_iterator().get_next()
    
    return features, labels

## Training the Model

In [72]:
lin_regress.train(
    input_fn = lambda:inputer(x, Y),
    steps = 100
)

IndexError: list index out of range

## Lets evaluate it.
We'll use MSE and RMSE to evaluate the effectiveness of the model. 

In [75]:
pred = lin_regress.predict(input_fn=lambda: inputer(x, Y, num_epochs=1, shuffle=False))
pred = np.array([pred_row["pred"][0] for pred_row in pred])

sum_stats = {
    "mse":metrics.mean_squared_error(pred, Y),
    "rmse":math.sqrt(MSE)
            }
print("MSE: %0.3f\tRMSE: %0.3f" %sum_stats["mse"], sum_stats["rmse"])

ValueError: Could not find trained model in model_dir: C:\Users\Ollie\AppData\Local\Temp\tmpvgp9l2fm.

In [76]:
sum_stats += {
    "min":df["Total Expenditures"].min(),
    "max":df["Total Expenditures"].max()
            }

print("MIN: %0.3f\tMAX: %0.3f" %sum_stats["min"], sum_stats["max"])

NameError: name 'sum_stats' is not defined

In [None]:
calib_data = pd.DataFrame()
calib_data["predictions"] = pd.Series(pred)
calib_data["actual"] = pd.Series(Y)
calib_data.describe()

In [None]:
sample = df.sample(400)

In [None]:
x_0 = sample["Household Head Age"].min()
x_1 = sample["Household Head Age"].max()

weight = lin_regress.get_variable_value("linear/linear_model/total_rooms/weights")[0]
bias = lin_regress.get_variable_value("linear/linear_model/bias_weights")

y_0 = weight * x_0 + bias
y_1 = weight * x_1 + bias

plt.plot([x_0, x_1], [y_0, y_1], c='r')

plt.ylabel("Total Expenditure")
plt.xlabel("Household head ages")

plt.scatter(sample["Household Head Ages"], sample["Total Expenditures"])

plt.show()