# Week 2 COMP188
## Dissecting Filipino family income and expenditure dataset

This week we are dissecting the [Filipino family income and expenditure dataset from Kaggle](https://www.kaggle.com/grosvenpaul/family-income-and-expenditure), with the methods used by [Google's tutorial on TensorFlow](https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/video-lecture). 

We'll be using TensorFlow's Linear Regressor to predict total family expediture from features of the family head as well as other family metrics.

In [1]:
# import tensorflow as tf
# from tensorflow.python.data import Dataset
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

import numpy as np
import pandas as pd

import math

# from IPython import display
from matplotlib import cm
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt

# import os
# tf.__version__ 
#I ran into problems with tensorflow being on an ancient version on my desktop's Anaconda, seems to work fine on my laptop

We'll load and examine our data.

In [2]:
df = pd.read_csv('Family Income and Expenditure.csv')

In [None]:
df.head()

I also want to get used to using pandas so I'll do some basic dataframe tasks. We'll get rid of some unnecessary data in order to run it faster and make it cleaner to look at.

We'll then sum up all the expenses, then remove all of them.

In [3]:
import re

if "Total Food Expenditure" in list(df):    
    df = df.drop("Total Food Expenditure", 1)

reg = re.compile('.*(([Ee]xpenditure)|([Ee]xpenses))$')
expenditure_types = [var for var in list(df) if re.match(reg, var)]
df["Total Expenditures"] = np.sum(df[expenditure_types],1)

We'll also get rid of `Total Household Income` and `Number or *` because that's cheating as it would very likely highly correlate with total expenditure. 

In [4]:
reg_number = re.compile('Number.*')
reg_house = re.compile('House .*')
reg_type = re.compile('Type .*')
remove = [var for var in list(df) if re.match(reg_number, var) 
                                  or re.match(reg_house, var) 
                                  or re.match(reg_type, var)]  + expenditure_types

for var in remove:
    if var in list(df):
        df = df.drop(var, 1)

In [5]:
df.head()

Unnamed: 0,Total Household Income,Region,Main Source of Income,Agricultural Household indicator,Imputed House Rental Value,Total Income from Entrepreneurial Acitivites,Household Head Sex,Household Head Age,Household Head Marital Status,Household Head Highest Grade Completed,...,Household Head Class of Worker,Total Number of Family members,Members with age less than 5 year old,Members with age 5 - 17 years old,Total number of family members employed,Tenure Status,Toilet Facilities,Electricity,Main Source of Water Supply,Total Expenditures
0,480332,CAR,Wage/Salaries,0,30000,44370,Female,49,Single,Teacher Training and Education Sciences Programs,...,Worked for government/government corporation,4,0,1,1,Own or owner-like possession of house and lot,"Water-sealed, sewer septic tank, used exclusiv...",1,"Own use, faucet, community water system",317889
1,198235,CAR,Wage/Salaries,0,27000,0,Male,40,Married,Transport Services Programs,...,Worked for private establishment,3,0,1,2,Own or owner-like possession of house and lot,"Water-sealed, sewer septic tank, used exclusiv...",1,"Own use, faucet, community water system",185834
2,82785,CAR,Wage/Salaries,1,7200,0,Male,39,Married,Grade 3,...,Worked for private establishment,6,0,4,3,Own or owner-like possession of house and lot,"Water-sealed, sewer septic tank, shared with o...",0,"Shared, faucet, community water system",116685
3,107589,CAR,Wage/Salaries,0,6600,15580,Male,52,Married,Elementary Graduate,...,Employer in own family-operated farm or business,3,0,3,2,Own or owner-like possession of house and lot,Closed pit,1,"Own use, faucet, community water system",145482
4,189322,CAR,Wage/Salaries,0,16800,75687,Male,65,Married,Elementary Graduate,...,Self-employed wihout any employee,4,0,0,2,Own or owner-like possession of house and lot,"Water-sealed, sewer septic tank, used exclusiv...",1,"Own use, faucet, community water system",188119


## Making the Model!
So now we wanna predict `Total Expenditures` through household parameters. Lets first shuffle the data.

In [6]:
df = df.reindex(np.random.permutation(df.index))

### Defining our feature
For just using a single predictor, we could use variables like `Imputed House Rental Value` or `Total Income from Entrepreneurial Activites` but it's effect would be a little too obvious and statistically significant. 

It would seem interesting to use `Household Head Age` as a predictor, so we'll use it as a numerical feature.

In [13]:
X = df["Household Head Age"]

### Defining our target

In [14]:
y = df["Total Expenditures"]

### Splitting data into train and test

In [15]:
X_train = X[:len(df)*0.8]
X_test = X[len(df)*0.8:]

y_train = y[:len(df)*0.8]
y_test = y[len(df)*0.8:]

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [33235.200000000004] of <class 'float'>