# Costa Rican Household Poverty Level Prediction


"The Inter-American Development Bank is asking the Kaggle community for help with income qualification for some of the world's poorest families. [...]

Here's the backstory: Many social programs have a hard time making sure the right people are given enough aid. It’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify. [...]"

For more information on the competition and to download the training and test data see here: https://www.kaggle.com/c/costa-rican-household-poverty-prediction#description

This notebook shows basic data cleaning, feature engineering and the usage of a LightGBM model.

In [3]:
# load necessary libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import get_scorer
import sklearn

# Input data files:

import os
print(os.listdir("data"))

['test.csv', 'train.csv']


## Data cleaning and feature engineering

First of all training and test data are loaded.

In [16]:
df = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

Let's have a look at the data:

In [5]:
df.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


LGBM can only be used with integer or float values, so we need to find any columns with type 'object' first and convert them to numbers.

In [6]:
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
categorical_columns

['Id', 'idhogar', 'dependency', 'edjefe', 'edjefa']

Replace 'yes' and 'no' -values in columns 'dependency', 'edjefe' and 'edjefa' with 1 and 0 respectivly. Id and idhogar can be ignored as they won't be used as features in the model.

In [17]:
def replace_yes_no(df):
    df[['dependency', 'edjefe', 'edjefa']] = df[['dependency','edjefe', 'edjefa']].replace({'yes':1, 'no':1}).astype(float)
    return df

df = replace_yes_no(df)
df_test = replace_yes_no(df_test)

The variable v18q indicates whether the household owns a tablet, while v18q1 contains the number of tablets the household owns. 
v18q1 is null when v18q = 0, so null values in v18q1 are replaced with 0 and v18q is discarded.

In [18]:
def tablets(df):
    df['v18q1'][df['v18q1'].isnull()] = 0
    df = df.drop(['v18q'], axis = 1)
    return df

df = tablets(df)
df_test = tablets(df_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Aggregate escolari per household (on idhogar) using mean and max

In [19]:
def escolari(df):
    escolari_mean = df.groupby(['idhogar'], as_index = False)['escolari'].mean().rename(columns = {'mean': 'escolari_mean'})
    escolari_mean.columns = ['idhogar', 'escolari_mean']

    escolari_max = df.groupby(['idhogar'], as_index = False)['escolari'].max().rename(columns = {'max': 'escolari_max'})
    escolari_max.columns = ['idhogar', 'escolari_max']

    df = df.merge(escolari_mean, how = 'left', on = 'idhogar')
    df = df.merge(escolari_max, how = 'left', on = 'idhogar')
    
    return df

df = escolari(df)
df_test = escolari(df_test)

There are several binary variables that can be combined to one ordinal variable:

1. water provision (columns abastaguadentro, abastaguafuera, abastaguano)
2. walls (columns epared1, epared2, epared3)
3. roof (columns etecho1, etecho2, etecho3)
4. floor (columns eviv1 eviv2, eviv3)
5. education level (columns instlevel1 to instlevel9)
6. tipovivi (columns tipovivi1 to tipovivi5)
7. rubbish (columns elimbasu1 to elimbasu6)
8. energy (colulmns energcocinar1 to energcocinar4)
9. toilet (columns sanitario1 to sanitario6)

Water provision:

In [20]:
def water_provision(df):
    df['water_prov'] = 0
    df.loc[df['abastaguadentro'] == 1, 'water_prov'] = 2
    df.loc[df['abastaguafuera'] == 1, 'water_prov'] = 1
    df.loc[df['abastaguano'] == 1, 'water_prov'] = 0
    df = df.drop(['abastaguadentro','abastaguafuera', 'abastaguano'], axis = 1)
    return df

df = water_provision(df)
df_test = water_provision(df_test)

Walls, roof and floor:

In [21]:
def walls_roof_floor(df):
    df['walls'] = 0
    df.loc[df['epared1'] == 1, 'walls'] = 1
    df.loc[df['epared2'] == 1, 'walls'] = 2
    df.loc[df['epared3'] == 1, 'walls'] = 3
    
    df['roof'] = 0
    df.loc[df['etecho1'] == 1, 'roof'] = 1
    df.loc[df['etecho2'] == 1, 'roof'] = 2
    df.loc[df['etecho3'] == 1, 'roof'] = 3
        
    df['floor'] = 0
    df.loc[df['eviv1'] == 1, 'floor'] = 1
    df.loc[df['eviv2'] == 1, 'floor'] = 2
    df.loc[df['eviv3'] == 1, 'floor'] = 3

    df = df.drop(['epared1','epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3'], axis = 1)
    
    return df

df = walls_roof_floor(df)
df_test = walls_roof_floor(df_test)

Education level:

In [22]:
def education_level(df):
    df['education'] = 0
    df.loc[df['instlevel1'] == 1, 'education'] = 1
    df.loc[df['instlevel2'] == 1, 'education'] = 2
    df.loc[df['instlevel3'] == 1, 'education'] = 3
    df.loc[df['instlevel4'] == 1, 'education'] = 4
    df.loc[df['instlevel5'] == 1, 'education'] = 5
    df.loc[df['instlevel6'] == 1, 'education'] = 6
    df.loc[df['instlevel7'] == 1, 'education'] = 7
    df.loc[df['instlevel8'] == 1, 'education'] = 8
    df.loc[df['instlevel9'] == 1, 'education'] = 9

    df = df.drop(['instlevel1','instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9'], axis = 1)
    
    return df

df = education_level(df)
df_test = education_level(df_test)

Tipovivi:

In [23]:
def tipovivi(df):
    df['tipovivi'] = 0
    df.loc[df['tipovivi1'] == 1, 'tipovivi'] = 1
    df.loc[df['tipovivi2'] == 1, 'tipovivi'] = 2
    df.loc[df['tipovivi3'] == 1, 'tipovivi'] = 3
    df.loc[df['tipovivi4'] == 1, 'tipovivi'] = 4
    df.loc[df['tipovivi5'] == 1, 'tipovivi'] = 5
    
    df = df.drop(['tipovivi1','tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5'], axis = 1)
    
    return df

df = tipovivi(df)
df_test = tipovivi(df_test)

Rubbish:

In [24]:
def rubbish(df):
    df['rubbish'] = 0
    df.loc[df['elimbasu1'] == 1, 'rubbish'] = 1
    df.loc[df['elimbasu2'] == 1, 'rubbish'] = 2
    df.loc[df['elimbasu3'] == 1, 'rubbish'] = 3
    df.loc[df['elimbasu4'] == 1, 'rubbish'] = 4
    df.loc[df['elimbasu5'] == 1, 'rubbish'] = 5
    df.loc[df['elimbasu6'] == 1, 'rubbish'] = 0
    
    df = df.drop(['elimbasu1','elimbasu2', 'elimbasu3', 'elimbasu4', 'elimbasu5', 'elimbasu6'], axis = 1)
    
    return df

df = rubbish(df)
df_test = rubbish(df_test)

Energy used for cooking:

In [25]:
def energy(df):
    df['energy'] = 0
    df.loc[df['energcocinar1'] == 1, 'energy'] = 1
    df.loc[df['energcocinar2'] == 1, 'energy'] = 2
    df.loc[df['energcocinar3'] == 1, 'energy'] = 3
    df.loc[df['energcocinar4'] == 1, 'energy'] = 4
    
    df = df.drop(['energcocinar1','energcocinar2', 'energcocinar3', 'energcocinar4'], axis = 1)
    
    return df

df = energy(df)
df_test = energy(df_test)

Toilet:

In [27]:
def toilet(df):
    df['toilet'] = 0
    df.loc[df['sanitario1'] == 1, 'toilet'] = 1
    df.loc[df['sanitario5'] == 1, 'toilet'] = 2
    df.loc[df['sanitario6'] == 1, 'toilet'] = 3
    df.loc[df['sanitario3'] == 1, 'toilet'] = 4
    df.loc[df['sanitario2'] == 1, 'toilet'] = 5
       
    df = df.drop(['sanitario1','sanitario2', 'sanitario3', 'sanitario5', 'sanitario6'], axis = 1)
    
    return df

df = toilet(df)
df_test = toilet(df_test)

Create new variables that might prove useful

In [28]:
def new_variables(df):
    df['rent_by_hhsize'] = df['v2a1'] / df['hhsize'] # rent by household size
    df['rent_by_people'] = df['v2a1'] / df['r4t3'] # rent by people in household
    df['rent_by_rooms'] = df['v2a1'] / df['rooms'] # rent by number of rooms
    df['rent_by_living'] = df['v2a1'] / df['tamviv'] # rent by number of persons living in the household
    df['rent_by_minor'] = df['v2a1'] / df['hogar_nin']
    df['rent_by_adult'] = df['v2a1'] / df['hogar_adul']
    df['children_by_adults'] = df['hogar_nin'] / df['hogar_adul']
    df['house_quali'] = df['walls'] + df['roof'] + df['floor']
    df['tablets_by_adults'] = df['v18q1'] / df['hogar_adul'] # number of tablets per adults
    df['ratio_nin'] = df['hogar_nin'] / df['hogar_adul'] # ratio children to adults
    return df

df = new_variables(df)
df_test = new_variables(df_test)

Now let's take a look at the data:

In [29]:
df.head(15)

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q1,r4h1,r4h2,...,rent_by_hhsize,rent_by_people,rent_by_rooms,rent_by_living,rent_by_minor,rent_by_adult,children_by_adults,house_quali,tablets_by_adults,ratio_nin
0,ID_279628684,190000.0,0,3,0,1,1,0.0,0,1,...,190000.0,190000.0,63333.333333,190000.0,inf,190000.0,0.0,4,0.0,0.0
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1.0,0,1,...,135000.0,135000.0,33750.0,135000.0,inf,135000.0,0.0,6,1.0,0.0
2,ID_68de51c94,,0,8,0,1,1,0.0,0,0,...,,,,,,,0.0,8,0.0,0.0
3,ID_d671db89c,180000.0,0,5,0,1,1,1.0,0,2,...,45000.0,45000.0,36000.0,45000.0,90000.0,90000.0,1.0,9,0.5,1.0
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1.0,0,2,...,45000.0,45000.0,36000.0,45000.0,90000.0,90000.0,1.0,9,0.5,1.0
5,ID_ec05b1a7b,180000.0,0,5,0,1,1,1.0,0,2,...,45000.0,45000.0,36000.0,45000.0,90000.0,90000.0,1.0,9,0.5,1.0
6,ID_e9e0c1100,180000.0,0,5,0,1,1,1.0,0,2,...,45000.0,45000.0,36000.0,45000.0,90000.0,90000.0,1.0,9,0.5,1.0
7,ID_3e04e571e,130000.0,1,2,0,1,1,0.0,0,1,...,32500.0,32500.0,65000.0,32500.0,65000.0,65000.0,1.0,4,0.0,1.0
8,ID_1284f8aad,130000.0,1,2,0,1,1,0.0,0,1,...,32500.0,32500.0,65000.0,32500.0,65000.0,65000.0,1.0,4,0.0,1.0
9,ID_51f52fdd2,130000.0,1,2,0,1,1,0.0,0,1,...,32500.0,32500.0,65000.0,32500.0,65000.0,65000.0,1.0,4,0.0,1.0


## Modeling

For this classification we use LightGBM with stratified k-fold cross validation as the classes are quite imbalanced.

In [30]:
# Use all columns as features except Ids and Target
feats = [f for f in df.columns if f not in ['Id','Target','idhogar']]

# 10 folds
folds = StratifiedKFold(n_splits= 10, shuffle=True, random_state=1054)

# matrix for predictions
preds = np.zeros((df_test.shape[0], 4))

# iterate through folds
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(df[feats], df['Target'])):
    print('Fold ', n_fold)
    train_x, train_y = df.iloc[train_idx], df['Target'].iloc[train_idx]
    valid_x, valid_y = df.iloc[valid_idx], df['Target'].iloc[valid_idx]
    
    # eliminate unnecessary features
    train_x = train_x[feats]
    valid_x = valid_x[feats]
    
    # create and fit model
    gbm = lgb.LGBMClassifier(n_jobs=4, random_state=0, class_weight='balanced', num_leaves = 100, learning_rate = 0.1, early_stopping_rounds = 200)
    gbm.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], 
                    verbose= 100, eval_metric = 'multi_error')
    
    # mean of predictions for argmax later on 
    preds += gbm.predict_proba(df_test[feats]) / folds.n_splits

Fold  0




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000555967	valid_1's multi_error: 0.0438871
Did not meet early stopping. Best iteration is:
[99]	training's multi_error: 0.000509637	valid_1's multi_error: 0.0449321
Fold  1




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000648629	valid_1's multi_error: 0.0564263
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.000648629	valid_1's multi_error: 0.0564263
Fold  2




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000648629	valid_1's multi_error: 0.0512017
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.000648629	valid_1's multi_error: 0.0512017
Fold  3




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000602298	valid_1's multi_error: 0.0585162
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.000602298	valid_1's multi_error: 0.0585162
Fold  4




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000833951	valid_1's multi_error: 0.0334378
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.000833951	valid_1's multi_error: 0.0334378
Fold  5




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000694959	valid_1's multi_error: 0.042887
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.000694959	valid_1's multi_error: 0.042887
Fold  6




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000926441	valid_1's multi_error: 0.0481675
Did not meet early stopping. Best iteration is:
[97]	training's multi_error: 0.000926441	valid_1's multi_error: 0.0481675
Fold  7




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.00069483	valid_1's multi_error: 0.0345912
Did not meet early stopping. Best iteration is:
[98]	training's multi_error: 0.00069483	valid_1's multi_error: 0.0345912
Fold  8




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.00046322	valid_1's multi_error: 0.0555556
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.00046322	valid_1's multi_error: 0.0555556
Fold  9




Training until validation scores don't improve for 200 rounds.
[100]	training's multi_error: 0.000833797	valid_1's multi_error: 0.0608604
Did not meet early stopping. Best iteration is:
[100]	training's multi_error: 0.000833797	valid_1's multi_error: 0.0608604


## Create Submission

In [31]:
# predicted class is the one with the highest prediction value
pred_maj = np.argmax(preds, axis = 1) + 1

In [32]:
df_test['Target'] = pred_maj.astype(int)
df_test[['Id', 'Target']].to_csv('submission_180831_lgbm.csv', index= False)

In [33]:
df_test['Target'].value_counts()

4    16288
2     4263
3     1994
1     1311
Name: Target, dtype: int64