#### DESCRIPTION

Identify the level of income qualification needed for the families in Latin America.

#### Problem Statement Scenario:
Many social programs have a hard time ensuring that the right people are given enough aid. It’s tricky when a program focuses on the poorest segment of the population. This segment of the population can’t provide the necessary income and expense records to prove that they qualify.

In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling or the assets found in their homes to
classify them and predict their level of need.

While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines.

The Inter-American Development Bank (IDB)believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.

#### Following actions should be performed:

* Identify the output variable.
* Understand the type of data.
* Check if there are any biases in your dataset.
* Check whether all members of the house have the same poverty level.
* Check if there is a house without a family head.
* Set poverty level of the members and the head of the house within a family.
* Count how many null values are existing in columns.
* Remove null value rows of the target variable.
* Predict the accuracy using random forest classifier.
* Check the accuracy using random forest with cross validation.

In [1]:
import numpy as np
import pandas as pd

from datetime import datetime



%matplotlib inline

import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('IncomeQualificationDataset/train.csv')
test = pd.read_csv('IncomeQualificationDataset/test.csv')

In [3]:
train.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [4]:
test.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,ID_2f6873615,,0,5,0,1,1,0,,1,...,4,0,16,9,0,1,2.25,0.25,272.25,16
1,ID_1c78846d2,,0,5,0,1,1,0,,1,...,41,256,1681,9,0,1,2.25,0.25,272.25,1681
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,...,41,289,1681,9,0,1,2.25,0.25,272.25,1681
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,...,59,256,3481,1,256,0,1.0,0.0,256.0,3481
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,...,18,121,324,1,0,1,0.25,64.0,,324


In [5]:
train["Target"]

In [8]:
train['is_test']=0
test['is_test']=1

df = pd.concat([train, test], axis=0)

In [9]:
print(df.columns)

Index(['Id', 'v2a1', 'hacdor', 'rooms', 'hacapo', 'v14a', 'refrig', 'v18q',
       'v18q1', 'r4h1',
       ...
       'SQBage', 'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin',
       'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq', 'Target',
       'is_test'],
      dtype='object', length=144)


In [10]:
#def dprint(*args,**kwargs):
#    print('[{}]'.format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M')) + " ",join(map(str,args)),**kwargs)

In [11]:
#dprint('Clean Features')

In [12]:
print(f'{datetime.now()} Cleaning Features...')

from tqdm import tqdm
cols = ['dependency']

for c in tqdm(cols):
    x = df[c].values
    strs = []
    for i,v in enumerate(c):
        try:
            val=float(v)
        except:
            strs.append(v)
            val=np.nan
        x[i]=val
    strs = np.unique(strs)
    
    for s in strs:
        df[c + '_' + s] = df[c].apply(lambda x: 1 if x==s else 0)
    
    df[c]=x
    
print(f'{datetime.now()} Done!!')

2021-01-13 23:12:23.612357 Cleaning Features...


100%|██████████| 1/1 [00:00<00:00,  8.32it/s]

2021-01-13 23:12:24.031987 Done!!





In [13]:
df.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBmeaned,agesq,Target,is_test,dependency_c,dependency_d,dependency_e,dependency_n,dependency_p,dependency_y
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100.0,1849,4.0,0,0,0,0,0,0,0
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144.0,4489,4.0,0,0,0,0,0,0,0
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121.0,8464,4.0,0,0,0,0,0,0,0
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,121.0,289,4.0,0,0,0,0,0,0,0
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121.0,1369,4.0,0,0,0,0,0,0,0
