# Income Qualification

## Project Objective: 
**Identify the level of income qualification needed for the families in Latin America**

## Problem Statement Scenario: 
**Many social programs have a hard time making sure the right people are given enough aid. 
It’s tricky when a program focuses on the poorest segment of the population. This segment of population can’t provide the necessary income and expense records to prove that they qualify.
In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling or the assets found in their homes to classify them and predict their level of need.
While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines.
The Inter-American Development Bank (IDB)believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.**



## Importing the modules

In [1]:
import numpy as np 
import pandas as pd
from sklearn.impute import SimpleImputer
import collections
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [2]:
#import the file using pandas
df=pd.read_csv('Income_qualification_train.csv')
print('Shape of the data',df.shape)
print()
print(df.head())

Shape of the data (9557, 143)

             Id      v2a1  hacdor  rooms  hacapo  v14a  refrig  v18q  v18q1  \
0  ID_279628684  190000.0       0      3       0     1       1     0    NaN   
1  ID_f29eb3ddd  135000.0       0      4       0     1       1     1    1.0   
2  ID_68de51c94       NaN       0      8       0     1       1     0    NaN   
3  ID_d671db89c  180000.0       0      5       0     1       1     1    1.0   
4  ID_d56d6f5f5  180000.0       0      5       0     1       1     1    1.0   

   r4h1  ...  SQBescolari  SQBage  SQBhogar_total  SQBedjefe  SQBhogar_nin  \
0     0  ...          100    1849               1        100             0   
1     0  ...          144    4489               1        144             0   
2     0  ...          121    8464               1          0             0   
3     0  ...           81     289              16        121             4   
4     0  ...          121    1369              16        121             4   

   SQBovercrowding  SQBde

## Check and remove the null values

In [3]:
df.isnull().sum()

Id                    0
v2a1               6860
hacdor                0
rooms                 0
hacapo                0
v14a                  0
refrig                0
v18q                  0
v18q1              7342
r4h1                  0
r4h2                  0
r4h3                  0
r4m1                  0
r4m2                  0
r4m3                  0
r4t1                  0
r4t2                  0
r4t3                  0
tamhog                0
tamviv                0
escolari              0
rez_esc            7928
hhsize                0
paredblolad           0
paredzocalo           0
paredpreb             0
pareddes              0
paredmad              0
paredzinc             0
paredfibras           0
                   ... 
bedrooms              0
overcrowding          0
tipovivi1             0
tipovivi2             0
tipovivi3             0
tipovivi4             0
tipovivi5             0
computer              0
television            0
mobilephone           0
qmobilephone    

In [4]:
null_columns=df.columns[df.isnull().any()]
df[null_columns].isnull().sum()

v2a1         6860
v18q1        7342
rez_esc      7928
meaneduc        5
SQBmeaned       5
dtype: int64

In [5]:
print ('Percentage of null values in v2a1 : ', df['v2a1'].isnull().sum()/df.shape[0]*100)
print ('Percentage of null values in v18q1 : ', df['v18q1'].isnull().sum()/df.shape[0]*100)
print ('Percentage of null values in rez_esc : ', df['rez_esc'].isnull().sum()/df.shape[0]*100)
print ('Percentage of null values in meaneduc : ', df['meaneduc'].isnull().sum()/df.shape[0]*100)
print ('Percentage of null values in SQBmeaned : ', df['SQBmeaned'].isnull().sum()/df.shape[0]*100)

Percentage of null values in v2a1 :  71.7798472323951
Percentage of null values in v18q1 :  76.82327090091033
Percentage of null values in rez_esc :  82.95490216595167
Percentage of null values in meaneduc :  0.05231767290990897
Percentage of null values in SQBmeaned :  0.05231767290990897


In [6]:
#Percentage of null values in v2a1, v18q1, rez_esc is more than 50%. So, these columns are dropped 
df= df.drop(['v2a1','v18q1','rez_esc'],axis=1) 
print(df.shape)

(9557, 140)


In [7]:
#Imputing the meaneduc & SQBmeaned coumns 
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(df[['meaneduc','SQBmeaned']])
df[['meaneduc','SQBmeaned']]=imp.transform(df[['meaneduc','SQBmeaned']])
df[['meaneduc','SQBmeaned']].isnull().sum()

meaneduc     0
SQBmeaned    0
dtype: int64

## From the train & test dataset, the output variable is Target column 

In [8]:
df= df.drop(['Id'],axis=1)
df.describe(include='O')

Unnamed: 0,idhogar,dependency,edjefe,edjefa
count,9557,9557,9557,9557
unique,2988,31,22,22
top,fd8a6d014,yes,no,no
freq,13,2192,3762,6230


In [9]:
df.dependency = df.dependency.replace(to_replace=['yes','no'],value=[0.5,0]).astype('float')

In [10]:
med_1=np.median(df.edjefe[df.edjefe.isin(['yes','no'])==False].astype('float'))
df.edjefe= df.edjefe.replace(to_replace=['yes','no'],value=[med_1,0]).astype('float')

In [11]:
med_2=np.median(df.edjefa[df.edjefa.isin(['yes','no'])==False].astype('float'))
df.edjefa= df.edjefa.replace(to_replace=['yes','no'],value=[med_2,0]).astype('float')

In [12]:
df.describe(include='O')

Unnamed: 0,idhogar
count,9557
unique,2988
top,fd8a6d014
freq,13


In [13]:
print(df.idhogar.nunique())

2988


## Finding biasness in the dataset

In [14]:
df.Target.value_counts()
import collections
print(df.shape)
collections.Counter(df['Target'])

(9557, 139)


Counter({4: 5996, 2: 1597, 3: 1209, 1: 755})

It shows the biasness in the dataset

## Checking whether all members of the house have the same poverty level.

In [15]:
poverty_level=(df.groupby('idhogar')['Target'].nunique()>1).index
print(poverty_level)

Index(['001ff74ca', '003123ec2', '006031de3', '006555fe2', '00693f597',
       '006b64543', '00941f1f4', '009ae1cec', '00e3e05c5', '00e443b00',
       ...
       'ff250fd6c', 'ff31b984b', 'ff38ddef1', 'ff6d16fd0', 'ff703eed4',
       'ff9343a35', 'ff9d5ab17', 'ffae4a097', 'ffe90d46f', 'fff7d6be1'],
      dtype='object', name='idhogar', length=2988)


## Checking if there is a house without a family head.

In [16]:
no_head=(df.groupby('idhogar')['parentesco1'].sum()==0).index
display(no_head)

Index(['001ff74ca', '003123ec2', '006031de3', '006555fe2', '00693f597',
       '006b64543', '00941f1f4', '009ae1cec', '00e3e05c5', '00e443b00',
       ...
       'ff250fd6c', 'ff31b984b', 'ff38ddef1', 'ff6d16fd0', 'ff703eed4',
       'ff9343a35', 'ff9d5ab17', 'ffae4a097', 'ffe90d46f', 'fff7d6be1'],
      dtype='object', name='idhogar', length=2988)

## Set poverty level of the members and the head of the house same in a family.

In [17]:
target_mean=df.groupby('idhogar')['Target'].mean().astype('int64').reset_index().rename(columns={'Target':'Target_mean'})
df=df.merge(target_mean,how='left',on='idhogar')
df.Target=df.Target_mean
df.drop('Target_mean',axis=1,inplace=True)
df.head()

Unnamed: 0,hacdor,rooms,hacapo,v14a,refrig,v18q,r4h1,r4h2,r4h3,r4m1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,0,3,0,1,1,0,0,1,1,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,0,4,0,1,1,1,0,1,1,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,0,8,0,1,1,0,0,0,0,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,0,5,0,1,1,1,0,2,2,1,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,0,5,0,1,1,1,0,2,2,1,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [18]:
df.shape

(9557, 139)

In [19]:
df= df.drop(['idhogar'],axis=1)
df.shape

(9557, 138)

## Assigning the value for x & y

In [20]:
x=df.drop(['Target'],axis=1)
print('shape of the x',x.shape)
y=df.Target
print('shape of the y',y.shape)

shape of the x (9557, 137)
shape of the y (9557,)


## Deploying Random Forest Classifier.

In [21]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=10)
rfc = RandomForestClassifier(criterion= 'gini',n_estimators=100)
rfc.fit(x_train,y_train)
pred=rfc.predict(x_test)

## Check the accuracy using random forest with cross validation.

In [22]:
print('Accuracy score: ', accuracy_score(pred,y_test))
print()
print('Confusion matrix: ', confusion_matrix(pred,y_test))
print()
print('Classification report: ', classification_report(pred,y_test))

Accuracy score:  0.9341004184100419

Confusion matrix:  [[ 135    1    1    2]
 [   3  272    3    3]
 [   1    1  176    2]
 [  30   43   36 1203]]

Classification report:                precision    recall  f1-score   support

           1       0.80      0.97      0.88       139
           2       0.86      0.97      0.91       281
           3       0.81      0.98      0.89       180
           4       0.99      0.92      0.95      1312

   micro avg       0.93      0.93      0.93      1912
   macro avg       0.87      0.96      0.91      1912
weighted avg       0.94      0.93      0.94      1912

