## Introduction

Costa Rica , officially the Republic of Costa Rica, is a country in the Central American region of North America. Costa Rica is bordered by Nicaragua to the north, the Caribbean Sea to the northeast, Panama to the southeast, and the Pacific Ocean to the southwest, as well as maritime border with Ecuador to the south of Cocos Island. It has a population of around five million in a land area of nearly 51,180 km2 (19,760 sq mi). An estimated 352,381 people live in the capital and largest city, San José, with around two million people in the surrounding metropolitan area.

Costa Rica's poverty rate is lower than most Latin American countries, but it has remained around 20% for almost two decades. The country defines poverty as those earning less than $155 per month, and the majority of people in this classification live in rural areas or the inner city of San José. However, poverty rates are especially high among vulnerable groups, such as Afro-descendants, Indigenous populations, and migrants.

Evaluation: 
Submissions will be evaluated based on their macro F1 score.

## Core Data fields
Id - a unique identifier for each row.

Target - the target is an ordinal variable indicating groups of income levels.

1 = extreme poverty

2 = moderate poverty

3 = vulnerable households

4 = non vulnerable households

idhogar - this is a unique identifier for each household. This can be used to create household-wide features, etc. All rows in a given household will 
have a matching value for this identifier.

parentesco1 - indicates if this person is the head of the household.

This data contains 142 total columns.

In [33]:
# Required Libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('train.csv')
df

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.000000,0.0000,100.0000,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.000000,64.0000,144.0000,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.250000,64.0000,121.0000,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0000,121.0000,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0000,121.0000,1369,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9552,ID_d45ae367d,80000.0,0,6,0,1,1,0,,0,...,81,2116,25,81,1,1.562500,0.0625,68.0625,2116,2
9553,ID_c94744e07,80000.0,0,6,0,1,1,0,,0,...,0,4,25,81,1,1.562500,0.0625,68.0625,4,2
9554,ID_85fc658f8,80000.0,0,6,0,1,1,0,,0,...,25,2500,25,81,1,1.562500,0.0625,68.0625,2500,2
9555,ID_ced540c61,80000.0,0,6,0,1,1,0,,0,...,121,676,25,81,1,1.562500,0.0625,68.0625,676,2


In [7]:
df.select_dtypes(include='object')

Unnamed: 0,Id,idhogar,dependency,edjefe,edjefa
0,ID_279628684,21eb7fcc1,no,10,no
1,ID_f29eb3ddd,0e5d7a658,8,12,no
2,ID_68de51c94,2c7317ea8,8,no,11
3,ID_d671db89c,2b58d945f,yes,11,no
4,ID_d56d6f5f5,2b58d945f,yes,11,no
...,...,...,...,...,...
9552,ID_d45ae367d,d6c086aa3,.25,9,no
9553,ID_c94744e07,d6c086aa3,.25,9,no
9554,ID_85fc658f8,d6c086aa3,.25,9,no
9555,ID_ced540c61,d6c086aa3,.25,9,no


In [10]:
# Null Values

pd.DataFrame(df.isnull().sum()).sort_values(by = 0, ascending= False).head(6)

Unnamed: 0,0
rez_esc,7928
v18q1,7342
v2a1,6860
SQBmeaned,5
meaneduc,5
Id,0


In [11]:
pd.DataFrame(df.isnull().sum()/len(df)*100).sort_values(by = 0, ascending= False).head(6)

Unnamed: 0,0
rez_esc,82.954902
v18q1,76.823271
v2a1,71.779847
SQBmeaned,0.052318
meaneduc,0.052318
Id,0.0


In [12]:
# The columns 'rez_esc', 'v18q1' and 'v2a1' are having null values more than 70%, better to drop them.

df.drop(columns = ['rez_esc', 'v18q1', 'v2a1'], inplace= True)

In [13]:
df.shape

(9557, 140)

In [14]:
# Null Value Imputation:
# meaneduc,average years of education for adults (18+)
# SQBmeaned - square of the mean years of education of adults (>=18) in the household

for i in ['SQBmeaned', 'meaneduc']:
    df[i] = df[i].fillna(df[i].mean())

In [15]:
pd.DataFrame(df.isnull().sum()).sort_values(by = 0, ascending= False).head()

Unnamed: 0,0
Id,0
hogar_total,0
parentesco11,0
parentesco12,0
idhogar,0


In [17]:
# Duplicates:

df.duplicated().sum()

0

In [18]:
# Dropping the ID Columns which may not be useful for analysis

df = df.drop(columns = ['Id', 'idhogar'])

In [21]:
df.shape

(9557, 138)

Categorical Columns:

dependency, Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)

escolari, years of schooling

edjefe, years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

edjefa, years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

In [20]:
df['escolari'].value_counts()

escolari
6     1985
0     1307
11    1123
9      612
8      498
7      488
3      401
15     382
5      346
2      318
14     316
4      306
10     297
12     296
1      247
13     199
17     199
16     195
21      21
19       9
18       8
20       4
Name: count, dtype: int64

In [19]:
for i in df.select_dtypes(include='object').columns:
    print(df[i].value_counts())
    print('---------------------------------')

dependency
yes          2192
no           1747
.5           1497
2             730
1.5           713
.33333334     598
.66666669     487
8             378
.25           260
3             236
4             100
.75            98
.2             90
.40000001      84
1.3333334      84
2.5            77
5              24
1.25           18
3.5            18
.80000001      18
2.25           13
.71428573      12
1.75           11
1.2            11
.83333331      11
.22222222      11
.2857143        9
1.6666666       8
.60000002       8
6               7
.16666667       7
Name: count, dtype: int64
---------------------------------
edjefe
no     3762
6      1845
11      751
9       486
3       307
15      285
8       257
7       234
5       222
14      208
17      202
2       194
4       137
16      134
yes     123
12      113
10      111
13      103
21       43
18       19
19       14
20        7
Name: count, dtype: int64
---------------------------------
edjefa
no     6230
6       947
11      3

In [22]:
# The columns edjefe and edjefa are based on interaction of other columns and are having obscure values, better to drop them

df = df.drop(columns = ['edjefa', 'edjefe'])
df.shape

(9557, 136)

In [23]:
# Let us convert the yes or no to 1 and 0 in the column dependency

for i in df.select_dtypes(include='object').columns:
    df[i].replace(['yes','no'],[1,0], inplace= True)

In [24]:
for i in df.select_dtypes(include='object').columns:
    print(df[i].value_counts())

dependency
1            2192
0            1747
.5           1497
2             730
1.5           713
.33333334     598
.66666669     487
8             378
.25           260
3             236
4             100
.75            98
.2             90
.40000001      84
1.3333334      84
2.5            77
5              24
1.25           18
3.5            18
.80000001      18
2.25           13
.71428573      12
1.75           11
1.2            11
.83333331      11
.22222222      11
.2857143        9
1.6666666       8
.60000002       8
6               7
.16666667       7
Name: count, dtype: int64


In [26]:
# Train Test Split

x = df.drop(columns = 'Target')
y = df['Target']
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size= 0.8, stratify= y, random_state= 7)

In [29]:
y.value_counts()

Target
4    5996
2    1597
3    1209
1     755
Name: count, dtype: int64

In [28]:
pipe = Pipeline((
("sc",StandardScaler()),   
("lr", LogisticRegression(class_weight='balanced')),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.29      0.58      0.39       151
           2       0.39      0.39      0.39       319
           3       0.22      0.36      0.28       242
           4       0.88      0.67      0.76      1200

    accuracy                           0.58      1912
   macro avg       0.45      0.50      0.45      1912
weighted avg       0.67      0.58      0.61      1912



In [30]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("lr", LogisticRegression(class_weight='balanced')),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.28      0.54      0.37       151
           2       0.42      0.41      0.42       319
           3       0.26      0.43      0.32       242
           4       0.89      0.68      0.77      1200

    accuracy                           0.59      1912
   macro avg       0.46      0.51      0.47      1912
weighted avg       0.69      0.59      0.62      1912



In [31]:
pipe = Pipeline((
("sc",StandardScaler()),   
("KNN", KNeighborsClassifier()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.74      0.52      0.61       151
           2       0.67      0.59      0.63       319
           3       0.65      0.48      0.55       242
           4       0.82      0.92      0.86      1200

    accuracy                           0.78      1912
   macro avg       0.72      0.63      0.66      1912
weighted avg       0.76      0.78      0.76      1912



In [32]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("KNN", KNeighborsClassifier()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.75      0.53      0.62       151
           2       0.67      0.61      0.63       319
           3       0.62      0.48      0.54       242
           4       0.82      0.91      0.87      1200

    accuracy                           0.78      1912
   macro avg       0.71      0.63      0.66      1912
weighted avg       0.77      0.78      0.77      1912



In [34]:
pipe = Pipeline((
("sc",StandardScaler()),   
("gnb", GaussianNB()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.27      0.26      0.26       151
           2       0.67      0.03      0.05       319
           3       0.13      0.91      0.23       242
           4       0.98      0.05      0.09      1200

    accuracy                           0.17      1912
   macro avg       0.51      0.31      0.16      1912
weighted avg       0.77      0.17      0.12      1912



In [37]:
pipe = Pipeline((
("sc",StandardScaler()), 
("pt",PowerTransformer()), 
("gnb", GaussianNB()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.23      0.21      0.22       151
           2       0.67      0.03      0.05       319
           3       0.13      0.91      0.23       242
           4       0.98      0.04      0.08      1200

    accuracy                           0.16      1912
   macro avg       0.50      0.30      0.14      1912
weighted avg       0.76      0.16      0.10      1912



In [36]:
pipe = Pipeline((
("sc",StandardScaler()),   
("bnb", BernoulliNB()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.29      0.40      0.34       151
           2       0.33      0.39      0.35       319
           3       0.20      0.18      0.19       242
           4       0.81      0.75      0.78      1200

    accuracy                           0.59      1912
   macro avg       0.41      0.43      0.41      1912
weighted avg       0.61      0.59      0.60      1912



In [38]:
pipe = Pipeline((
("sc",StandardScaler()), 
("pt",PowerTransformer()),
("bnb", BernoulliNB()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.29      0.38      0.33       151
           2       0.32      0.39      0.35       319
           3       0.24      0.22      0.23       242
           4       0.82      0.75      0.78      1200

    accuracy                           0.59      1912
   macro avg       0.42      0.44      0.42      1912
weighted avg       0.62      0.59      0.61      1912



In [39]:
pipe = Pipeline((
("sc",StandardScaler()),  
("dt", DecisionTreeClassifier(class_weight='balanced')),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.76      0.90      0.83       151
           2       0.85      0.85      0.85       319
           3       0.91      0.86      0.89       242
           4       0.96      0.95      0.96      1200

    accuracy                           0.92      1912
   macro avg       0.87      0.89      0.88      1912
weighted avg       0.92      0.92      0.92      1912



In [40]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("dt", DecisionTreeClassifier(class_weight='balanced')),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.75      0.89      0.81       151
           2       0.84      0.85      0.84       319
           3       0.90      0.83      0.86       242
           4       0.96      0.95      0.96      1200

    accuracy                           0.91      1912
   macro avg       0.86      0.88      0.87      1912
weighted avg       0.92      0.91      0.91      1912



In [41]:
pipe = Pipeline((
("sc",StandardScaler()),  
("rf", RandomForestClassifier(class_weight='balanced')),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.96      0.84      0.90       151
           2       0.96      0.82      0.88       319
           3       0.96      0.74      0.83       242
           4       0.90      0.99      0.95      1200

    accuracy                           0.92      1912
   macro avg       0.95      0.85      0.89      1912
weighted avg       0.92      0.92      0.92      1912



In [42]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("rf", RandomForestClassifier(class_weight='balanced')),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.93      0.85      0.89       151
           2       0.95      0.82      0.88       319
           3       0.97      0.74      0.84       242
           4       0.91      1.00      0.95      1200

    accuracy                           0.92      1912
   macro avg       0.94      0.85      0.89      1912
weighted avg       0.93      0.92      0.92      1912



In [44]:
pipe = Pipeline((
("sc",StandardScaler()), 
("rf", RandomForestClassifier(class_weight='balanced', max_features=0.5))
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.94      0.89      0.91       151
           2       0.96      0.89      0.92       319
           3       0.95      0.85      0.90       242
           4       0.95      0.99      0.97      1200

    accuracy                           0.95      1912
   macro avg       0.95      0.91      0.93      1912
weighted avg       0.95      0.95      0.95      1912



In [43]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("rf", RandomForestClassifier(class_weight='balanced', max_features=0.5))
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.93      0.91      0.92       151
           2       0.95      0.89      0.92       319
           3       0.95      0.84      0.89       242
           4       0.95      0.99      0.97      1200

    accuracy                           0.95      1912
   macro avg       0.95      0.91      0.93      1912
weighted avg       0.95      0.95      0.95      1912



In [46]:
pipe = Pipeline((
("sc",StandardScaler()),
("rf", RandomForestClassifier(class_weight='balanced', max_features=0.7))
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.91      0.91      0.91       151
           2       0.94      0.89      0.91       319
           3       0.95      0.85      0.90       242
           4       0.96      0.99      0.98      1200

    accuracy                           0.95      1912
   macro avg       0.94      0.91      0.92      1912
weighted avg       0.95      0.95      0.95      1912



In [45]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("rf", RandomForestClassifier(class_weight='balanced', max_features=0.7))
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.91      0.91      0.91       151
           2       0.94      0.90      0.92       319
           3       0.95      0.84      0.89       242
           4       0.96      0.99      0.97      1200

    accuracy                           0.95      1912
   macro avg       0.94      0.91      0.92      1912
weighted avg       0.95      0.95      0.95      1912



In [47]:
pipe = Pipeline((
("sc",StandardScaler()),  
("ada", AdaBoostClassifier()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.41      0.28      0.34       151
           2       0.34      0.34      0.34       319
           3       0.46      0.13      0.21       242
           4       0.75      0.88      0.81      1200

    accuracy                           0.65      1912
   macro avg       0.49      0.41      0.42      1912
weighted avg       0.62      0.65      0.62      1912



In [48]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("ada", AdaBoostClassifier()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.41      0.28      0.34       151
           2       0.34      0.34      0.34       319
           3       0.46      0.13      0.21       242
           4       0.75      0.88      0.81      1200

    accuracy                           0.65      1912
   macro avg       0.49      0.41      0.42      1912
weighted avg       0.62      0.65      0.62      1912



In [49]:
pipe = Pipeline((
("sc",StandardScaler()), 
("gb", GradientBoostingClassifier()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.82      0.46      0.59       151
           2       0.56      0.48      0.52       319
           3       0.83      0.24      0.38       242
           4       0.77      0.95      0.85      1200

    accuracy                           0.74      1912
   macro avg       0.75      0.54      0.58      1912
weighted avg       0.75      0.74      0.71      1912



In [50]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("gb", GradientBoostingClassifier()),
))
pipe.fit(xtrain, ytrain)
ypred = pipe.predict(xtest)

print("classification_report")
print(classification_report(ytest, ypred))

classification_report
              precision    recall  f1-score   support

           1       0.82      0.46      0.59       151
           2       0.56      0.48      0.52       319
           3       0.82      0.24      0.38       242
           4       0.77      0.95      0.85      1200

    accuracy                           0.74      1912
   macro avg       0.74      0.54      0.58      1912
weighted avg       0.75      0.74      0.71      1912



In [None]:
# Best performing model is Random Forest

In [52]:
testdf = pd.read_csv('test.csv')
testdf

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,ID_2f6873615,,0,5,0,1,1,0,,1,...,4,0,16,9,0,1,2.25,0.25,272.2500,16
1,ID_1c78846d2,,0,5,0,1,1,0,,1,...,41,256,1681,9,0,1,2.25,0.25,272.2500,1681
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,...,41,289,1681,9,0,1,2.25,0.25,272.2500,1681
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,...,59,256,3481,1,256,0,1.00,0.00,256.0000,3481
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,...,18,121,324,1,0,1,0.25,64.00,,324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23851,ID_a065a7cad,,1,2,1,1,1,0,,0,...,10,9,100,36,25,4,36.00,0.25,33.0625,100
23852,ID_1a7c6953b,,0,3,0,1,1,0,,0,...,54,36,2916,16,36,4,4.00,1.00,36.0000,2916
23853,ID_07dbb4be2,,0,3,0,1,1,0,,0,...,12,16,144,16,36,4,4.00,1.00,36.0000,144
23854,ID_34d2ed046,,0,3,0,1,1,0,,0,...,12,25,144,16,36,4,4.00,1.00,36.0000,144


In [53]:
test = testdf.drop(columns = ['Id','idhogar','rez_esc', 'v18q1', 'v2a1','edjefa', 'edjefe'])

In [54]:
test.isnull().sum().sort_values(ascending = False).head(10)

SQBmeaned       31
meaneduc        31
hacdor           0
hogar_adul       0
parentesco9      0
parentesco10     0
parentesco11     0
parentesco12     0
hogar_nin        0
hogar_total      0
dtype: int64

In [55]:
for i in ['SQBmeaned', 'meaneduc']:
    test[i] = test[i].fillna(df[i].mean())

In [56]:
for i in test.select_dtypes(include='object').columns:
    print(test[i].value_counts())

dependency
yes          5388
no           4289
.5           3678
2            1769
1.5          1758
.33333334    1533
.66666669    1130
8            1037
.25           684
3             596
1.3333334     278
2.5           224
.2            216
.75           203
4             195
.40000001     175
.60000002     128
1.6666666     120
5              96
.16666667      56
1.25           54
.80000001      45
.14285715      32
2.3333333      30
.83333331      22
3.5            18
7              16
3.3333333      13
.85714287      13
2.25           13
.375           11
1.2            11
.2857143        9
.125            9
6               7
Name: count, dtype: int64


In [57]:
for i in test.select_dtypes(include='object').columns:
    test[i].replace(['yes','no'],[1,0], inplace= True)

In [58]:
test

Unnamed: 0,hacdor,rooms,hacapo,v14a,refrig,v18q,r4h1,r4h2,r4h3,r4m1,...,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,0,5,0,1,1,0,1,1,2,0,...,4,0,16,9,0,1,2.25,0.25,272.250000,16
1,0,5,0,1,1,0,1,1,2,0,...,41,256,1681,9,0,1,2.25,0.25,272.250000,1681
2,0,5,0,1,1,0,1,1,2,0,...,41,289,1681,9,0,1,2.25,0.25,272.250000,1681
3,0,14,0,1,1,1,0,1,1,0,...,59,256,3481,1,256,0,1.00,0.00,256.000000,3481
4,0,4,0,1,1,1,0,0,0,0,...,18,121,324,1,0,1,0.25,64.00,102.588867,324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23851,1,2,1,1,1,0,0,2,2,1,...,10,9,100,36,25,4,36.00,0.25,33.062500,100
23852,0,3,0,1,1,0,0,1,1,0,...,54,36,2916,16,36,4,4.00,1.00,36.000000,2916
23853,0,3,0,1,1,0,0,1,1,0,...,12,16,144,16,36,4,4.00,1.00,36.000000,144
23854,0,3,0,1,1,0,0,1,1,0,...,12,25,144,16,36,4,4.00,1.00,36.000000,144


In [59]:
pipe = Pipeline((
("sc",StandardScaler()),
("pt",PowerTransformer()),   
("rf", RandomForestClassifier(class_weight='balanced', max_features=0.5))
))

pipe.fit(x, y)
ypred = pipe.predict(test)

test_pred = pd.DataFrame(ypred, index = testdf["Id"])
test_pred.rename(columns={0:"Target"}, inplace = True)
test_pred.to_csv("sub1.csv")

In [62]:
rf = RandomForestClassifier(class_weight='balanced', max_features=0.5)

rf.fit(x, y)
ypred = rf.predict(test)

In [63]:
# Feature Importance

pd.DataFrame({'Variable': x.columns, 'Imp':rf.feature_importances_}).sort_values(by = 'Imp', ascending= False)

Unnamed: 0,Variable,Imp
133,SQBmeaned,0.065520
96,meaneduc,0.063548
129,SQBedjefe,0.038346
116,qmobilephone,0.035044
107,overcrowding,0.031633
...,...,...
29,pisoother,0.000030
59,elimbasu6,0.000026
36,techootro,0.000002
105,instlevel9,0.000001
