**Costa Rican Household Poverty Level Prediction**

Problem and Data Explanation The data for this competition is provided in two files: train.csv and test.csv. The training set has 9557 rows and 143 columns while the testing set has 23856 rows and 142 columns. Each row represents one individual and each column is a feature, either unique to the individual, or for the household of the individual. The training set has one additional column, Target, which represents the poverty level on a 1-4 scale and is the label for the competition. A value of 1 is the most extreme poverty.

The Target values represent poverty levels as follows:

1 = extreme poverty 
2 = moderate poverty 
3 = vulnerable households 
4 = non vulnerable households

The explanations for all 143 columns can be found in the [competition documentation](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/data), but a few to note are below:

**Id:** a unique identifier for each individual, this should not be a feature that we use!
**idhogar:** a unique identifier for each household. This variable is not a feature, but will be used to group individuals by household as all individuals in a household will have the same identifier. 
**parentesco1:** indicates if this person is the head of the household. 
**Target:** the label, which should be equal for all members in a household

This is a supervised multi-class classification machine learning problem:

**Supervised:** provided with the labels for the training data
**Multi-class classification: ** Labels are discrete values with 4 classes

![![Screen%20Shot%202019-05-03%20at%201.07.30%20PM.png](attachment:Screen%20Shot%202019-05-03%20at%201.07.30%20PM.png)](https://www.habitatforhumanity.org.uk/wp-content/uploads/2017/10/Housing-poverty-Costa-Rica--1200x600-c-default.jpg)

**Objectives:**

Objective of this kernel is to perform modeling with the following estimators with default parameters & get accuracy

**Modeling Estimaters**
    1. GradientBoostingClassifier
    2. RandomForestClassifier
    3. KNeighborsClassifier
    4. ExtraTreesClassifier
    5. XGBoost
    6. LightGBM

**Tuning:**
Perform tuning using Bayesian Optimization & compare the accuracy of the estimators.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# for data visulization
import matplotlib.pyplot as plt
import seaborn as sns

# for modeling estimators
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier as gbm
from xgboost.sklearn import XGBClassifier
import lightgbm as lgb

#for data processing
from sklearn.model_selection import train_test_split

#for tuning parameters
from bayes_opt import BayesianOptimization
from skopt import BayesSearchCV
from eli5.sklearn import PermutationImportance


# Misc.
import os
import time
import gc



**Summarize Data**
We will start out by understanding the data that we have by looking at it’s structure.

**Load Data**
Start by loading the CSV data from file into memory as a data frame. We know the names of the data provided, so we will set those names when loading the data from the file.

In [None]:
# Read in data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

ids=test['Id']

**Explore data**

In [None]:
train.head()

In [None]:
train.shape, test.shape

In [None]:
train.info() 

**Perform data visualization**

A graph is a lot more telling about the distribution and relationships of attributes.

Nevertheless, it is important to take your time and review the statistics first. Each time you review the data a different way, you open yourself up to noticing different aspects and potentially achieving different insights into the problem.

We can refer about [data visualization in Pandas](http://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

**Feature Distributions**
The first and easy property to review is the distribution of each attribute.



In [None]:
train.plot(figsize = (12,10))

**Feature-Target Relationships**

The next important relationship to explore is that of each attribute to the "Target" attribute.

In [None]:
sns.countplot("Target", data=train)

In [None]:
 sns.countplot(x="r4t3",hue="Target",data=train)

In [None]:
sns.countplot(x="hhsize",hue="Target",data=train)

**Feature-Feature Relationships**

The final important relationship to explore is that of the relationships between the attributes.

We can review the relationships between attributes by looking at the distribution of the interactions of each pair of attributes.

This uses a built function to create a matrix of scatter plots of all attributes versus all attributes. The diagonal where each attribute would be plotted against itself shows the Kernel Density Estimation of the attribute instead.

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(train.select_dtypes('float'), alpha=0.2, figsize=(26, 20), diagonal='kde')

The below are Distribution plots using seaborn

In [None]:
from collections import OrderedDict

plt.figure(figsize = (20, 16))
plt.style.use('fivethirtyeight')

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 4: 'non vulnerable'})

# Iterate through the float columns
for i, col in enumerate(train.select_dtypes('float')):
    ax = plt.subplot(4, 2, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # Plot each poverty level as a separate line
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')

plt.subplots_adjust(top = 2)

**Object Columns**
The last column type is object which we can view as follows.

In [None]:
train.select_dtypes('object').head()

In [None]:
yes_no_map = {'no':0,'yes':1}
train['dependency'] = train['dependency'].replace(yes_no_map).astype(np.float32)
train['edjefe'] = train['edjefe'].replace(yes_no_map).astype(np.float32)
train['edjefa'] = train['edjefa'].replace(yes_no_map).astype(np.float32)

In [None]:
yes_no_map = {'no':0,'yes':1}
test['dependency'] = test['dependency'].replace(yes_no_map).astype(np.float32)
test['edjefe'] = test['edjefe'].replace(yes_no_map).astype(np.float32)
test['edjefa'] = test['edjefa'].replace(yes_no_map).astype(np.float32)

**Converting categorical objects into numericals**

In [None]:
train[["dependency","edjefe","edjefa"]].describe()

In [None]:
train[["dependency","edjefe","edjefa"]].hist()

In [None]:
plt.figure(figsize = (16, 12))

# Iterate through the float columns
for i, col in enumerate(['dependency', 'edjefa', 'edjefe']):
    ax = plt.subplot(3, 1, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # Plot each poverty level as a separate line
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')

plt.subplots_adjust(top = 2)

**Fill in missing values (NULL values) using 1 for yes and 0 for no**

In [None]:
# Number of missing in each column
missing = pd.DataFrame(train.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(train)

missing.sort_values('percent', ascending = False).head(10)

In [None]:
train['v18q1'] = train['v18q1'].fillna(0)
test['v18q1'] = test['v18q1'].fillna(0)
train['v2a1'] = train['v2a1'].fillna(0)
test['v2a1'] = test['v2a1'].fillna(0)

train['rez_esc'] = train['rez_esc'].fillna(0)
test['rez_esc'] = test['rez_esc'].fillna(0)
train['SQBmeaned'] = train['SQBmeaned'].fillna(0)
test['SQBmeaned'] = test['SQBmeaned'].fillna(0)
train['meaneduc'] = train['meaneduc'].fillna(0)
test['meaneduc'] = test['meaneduc'].fillna(0)

In [None]:
#Checking for missing values again to confirm that no missing values present
# Number of missing in each column
missing = pd.DataFrame(train.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(train)

missing.sort_values('percent', ascending = False).head(10)

In [None]:
#Checking for missing values again to confirm that no missing values present
# Number of missing in each column
missing = pd.DataFrame(test.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(train)

missing.sort_values('percent', ascending = False).head(10)

**Dropping unnecesary columns**

In [None]:
train.drop(['Id','idhogar'], inplace = True, axis =1)

test.drop(['Id','idhogar'], inplace = True, axis =1)


In [None]:
train.shape, test.shape


**Dividing the data into predictors & target**

In [None]:
y = train.iloc[:,140]
y.unique()

In [None]:
X = train.iloc[:,1:141]
X.shape

**Splitting the data into train & test**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
                                                    X,
                                                    y,
                                                    test_size = 0.2)

**Modeling**

**Modelling with GradientBoostingClassifier**

In [None]:
modelgbm=gbm()

In [None]:
start = time.time()
modelgbm = modelgbm.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelgbm.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

**Performing tuning using Bayesian Optimization.**

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    gbm(
               # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 2                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modelgbmTuned=gbm(
               max_depth=31,
               max_features=29,
               min_weight_fraction_leaf=0.02067,
               n_estimators=489)

In [None]:
start = time.time()
modelgbmTuned = modelgbmTuned.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
ygbm=modelgbmTuned.predict(X_test)
ygbmtest=modelgbmTuned.predict(test)

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']



**Modeling with Random Forest**

In [None]:
modelrf = rf()

In [None]:
start = time.time()
modelrf = modelrf.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelrf.predict(X_test)

In [None]:
(classes == y_test).sum()/y_test.size 

**Performing tuning using Bayesian Optimization.**

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    rf(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modelrfTuned=rf(criterion="gini",
               max_depth=88,
               max_features=41,
               min_weight_fraction_leaf=0.1,
               n_estimators=285)

In [None]:
start = time.time()
modelrfTuned = modelrfTuned.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
yrf=modelrfTuned.predict(X_test)
yrf

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

**Modelling with KNeighborsClassifier**

In [None]:
modelneigh = KNeighborsClassifier(n_neighbors=7)

In [None]:
start = time.time()
modelneigh = modelneigh.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelneigh.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

**Performing tuning using Bayesian Optimization.**

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    KNeighborsClassifier(
       n_neighbors=7         # No need to tune this parameter value
      ),
    {"metric": ["euclidean", "cityblock"]},
    n_iter=32,            # How many points to sample
    cv = 2            # Number of cross-validation folds
   )

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modelneighTuned = KNeighborsClassifier(n_neighbors=7,
               metric="cityblock")

In [None]:
start = time.time()
modelneighTuned = modelneighTuned.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
yneigh=modelneighTuned.predict(X_test)

In [None]:
yneightest=modelneighTuned.predict(test)

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

**Modelling with ExtraTreeClassifier**

In [None]:
modeletf = ExtraTreesClassifier()

In [None]:
start = time.time()
modeletf = modeletf.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modeletf.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size

**Performing tuning using Bayesian Optimization.**

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    ExtraTreesClassifier( ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {   'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    n_iter=32,            # How many points to sample
    cv = 2            # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modeletfTuned=ExtraTreesClassifier(criterion="entropy",
               max_depth=100,
               max_features=64,
               min_weight_fraction_leaf=0.0,
               n_estimators=100)

In [None]:
start = time.time()
modeletfTuned = modeletfTuned.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
yetf=modeletfTuned.predict(X_test)
yetftest=modeletfTuned.predict(test)

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

**Modelling with XGBoosterClassifier**

In [None]:
modelxgb=XGBClassifier()

In [None]:
start = time.time()
modelxgb = modelxgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelxgb.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

**Performing tuning using Bayesian Optimization.**

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    XGBClassifier(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 300),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modelxgbTuned=XGBClassifier(criterion="gini",
               max_depth=85,
               max_features=47,
               min_weight_fraction_leaf=0.035997,
               n_estimators=178)

In [None]:
start = time.time()
modelxgbTuned = modelxgbTuned.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
#yxgb=modelxgbTuned.predict(X_test)
#yxgbtest=modelxgbTuned.predict(test)

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

**Modelling with Light Gradient Booster**

In [None]:
modellgb = lgb.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
                             random_state=None, silent=True, metric='None', 
                             n_jobs=4, n_estimators=5000, class_weight='balanced',
                             colsample_bytree =  0.93, min_child_samples = 95, num_leaves = 14, subsample = 0.96)

In [None]:
start = time.time()
modellgb = modellgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modellgb.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

**Performing tuning using Bayesian Optimization.**

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    lgb.LGBMClassifier(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modellgbTuned = lgb.LGBMClassifier(criterion="entropy",
               max_depth=35,
               max_features=14,
               min_weight_fraction_leaf=0.18611,
               n_estimators=148)

In [None]:
start = time.time()
modellgbTuned = modellgbTuned.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
ylgb=modellgbTuned.predict(X_test)
ylgbtest=modellgbTuned.predict(test)

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

**BUILDING a new dataset with predicted results with all these models**

In [None]:
NewTrain = pd.DataFrame()
#NewTrain['yrf'] = yrf.tolist()
NewTrain['yetf'] = yetf.tolist()
NewTrain['yneigh'] = yneigh.tolist()
NewTrain['ygbm'] = ygbm.tolist()
#NewTrain['yxgb'] = yxgb.tolist()
NewTrain['ylgb'] = ylgb.tolist()

NewTrain.head(5), NewTrain.shape

In [None]:
NewTest = pd.DataFrame()
#NewTest['yrf'] = yrftest.tolist()
NewTest['yetf'] = yetftest.tolist()
NewTest['yneigh'] = yneightest.tolist()
NewTest['ygbm'] = ygbmtest.tolist()
#NewTest['yxgb'] = yxgbtest.tolist()
NewTest['ylgb'] = ylgbtest.tolist()
NewTest.head(5), NewTest.shape

In [None]:
NewModel=rf(criterion="entropy",
               max_depth=87,
               max_features=4,
               min_weight_fraction_leaf=0.0,
               n_estimators=600)

In [None]:
start = time.time()
NewModel = NewModel.fit(NewTrain, y_test)
end = time.time()
(end-start)/60

In [None]:
ypredict=NewModel.predict(NewTest)
ypredict

In [None]:

#submit=pd.DataFrame({'Id': ids, 'Target': ylgbtest})
submit=pd.DataFrame({'Id': ids, 'Target': ypredict})
submit.head(5)

In [None]:
submit.to_csv('submit.csv', index=False)

In [None]:
sub = pd.read_csv('../input/sample_submission.csv')
sub['target'] = ypredict
sub.drop(sub.columns[[1]], axis=1, inplace=True)
sub.to_csv('submission.csv',index=False)

.... Contine Tuning, Analysing ... 