In [1]:
import pandas as pd
import sklearn as sfs
import matplotlib.pyplot as plt
import numpy as np
import sys
sys.path.append('..')
from model_handler import ModelHandler
from feature_selection import FeatureSelectionAndGeneration
handler = ModelHandler()
dataset = handler.dataset
train_set = dataset[handler.train_mask]
dt = train_set.copy()

In [2]:
dt.head()

Unnamed: 0,city,latitude,longitude,population,country_code,c40,risk0,risk1,risk2,risk3,...,SDG 6.4.1. Water Use Efficiency,SDG 6.4.2. Water Stress,Seasonal variability (WRI),Total internal renewable water resources per capita,Total population with access to safe drinking-water (JMP),Total renewable water resources per capita,Total water withdrawal per capita,Urban population with access to safe drinking-water (JMP),country,population_1k_density
2,Copenhagen,55.6786,12.5635,1085000.0,DNK,False,,2.0,,2.0,...,368.612902,20.040562,1.3,1046.705025,100.0,1046.705025,129.285516,100.0,Denmark,6691.52832
4,Frederikshavn,57.4337,10.5333,24103.0,DNK,False,,2.0,,,...,368.612902,20.040562,1.3,1046.705025,100.0,1046.705025,129.285516,100.0,Denmark,551.431763
7,Abidjan,5.3364,-4.0267,4980000.0,CIV,False,,,,,...,23.167322,5.087566,2.8,3144.351686,81.9,3443.07328,47.54993,93.1,Côte d'Ivoire,14128.076172
11,BouakÃ©,7.4137,-5.0149,715435.0,CIV,False,,,,3.0,...,23.167322,5.087566,2.8,3144.351686,81.9,3443.07328,47.54993,93.1,Côte d'Ivoire,81.718063
12,Abington,40.1108,-75.1146,55573.0,USA,False,,1.0,1.0,2.0,...,42.378501,28.161984,1.7,8668.50859,99.2,9440.614927,1366.689738,99.4,United States of America,957.105957


The dataset includes different risks that need a prediction. Every risk is considered as a different target of labels, namely a response variable.

The aim is to build a model able to predict each risk in the most accurate way possible. However, the learning process is different for each of them, meaning that the minimum set of variables that best explain the largest amount of variance in the dataset is unique for every risk. As a consequence, the following pipeline will be executed as much time as the number of risks in order to return as more precise predictions as possible. 

# Dataset splitting

The first step consists in splitting the dataset into training and test sets. The first will be used during the feature selection part, which is implemented using a boosted logistic regression model. This is a supervised learning approach, thus labels are needed for the regression to be carried out. In this dataset risks are assigned to only some of the cities, therefore it's wise to select as training set all the entries containing values for the given risk. All the rest will be referred to as test set, used for the classification task, since those cities will be the ones needing a prediction.

In [3]:
import random

def data_splitting(dt,risk):
    # Select the columns containing labelled risk remove labels from the dataset to define training set
    train = dt[dt[risk].notnull()]
    y_train = train[risk] # define response variable
    # Remove labels from the dataset to define training set
    train = train[dt.columns.difference(dt.filter(like = 'risk').columns,sort=False)]
    # Remove categorical columns since they are only descriptive
    num_cols = train._get_numeric_data().columns
    to_drop = list(set(train.columns) - set(num_cols))
    to_drop.append("c40") #comment if aug
    train = train[train.columns.drop(to_drop)]
    # Define test set
    test = dt[~dt.index.isin(train.index)]
    test = test[test.columns.drop(to_drop)]
    test = test[test.columns.difference(test.filter(like = 'risk').columns,sort=False)]
    y_test = [random.randrange(4) for x in range(len(test))]
    return train, test, y_train, y_test

# Feature selection

When there is a highly non-linear and complex relationship between the predictors and the labels decision trees are preferable. The dataset has many different predictors and we don't know whether this relationship is linear or not.

The most robust approach among the ensemble method is `Boosting`. It allows to aggregate many decision trees, differently from `Random Forest`, and grow them sequentially, instead of using boostrap sampling like in `Bagging`. 

The procedure consists in fitting small trees to the residuals in order to slowly improve the prediction error. Generally, model that learn slowly tend to perform better. A pitfall of Boosting, however, is that it relies very much on its tuning parameters. Hence, it's important to undergo `Cross Validation` in order to select the combination returning the highest accuracy, for every target. 
For this purpose we decided to use 10-fold cross validation in such a way to speed up the tuning process, which is already slow given the amount of parameters that need to be optimized.

In [4]:
import xgboost as xgb
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
import shutil
import os
memory_dir = '.pipeline_cache.tmp'
if os.path.isdir(memory_dir):
    shutil.rmtree(memory_dir)
model = Pipeline([('generation_and_selection', FeatureSelectionAndGeneration(feats_num=200)), ('regressor', xgb.XGBRegressor())],memory=memory_dir)

XgBoost has as default objective function `reg:squarederror`, which corresponds to a linear regression with mean-squared error as loss function.

In [5]:
def boosting_reg(train, y_train, risk, best_parameters):
    
    '''Cross Validation'''
    
    kfold = KFold(n_splits=10)
    reg_cv = GridSearchCV(model, cv = kfold,
                          param_grid = {"regressor__colsample_bytree":[0.1,0.5,1.0],"regressor__min_child_weight":[1.0,1.2],
                            'regressor__max_depth': [7,9], 'regressor__n_estimators': [500], "regressor__alpha": [10,12,15], "regressor__subsample": [0.5],
                                    "regressor__objective": ["reg:squarederror"]})
                            # also try "objective": ["multi:softmax", "multi:softprob", "rank:map"], "n_classes": 4'''    
    reg_cv.fit(train,y_train)
    best_parameters[risk] = reg_cv.best_params_
    
    '''Training'''
    
    gbm = xgb.XGBRegressor(**best_parameters[risk])
    gbm.fit(train,y_train)
    
    '''Feature selection
    im=pd.DataFrame({'importance':gbm.feature_importances_,'var':X.columns})
    im=im.sort_values(by='importance',ascending=False)
    fig,ax = plt.subplots(figsize=(8,8))
    plot_importance(xgb,max_num_features=15,ax=ax,importance_type='gain')
    plt.show()'''
    
    # accuracy_scores[risk]=[gbm.score(train,y_train),0]   
    sorted_idx = np.argsort(gbm.feature_importances_)[::-1]
    best_features = list()
    for index in sorted_idx:
        if gbm.feature_importances_[index] > 0:
            best_features.append(train.columns[index]) 
    return gbm, best_features[:15], best_parameters#, accuracy_scores

# Classification

In [6]:
def boosting_clas(gbm, test, y_test, risk, accuracy_scores):
    
    predictions = gbm.predict(test, iteration_range = (0, gbm.best_iteration)).argmax(axis=0)
    accuracy_scores[risk] = [gbm.score(test,y_test),0] 
    accuracy_scores[risk][1] = accuracy_score(y_test, predictions) 
    
    return predictions, accuracy_scores

# Generation of a reduced dataset filled with predictions

In [7]:
#WIP

risks = ["risk3"]#list(dt.filter(like='risk').columns)
best_parameters = dict()
accuracy_scores = dict()
filled_dataset = dt.copy()

for risk in risks:
    train, test, y_train, y_test = data_splitting(dt,risk)
    gbm, features, best_parameters = boosting_reg(train, y_train, risk, best_parameters)
    fname = risk+"_"+"boost_model.json"
    gbm.save_model(fname)
    # gbm.load_model(fname)

    predictions, accuracy_scores = boosting_clas(gbm, test, y_test, risk, accuracy_scores)
    test_index = dt.index.isin(test.index)
    filled_dataset.loc[test_index, risk] = predictions.round()
    
    get_back = ["city", "country", risk]
    to_drop = set(dt.columns) - set(features) - set(get_back)
    filled_red_dataset = filled_dataset[dt.columns.drop(to_drop)]
    
    filled_red_dataset.to_csv(risk+"_"+'filled_red_dataset.csv',index=False)

Explained variation percentage per principal component: [51.527267687473255, 31.276683461689032, 5.944197322986371, 2.298170288998239, 1.9794909965853398, 1.3773852095055767, 0.9789147728212787, 0.6936766873395442, 0.6084755821833407, 0.5016407144677693]
Total percentage of the explained data by 10 components is: 97.19
Percentage of the information that is lost for using 10 components is: 2.81
Features select 
 |      | Specs                                                                                                                                                                                   |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
| 1962 | perc_poly__Feat{Overall loss in HDI due to inequality (%)} * Feat{Gross enrolment ratio, pre-primary, female (%)}                                                   

Explained variation percentage per principal component: [64.71604980096023, 16.173153216081076, 4.9134557649436585, 3.2423565414748152, 2.4538440083654676, 1.611359554735935, 1.2101988388678915, 0.8267926378151152, 0.7451364753614902, 0.5960329106010072]
Total percentage of the explained data by 10 components is: 96.49
Percentage of the information that is lost for using 10 components is: 3.51
Features select 
 |      | Specs                                                                                                                                                                                   |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  746 | perc_poly__Feat{Labour force participation rate (% ages 15 and older), male} * Feat{Inequality in income (%)}                                                       

Explained variation percentage per principal component: [63.65735348341913, 15.037335597405638, 5.847738528689826, 4.442466260273361, 2.440273569528914, 1.6640176332928827, 1.2351458217885518, 0.8815571642516987, 0.7113326679152352, 0.6398889605536092]
Total percentage of the explained data by 10 components is: 96.56
Percentage of the information that is lost for using 10 components is: 3.44
Features select 
 |      | Specs                                                                                                                                                                                   |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  137 | scaled__SDG 6.4.2. Water Stress                                                                                                                                       

Explained variation percentage per principal component: [51.251380230940946, 33.07911802854864, 6.225427611944596, 2.1852889434098177, 2.060126485422884, 1.1870000035569066, 0.7309523767593628, 0.6037541290169969, 0.411492577387385, 0.31819088028331005]
Total percentage of the explained data by 10 components is: 98.05
Percentage of the information that is lost for using 10 components is: 1.95
Features select 
 |      | Specs                                                                                                                                                                           |    Score |
|-----:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  746 | perc_poly__Feat{Labour force participation rate (% ages 15 and older), male} * Feat{Inequality in income (%)}                                                                   | 20.

Explained variation percentage per principal component: [69.9467491247594, 10.376012018750046, 7.162054662491743, 3.2497719523455744, 2.5909092814749686, 1.302560926449103, 0.8783157220659437, 0.7739778642973619, 0.48480948955984815, 0.4535272846688262]
Total percentage of the explained data by 10 components is: 97.22
Percentage of the information that is lost for using 10 components is: 2.78
Features select 
 |      | Specs                                                                                                                                                                                   |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
| 1568 | perc_poly__Feat{Youth not in school or employment (% ages 15-24)} * Feat{MDG 7.5. Freshwater withdrawal as % of total renewable water resources}                     

Explained variation percentage per principal component: [53.60882456931228, 31.41378472564097, 4.9185344523875525, 2.1178740001840914, 1.805622413128305, 1.203277213356875, 0.7840929289085885, 0.64784538321942, 0.4736012797703188, 0.3752233565273033]
Total percentage of the explained data by 10 components is: 97.35
Percentage of the information that is lost for using 10 components is: 2.65
Features select 
 |      | Specs                                                                                                                                                                                     |    Score |
|-----:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
| 1568 | perc_poly__Feat{Youth not in school or employment (% ages 15-24)} * Feat{MDG 7.5. Freshwater withdrawal as % of total renewable water resources}                    

Explained variation percentage per principal component: [51.20465635130829, 29.72395547473072, 6.840240520306525, 4.27633796676246, 1.8954329899633102, 1.1806660298543528, 0.9856318287711557, 0.6407233690401521, 0.5395536218905556, 0.37588443864541143]
Total percentage of the explained data by 10 components is: 97.66
Percentage of the information that is lost for using 10 components is: 2.34
Features select 
 |      | Specs                                                                                                                                                                                   |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  746 | perc_poly__Feat{Labour force participation rate (% ages 15 and older), male} * Feat{Inequality in income (%)}                                                         

Explained variation percentage per principal component: [75.24563413436083, 14.422011630240439, 5.353987515055839, 1.6405914905760142, 0.7467830810901709, 0.4225794472003744, 0.4125292402415426, 0.25252675469829833, 0.22884847050094978, 0.18545467185370323]
Total percentage of the explained data by 10 components is: 98.91
Percentage of the information that is lost for using 10 components is: 1.09
Features select 
 |      | Specs                                                                                                                                                                                   |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  746 | perc_poly__Feat{Labour force participation rate (% ages 15 and older), male} * Feat{Inequality in income (%)}                                                    

Explained variation percentage per principal component: [61.66378476352277, 25.478668333804354, 4.282122565298741, 3.106697674905659, 1.6996923997762763, 0.663213458664834, 0.5394872965933543, 0.388606555797301, 0.3444134778993283, 0.240186123623511]
Total percentage of the explained data by 10 components is: 98.41
Percentage of the information that is lost for using 10 components is: 1.59
Features select 
 |      | Specs                                                                                                                                                                                       |    Score |
|-----:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  425 | perc_poly__Feat{Population with at least some secondary education, male (% ages 25 and older)} * Feat{Unemployment, total (% of total labor force) (modeled ILO 

Explained variation percentage per principal component: [59.6721874111433, 24.743085902353513, 7.546285123373761, 2.534894120536559, 1.79660405682616, 0.6690106402391656, 0.46646554574725096, 0.4139412108099819, 0.3836163404984308, 0.2620553225329359]
Total percentage of the explained data by 10 components is: 98.49
Percentage of the information that is lost for using 10 components is: 1.51
Features select 
 |      | Specs                                                                                                                                                                                         |    Score |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
|  746 | perc_poly__Feat{Labour force participation rate (% ages 15 and older), male} * Feat{Inequality in income (%)}                                              

KeyboardInterrupt: 