# **Exercício XGBoost**

---

<a href="https://midoritoyota.netlify.app/" target="_blank"><img align="left" src="./images/portfolio.png" title="See my portfolio!"/></a><img align="left" src="./images/espaco.png"/>

<a href="mailto:midori.toyota@gmail.com" target="_blank"><img align="left" src="./images/gmail.png" title="Contact me!"/></a><img align="left" src="./images/espaco.png"/>

<a href="https://www.linkedin.com/in/midoritoyota/" target="_blank"> <img align="left" src="./images/linkedin.png" title="Add me on linkedin!" /></a><img align="left" src="./images/espaco.png"/>

<a href="https://github.com/MidoriToyota" target="_blank"> <img align="left" src="./images/github.png" title="Follow me on github!"/></a>

<br/><br/>


Esse notebook é o estudo do algorítimo XGBoost e é uma reprodução do exercício resolvido pelo criador do algorítimo, Tong He no vídeo abaixo:

https://www.youtube.com/watch?time_continue=717&v=ufHo8vbk6g4&feature=emb_title

Os dados para a execução do exercício são de uma competição do Kaggle "Higgs Boson Machine Learning Challenge":

https://www.kaggle.com/c/higgs-boson


## **Exemplo com dataset do pacote**

### **Pacote e dados**

O pacote já conta com 2 datasets, um de treino e um de teste.

In [1]:
# Carregar pacote
library(xgboost)
library(methods)

In [2]:
# Carregar dataset
data(agaricus.train, package="xgboost")
data(agaricus.test, package="xgboost")
train = agaricus.train
test = agaricus.test

# Estrutura dos dados
str(train)

List of 2
 $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
  .. ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
  .. ..@ Dim     : int [1:2] 6513 126
  .. ..@ Dimnames:List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
  .. ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..@ factors : list()
 $ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ...


### **O algorítimo**

**Informações mínimas**

- `Dados de entrada` : dados em forma de matriz
- `Variável target` : Um vetor numérico, usar número começando de 0 para classificação.
- `Objetivo` : "reg:linear" ou "binary:logistic"
- `Número de interações` : número de árvores adicionadas ao modelo

**Comando para criar o modelo**

In [3]:
# Para medir os resultados com base no erro
bst = xgboost(data = train$data, label = train$label, nround = 2, objective = "binary:logistic")

[1]	train-error:0.000614 
[2]	train-error:0.001228 


In [4]:
# Para medir os resultados com base no AUC (area under the curve)
bst = xgboost(data = train$data, label = train$label, nround = 2, objective = "binary:logistic", eval_metric = "auc")

[1]	train-auc:0.999238 
[2]	train-auc:0.999238 


**Previsão dos resultados**

In [5]:
# Visualização do resultado da previsão
pred = predict(bst, test$data)
head(pred)

**Cross validation**

In [6]:
# Modelo utilizando cross validation (para garantir que não está tendo overffiting)
cv.res = xgb.cv(data = train$data, nfold = 5, label = train$label, nround = 2, objective = "binary:logistic", eval_metric = "auc")

[1]	train-auc:0.998856+0.000462	test-auc:0.998425+0.001363 
[2]	train-auc:0.999257+0.000461	test-auc:0.998659+0.001512 


In [7]:
# Visualização dos resultados
cv.res

##### xgb.cv 5-folds
 iter train_auc_mean train_auc_std test_auc_mean test_auc_std
    1      0.9988564  0.0004623317     0.9984248  0.001363004
    2      0.9992570  0.0004608045     0.9986592  0.001511688

## **Exercício - Higgs Boson Competition**

Os dados para a execução do exercício são de uma competição do Kaggle "Higgs Boson Machine Learning Challenge":

https://www.kaggle.com/c/higgs-boson


In [9]:
# Carregar dataset
dtrain = read.csv("data/training.csv", header = TRUE)
head(dtrain)

Unnamed: 0_level_0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,...,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.002653311,s
2,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,...,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.233584487,b
3,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,...,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.347388944,b
4,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,5.446378212,b
5,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245332687,b
6,100005,89.744,13.55,59.149,116.344,2.636,284.584,-0.54,1.362,61.619,...,3,90.547,-2.412,-0.653,56.165,0.224,3.106,193.66,0.083414031,b


### **Criação do modelo**

In [10]:
# Mudar a variável target de character para 1 ou 0 (binário)
dtrain[33] = dtrain[33] == "s"

# Criar um dataframe só com os labels
label = as.numeric(dtrain[[33]])

# Colocar todos os outros dados no dataset data
data = as.matrix(dtrain[2:31])

# Mudança feita por razões da competição
testsize = 550000
weight = as.numeric(dtrain[[32]]) * testsize / length(label)
sumwpos = sum(weight * (label == 1.0))
sumwneg = sum(weight * (label == 0.0))

# Construção de uma matrix (os valores missing estão marcados com -999 no dataframe)
xgmat = xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)

# Visualização dos dados
str(xgmat)

Class 'xgb.DMatrix' <externalptr> 
 - attr(*, ".Dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:30] "DER_mass_MMC" "DER_mass_transverse_met_lep" "DER_mass_vis" "DER_pt_h" ...


In [11]:
# Set dos parametros
param = list("objective" = "binary:logitraw",
            "scale_pos_weight" = sumwneg/sumwpos,
            "bst:eta" = 0.1,
            "bst:max_depth" = 6,
             "eval_metric" = "auc",
             "eval_metric" = "ams@0.15",
             "silent" = 1,
             "nthread" = 16)

# Criação do modelo
bst = xgboost(params = param, data = xgmat, nround = 120)

[1]	train-auc:0.910911	train-ams@0.15:3.708224 
[2]	train-auc:0.917686	train-ams@0.15:3.965037 
[3]	train-auc:0.920749	train-ams@0.15:4.275046 
[4]	train-auc:0.922946	train-ams@0.15:4.344920 
[5]	train-auc:0.924997	train-ams@0.15:4.408474 
[6]	train-auc:0.927125	train-ams@0.15:4.493840 
[7]	train-auc:0.928785	train-ams@0.15:4.646137 
[8]	train-auc:0.929895	train-ams@0.15:4.642409 
[9]	train-auc:0.931058	train-ams@0.15:4.730632 
[10]	train-auc:0.932234	train-ams@0.15:4.794474 
[11]	train-auc:0.933023	train-ams@0.15:4.858200 
[12]	train-auc:0.933709	train-ams@0.15:4.949893 
[13]	train-auc:0.934582	train-ams@0.15:4.981353 
[14]	train-auc:0.935454	train-ams@0.15:5.032665 
[15]	train-auc:0.935974	train-ams@0.15:5.077259 
[16]	train-auc:0.936736	train-ams@0.15:5.120542 
[17]	train-auc:0.937236	train-ams@0.15:5.180023 
[18]	train-auc:0.937664	train-ams@0.15:5.201569 
[19]	train-auc:0.938031	train-ams@0.15:5.231637 
[20]	train-auc:0.938618	train-ams@0.15:5.291092 
[21]	train-auc:0.939102	train

### **Aplicação do modelo**

In [12]:
# Dados para teste
dtest = read.csv("data/test.csv", header = TRUE)
data = as.matrix(dtest[2:31])

# Construção da matriz legível para o algorítimo
xgmat = xgb.DMatrix(data, missing = -999.0)

# Aplicação do modelo nos dados
ypred = predict(bst, xgmat)

# Formatação da tabela de saída
idx = dtest[[1]]
rorder = rank(ypred, ties.method = "first" )
threshold = 0.15
ntop = length(rorder) - as.integer(threshold * length(rorder))
plabel = ifelse(rorder > ntop, "s", "b")
outdata = list("EventId" = idx,
              "RankOrder" = rorder,
               "Class" = plabel)

# Salvar dados em csv
write.csv(outdata, file = "./data/submission.csv", quote=FALSE, row.names = FALSE)

# Visualização dos dados
dados = as.data.frame(outdata)
head(dados)

Unnamed: 0_level_0,EventId,RankOrder,Class
Unnamed: 0_level_1,<int>,<int>,<fct>
1,350000,16705,b
2,350001,216737,b
3,350002,348612,b
4,350003,486482,s
5,350004,134241,b
6,350005,184578,b


---

<a href="https://midoritoyota.netlify.app/" target="_blank"><img align="left" src="./images/portfolio.png" title="See my portfolio!"/></a><img align="left" src="./images/espaco.png"/>

<a href="mailto:midori.toyota@gmail.com" target="_blank"><img align="left" src="./images/gmail.png" title="Contact me!"/></a><img align="left" src="./images/espaco.png"/>

<a href="https://www.linkedin.com/in/midoritoyota/" target="_blank"> <img align="left" src="./images/linkedin.png" title="Add me on linkedin!" /></a><img align="left" src="./images/espaco.png"/>

<a href="https://github.com/MidoriToyota" target="_blank"> <img align="left" src="./images/github.png" title="Follow me on github!"/></a>

<br/><br/>
