The following kernel provides you additional synthetically generated observations. Maybe you should not use all observations.

## Introduction

This approach is called the “Synthetic Minority Oversampling Technique” (SMOTE). On the line from one minority class observation x0 to another x1, synthetic observations are generated [cf. Chawla
2002].

$$ 𝒙𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐 = 𝒙_0 + (𝒙_1 − 𝒙_0)α  $$

where α is a uniform distributed random number. Hence, the synthetic observations are solely related to the minority class. The borderline SMOTE (BLSMOTE) generates observations
merely in the sphere of the majority class. This is because the profile of the minority class
blurs near the majority class [cf. Han/Wang/Mao 2011]. The density-based SMOTE
(DBSMOTE) first detects agglomerations of the minority class by adding spheres around
each observation of the minority class. In the case of overlapping spheres, a cluster is detected. From the centre of this cluster to the original minority class observations within the
cluster, synthetic observations are generated [cf. Ester et al. 1996]. 

Extract from [Classifying Changes in the Consolidation Perimeter in the German Group Accounts Statistics Using Statistical Learning](https://poseidon01.ssrn.com/delivery.php?ID=681105073027021002020087096118101029122047004088035085027090111031068114102116114077126045101012021097047092068101112115113100114008094039021093069020012100126092031060078024021125004103113000073114104065031086126079065093100029071096069064108081099115&EXT=pdf)

### *Note: Add this observations just to the learning data set.*

In [1]:
DBSMOTEsize = 1

In [2]:
library(data.table)
library(plotly)
library(htmlwidgets)
library(IRdisplay)
library(DT)
install.packages("smotefamily")
library(smotefamily)
library(Rtsne)

install.packages("randomcoloR")
library(randomcoloR)

dir.create(file.path("charts/"), showWarnings = FALSE)

invisible(gc())

options(warn=-1)

Loading required package: ggplot2


Attaching package: ‘plotly’


The following object is masked from ‘package:ggplot2’:

    last_plot


The following object is masked from ‘package:stats’:

    filter


The following object is masked from ‘package:graphics’:

    layout


The following object is masked from ‘package:httr’:

    config


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [3]:
X <- fread("../input/lish-moa/train_features.csv")
Y <- fread("../input/lish-moa/train_targets_scored.csv", data.table = F)

In [4]:
factorVars = c("cp_time", "cp_dose", "cp_type")

X[, (factorVars) :=
    lapply(.SD, function(i){factor(i)}),
    .SDcols = factorVars]

In [5]:
#Y <- Y[X$cp_type == "trt_cp",] 
for(i in 2:207){
    Y[,i] <- factor(Y[,i])
}

In [6]:
#X <- X %>% 
#    filter(cp_type == "trt_cp") %>% 
#        select(-cp_type)

In [7]:
Spool <- X %>% 
        select(-sig_id) %>% 
            data.frame()

time <- model.matrix(~cp_time-1,data = Spool)
Spool <- cbind(Spool, time)
Spool <- Spool %>% 
    select(-cp_time, -cp_time72)


Spool$cp_dose <- as.numeric(Spool$cp_dose)
Spool$cp_type <- as.numeric(Spool$cp_type)

In [8]:
DBSMOTE_Y <- matrix(0, ncol = 206, nrow = 90000)
DBSMOTE_X <- Spool[1,] 

#colnames(Spool)

#i = 206

k = 1

for(i in 2:207){
    
    Yi <- Y[,i]
    
    controlSum <- sum(as.numeric(Y[,i])-1)
    
    if(controlSum > 3){
    
        Minority <- smotefamily::DBSMOTE(X = Spool, target = Yi, dupSize = DBSMOTEsize)
        Minority <- Minority$syn_data

        DBSMOTE_Y[k:(nrow(Minority)+(k-1)),(i-1)] <- Minority$class

        DBSMOTE_X <- rbind(DBSMOTE_X, Minority %>% select(-class))
    
        k = k + nrow(Minority)
        
    }  
        
    print(i)
}

DBSMOTE_X <- DBSMOTE_X[-1,]

DBSMOTE_X$cp_time24 <- round(DBSMOTE_X$cp_time24)

DBSMOTE_X$cp_time48 <- round(DBSMOTE_X$cp_time48)


DBSMOTE_X$cp_time <- sapply(1:nrow(DBSMOTE_X), FUN = function(i){if(DBSMOTE_X$cp_time24[i] == 1){24}else if(DBSMOTE_X$cp_time48[i] == 1){48}else{72}})

DBSMOTE_X <- DBSMOTE_X %>% 
    select(-cp_time48, -cp_time24)

DBSMOTE_X$cp_dose <- round(DBSMOTE_X$cp_dose)
DBSMOTE_X$cp_dose <- ifelse(DBSMOTE_X$cp_dose == 1, "D1", "D2")

DBSMOTE_X$cp_type <- round(DBSMOTE_X$cp_type)
DBSMOTE_X$cp_type <- ifelse(DBSMOTE_X$cp_type == 1, "ctl_vehicle", "trt_cp")

colnames(DBSMOTE_X) <- gsub("\\.", "-", colnames(DBSMOTE_X))

DBSMOTE_X$sig_id <- paste("smote", seq(1,nrow(DBSMOTE_X), by = 1))

DBSMOTE_X <- DBSMOTE_X[,c(ncol(DBSMOTE_X), 1, (ncol(DBSMOTE_X)-1), 2:(ncol(DBSMOTE_X)-2))]

#sum i=206 == 39
DBSMOTE_Y <- DBSMOTE_Y[1:max(which(DBSMOTE_Y[ , 206] == 1)), ]

colnames(DBSMOTE_Y) <- colnames(Y[,-1])

DBSMOTE_Y <- as.data.frame(DBSMOTE_Y)

DBSMOTE_Y$sig_id <- paste("smote", seq(1,nrow(DBSMOTE_Y), by = 1))

DBSMOTE_Y <- DBSMOTE_Y[,c(ncol(DBSMOTE_Y),1:(ncol(DBSMOTE_Y)-1))]

fwrite(DBSMOTE_Y, "targetSynth.csv")
fwrite(DBSMOTE_X, "trainSynth.csv")

[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] "DBSMOTE is Done"
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] "DBSMOTE is Done"
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] "DBSMOTE is Done"
[1] 4
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 3
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 3
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 3
[1] 

## t-SNE

Now, we want to check if the observations are similar to the original ones.

In [9]:
factorVars = c("cp_type", "cp_time", "cp_dose")
DBSMOTE_X <- as.data.table(DBSMOTE_X)
DBSMOTE_X[, (factorVars) :=
          lapply(.SD, function(i){factor(i)}),
          .SDcols = factorVars]

DBSMOTE_X$synth = TRUE
X$synth = FALSE

In [10]:
Xx <- rbind(X, DBSMOTE_X)

In [11]:
tSNE <- Rtsne(Xx %>% select(-sig_id, -cp_type, -cp_time, -cp_dose, -synth), 
              dims = 3, perplexity = 20, 
              theta = 0.1,
              verbose = FALSE, 
              max_iter = 1200)
tSNE <- as.data.frame(tSNE$Y)
tSNE$cp_type <- Xx$cp_type
tSNE$cp_time <- Xx$cp_time
tSNE$cp_dose <- Xx$cp_dose
tSNE$synth <- Xx$synth

In [12]:
axis <- list(
  title = "",
  zeroline = FALSE,
  showline = FALSE,
  showticklabels = FALSE,
  showgrid = TRUE
)

In [13]:
#colMoA <- randomcoloR::randomColor(length(unique(tSNE$synth)))
colMoA <- c("#e74c3c", "#52be80")

Plot3DMoA1 <- plot_ly(tSNE, x=~V1, 
                  y=~V2, 
                  z=~V3, 
                  type="scatter3d", 
                  mode="markers", 
                  color=~synth, 
                  opacity = 1, 
                  colors = colMoA,
                  marker = list(size=8, symbol = "circle", line = list(color = 'grey', width = .01)),
                  text = ~paste('<br>cp_type:', cp_type, '<br>cp_dose:', cp_dose, '<br>cp_time:', cp_time)
                  ) %>% 

layout(title = "t-SNE colored by Synthetic or Not", 
       legend=list(title=list(text='<b>Synthetic?</b>')),
       scene = list(camera = list(eye = list(x = -.3, y = 1.2, z = 1.2)),
                    xaxis = axis,
                    yaxis = axis,
                    zaxis = axis)
      )

path <-"charts/Plot3DMoA1.html"
saveWidget(Plot3DMoA1, file.path(normalizePath(dirname(path)),basename(path)))
display_html('<iframe src="charts/Plot3DMoA1.html" align="center" width="100%" height="800" frameBorder="0"></iframe>')

As you can see, there is no significant difference between the artificial and the original data.

Sources:

Chawla, N.V. et al. (2002): SMOTE: Synthetic Minority Over-Sampling Technique. In: Journal
of Artificial Intelligence Research, Vol. 16., AAAI Press, New York, pp. 321-357.

Ester, M./ Kriegel, H./ Sander, J./Xu, X. (1996): A Density based Algorithm for discovering
Clusters. A Density based Algorithm for discovering Clusters in Large Spatial Databases with
Noise. In: Conference on Knowledge Discovery and Data Mining 1996, AAAI, pp. 226-231. 

Han, H./Wang, W./Mao, B. (2005): Borderline-Smote. A New Over Sampling Method in Unbalanced Data Set Learning. In: International Conference on Intelligent Computing 2005, Teil
1, Springer Verlag Berlin Heidelberg, pp. 878-887. 
