## Sine - Abrupt

### Note

#### Original definition of SINE dataset 
- SINE1: Abrupt concept drift, noise-free examples. The dataset has two relevant attributes. Each attribute has values uniformly distributed in [0, 1]. In the ﬁrst context all points below the curve y = sin(x) are classiﬁed as positive. After the context change the classiﬁcation is reversed. [1, 2]

To validate our approach we need drift in the data itself (virtual drift). For this purpose we have to modify the original data from the literature, but only as far as it is absolutely necessary. The original basic data distribution between the drifts, the selection of relevant features and the classification function are adopted.

#### Drifts, we have induced manually
- Drift1: x1 and x2 now np.random.uniform(-1,1) and reverse classification function
- Drift2: x1 and x2 now np.random.uniform(-0.5,0.5) and reverse classification function
- Drift3: x1 and x2 now np.random.uniform(-1,0) and reverse classification function

#### Validation set
We do not want to determine the optimal parameters for our approach on the same data set we are testing on. In this way we want to avoid "overfitting". For this reason we create a validation set. The data distribution of the validation set is created according to the same rules as the test set. Finally, a randomly selected drift from the test set is taken from the test set into the validation set to determine the parameters for our detector.
- size of data set: 20000
- train on 5%: 1000
- validate on 10%: 2000
- test on 85%: 17000

The validation set and test set are containing the same 1000 instances for the initial training step.


[1]: Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3171(May 2014), 286–295. https://doi.org/10.1007/978-3-540-28645-5_29

[2]: Kubat, M., & Widmer, G. (1995). Adapting to drift in continuous domains (Extended abstract). Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 912, 307–310. https://doi.org/10.1007/3-540-59286-5_74

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
np.random.seed(10)

In [2]:
x1 = np.random.uniform(0,1,20000)
x2 = np.random.uniform(0,1,20000)
x3 = np.random.uniform(0,1,20000)
x4 = np.random.uniform(0,1,20000)
y= np.where(x2 < np.sin(x1), np.ones(20000, dtype=np.int8), np.zeros(20000, dtype=np.int8))

In [3]:
# Drift 1
x1[5000:7500] = np.random.uniform(-1,1,2500)
x2[5000:7500] = np.random.uniform(-1,1,2500)
y[5000:7500] = np.where(x2[5000:7500] > np.sin(x1[5000:7500]), np.ones(2500, dtype=np.int8), np.zeros(2500, dtype=np.int8))

# Drift 2
x1[10000:12500] = np.random.uniform(-0.5,0.5,2500)
x2[10000:12500] = np.random.uniform(-0.5,0.5,2500)
y[10000:12500]= np.where(x2[10000:12500] < np.sin(x1[10000:12500]), np.ones(2500, dtype=np.int8), np.zeros(2500, dtype=np.int8))

# Drift 3
x1[15000:17500]=np.random.uniform(-1,0,2500)
x2[15000:17500]=np.random.uniform(-1,0,2500)
y[15000:17500]= np.where(x2[15000:17500] > np.sin(x1[15000:17500]), np.ones(2500, dtype=np.int8), np.zeros(2500, dtype=np.int8))

In [4]:
data = pd.DataFrame([x1,x2,x3,x4,y]).transpose()
data.columns = ['x1','x2','x3','x4', 'label']
data['label'] = data['label'].astype('int32')

df_train = data.iloc[:1000,:]
df_test = data

In [5]:
# create validation set
x1 = np.random.uniform(0,1,2000)
x2 = np.random.uniform(0,1,2000)
x3 = np.random.uniform(0,1,2000)
x4 = np.random.uniform(0,1,2000)
y= np.where(x2 < np.sin(x1), np.ones(2000, dtype=np.int8), np.zeros(2000, dtype=np.int8))

#Drift (zuvor 3)
x1[500:1500]=np.random.uniform(-1,0,1000)
x2[500:1500]=np.random.uniform(-1,0,1000)
y[500:1500]= np.where(x2[500:1500] > np.sin(x1[500:1500]), np.ones(1000, dtype=np.int8), np.zeros(1000, dtype=np.int8))

df_val = pd.DataFrame([x1,x2,x3,x4,y]).transpose()
df_val.columns = ['x1','x2','x3','x4', 'label']
df_val['label'] = df_val['label'].astype('int32')

In [6]:
df_train_and_validate = df_train.append(df_val)

In [7]:
df_train_and_validate.to_csv('../../Experiment/Data_prep/own_synthetic/sine_train_val.csv', index=False)
df_test.to_csv('../../Experiment/Data_prep/own_synthetic/sine_train_test.csv', index=False)