## Sine - Incremental

### Note

#### Original definition of SINE dataset
- SINE1: Abrupt concept drift, noise-free examples. The dataset has two relevant attributes. Each attribute has values uniformly distributed in [0, 1]. In the ﬁrst context all points below the curve y = sin(x) are classiﬁed as positive. After the context change the classiﬁcation is reversed. [1, 2]

To validate our approach we need drift in the data itself (virtual drift). For this purpose we have to modify the original data from the literature, but only as far as it is absolutely necessary. The original basic data distribution between the drifts, the selection of relevant features and the classification function are adopted.

#### Drifts, we have induced manually
- Drift1[5000:15000]: shifting (0,1) to (0,11) by value 1 in steps of 1000 instances 10 times
- Drift2[20000:25000]: shifting (0,1) to (0,-20) by value 2 in steps of 500 instances 10 times

The classification function was - unlike with the same data set and abrupt drift - retained and was not reversed.

#### Validation set
We do not want to determine the optimal parameters for our approach on the same data set we are testing on. In this way we want to avoid "overfitting". For this reason we create a validation set. The data distribution of the validation set is created according to the same rules as the test set. Finally, a  modified version (opposite direction of shift, shorter drift width) of the second drift from the test set is taken from the test set to determine the parameters for our detector.
- size of data set: 30000
- train on 5%: 1500
- validate on 10%: 3000
- test on 85%: 25000

The validation set and test set are containing the same 1500 instances for the initial training step.


[1]: Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3171(May 2014), 286–295. https://doi.org/10.1007/978-3-540-28645-5_29

[2]: Kubat, M., & Widmer, G. (1995). Adapting to drift in continuous domains (Extended abstract). Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 912, 307–310. https://doi.org/10.1007/3-540-59286-5_74

In [34]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
np.random.seed(10)

In [35]:
np.random.seed(42)
x1 = np.random.uniform(0,1,30000)
x2 = np.random.uniform(0,1,30000)
x3 = np.random.uniform(0,1,30000)
x4 = np.random.uniform(0,1,30000)

In [36]:
#Drift1 lang aber weniger stark
j = 0
for i in range(1,11):
    #print(0,1+i, ':   ', 5000+j,':',6000+j)
    x1[5000+j:6000+j]=np.random.uniform(0,1+i,1000)
    x2[5000+j:6000+j]=np.random.uniform(0,1+i,1000)
    j += 1000

print()
# stärkerer drift aber kürzer
j=0
for i in range(1,11):
    print(0-(i*2), ':   ', 20000+j,':',20500+j)
    x1[20000+j:20500+j]=np.random.uniform(0-(i*2),1,500)
    x2[20000+j:20500+j]=np.random.uniform(0-(i*2),1,500)
    j += 500
    
y= np.where(x2 > np.sin(x1), np.ones(30000, dtype=np.int8), np.zeros(30000, dtype=np.int8))


-2 :    20000 : 20500
-4 :    20500 : 21000
-6 :    21000 : 21500
-8 :    21500 : 22000
-10 :    22000 : 22500
-12 :    22500 : 23000
-14 :    23000 : 23500
-16 :    23500 : 24000
-18 :    24000 : 24500
-20 :    24500 : 25000


In [37]:
data = pd.DataFrame([x1,x2,x3,x4,y]).transpose()
data.columns = ['x1','x2','x3','x4', 'label']
data['label'] = data['label'].astype('int32')

df_train = data.iloc[:1500,:]
df_test = data

In [38]:
#np.random.seed(50)


# create validation set
x1 = np.random.uniform(0,1,3000)
x2 = np.random.uniform(0,1,3000)
x3 = np.random.uniform(0,1,3000)
x4 = np.random.uniform(0,1,3000)

# stärkerer drift aber kürzer
j=0
for i in range(1,4):
    #print(0,1+i*2, ':   ', 500+j,':',1000+j)
    x1[500+j:1000+j]=np.random.uniform(0,1+i*2,500)
    x2[500+j:1000+j]=np.random.uniform(0,1+i*2,500)
    j += 500
    
y= np.where(x2 > np.sin(x1), np.ones(3000, dtype=np.int8), np.zeros(3000, dtype=np.int8))

df_val = pd.DataFrame([x1,x2,x3,x4,y]).transpose()
df_val.columns = ['x1','x2','x3','x4', 'label']
df_val['label'] = df_val['label'].astype('int32')

In [39]:
df_train_and_validate = df_train.append(df_val)

In [40]:
df_train_and_validate.to_csv('../../Experiment/Data_prep/own_synthetic/sine_incr_train_val.csv', index=False)
df_test.to_csv('../../Experiment/Data_prep/own_synthetic/sine_incr_train_test.csv', index=False)