## Mixed - Abrupt

### Note

#### Original definition of MIXED dataset 
- MIXED. Abrupt concept drift, boolean noise-free examples. Four relevant attributes, two boolean attributes v, w and two numeric attributes from [0, 1]. The examples are classiﬁed positive if two of three conditions are satisﬁed:v, w, y < 0.5 + 0.3 ∗ sin(3πx). After each context change the classiﬁcation is reversed.[1,2]

To validate our approach we need drift in the data itself (virtual drift). For this purpose we have to modify the original data from the literature, but only as far as it is absolutely necessary. The original basic data distribution between the drifts, the selection of relevant features and the classifications function are adopted in general. 
But in this case we make an exception and do not change the classification function in case of a concept drift as it is described in literature to evaluate also an example where only virtual drift exists.

#### Drifts, we have induced manually
- Drift1:[5000:7500]: x and z with p now [0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125]
- Drift2:[10000:12500]: x and z with p now [0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125]
- Drift3:[15000:17500]: x and z with p now [0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7]

The classification function was - unlike in the paper - retained and was not reversed.


#### Validation set
We do not want to determine the optimal parameters for our approach on the same data set we are testing on. In this way we want to avoid "overfitting". For this reason we create a validation set. The data distribution of the validation set is created according to the same rules as the test set. Finally, a randomly selected drift from the test set is taken from the test set into the validation set to determine the parameters for our detector.
- size of data set: 20000
- train on 5%: 1000
- validate on 10%: 2000
- test on 85%: 17000

The validation set and test set are containing the same 1000 instances for the initial training step.


[1]: Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3171(May 2014), 286–295. https://doi.org/10.1007/978-3-540-28645-5_29

[2]: Kubat, M., & Widmer, G. (1995). Adapting to drift in continuous domains (Extended abstract). Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 912, 307–310. https://doi.org/10.1007/3-540-59286-5_74

In [28]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
np.random.seed(10)

In [29]:
v = np.random.choice([False,True], size=20000)
w = np.random.choice([False,True], size=20000)
x = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=20000)
z = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=20000)

In [30]:
# Drift 1
x[5000:7500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125], size=2500)
z[5000:7500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125], size=2500)

#Drift 2
x[10000:12500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125], size=2500)
z[10000:12500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125], size=2500)

#Drift 3
x[15000:17500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7], size=2500)
z[15000:17500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7], size=2500)

In [31]:
condition1 = v & w
condition2 = v | w
condition3 = z < 0.5 + 0.3 * np.sin(3*np.pi*x)

y = np.where(condition1 | (condition2 & condition3),np.ones(20000, dtype=np.int8), np.zeros(20000, dtype=np.int8))

In [32]:
data = pd.DataFrame([v,w,x,z,y]).transpose()
data.columns = ['x1','x2','x3','x4', 'label']
data['label'] = data['label'].astype('int32')

# Encode x1 and x2
le = preprocessing.LabelEncoder()
data['x1'] = le.fit_transform(data['x1'])
data['x2'] = le.fit_transform(data['x2'])

df_train = data.iloc[:1000,:]
df_test = data

In [33]:
# create validation set 
v = np.random.choice([False,True], size=2000)
w = np.random.choice([False,True], size=2000)
x = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=2000)
z = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=2000)

#Drift 1 (zuvor 2)
x[500:1500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125], size=1000)
z[500:1500] = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125], size=1000)
condition1 = v & w
condition2 = v | w
condition3 = z < 0.5 + 0.3 * np.sin(3*np.pi*x)
y = np.where(condition1 | (condition2 & condition3),np.ones(2000, dtype=np.int8), np.zeros(2000, dtype=np.int8))

df_val = pd.DataFrame([v,w,x,z,y]).transpose() 
df_val.columns = ['x1','x2','x3','x4', 'label']
df_val['label'] = df_val['label'].astype('int32')

# Encode x1 and x2
le = preprocessing.LabelEncoder()
df_val['x1'] = le.fit_transform(df_val['x1'])
df_val['x2'] = le.fit_transform(df_val['x2'])

In [34]:
df_train_and_validate = df_train.append(df_val)

In [35]:
df_train_and_validate.to_csv('../../Experiment/Data_prep/own_synthetic/mixed_train_val.csv', index=False)
df_test.to_csv('../../Experiment/Data_prep/own_synthetic/mixed_train_test.csv', index=False)