## Mixed - Incremental

### Note

#### Original definition of MIXED dataset: 
- MIXED. Abrupt concept drift, boolean noise-free examples. Four relevant attributes, two boolean attributes v, w and two numeric attributes from [0, 1]. The examples are classiﬁed positive if two of three conditions are satisﬁed:v, w, y < 0.5 + 0.3 ∗ sin(3πx). After each context change the classiﬁcation is reversed.[1,2]

To validate our approach we need drift in the data itself (virtual drift). For this purpose we have to modify the original data from the literature, but only as far as it is absolutely necessary. The original basic data distribution between the drifts, the selection of relevant features and the classifications function are adopted in general. 
But in this case we make an exception and do not change the classification function in case of a concept drift as it is described in literature to evaluate also an example where only virtual drift exists. Additionally, we induce incremental Drift, which is also not contained in the dataset originally.

#### Incremental Drifts, we have induced manually:
- Drift1:[5000:15000]: shifting p by 1 place in steps of 1000 instances 10 times
- Drift2:[20000:25000]: shifting p by 2 places in steps of 500 instances 10 times

The classification function was - unlike in the paper - retained and was not reversed.


#### Validation set
We do not want to determine the optimal parameters for our approach on the same data set we are testing on. In this way we want to avoid "overfitting". For this reason we create a validation set. The data distribution of the validation set is created according to the same rules as the test set. Finally, a  modified version (different initial p, shorter drift width) of the second drift from the test set is taken from the test set to determine the parameters for our detector.
- size of data set: 30000
- train on 5%: 1500
- validate on 10%: 3000
- test on 85%: 25000

The validation set and test set are containing the same 1500 instances for the initial training step.


[1]: Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3171(May 2014), 286–295. https://doi.org/10.1007/978-3-540-28645-5_29

[2]: Kubat, M., & Widmer, G. (1995). Adapting to drift in continuous domains (Extended abstract). Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 912, 307–310. https://doi.org/10.1007/3-540-59286-5_74

In [18]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
np.random.seed(10)

In [19]:
v = np.random.choice([False,True], size=30000)
w = np.random.choice([False,True], size=30000)
x = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=30000)
z = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=30000)

In [20]:
#Drift1 lang aber weniger stark
j=0
p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125]
for i in range(10):
    element = p.pop()
    p.insert(0, element)
    x[5000+j:6000+j]=np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=p, size=1000)
    z[5000+j:6000+j]=np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=p, size=1000)
    j += 1000

print()
# stärkerer drift aber kürzer
j=0
p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125]
for i in range(10):
    element = p[-2:]
    del p[-2:]
    p.insert(0, element[1])
    p.insert(0, element[0])
    
    #print(20000+j,20500+j, p)
    x[20000+j:20500+j]=np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=p, size=500)
    z[20000+j:20500+j]=np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=p, size=500)
    
    j += 500
    


condition1 = v & w
condition2 = v | w
condition3 = z < 0.5 + 0.3 * np.sin(3*np.pi*x)

y = np.where(condition1 | (condition2 & condition3),np.ones(30000, dtype=np.int8), np.zeros(30000, dtype=np.int8))




In [21]:
data = pd.DataFrame([v,w,x,z,y]).transpose()
data.columns = ['x1','x2','x3','x4', 'label']
data['label'] = data['label'].astype('int32')

# Encode x1 and x2
le = preprocessing.LabelEncoder()
data['x1'] = le.fit_transform(data['x1'])
data['x2'] = le.fit_transform(data['x2'])

df_train = data.iloc[:1500,:]
df_test = data

In [22]:
# create validation set
v = np.random.choice([False,True], size=3000)
w = np.random.choice([False,True], size=3000)
x = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=3000)
z = np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=[0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125], size=3000)

# stärkerer drift aber kürzer
j=0
p=[0.0125,0.0125,0.1,0.7,0.1,0.0125,0.0125,0.0125,0.0125,0.0125,0.0125]#starts with different p as it does in test set an is also shorter
for i in range(3):
    element = p[-2:]
    del p[-2:]
    p.insert(0, element[1])
    p.insert(0, element[0])
    #print(p)
    #print(500+j, 1000+j)
    x[500+j:1000+j]=np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=p, size=500)
    z[500+j:1000+j]=np.random.choice([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], p=p, size=500)
    j += 500
    
condition1 = v & w
condition2 = v | w
condition3 = z < 0.5 + 0.3 * np.sin(3*np.pi*x)

y = np.where(condition1 | (condition2 & condition3),np.ones(3000, dtype=np.int8), np.zeros(3000, dtype=np.int8))

In [23]:
df_val = pd.DataFrame([v,w,x,z,y]).transpose()
df_val.columns = ['x1','x2','x3','x4', 'label']
df_val['label'] = df_val['label'].astype('int32')

# Encode x1 and x2
le = preprocessing.LabelEncoder()
df_val['x1'] = le.fit_transform(df_val['x1'])
df_val['x2'] = le.fit_transform(df_val['x2'])

In [24]:
df_train_and_validate = df_train.append(df_val)

In [25]:
df_train_and_validate.describe()

Unnamed: 0,x1,x2,label
count,4500.0,4500.0,4500.0
mean,0.491778,0.505333,0.591333
std,0.499988,0.500027,0.491642
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,1.0,1.0
75%,1.0,1.0,1.0
max,1.0,1.0,1.0


In [26]:
df_test.describe()

Unnamed: 0,x1,x2,label
count,30000.0,30000.0,30000.0
mean,0.5023,0.498933,0.585533
std,0.500003,0.500007,0.492638
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,1.0,0.0,1.0
75%,1.0,1.0,1.0
max,1.0,1.0,1.0


In [27]:
df_train_and_validate.to_csv('../../Experiment/Data_prep/own_synthetic/mixed_incr_train_val.csv', index=False)
df_test.to_csv('../../Experiment/Data_prep/own_synthetic/mixed_incr_train_test.csv', index=False)