## CategoricalClassification Usage

_CategoricalClassification_ is a library designed to quickly and easily generate binary categorical datasets. It supports both linearly and non-linearly separable dataset generation, as well as various noise simulating functions.


### Importing
Once copied to your working directory, _CategoricalClassification_ can be imported as any other Python library. 

In [23]:

from CategoricalClassification import CategoricalClassification
cc = CategoricalClassification()


### Generating a linearly separable datasets
Generates a linearly separable dataset with 100 relevant features, 400 irrelevant features, 10000 samples, with a seed of 42.

In [24]:
X,y = cc.generate_linear_binary_data(100, 400, samples=10000, seed=42)
print(X)
print(y)

[[0 1 0 ... 0 1 0]
 [0 0 1 ... 1 1 0]
 [0 0 0 ... 1 1 0]
 ...
 [0 0 0 ... 0 1 1]
 [1 1 0 ... 1 1 1]
 [1 1 1 ... 0 1 0]]
[1 0 0 ... 1 0 0]


Generates a linearly separable dataset with 100 relevant features and 400 irrelevant features from a label array.

In [25]:
labels = cc.generate_binary_labels(10000, 0.5, seed=42)
X,y = cc.generate_linear_binary_data(100,400, labels=labels, seed=42)
print(X)
print(y)
print(all(y == labels))

[[0 0 0 ... 0 0 0]
 [1 0 1 ... 0 1 0]
 [1 0 0 ... 1 0 1]
 ...
 [1 0 1 ... 1 1 0]
 [0 1 1 ... 0 0 1]
 [0 1 1 ... 0 1 1]]
[0 1 1 ... 1 0 0]
True


### Generating a non-linearly separable datasets
Generates a non-linearly separable dataset with 100 relevant features, 400 irrelevant features, 10000 samples, with a seed of 42.

In [26]:
X,y = cc.generate_nonlinear_data(100, 10000, p=0.5, n_irrelevant=400, seed=42)
print(X)
print(y)

[[1 0 1 ... 1 1 0]
 [1 1 0 ... 0 0 0]
 [1 0 0 ... 0 1 1]
 ...
 [0 0 0 ... 0 0 1]
 [1 0 1 ... 0 1 0]
 [0 1 0 ... 0 0 0]]
[0 0 1 ... 1 1 1]


Generates a non-linearly separable dataset with 100 relevant features and 400 irrelevant features from a label array.

In [27]:
labels = cc.generate_binary_labels(10000, 0.5, seed=42)
X,y = cc.generate_nonlinear_data(100, 10000, n_irrelevant=400, labels=labels, seed=42)
print(X)
print(y)
print(all(y == labels))

IndexError: index 0 is out of bounds for axis 0 with size 0


### Applying noise to datasets
Applying cardinal noise to any binary or categorical dataset X, cardinality of 10 to class label 1.

In [None]:
X = cc.replace_with_cardinality(X, [10, 1], seed=42)
print(X)

[[1 6 1 ... 1 1 0]
 [1 1 4 ... 8 0 0]
 [1 3 8 ... 0 1 1]
 ...
 [0 9 0 ... 7 5 1]
 [1 1 1 ... 8 1 0]
 [2 1 7 ... 4 9 6]]


Applying categorical noise to 20% of any binary dataset X.

In [None]:
X = cc.noisy_data_cat(X, p=0.2, seed=42)
print(X)

[[0 0 0 ... 0 0 1]
 [1 1 4 ... 8 0 0]
 [1 3 8 ... 0 1 1]
 ...
 [0 9 0 ... 7 5 1]
 [0 0 0 ... 0 0 1]
 [2 1 7 ... 4 9 6]]


Applying missing values to 35% of any dataset X.

In [19]:
X = cc.replace_with_none(X, 0.35, seed=42)
print(X)

[[-1 6 -1 ... 1 1 -1]
 [1 -1 4 ... 8 -1 0]
 [-1 -1 8 ... 0 1 -1]
 ...
 [-1 9 -1 ... 7 5 1]
 [1 -1 1 ... -1 1 0]
 [2 1 -1 ... 4 9 6]]
