## CategoricalClassification Usage

_CategoricalClassification_ is a library designed to quickly and easily generate binary categorical datasets. It supports both linearly and non-linearly separable dataset generation, as well as various noise simulating functions.


### Importing
Once copied to your working directory, _CategoricalClassification_ can be imported as any other Python library. 

In [1]:
from CategoricalClassification import CategoricalClassification
cc = CategoricalClassification()

### Generating a linearly separable datasets
Generates a linearly separable dataset with 100 relevant features, 400 irrelevant features, 10000 samples, with a seed of 42.

In [2]:
X,y = cc.generate_linear_binary_data(100, 400, samples=10000, seed=42)
print(X)
print(y)

[[0 1 0 ... 0 1 0]
 [0 0 1 ... 1 1 0]
 [0 0 0 ... 1 1 0]
 ...
 [0 0 0 ... 0 1 1]
 [1 1 0 ... 1 1 1]
 [1 1 1 ... 0 1 0]]
[1 0 0 ... 1 0 0]


Generates a linearly separable dataset with 100 relevant features and 400 irrelevant features from a label array.

In [3]:
labels = cc.generate_binary_labels(10000, 0.5, seed=42)
X,y = cc.generate_linear_binary_data(100,400, labels=labels, seed=42)
print(X)
print(y)
print(all(y == labels))

[[0 0 0 ... 0 0 0]
 [1 0 1 ... 0 1 0]
 [1 0 0 ... 1 0 1]
 ...
 [1 0 1 ... 1 1 0]
 [0 1 1 ... 0 0 1]
 [0 1 1 ... 0 1 1]]
[0 1 1 ... 1 0 0]
True


### Generating a non-linearly separable datasets
Generates a non-linearly separable dataset with 100 relevant features, 400 irrelevant features, 10000 samples, with a seed of 42.

In [4]:
X,y = cc.generate_nonlinear_data(100, 10000, p=0.5, n_irrelevant=400, seed=42)
print(X)
print(y)

[[1 0 1 ... 1 1 0]
 [1 1 0 ... 0 0 0]
 [1 0 0 ... 0 1 1]
 ...
 [0 0 0 ... 0 0 1]
 [1 0 1 ... 0 1 0]
 [0 1 0 ... 0 0 0]]
[0 0 1 ... 1 1 1]


Generates a non-linearly separable dataset with 100 relevant features and 400 irrelevant features from a label array.

In [5]:
labels = cc.generate_binary_labels(10000, 0.5, seed=42)
X,y = cc.generate_nonlinear_data(100, 10000, n_irrelevant=400, labels=labels, seed=42)
print(X)
print(y)
print(all(y == labels))

[[0 1 0 ... 1 0 0]
 [0 1 1 ... 1 0 1]
 [0 1 0 ... 0 1 1]
 ...
 [1 0 1 ... 1 0 1]
 [1 1 0 ... 0 1 0]
 [0 1 1 ... 1 0 1]]
[0 1 1 ... 1 0 0]
True



### Applying noise to datasets
Applying cardinal noise to any binary or categorical dataset X, cardinality of 10 to class label 1.

In [6]:
X = cc.replace_with_cardinality(X, [10, 1], seed=42)
print(X)

[[6 1 3 ... 1 6 9]
 [8 1 1 ... 1 8 1]
 [3 1 9 ... 4 1 1]
 ...
 [1 2 1 ... 1 1 1]
 [1 1 7 ... 2 1 5]
 [8 1 1 ... 1 9 1]]


Applying categorical noise to 20% of any binary dataset X.

In [7]:
X = cc.noisy_data_cat(X, p=0.2, seed=42)
print(X)

[[0 0 0 ... 0 0 0]
 [8 1 1 ... 1 8 1]
 [3 1 9 ... 4 1 1]
 ...
 [1 2 1 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 [8 1 1 ... 1 9 1]]


Applying missing values to 35% of any dataset X.

In [8]:
X = cc.replace_with_none(X, 0.35, seed=42)
print(X)

[[-1 0 -1 ... 0 0 -1]
 [8 -1 1 ... 1 -1 1]
 [-1 -1 9 ... 4 1 -1]
 ...
 [-1 2 -1 ... 1 1 1]
 [0 -1 0 ... -1 0 0]
 [8 1 -1 ... 1 9 1]]
