### Dataset with categorical features.

Here you can see how datapot works with [Mushroom Data Set](https://archive.ics.uci.edu/ml/datasets/Mushroom).
The important detail about this dataset is that all it's features are categorical.

In [1]:
import datapot as dp
import pandas as pd



Creating the DataPot object.

In [2]:
datapot = dp.DataPot()

In [4]:
import bz2
ftr = bz2.BZ2File('../data/mushrooms.jsonlines.bz2') 

Let's call the fit method. It automatically finds appropriate transformers for the fields of jsonlines file. The parameter 'limit' means how many objects will be used to detect the right transformers.

In [5]:
datapot.fit(ftr, limit=100)

In [6]:
datapot

DataPot class instance
 - number of features without transformation: 23
 - number of new features: 0
features to transform: 
	('k', [SVDOneHotTransformer])
	('s.1', [SVDOneHotTransformer])
	('t', [SVDOneHotTransformer])
	('n', [SVDOneHotTransformer])
	('s', [SVDOneHotTransformer])
	('s.3', [SVDOneHotTransformer])
	('e', [SVDOneHotTransformer])
	('x', [SVDOneHotTransformer])
	('p.2', [SVDOneHotTransformer])
	('u', [SVDOneHotTransformer])
	('w.2', [SVDOneHotTransformer])
	('p', [SVDOneHotTransformer])
	('c', [SVDOneHotTransformer])
	('p.1', [SVDOneHotTransformer])
	('w.1', [SVDOneHotTransformer])
	('s.2', [SVDOneHotTransformer])
	('n.1', [SVDOneHotTransformer])
	('o', [SVDOneHotTransformer])
	('f', [SVDOneHotTransformer])
	('w', [SVDOneHotTransformer])
	('k.1', [SVDOneHotTransformer])
	('e.1', [SVDOneHotTransformer])
	('p.3', [SVDOneHotTransformer])

As a result, only categorical transformers were choosen.

In [7]:
data = datapot.transform(ftr)

In [8]:
data.head()

Unnamed: 0,k_one_hot0,k_one_hot1,k_one_hot2,k_one_hot3,k_one_hot4,k_one_hot5,k_one_hot6,k_one_hot7,k_one_hot8,k_one_hot9,...,w_one_hot2,w_one_hot3,w_one_hot4,w_one_hot5,w_one_hot6,w_one_hot7,w_one_hot8,k.1_one_hot0,e.1_one_hot0,p.3_one_hot0
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0


In [9]:
data.columns

Index(['k_one_hot0', 'k_one_hot1', 'k_one_hot2', 'k_one_hot3', 'k_one_hot4',
       'k_one_hot5', 'k_one_hot6', 'k_one_hot7', 'k_one_hot8', 'k_one_hot9',
       'k_one_hot10', 'k_one_hot11', 's.1_one_hot0', 't_one_hot0',
       't_one_hot1', 'n_one_hot0', 'n_one_hot1', 'n_one_hot2', 'n_one_hot3',
       'n_one_hot4', 'n_one_hot5', 'n_one_hot6', 'n_one_hot7', 'n_one_hot8',
       'n_one_hot9', 's_one_hot0', 's_one_hot1', 's_one_hot2', 's_one_hot3',
       's.3_one_hot0', 'e_one_hot0', 'e_one_hot1', 'x_one_hot0', 'x_one_hot1',
       'x_one_hot2', 'x_one_hot3', 'x_one_hot4', 'x_one_hot5', 'p.2_one_hot0',
       'u_one_hot0', 'u_one_hot1', 'u_one_hot2', 'u_one_hot3', 'u_one_hot4',
       'u_one_hot5', 'u_one_hot6', 'w.2_one_hot0', 'p_one_hot0', 'p_one_hot1',
       'c_one_hot0', 'c_one_hot1', 'p.1_one_hot0', 'w.1_one_hot0',
       's.2_one_hot0', 'n.1_one_hot0', 'o_one_hot0', 'o_one_hot1',
       'o_one_hot2', 'f_one_hot0', 'f_one_hot1', 'w_one_hot0', 'w_one_hot1',
       'w_one_hot2', 'w

Let's test new features. For prediction, 'e' field is choosen.

In [10]:
X = data.drop(['e_one_hot0', 'e_one_hot1'], axis=1)
y = data['e_one_hot0']

In [11]:
from sklearn.model_selection import cross_val_score

In [13]:
from xgboost import XGBClassifier
clf = XGBClassifier(n_estimators=100)
cross_val_score(clf, X, y, cv=5)



array([ 0.976     ,  0.92430769,  0.98523077,  0.9612069 ,  0.92118227])