First, install OpenML, a library that provides access to the datasets on https://www.openml.org. You also need scale to make all the measurements have equal importance.

In [0]:
!pip install openml

Next, import `openml`. You also import `train_test_split` for splitting up the data into training and testing data. Also import `TPOTClassifier` from TPOT.

In [0]:
import openml as oml
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

Next, load in the dataset from OpenML and split it into training and testing data.

In [3]:
dataset = oml.datasets.get_dataset(1471)

x, y, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute,
    return_attribute_names=True,
)

x = scale(x)

xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.05, random_state=42)

print(x[0])

[ 0.00293428 -0.01170463  0.5673981  -0.00320852  0.24522924 -0.01978755
 -0.00292999  0.85256034  0.00150875  0.18774854  0.23350398  0.03073905
  0.01712658 -0.00383388]




Now initialize the classifier and fit it to the data.

In [4]:
clf = ExtraTreesClassifier(min_samples_split=4,random_state=42,n_estimators=100)
clf.fit(xtrain,ytrain)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=4,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

As you can see we achieved an accuracy of 100% with minimal overfitting (it tested 100%). 

In [5]:
print(clf.score(xtest,ytest))

0.9546061415220294
