    # Automatic Jupyter Notebook for OpenML dataset 9: autos

To properly show several of the elements of this notebook please run this entire notebook again.

These are the top 10 flows for this dataset available on OpenML.

In [None]:
%matplotlib inline
from scripts.preamble import *
did = 9
data = oml.datasets.get_dataset(did)
X, y, features = getData(data)
task, topList, strats, scores = getOpenMLData(did, data.default_target_attribute) 
topList[:10] 

Here we determine the problem type and verify whether or not we have automatically selected the correct main task.

In [None]:
from scripts.problemType import *
problemType = findProblemType(data) 
checkTask(task, problemType, data.default_target_attribute) 

These plots show histograms for each of the features of this dataset.

In [None]:
%matplotlib inline
from scripts.dataVisualization import *
show1DHist(data) 

The first plot is a plot of normalized boxplots for each of the features and the second plot is a styled dataframe sorted by anomaly score, yellow denotes outlier values and a heatmap is included for the anomaly scores.

In [None]:
%matplotlib inline
from scripts.outlierDetection import *
outlierDetection(X, features, 10) 

These are the top 10 similar datasets based on the cosine similarity between their standardized property vectors.

In [None]:
from scripts.localDataOpenMLInterface import *
showTopNSimilarDatasets("datasetSimilarityMatrixNormalized", did, 10) 

Complexity threshold to determine if an algorithm will be run.


In [None]:
comp = 50000000000000

This plot shows the feature importance of each of the features.

In [None]:
%matplotlib inline
from scripts.featureImportance import *
featureImportance(data) 

This plot shows the baseline performance on this dataset, from this point onwards the red dotted line will denote the maximum baseline threshold.

In [None]:
%matplotlib inline
from scripts.baselines import *
maxBaseline = generateBaselines(data, problemType) 

Please run the full notebook then toggle removeOutliers and run this cell and the next to compare the algorithms with and without outliers. 

In [None]:
removeOutliers = False 

This plot shows algorithm performance for every algorithm that is run in this notebook. A red bar denotes that the algorithm performs worse than the baseline. A green bar denotes that the algorithm performs better than the baseline and a purple bar is only shown when removing outliers and denotes whether the original or algorithm with outliers removed is better.

In [None]:
%matplotlib inline
from scripts.machineLearningAlgorithms import *
settings = runMachineLearningAlgorithms(data, comp, strats, problemType, task, showRuntimePrediction=False, runTPOT=False, removeOutliers=removeOutliers)
plot_alg(data, settings.strats, maxBaseline, problemType) 

Please use this cell to easily add new classifiers, an example is already given.

In [None]:
%matplotlib inline
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

#Create a custom algorithm by setting classifier and name
#---------------------------------------------------
clf =  QuadraticDiscriminantAnalysis()
name = "CustomAlgorithmQuadraticDiscriminantAnalysis"
#---------------------------------------------------
runMLAlgorithm(estimator=clf, name=name, settings=settings)
plot_alg(data, settings.strats, maxBaseline, problemType) 

This is an interactive plot which shows all algorithm performance datapoints from OpenML, a boxplot is overlaid. The red dots are all the algorithms run in this notebook, you can mouse over each of the datapoints to see their name and accuracy.

In [None]:
%matplotlib nbagg
from scripts.relativePerformance import * 
showRelativePerformanceBoxplot(scores, topList, settings.strats, maxBaseline) 