# This notebook shows how a SuloClassifier beat a single model as well as a Voting classifier due to its superior design

### We are going to test it on a large dataset using the sample example code provided by:
https://machinelearningmastery.com/weighted-average-ensemble-with-python/
Thanks to Jason Brownlee for his Machine Learning Mastery blogs. He is absolutely great!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from lazytransform import SuloClassifier, LazyTransformer

Imported lazytransform v1.13.



In [2]:
transform_target = True
lazy = LazyTransformer(transform_target=transform_target)

In [3]:
datapath = '../data_sets/'
filename = 'breast-cancer.csv'
filename = 'machinefailuretype.csv'
sep = ','

In [4]:
df = pd.read_csv(datapath+filename, sep=sep, header=0)
print(df.shape)
df.head()

(10000, 10)


Unnamed: 0,udi,product,machinetype,airtemp,processtemperature,rotationalspeed,torque,toolwear,fail,failtype
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No Failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No Failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No Failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No Failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No Failure


In [5]:
target = 'failtype'
preds = [x for x in list(df) if x not in [target]]
X = df[preds]
y = df[target]
X.shape, y.shape

((10000, 9), (10000,))

In [None]:
# evaluate a weighted average ensemble for classification compared to base model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier

# get a list of base models
bayes = GaussianNB()
def get_models():
	models = list()
	models.append(('lr', LogisticRegression()))
	models.append(('LGBM', LGBMClassifier(random_state=0, n_estimators=100)))
	models.append(('bayes', GaussianNB()))
	return models
 
# evaluate each base model
def evaluate_models(models, X_train, X_val, y_train, y_val):
	# fit and evaluate the models
	scores = list()
	for name, model in models:
		# fit the model
		if name == 'SuloClassifier':
			model.fit(pd.DataFrame(X_train), pd.Series(y_train))
		else:
			model.fit(X_train, y_train)
		# evaluate the model
		yhat = model.predict(X_val)
		acc = balanced_accuracy_score(y_val, yhat)
		# store the performance
		scores.append(acc)
		# report model performance
	return scores
 
# define dataset
#X, y = make_classification(n_samples=100000, n_features=50, n_informative=40, n_redundant=5, random_state=7)

# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.50, random_state=1)

In [6]:
# split the full train set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)
X_train, y_train = lazy.fit_transform(X_train, y_train)
X_val = lazy.transform(X_val)
if transform_target:
    y_val = lazy.yformer.transform(y_val)
# create the base models
models = get_models()
rfc = RandomForestClassifier(random_state=0, n_estimators=100)
lgbm = LGBMClassifier(random_state=0, n_estimators=100)
sulo = SuloClassifier(base_estimator=lgbm, n_estimators=5, pipeline=False, weights=False, imbalanced=False, verbose=0)
models.append(('SuloClassifier',sulo))
# fit and evaluate each model
scores = evaluate_models(models, X_train, X_val, y_train, y_val)
for i in range(len(models)):
	print('EvalScore for %s: %.3f' % (models[i][0], scores[i]*100))
# create the ensemble
X_train_full, y_train_full = lazy.fit_transform(X_train_full, y_train_full)
X_test = lazy.transform(X_test)
if transform_target:
    y_test = lazy.yformer.transform(y_test)
# evaluate each standalone model
# create the base models
models = get_models()
rfc = RandomForestClassifier(random_state=0, n_estimators=100)
lgbm = LGBMClassifier(random_state=0, n_estimators=100)
sulo = SuloClassifier(base_estimator=lgbm, n_estimators=5, pipeline=False, weights=False, imbalanced=False, verbose=0)
models.append(('SuloClassifier',sulo))
scores = evaluate_models(models, X_train_full, X_test, y_train_full, y_test)
for i in range(len(models)):
	print('>>%s: %.3f' % (models[i][0], scores[i]*100))
# evaluate equal weighting
ensemble = VotingClassifier(estimators=models[:3], voting='soft')
ensemble.fit(X_train_full, y_train_full)
yhat = ensemble.predict(X_test)
score = balanced_accuracy_score(y_test, yhat)
print('>>Voting Accuracy: %.3f' % (score*100))

    Single_Label Multi_Classification problem 
Shape of dataset: (3350, 9). Now we classify variables into different types...
Time taken to define data pipeline = 1 second(s)
No model input given...
Lazy Transformer Pipeline created...
    transformed target from object type to numeric
    Time taken to fit dataset = 1 second(s)
    Time taken to transform dataset = 1 second(s)
    Shape of transformed dataset: (3350, 11)
    Time taken to transform dataset = 1 second(s)
    Shape of transformed dataset: (1650, 11)
Input data shapes: X = (3350, 11)
    y shape = (3350,)
No GPU available on this device. Using CPU for lightgbm and others.
    Number of estimators used in SuloClassifier = 5


k-fold training:   0%|                                                          | 0/5 [00:00<?, ?it/s]

No HPT tuning performed since base estimator is given by input...


k-fold training: 100%|██████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.86it/s]


Final balanced Accuracy of 5-estimator SuloClassifier: 83.7%
Time Taken: 3 (seconds)
EvalScore for lr: 21.191
EvalScore for LGBM: 71.058
EvalScore for bayes: 74.012
EvalScore for SuloClassifier: 69.931
    Single_Label Multi_Classification problem 
Shape of dataset: (5000, 9). Now we classify variables into different types...
Time taken to define data pipeline = 1 second(s)
No model input given...
Lazy Transformer Pipeline created...
    transformed target from object type to numeric
    Time taken to fit dataset = 1 second(s)
    Time taken to transform dataset = 1 second(s)
    Shape of transformed dataset: (5000, 11)
    Time taken to transform dataset = 1 second(s)
    Shape of transformed dataset: (5000, 11)
Input data shapes: X = (5000, 11)
    y shape = (5000,)
No GPU available on this device. Using CPU for lightgbm and others.
    Number of estimators used in SuloClassifier = 5


k-fold training:   0%|                                                          | 0/5 [00:00<?, ?it/s]

No HPT tuning performed since base estimator is given by input...


k-fold training: 100%|██████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.59it/s]


Final balanced Accuracy of 5-estimator SuloClassifier: 76.6%
Time Taken: 3 (seconds)
>>lr: 21.082
>>LGBM: 75.696
>>bayes: 67.636
>>SuloClassifier: 76.293
>>Voting Accuracy: 75.954


# On Multi-class datasets, Suloclassifier can beat Logistic, LGBM, Bayes and Voting ensembles with almost twice the balanced accuracy numbers

In [7]:
sulo

## Tips for using Sulo for High Performance:
1. First try it with base_estimator as None and all other params as either None or False
2. Then set weights = True, then Imbalanced=True and see
3. If one of the models is close to beating Sulo, then input that model as base_estimator while leaving all other params above untouched.
4. Finally change the n_estimators from default None to 5 and see.
5. Finally increase n_estimators to 7 and 10. 
6. The more you increase the number of estimators, the more performance boost you will get until at some point it drops off. Keep increasing until then.
