## Automated ML Pipeline Generator using TPOT in Python
+ Genetic Algorithms: based on natural selection/survival of the fittest

#### Steps of Genetic Algorithms
+ Selection: find the best and fittest
+ Crossover: breed the best and the fittest to get a new generation
+ Mutation: mutate the offspring of the new generation till you get the best and fittest

### Pkgs
+ pip install tpot

#### Dependencies
+ scikit learn and numpy

#### NB
+ Remove Missing Values
+ Must be categorical and numbers





In [0]:
data_url = "https://raw.githubusercontent.com/Jcharis/Machine-Learning-Web-Apps/master/Iris-Species-Predictor-ML-Flask-App-With-Materialize.css/data/iris.csv"
data_url2 = "https://raw.githubusercontent.com/Jcharis/Python-Machine-Learning/master/Predicting_Contraceptives_MethodChoice_n_Usage_with_ML_and_TPOT/data_cmc/cmc.data"

In [0]:
# Load ML Pkgs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

In [0]:
# Load EDA pkg
import pandas as pd 
import numpy as np

In [0]:
# Load Our dataset
df = pd.read_csv(data_url)

In [23]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [24]:
# Checking for missing
df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [25]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [0]:
# Species to Numerical
d = {value:index for index,value in enumerate(df['species'].unique())}

In [28]:
d

{'setosa': 0, 'versicolor': 1, 'virginica': 2}

In [0]:
df['new_label'] = df['species'].map(d)

In [30]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,new_label
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0


In [0]:
xfeatures = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
ylabels = df['new_label']

In [0]:

from sklearn.model_selection import cross_val_score

In [33]:
# Individual Algorithm
cv_scores = cross_val_score(LogisticRegression(),xfeatures,ylabels,cv=10)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [34]:
cv_scores

array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])

In [35]:
print(np.mean(cv_scores))

0.9733333333333334


In [0]:
# Individual Algorithm
rf_cv_scores = cross_val_score(RandomForestClassifier(),xfeatures,ylabels,cv=10)

In [40]:
rf_cv_scores

array([1.        , 0.93333333, 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])

In [37]:
print(np.mean(rf_cv_scores))

0.9666666666666666


In [0]:
# Individual Algorithm
rf_cv_scores2 = cross_val_score(RandomForestClassifier(n_estimators=100,max_depth=2),xfeatures,ylabels,cv=10)

In [39]:
rf_cv_scores2

array([0.93333333, 0.93333333, 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.86666667, 1.        , 1.        , 1.        ])

In [41]:
print(np.mean(rf_cv_scores2))

0.9533333333333334


In [0]:
### AutoML with TPOT

In [42]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/37/d8/719024ea20497eb6566ed5cc070e66e8c1e125e0e5d9966837cd00a3a83d/TPOT-0.11.2-py3-none-any.whl (76kB)
[K     |████▎                           | 10kB 15.3MB/s eta 0:00:01[K     |████████▋                       | 20kB 3.1MB/s eta 0:00:01[K     |████████████▉                   | 30kB 3.8MB/s eta 0:00:01[K     |█████████████████▏              | 40kB 4.2MB/s eta 0:00:01[K     |█████████████████████▌          | 51kB 3.6MB/s eta 0:00:01[K     |█████████████████████████▊      | 61kB 4.0MB/s eta 0:00:01[K     |██████████████████████████████  | 71kB 4.3MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.4MB/s 
[?25hCollecting deap>=1.2
[?25l  Downloading https://files.pythonhosted.org/packages/0a/eb/2bd0a32e3ce757fb26264765abbaedd6d4d3640d90219a513aeabd08ee2b/deap-1.3.1-cp36-cp36m-manylinux2010_x86_64.whl (157kB)
[K     |██                              | 10kB 19.3MB/s eta 0:00:01[K    

In [0]:
import tpot

In [44]:
# Methods and Attributes
dir(tpot)

['TPOTClassifier',
 'TPOTRegressor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_version',
 'base',
 'builtins',
 'config',
 'decorators',
 'driver',
 'export_utils',
 'gp_deap',
 'gp_types',
 'main',
 'metrics',
 'operator_utils',
 'tpot']

In [0]:
# Split in train and test
x_train,x_test,y_train,y_test = train_test_split(xfeatures,ylabels,test_size=0.3,random_state=42)

In [0]:
# Init
tpot = TPOTClassifier(generations=5,verbosity=2)

In [51]:
# Fit data
tpot.fit(x_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=600.0, style=ProgressStyle(de…

Generation 1 - Current best internal CV score: 0.9714285714285713
Generation 2 - Current best internal CV score: 0.9714285714285715
Generation 3 - Current best internal CV score: 0.9714285714285715
Generation 4 - Current best internal CV score: 0.9714285714285715
Generation 5 - Current best internal CV score: 0.9714285714285715

Best pipeline: KNeighborsClassifier(Nystroem(input_matrix, gamma=0.25, kernel=cosine, n_components=4), n_neighbors=20, p=1, weights=distance)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=5,
               log_file=<ipykernel.iostream.OutStream object at 0x7f05559644a8>,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=100,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

In [52]:
tpot.score(x_test,y_test)

1.0

In [0]:
# Export the result
tpot.export('tpot_ml_pipeline.py')

In [0]:
# Predictions
ex = np.array([6.2,3.4,5.4,2.3]).reshape(1,-1)

In [57]:
tpot.predict(ex)

array([2])

In [58]:
d

{'setosa': 0, 'versicolor': 1, 'virginica': 2}