# Automated Machine Learning 

### What?

Process of automating traditional Machine Learning tasks. \
Covers the complete ML pipeline from the raw dataset to model deployment.

### Why?

Low code \
Automates repetitive tasks \
Time saving \
Better results \
Organized way of execution \
Speeds up the development activity

### How?

Automated Data Preparation and Feature Engineering \
Automated Model Selection and Hyperparameter Optimization \
Neural Architecture Search (NAS- using neural nets to design neural nets): selecting the appropriate layers and learning rates \
Critical analysis of evaluation metrics.


### Implementations: (https://aimultiple.com/automl-software) 

#### Commercial Platforms:
Google AutoML \
Microsoft Azure AutoML \
DataRobot \
IBM Watson (AutoAI)\
AWS Sagemaker \
Alteryx \
Oracle Accelerated Data Science (ADS) SDK \

#### Open Source:
auto-sklearn \
auto-keras \
MLBox \
TPOT (Tree based Pipeline Optimisation Tool) \
H2O AutoML \
Auto-PyTorch \
PyCaret





## Usecase: PyCaret (https://pycaret.org/)

Installing pycaret: https://insaid.medium.com/a-complete-guide-to-pycaret-c07b1e51f698

In [1]:
pip install pycaret

Note: you may need to restart the kernel to use updated packages.


'C:\Users\Sayandeep' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
import pycaret

Importing data from Pycaret repository.

In [3]:
from pycaret.datasets import get_data
nba = get_data('nba')

Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,...,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,...,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,...,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


Modules in Pycaret \
Classification: from pycaret.classification import* \
Regression: from pycaret.regression import* \
Clustering: from pycaret.clustering import* \
Anomaly Detection: from pycaret.anomaly import* \
Natural Language Processing: from pycaret.nlp import* \
Association Rule Mining: from pycaret.arules import* \

Setting up Environment in PyCaret \

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. \

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured. \

setup() function automatically does data pre-processing and data sampling in the background. It operates on default parameters but these paraments can be changed according to one’s requirement. \

We take a dataset named “nba”, where the target variable is “TARGET_5Yrs” and it is a binary classification problem. Hence we import the classification module.

(pycaret datasets: https://pycaret.org/get-data/)

from pycaret.classification import *

pycar_test = setup(nba, target = 'TARGET_5Yrs')

![](Processing.jpg)

The preprocessing features which PyCaret handles: \

— Data Preparation \

PyCaret automatically detects the data type of the features present in the dataset. These values might be wrong at times. So this problem can be solved by giving a parameter. \
Parameters: \
numeric_features = [‘column_name’] \
categorical_features = [‘column_name’] or date_features = ‘date_column_name \
ignore_features = [‘column_name’] \

example: “GP” is categorical but PyCaret interprets it as numerical, then this can be overwritten.

In [4]:
from pycaret.classification import *
pycar = setup(nba, target = 'TARGET_5Yrs', categorical_features = ['GP'])

Unnamed: 0,Description,Value
0,session_id,7065
1,Target,TARGET_5Yrs
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(1340, 21)"
5,Missing Values,True
6,Numeric Features,18
7,Categorical Features,2
8,Ordinal Features,False
9,High Cardinality Features,False


Dataset is set up as a classification model and you can see the output message stating “setup successfully completed!”


In [None]:
Ignore


— Sampling and Split \

(i) Train Test Split: 70% of the data belongs to the training dataset and 30% of the data belongs to the testing dataset by default. These dataset sizes can be varied by just passing the 'train_size' parameter in the function. 

In [5]:
from pycaret.classification import *
reg1 = setup(data = nba, target = 'TARGET_5Yrs', train_size = 0.6)

Unnamed: 0,Description,Value
0,session_id,556
1,Target,TARGET_5Yrs
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(1340, 21)"
5,Missing Values,True
6,Numeric Features,19
7,Categorical Features,1
8,Ordinal Features,False
9,High Cardinality Features,False


(ii) Sampling: If the samples/datapoints of the dataset are large i.e. if it exceeds 25,000 samples then sampling is done automatically by PyCaret. A base estimator with various sample sizes is built and a plot is obtained showing the performance metrics for each sample. Then the desired sample size can be entered in the text box. Sampling is a Boolean parameter and the default value is True.\
This functionality is only available in pycaret.classification and pycaret.regression modules.

   

In [13]:
# Importing dataset
from pycaret.datasets import get_data
bank = get_data('bank')
# Importing module and initializing setup
from pycaret.regression import *
reg1 = setup(data = bank, target = 'deposit')

Unnamed: 0,Description,Value
0,session_id,6423
1,Target,deposit
2,Original Data,"(45211, 17)"
3,Missing Values,False
4,Numeric Features,7
5,Categorical Features,9
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(31647, 48)"


In [7]:
from pycaret.datasets import get_data
income = get_data('income')

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [8]:
income.shape

(32561, 14)

In [9]:
from pycaret.classification import *
model = setup(data = income, target = 'income >50K')

Unnamed: 0,Description,Value
0,session_id,7784
1,Target,income >50K
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(32561, 14)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False




(i) Missing value Imputation (Doesn't work): PyCaret does missing value imputation automatically. \
    Parameters: \
    numeric_imputation: numerical, default = ‘mean’ \
    categorical_imputation: string, default = ‘constant’ \
    

In [10]:
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,0,30,2,1.0,2,2,2,2,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,0,50,1,1.0,2,1,2,2,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,0,78,1,2.0,2,1,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,0,31,1,,1,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,0,34,1,2.0,2,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


In [11]:
from pycaret.classification import *
clf1 = setup(data = hepatitis, target = 'Class')

clf1[0]

Unnamed: 0,Description,Value
0,session_id,940
1,Target,Class
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(154, 20)"
5,Missing Values,True
6,Numeric Features,6
7,Categorical Features,13
8,Ordinal Features,False
9,High Cardinality Features,False


False

(ii) Changing data types: Auto detects datatypes of features which can be changed as per requirement.