<a href="https://colab.research.google.com/github/sidharth178/AutoML/blob/master/Auto_Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 align="center"><b> Auto-Sklearn</h1>


# 1. **Overview**
- **Auto-Sklearn** is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Bayesian Optimization search procedure to efficiently discover a top-performing model pipeline for a given dataset.
- **auto-sklearn is based on defining AutoML as a CASH problem.** 

- **CASH** = Combined Algorithm Selection and Hyperparameter optimization. Put simply, we want to find the best ML model and its hyperparameter for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters. 
- In the figure below, you can see a representation of auto-sklearn provided by its authors.

 

<p align="center">
  <img src="https://machinelearningmastery.com/wp-content/uploads/2020/03/Overview-of-the-Auto-Sklearn-System.png"  alt="Auto-Sklearn diagram"/>
</p>

### **Important Parameters :-**
- **load_models :** [default value: True] -> Show the models after fitting or not
- **time_left_for_this_task :** [3600 sec] -> It shows how many seconds are left for the task. If you increase it, the chance for better performance will be increased as well.
- **n_jobs :** [ 1 ] -> you should set the “n_jobs” argument to the number of cores in your system.
- **ensemble_size, initial_configurations_via_metalearning :** [50, 25]  -> By default, the search will create an ensemble of top-performing models discovered as part of the search. Sometimes, this can lead to **overfitting** and can be disabled by setting the “ensemble_size” argument to **1** and “initial_configurations_via_metalearning” to **0**.  *initial_configurations_via_metalearning* parameter is not available in the auto-sklearn V2.
- **ensemble_nbest :** [50] -> Number of best models for building an ensemble model. Only works when ensemble_size is more than one.
- **include_estimators :** -> It will use all estimators when there is None. Not available in auto-sklearn V2.
- **exclude_estimators :** -> You can exclude some estimators from the search space. Not available in auto-sklearn V2.
- **Metric :** -> If you don’t define a metric, it will be selected based on the task.By default, the regressor will optimize the R^2 metric.
- **resampling_strategy :**[cv] -> In auto-sklearn V1, If I did not define the resampling_strategy, it could not get a good result. But in auto-sklearn V2, it did it automatically.
- **sprint_statistics() :** -> It summarizes the search and the performance of the final model.


**NOTE:** Auto-Sklearn doesn't do the data preprocessing and it takes only numerical values for training.









### **2. Install Auto-Sklearn**

In [None]:
!pip install auto-sklearn

### **3. Import Auto-Sklearn**

In [None]:
import autosklearn
print(autosklearn.__version__)
# Warning: If you find error while importing autosklearn, just restart the runtime

# import necessary libraries for project
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error

from autosklearn.regression import AutoSklearnRegressor
from autosklearn.classification import AutoSklearnClassifier

## **4.1.  Auto-sklearn for classification**

In [36]:
# import dataset
churn_df = pd.read_csv("/content/churn_data_st.csv")
churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,tenure,ServiceCount,Contract,PaperlessBilling,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,1,2,Month-to-month,Yes,29.85,29.85,No
1,5575-GNVDE,Male,0,34,4,One year,No,56.95,1889.5,No
2,3668-QPYBK,Male,0,2,4,Month-to-month,Yes,53.85,108.15,Yes
3,7795-CFOCW,Male,0,45,4,One year,No,42.3,1840.75,No
4,9237-HQITU,Female,0,2,2,Month-to-month,Yes,70.7,151.65,Yes


In [None]:
# shape of dataset
churn_df.shape

### **4.2. Data Pre-Processing**

In [None]:
# fill missing values with 0
churn_df.fillna(0, inplace=True)
churn_df.drop(columns=["customerID"],axis=1,inplace=True)
col_name = churn_df.columns

# convert categorical column to numerical column
df = OrdinalEncoder().fit_transform(churn_df)
churn_df_trans = pd.DataFrame(df, columns=col_name).astype(int)

X = churn_df_trans.drop(columns=["Churn"],axis=1)
y = churn_df_trans["Churn"]

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
X_train.shape

### **4.3. Model Building**

In [None]:
# #define the model
automl = AutoSklearnClassifier(
    time_left_for_this_task=2*60,
    per_run_time_limit=30,
    metric = autosklearn.metrics.roc_auc,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5},
    ensemble_size = 1,
    initial_configurations_via_metalearning = 0,
    # include_estimators=["random_forest", "sgd"], 
    # exclude_estimators=None
    # n_jobs=4,
)

# automl = AutoSklearnClassifier()

In [37]:
#train the model
model = automl.fit(X_train, y_train )

### **4.4. Show Statistics**

In [38]:
# summarize
print(model.sprint_statistics())

auto-sklearn results:
  Dataset name: cfece43a-3a8e-11ec-8097-0242ac1c0002
  Metric: roc_auc
  Best validation score: 0.824683
  Number of target algorithm runs: 8
  Number of successful target algorithm runs: 6
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 2
  Number of target algorithms that exceeded the memory limit: 0



In [39]:
# evaluate best model
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

Accuracy: 0.788


- When you fit the auto-sklearn model, you can check all the best outperforming pipelines with PipelineProfiler (pip install pipelineprofiler). To do that, you need to run the following code:

In [None]:
!pip install PipelineProfiler

In [None]:
import PipelineProfiler
# automl is an object Which has already been created.
profiler_data= PipelineProfiler.import_autosklearn(model)
PipelineProfiler.plot_pipeline_matrix(profiler_data)

## **5.1.  Auto-sklearn for Regression**

In [None]:
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = pd.read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X1, y1 = data[:, :-1], data[:, -1]

# split the dataset
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.30, random_state=42)

### **5.2. Model Building**

In [None]:
# #define the model
automl2 = AutoSklearnRegressor(
    time_left_for_this_task=60,
    per_run_time_limit=30,
    metric = autosklearn.metrics.mean_absolute_error,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5},
    ensemble_size = 1,
    initial_configurations_via_metalearning = 0,
    # include_estimators=["random_forest", "sgd"], 
    # exclude_estimators=None,
    # n_jobs=4
)

In [40]:
# perform the search
automl2.fit(X_train1, y_train1)



AutoSklearnRegressor(ensemble_size=1, initial_configurations_via_metalearning=0,
                     metric=mean_absolute_error, per_run_time_limit=30,
                     resampling_strategy='cv',
                     resampling_strategy_arguments={'folds': 5},
                     time_left_for_this_task=60)

### **5.3. Show Statistics**

In [41]:
# summarize
print(automl2.sprint_statistics())

auto-sklearn results:
  Dataset name: 14632be4-3a8f-11ec-8097-0242ac1c0002
  Metric: mean_absolute_error
  Best validation score: 33.934434
  Number of target algorithm runs: 8
  Number of successful target algorithm runs: 7
  Number of crashed target algorithm runs: 1
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

