<h1 style="font-size:300%; font-family:monospace; background:#3CB371; color:white; text-align:center; border:10px solid ; padding:25px;"> Introduction to Pycaret</h1>

<img src="https://pycaret.org/wp-content/uploads/2021/02/pycaret2.3.png" width="450px"> <img src="https://i1.wp.com/pycaret.org/wp-content/uploads/2020/04/thumbnail.png?fit=1166%2C656&ssl=1" width="500px">



<img align=center src="https://cdn.dribbble.com/users/18013/screenshots/12600021/media/3cb1d96666688e41589a638d48cd4674.png" width="500px">

<cite>Image from www.dribbble.com by Vitaliy Sokovikov</cite>

<li style="font-size:120%; font-family:monospace"> PyCaret is a low-code machine learning library in Python that allows you to go from preparing your data to deploying your model in less time, so you can spend time doing something else while waiting for your model training 😎
</li>


<li style="font-size:120%; font-family:monospace"> You can read the documentation of PyCaret at https://pycaret.org or https://pycaret.readthedocs.io/en/latest/
</li>

<li style="font-size:120%; font-family:monospace">and a similar low-code ML library called LazyPredict at https://lazypredict.readthedocs.io/en/latest/
</li>

# Intro to Pycaret

<h1 style="font-size:210%; font-family:monospace; background:#3CB371; color:white; text-align:center; border:10px solid ; padding:25px;"> Let's turn on Accelerator in Kaggle kernel and begin with PyCaret! 🎢 </h1>

<div class="alert alert-block alert-info">  📌 First, we will be using PyCaret on the diabetes dataset for classification</div>

<h1 style="font-size:210%; font-family:monospace; background:#3CB371; color:white; text-align:center; border:10px solid ; padding:25px;"> Pycaret Installation ✨</h1>

In [None]:
# Install Pycaret
!pip install pycaret -q

<div class="alert alert-block alert-info">  📌 We have two choice which are importing dataset from Kaggle or from Pycaret dataset itself (two of them are the same except for the column name)</div>

In [None]:
# importing dataset
import pandas as pd
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df

In [None]:
# Importing dataset from PyCaret
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# Classification

<h1 style="font-size:210%; font-family:monospace; background:#3CB371; color:white; text-align:center; border:10px solid ; padding:25px;"> importing classification module and initializing setup 🐱‍👤</h1>

In [None]:
# import a classification module of PyCaret
from pycaret.classification import *

<div class="alert alert-block alert-info">  📌 Running the below command and the module will automatically preprocesses the data and then creates a dataframe</div>


<li style="font-size:100%; font-family:monospace"><b></b>you can put the whole dataframe in the "data" parameter and put the target variable in "target"
</li>
<li style="font-size:100%; font-family:monospace"><b></b>parameter silent = True will ignore pops up message</li>

In [None]:
clf = setup(data = df, target = 'Outcome', silent = True, session_id = 123)

<div class="alert alert-block alert-info">  You can see that the setup process preprocess the data and create a train/test set automatically for us</div>

In [None]:
# run compare_models and save top 5 models based on 'Accuracy'
top_five = compare_models(n_select = 5)

<h1 style="font-size:210%; font-family:monospace; background:#3CB371; color:white; text-align:center; border:10px solid ; padding:25px;">Create the Best Model Performance (result can be different in each run)</h1>

In [None]:
catboost = create_model('catboost')   

# Tuning the model

In [None]:
# tune the best performance model
tuned_catboost = tune_model(catboost)

In [None]:
# tune multiple models dynamically (tuning top 5 model)
tuned_top5 = [tune_model(i) for i in top_five]

# Hyperparameter of each of tuned_top5 model

In [None]:
tuned_top5

# Visualize model performance
<div class="alert alert-block alert-info">  PyCaret uses high-level library Yellowbrick for creating these visualizations.</div>

In [None]:
tuned_model_1 = tuned_top5[0] 
tuned_model_2 = tuned_top5[1]
tuned_model_3 = tuned_top5[2] 
tuned_model_4 = tuned_top5[3]
tuned_model_5 = tuned_top5[4]

In [None]:
evaluate_model(tuned_catboost)

In [None]:
evaluate_model(tuned_model_2)

In [None]:
evaluate_model(tuned_model_3)

In [None]:
evaluate_model(tuned_model_4)

In [None]:
evaluate_model(tuned_model_5)

#### Isn't that amazing? It's just one line of code!

# Interpret the model 
This function supports tree based models for binary classification: lightgbm, catboost, et, xgboost, rf, dt.)

In [None]:
interpret_model(tuned_catboost)

# AutoML
The following function returns the best model out of all models created in the current active environment based on metric defined in optimize parameter. Run this code at the end of  your script.

In [None]:
# using recall 
automl_model = automl(optimize = 'f1')
automl_model

# Prediction

The predict_model function allows us to predict data from the experiment or new unseen data. 

In [None]:
prediction = predict_model(automl_model)

In [None]:
prediction.head()

# Blending Model
This function automatically create voting classifer based on the model we passed into the "estimator list" parameter

In [None]:
# specify the model in "estimator_list" parameter
blended_top5 = blend_models(estimator_list = tuned_top5) 
blended_top5

# Emsemble Model

In [None]:
# ensemble top 5 tuned models
bagged_top5 = [ensemble_model(i) for i in tuned_top5]

# Saving and loading the Model
The following function save any model you want. After running it, you can check the file in the output folder on your right

In [None]:
# specify which model you want to save in the first parameter, name in the second
save_model(automl_model, model_name='automl-model')

In [None]:
# load model
loaded_model = load_model('automl-model')
print(loaded_model)

# Bonus section on Classification (applying feature engineering) 
You can check how to do a feature engineering here https://www.kaggle.com/vincentlugat/pima-indians-diabetes-eda-prediction-0-906



In [None]:
# Data preprocessing
def replace_missing(data):
    data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)
    return data

def median_target(data,var):   
    temp = data[data[var].notnull()]
    temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index()
    return temp

def replace_median(data):
    null_columns = ['BloodPressure', 'BMI', 'SkinThickness', 'Glucose', 'Insulin']
    for i in null_columns:
        f = median_target(data, i)
        data.loc[(data['Outcome'] == 0 ) & (data[i].isnull()), i] = f[[i]].values[0][0]
        data.loc[(data['Outcome'] == 1 ) & (data[i].isnull()), i] = f[[i]].values[1][0]
    return data

def feature_engineering(data):
    data.loc[:,'N1']=0
    data.loc[(data['Age']<=30) & (data['Glucose']<=120),'N1']=1

    data.loc[:,'N2']=0
    data.loc[(data['BMI']<=30),'N2']=1

    data.loc[:,'N3']=0
    data.loc[(data['Age']<=30) & (data['Pregnancies']<=6),'N3']=1

    data.loc[:,'N3_1']=0
    data.loc[(data['Glucose']<=110) & (data['Pregnancies']<=5),'N3_1']=1

    data.loc[:,'N4']=0
    data.loc[(data['Glucose']<=105) & (data['BloodPressure']<=80),'N4']=1

    data.loc[:,'N4_1']=0
    data.loc[(data['Age']<=30) & (data['Pregnancies']<=6),'N4_1']=1

    data.loc[:,'N5']=0
    data.loc[(data['SkinThickness']<=20) ,'N5']=1

    data.loc[:,'N6']=0
    data.loc[(data['BMI']<30) & (data['SkinThickness']<=20),'N6']=1

    data.loc[:,'N7']=0
    data.loc[(data['Glucose']<=105) & (data['BMI']<=30),'N7']=1

    data.loc[:,'N7_1']=0
    data.loc[(data['BMI']<30) & (data['SkinThickness']<=20),'N7_1']=1

    data.loc[:,'N9']=0
    data.loc[(data['Insulin']<200),'N9']=1

    data.loc[:,'N10']=0
    data.loc[(data['BloodPressure']<80),'N10']=1

    data.loc[:,'N11']=0
    data.loc[(data['Pregnancies']<4) & (data['Pregnancies']!=0) ,'N11']=1

    # highly correlate data

    data['N0'] = data['BMI'] * data['SkinThickness']

    data['N8'] =  data['Pregnancies'] / data['Age']

    data['N13'] = data['Glucose'] / data['DiabetesPedigreeFunction']

    data['N12'] = data['Age'] * data['DiabetesPedigreeFunction']

    data['N14'] = data['Age'] / data['Insulin']

    data['N15'] = data['BMI'] / data['Insulin']
    return data

def prepare_data(data):
    cat_cols   = data.nunique()[data.nunique() < 12].keys().tolist()
    cat_cols   = [x for x in cat_cols]
    #numerical columns
    num_cols   = [x for x in data.columns if x not in cat_cols]
    #Binary columns with 2 values
    bin_cols   = data.nunique()[data.nunique() == 2].keys().tolist()
    #Columns more than 2 values
    multi_cols = [i for i in cat_cols if i not in bin_cols]

    #Label encoding Binary columns
    le = LabelEncoder()
    for i in bin_cols :
        data[i] = le.fit_transform(data[i])
        
    #Duplicating columns for multi value columns
    data = pd.get_dummies(data = data,columns = multi_cols )

    #Scaling Numerical columns
    std = StandardScaler()
    scaled = std.fit_transform(data[num_cols])
    scaled = pd.DataFrame(scaled,columns=num_cols)

    #dropping original values merging scaled values for numerical columns
    df_data_og = data.copy()
    data = data.drop(columns = num_cols,axis = 1)
    data = data.merge(scaled,left_index=True,right_index=True,how = "left")
    return data

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
df_rms = replace_missing(df)
df_rm = replace_median(df_rms)
df_fe = feature_engineering(df_rm)
prepared_data = prepare_data(df_fe)

In [None]:
prepared_data

In [None]:
clf_new = setup(data = prepared_data, target = 'Outcome', silent = True, session_id = 125)

In [None]:
# run compare_models and save top 5 models
top_five = compare_models(n_select = 5)

After we applied feature engineering, we can see a big improvement of each model

# Regression

In [None]:
car_data = pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')
car_data

In [None]:
from pycaret.regression import *

In [None]:
rgs = setup(data = car_data,  target = 'price', silent = True, session_id = 124)

In [None]:
# run compare_models and save top 5 models 
top_five_r = compare_models(n_select = 5)

In [None]:
catboost = create_model('catboost')   

In [None]:
# tune the best performance model
tuned_catboost = tune_model(catboost)

In [None]:
evaluate_model(tuned_catboost)

In [None]:
interpret_model(tuned_catboost)

In [None]:
# using recall 
automl_model = automl(optimize = 'MAE')
automl_model

The predict_model function below produces predictions for the holdout datasets used for validating the model during cross-validation. The code also gives us a dataframe with performance statistics for the predictions generated by the AutoML model.

In [None]:
pred_holdouts = predict_model(automl_model)
pred_holdouts.head()

We can also produce predictions on the entire dataset

In [None]:
new_data = car_data.copy()
new_data.drop(['price'], axis=1, inplace=True)
predictions = predict_model(automl_model, data=new_data)
predictions.head()

<h1 style="font-size:210%; font-family:monospace; background:#3CB371; color:white; text-align:center; border:10px solid ; padding:25px;"> Thank you for reading and there are more to come! If you found this helpful, please upvote 😊 </h1> 
