In [None]:
import numpy as np

# Loading the Data

In [None]:
from pycaret.datasets import get_data
data_df = get_data('diamonds')
#check the shape of data
data_df.shape

In order to demonstrate the use of the predict_model function on test data, a sample of 600 records has been divided from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real-life scenario. Another way to think about this is that these 600 records are not available at the time when this machine learning experiment was performed.

In [None]:
data_df = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data_df.index)

data_df.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data_df.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

# Exploratory Data Analysis
do some quick visualization to assess the relationship of independent features (weight, cut, color, clarity, etc.) with the target variable i.e. Price

In [None]:
# plot scatter carat_weight and Price
import plotly.express as px
fig = px.scatter(x=data_df['Carat Weight'], y=data_df['Price'], facet_col = data_df['Cut'], opacity = 0.25, template = 'plotly_dark', trendline='ols', trendline_color_override = 'red', title = 'SARAH GETS A DIAMOND - A CASE STUDY')
fig.show()

check the distribution of the target variable

In [None]:
# plot histogram
fig = px.histogram(data_df, x=["Price"], template = 'plotly_dark', title = 'Histogram of Price')
fig.show()

that the distribution of Price is right-skewed, we can quickly check to see if log transformation can make Price approximately normal to give fighting chance to algorithms that assume normality

In [None]:
# create a copy of data
data_copy = data_df.copy()

# create a new feature Log_Price
data_copy['Log_Price'] = np.log(data_df['Price'])

# plot histogram
fig = px.histogram(data_copy, x=["Log_Price"], title = 'Histgram of Log Price', template = 'plotly_dark')
fig.show()

# Data Preparation

# Setting up the Environment

we need to set up the environment using the setup function. This function initializes the experiment in PyCaret, handles various preprocessing tasks, and allows users to customize the behavior of the machine learning pipeline based on the parameters passed in the function. Here are some important parameters of the setup() function

In [None]:
from pycaret.regression import *
s = setup(data, target = 'Price', transform_target = True, log_experiment = True, experiment_name = 'diamonds')

Now our environment is fully functional.

# Comparing all Regression models

With just a single line of code, you can run your training set on all of the available models in PyCaret. You can view the models available by typing:

In [None]:
# all the models that are available are 
models()


In [None]:
#select the best one 
best = compare_models(exclude = ['ransac'])

the compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select parameter.

# Creating the model

create_model is the most granular function in PyCaret and is often the foundation behind most of the PyCaret functionalities

    AdaBoost Regressor (‘ada’)
    Light Gradient Boosting Machine (‘lightgbm’)
    Decision Tree (‘dt’)

# AdaBoost Regressor

In [None]:
ada = create_model('ada')

In [None]:
print(ada)

# Light Gradient Boosting Machine

In [None]:
lightgbm = create_model('lightgbm')

# Decision Tree

In [None]:
dt = create_model('dt')

# Model Evaluation

In [None]:
evaluation_best_clf = evaluate_model(best)

Here are a few more charts on the performance of our model:

In [None]:
plot_model(best_reg, plot = 'learning')

In [None]:
plot_model(best_reg, plot = 'error')

Here is an overview of possible graphics in PyCaret: Examples by module - Regression

# Saving image files
If you want to save the output graphics in PyCaret, you have to set the safe parameter to True. The syntax would look like this:

In [None]:
plot_model(best_reg, plot = 'error',save = True)

# Model Optimization
will try to improve the performance of our created algorithm with different methods. At the end of the chapter I will create an overview of the performance values. On their basis I will select afterwards the final model.

# Tune the Model

AdaBoost Regressor

In [None]:
tuned_ada = tune_model(ada)

In [None]:
print(tuned_ada)

Light Gradient Boosting Machine

In [None]:
lgbm_params = {'num_leaves': np.arange(10,200,10),
                        'max_depth': [int(x) for x in np.linspace(10, 110, num = 11)],
                        'learning_rate': np.arange(0.1,1,0.1)
                        }tuned_lightgbm = tune_model(lightgbm, custom_grid = lgbm_params)

In [None]:
print(tuned_lightgbm)

Decision Tree

In [None]:
tuned_dt = tune_model(dt)

# Plot a Model

Before finalizing the model, the plot_model() function can be used to evaluate the performance of the model across different aspects such as Residual Plots, Prediction Error, Feature Importance, etc.

There are over 10 plots available under plot_model(), which you can view by typing:

In [None]:
plot_model?

# Residual Plot

In [None]:
plot_model(tuned_lightgbm)

# Prediction Error Plot

In [None]:
plot_model(tuned_lightgbm, plot = 'error')

# Feature Importance Plot

In [None]:
plot_model(tuned_lightgbm, plot='feature')

Another nice way of analyzing the model is to use the evaluate_model() function which creates an Interactive dashboard with all the available plots to choose from. The user can easily select an option and view the plot of their choice. The plot_model() function is used internally. For example, here I’ve chosen the ‘Cooks Distance’ plot:

In [None]:
evaluate_model(tuned_lightgbm)

# Prediction on Test Data

perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. 

In [None]:
predict_model(tuned_lightgbm)

# Finalizing model for deployment

The finalize_model function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production

In [None]:
final_lightgbm = finalize_model(tuned_lightgbm)print(final_lightgbm)

In [None]:
predict_model(final_lightgbm)

# Predicting on the Test set

What the below code does is adds the predicted value to a new column called “Label” at the end of the DataFrame.

In [None]:
unseen_predictions = predict_model(final_lightgbm, data=data_unseen)
unseen_predictions.head()

In [None]:
#check the metrics on this since you have an actual target column Price available.
from pycaret.utils import check_metric
check_metric(unseen_predictions.Price, unseen_predictions.Label, 'R2')

# Saving the model
We have now finished the experiment, and have used the stored model called final_gbr to predict the unseen data. But what happens when we have new data to predict? Do we have to start from scratch and create a model again? Well, the answer is obviously No. PyCaret’s inbuilt save_model() allows us to save this already trained model for future use.

In [None]:
save_model(final_lightgbm,'Final LightGBM Model 25Nov2020')

you can simply use it to predict any new data using the same predict_model function

In [None]:
new_prediction = predict_model(saved_final_lightgbm, data=data_unseen)new_prediction.head()

we have applied the loaded model to predict the same data_unseen that we used in above.

In [None]:
from pycaret.utils import check_metric
check_metric(new_prediction.Price, new_prediction.Label, 'R2')

the results of unseen_predictions and new_prediction are identical.

# Conclusion
covered the entire machine learning pipeline starting from loading the data, pre-processing the data, training model, hyperparameter tuning, and at last the prediction and saving of the trained model. All this is done in less than 10 commands which are intuitive and easy to remember. Recreating this whole process without the use of PyCaret would’ve taken 100s of lines of codes using the normal libraries, but these are only the basics of the pycaret.regression module