# We are going to use AutoViML and PyCaret to build better models for Housing Prices
## Please turn on the GPU on this kernel to the right in Accelerator => GPU
###  Please 
### This is a modified version of a fantastic original notebook here:
https://www.kaggle.com/pavansanagapati/6-useful-automated-ml-tools-for-data-scientists


In [None]:
import pandas as pd
import numpy as np

## 4.1 Load Dataset<a id="41"></a> <br>

To demonstrate the pycaret capability we will use a dataset from UCI called **Default of Credit Card Clients Dataset**. This dataset contains information on default payments, demographic factors, credit data, payment history, and billing statements of credit card clients in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 features. Short descriptions of each column are as follows:

- **ID:** ID of each client
- **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- **SEX:** Gender (1=male, 2=female)
- **EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE:** Marital status (1=married, 2=single, 3=others)
- **AGE:** Age in years
- **PAY_0 to PAY_6:** Repayment status by n months ago (PAY_0 = last month ... PAY_6 = 6 months ago) (Labels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
- **BILL_AMT1 to BILL_AMT6:** Amount of bill statement by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
- **PAY_AMT1 to PAY_AMT6:** Amount of payment by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
- **default.payment.next.month:** Default payment (1=yes, 0=no) `Target Column`

In [None]:
import pandas as pd
#data=pd.read_csv('../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')
data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
print(data.shape)
data.head()

In order to demonstrate the predict_model() function on unseen data, a sample of 1500 records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 1500 records are not available at the time when the machine learning experiment was performed.

In [None]:
dataset = data.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = data.drop(dataset.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

In [None]:
#target = 'default.payment.next.month'
target = 'SalePrice'
dataset.head(1)

## Let's build a model using Auto_ViML first

In [None]:
!pip install autoviml --upgrade

In [None]:
from autoviml.Auto_ViML import Auto_ViML

## You have to just give the dataset, data_unseen and target variable. That's all!

In [None]:
 m, feats, trainm, testm = Auto_ViML(dataset, target, data_unseen,
                            sample_submission='',
                            scoring_parameter='', KMeans_Featurizer=False,
                            hyper_param='RS',feature_reduction=True,
                             Boosting_Flag=True, Binning_Flag=False,
                            Add_Poly=0, Stacking_Flag=False,Imbalanced_Flag=True,
                            verbose=2)

# Let's Compare it to PyCaret

In [None]:
!pip install pycaret

In [None]:
#import regression and classification modules from pycaret
#from pycaret.classification import *
from pycaret.regression import *

In [None]:
help(setup)

In [None]:
 reg = setup(data = data, target = target, train_size=0.8,
                ignore_features=['Id'], session_id=21, imputation_type='iterative',
                normalize=True, pca=True, pca_method='kernel', 
                transform_target=False, ignore_low_variance = True, 
                combine_rare_levels = True, remove_outliers=True)

In [None]:
compare_models()

There you go created over 15 models using 10 fold stratified cross validation and evaluated the 6 most commonly used classification metrics (Accuracy, AUC, Recall, Precision, F1, Kappa). The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing the sort parameter. For example **compare_models(sort = 'Recall')** will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example **compare_models(fold = 5)** will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time.
## 4.3 Create model<a id="43"></a> <br>
While compare_models() is a powerful function and often a starting point in any experiment, it does not return any trained models. PyCaret's recommended experiment workflow is to use compare_models() right after setup to evaluate top performing models and finalize a few candidates for continued experimentation. As such, the function that actually allows to you create a model is unimaginatively called **create_model()**.

There are 18 classifiers available in the model library of PyCaret. 

For illustration purposes only we will be considering the following Classifiers .

* Logistic Regression('lr')
* Decision Tree Classifier ('dt')
* K Neighbors Classifier ('knn')
* Random Forest Classifier ('rf')

In [None]:
### we can remove a few models 
lr = compare_models(exclude = ['en','dt','omp', 'gbr','ada', 'par'], n_select=2)

In [None]:
lr

In [None]:
lr[1]

In [None]:
rf = create_model('rf')

Notice that the mean score of all models matches with the score printed in compare_models(). This is because the metrics printed in the compare_models() score grid are the average scores across all CV folds. Similar to compare_models(), if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For Example: create_model('dt', fold = 5) will create a Decision Tree Classifier using 5 fold stratified CV.
## 4.4 Tune model<a id="44"></a> <br>
When a model is created using the create_model() function it uses the default hyperparameters. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold.

Now let us tune the below models 
* Logistic Regression('lr')
* Decision Tree Classifier ('dt')
* K Neighbors Classifier ('knn')
* Random Forest Classifier ('rf')

In [None]:
# Tune the Logistic regression model
tuned_lr = tune_model('lr')

In [None]:
# Tune the Decision Tree Classifier model
tuned_dt = tune_model('dt')

In [None]:
# Tune the K Neighbors Classifier model
tuned_knn = tune_model('knn')

In [None]:
# Tune the Random Forest Classifier model
tuned_rf = tune_model('rf')

**Note:**

Notice how the results after tuning have been improved:

* Logistic Regression(Before: 0.7786 , After: 0.7786)
* Decision Tree Classifier (Before: 0.7216 , After: 0.7413)
* K Neighbors Classifier (Before: 0.7355 , After: 0.7772)
* Random Forest Classifier (Before: 0.8015 , After: 0.8103)

## 4.5 Plot Model<a id="45"></a> <br>

Before model finalization, the `plot_model()` function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set. 

There are 15 different plots available.

In [None]:
#Plot LR model: ROC-AUC curve
plot_model(lr)

In [None]:
#Plot LR model: ROC-AUC curve
plot_model(tuned_lr)

In [None]:
#Plot Decision Tree model: ROC-AUC curve
plot_model(dt)

In [None]:
#Plot KNN model: ROC-AUC curve
plot_model(knn)

To analyze the performance of models is to use the **evaluate_model()** function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.

In [None]:
evaluate_model(lr)

In [None]:
#create a tree base model to interpret model and check feature importance
dt = create_model('dt')
#interpret a model
interpret_model(dt)

In [None]:
#optimize threshold for trained LR model
optimize_threshold(lr)

# 6.AutoViz<a id="6"></a> <br>
![](https://github.com/AutoViML/AutoViz/raw/master/logo.png)
Automatically Visualize any dataset, any size with a single line of code.

AutoViz performs automatic visualization of any dataset with one line. Give any input file (CSV, txt or json) and AutoViz will visualize it.

In [None]:
!pip install autoviz

In [None]:
import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

In [None]:
sep = ','
target = 'medv'
datapath = ''
filename = 'https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/Boston.csv'
dft = AV.AutoViz(datapath+filename, sep=sep, depVar=target, dfte='', header=0, verbose=2,
                            lowess=False,chart_format='svg',max_rows_analyzed=1500,max_cols_analyzed=30)

# Conclusion <a id="8"></a> <br>

Hence Automated ML tools is enabling data scientists to improve their productivity and realize their true potential quickly and time to market with quicker insights. I hope you find this kernel useful and will use the above tools to good effect in your day to day data science career path.

# If you like this kernel greatly appreciate to <font color='red'>UPVOTE 