In [None]:
# FutureWarning: is_categorical is deprecated and will be removed in a future version.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Preparing Data
Load training data from a CSV file. dataset.   
Note that we loaded data from a CSV file stored in the cloud (AWS s3 bucket), but you can you specify a local file-path instead if you have already downloaded the CSV file to your own machine (e.g., using wget). Each row in the table train_data corresponds to a single training example.

### Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [None]:
import pandas as pd 

df_train = pd.read_csv('./titanic/train.csv')
df_test = pd.read_csv('./titanic/test.csv')

target_col = 'Survived'

### DataFrame.head
Returns the first n rows.  
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

In [None]:
df_train.head()

## Automated Dataset Overview
Automated dataset overview allows you to easily get a high-level understanding of datasets, including information about the number of rows and columns, the data types of each column, and basic statistical information about the number of rows and columns, the data types of each column and basic statistical information such as min/max values, mean, quartiles, and standard deviation. This functionality can be a valuable tool for quickly identifying potential issues or areas of interest in your dataset before diving deeper into your analysis.

The last chart is a feature distance. It measures the similarity between features in a dataset. For example, if two variables are almost identical, their feature distance will be small. Understanding feature distance is useful in feature selection, where it can be used to identify which variables are redundant and should be considered to removal.

In [None]:
import autogluon.eda.auto as auto

auto.dataset_overview(train_data=df_train, test_data=df_test, label=target_col)

## Covariate Shift Analysis
Covariate shift is a phenomenon in machine learning where the distribution of the independent variables in the training and testing data is different. This can occur when the training data and testing data come from different sources, regions or changes over time. This can result in biased model performance, as the model is not generalizing well to the test data.

To address covariate shift, various techniques can be used, such as re-sampling the data, adjusting the model to account for the shift, transforming the data to a form not exposed to the shift (i.e. car year make -> car age) or obtaining additional data to balance the distribution of the independent variables. The goal is to ensure that the model is trained and tested on similar data distributions, so that the model is generalizing well when deployed into productio

In [None]:
auto.covariate_shift_detection(train_data=df_train, test_data=df_test, label=target_col)

In [None]:
df_train = df_train.drop(columns='PassengerId')
df_test = df_test.drop(columns='PassengerId')

## 
Feature Interaction Chartin
This tool is made for quick interactions visualization between variables in a dataset. User can specify the variables to be plotted on the x, y and hue (color) parameters. The tool automatically picks chart type to render based on the detected variable types and renders 1/2/3-way interactions.

This feature can be useful in exploring patterns, trends, and outliers and potentially identify good predictors for the task. 
g
### Missing value analysis

Analyze dataset's missing value counts and frequencies

In [None]:
auto.missing_values_analysis(train_data=df_train)

In [None]:
def mean_median(df,variable):
    df[variable+'_mean']  = df[variable].fillna(df[variable].mean())
    df[variable+'_median']  = df[variable].fillna(df[variable].median())

# mean_median(df_train,'Age')

In [None]:
df_train[df_train.Embarked.isna()]

It looks like there are only two null values in the Embarked feature. 
We may be able to fill these by looking at other independent. Both passengers paid a Fare of $80 are in the C Embarked values where Pclass is 1. 

In [None]:
auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass')

In [None]:
auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test)     

## Predicting Columns in a Table

Via a simple fit() call, AutoGluon can produce highly-accurate models to predict the values in one column of a data table based on the rest of the column's values. Use AutoGluon with tabular data for both classification and regression problems.  

https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html
## Description of fit():
Here we discuss what happened during fit().

Since there are only two possible values of the class variable, this was a binary classification problem, for which an appropriate performance metric is accuracy. AutoGluon automatically infers this as well as the type of each feature (i.e., which columns contain continuous numbers vs. discrete categories). AutoGluon can also automatically handle common issues like missing data and rescaling feature values.

We did not specify separate validation data and so AutoGluon automatically choses a random training/validation split of the data. The data used for validation is seperated from the training data and is used to determine the models and hyperparameter-values that produce the best results. Rather than just a single model, AutoGluon trains multiple models and ensembles them together to ensure superior predictive performance.

By default, AutoGluon tries to fit various types of models including neural networks and tree ensembles. Each type of model has various hyperparameters, which traditionally, the user would have to specify. AutoGluon automates this process.

AutoGluon automatically and iteratively tests values for hyperparameters to produce the best performance on the validation data. This involves repeatedly training models under different hyperparameter settings and evaluating their performance. This process can be computationally-intensive, so fit() can parallelize this process across multiple threads (and machines if distributed resources are available). To control runtimes, you can specify various arguments .d classes:


In [None]:
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(label=target_col).fit(df_train, num_cpus=12, num_gpus=1)

We can also evaluate the performance of each individual trained model on our (labeled) test data.

In [None]:
predictor.leaderboard(df_test, silent=True)

In [None]:
results = predictor.fit_summary(show_plot=True)

In [None]:
predictor.predict(df_test)
predictor.predict_proba(df_test)
predictor.evaluate(df_test)

Above the scores of predictive performance were based on a default evaluation metric (accuracy for binary classification). Performance in certain applications may be measured by different metrics than the ones AutoGluon optimizes for by default. If you know the metric that counts in your application, you should specify it as demonstrated in the next section.

## Presets

AutoGluon comes with a variety of presets that can be specified in the call to `.fit` via the `presets` argument. `medium_quality` is used by default to encourage initial prototyping, but for serious usage, the other presets should be used instead.

| Preset                            | Model Quality                                          | Use Cases                                                                                                                                               | Fit Time (Ideal) | Inference Time (Relative to medium_quality) | Disk Usage |
|:----------------------------------|:-------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|:--------------------------------------------|:-----------|
| best_quality                      | State-of-the-art (SOTA), much better than high_quality | When accuracy is what matters                                                                                                                           | 16x+             | 32x+                                        | 16x+       |
| high_quality                      | Better than good_quality                               | When a very powerful, portable solution with fast inference is required: Large-scale batch inference                                                    | 16x              | 4x                                          | 2x         |
| good_quality                      | Significantly better than medium_quality               | When a powerful, highly portable solution with very fast inference is required: Billion-scale batch inference, sub-100ms online-inference, edge-devices | 16x              | 2x                                          | 0.1x       |
| medium_quality                    | Competitive with other top AutoML Frameworks           | Initial prototyping, establishing a performance baseline                                                                                                | 1x               | 1x                                          | 1x         |

We recommend users to start with `medium_quality` to get a sense of the problem and identify any data related issues. If `medium_quality` is taking too long to train, consider subsampling the training data during this prototyping phase.  
Once you are comfortable, next try `best_quality`. Make sure to specify at least 16x the `time_limit` value as used in `medium_quality`. Once finished, you should have a very powerful solution that is often stronger than `medium_quality`.  
Make sure to consider holding out test data that AutoGluon never sees during training to ensure that the models are performing as expected in terms of performance.  
Once you evaluate both `best_quality` and `medium_quality`, check if either satisfies your needs. If neither do, consider trying `high_quality` and/or `good_quality`.  
If none of the presets satisfy requirements, refer to [tutorials/tabular_prediction/tabular-indepth.ipynb](https://github.com/gidler/autogluon-tutorials/blob/main/tutorials/tabular_prediction/tabular-indepth.ipynb) for more advanced Aally use it like this:

In [None]:
time_limit = 60  
metric = 'roc_auc'  
predictor = TabularPredictor(target_col, eval_metric=metric).fit(df_train, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(df_test, silent=True)

This command implements the following strategy to maximize accuracy:

- Specify the argument `presets='best_quality'`, which allows AutoGluon to automatically construct powerful model ensembles based on [stacking/bagging](https://arxiv.org/abs/2003.06505), and will greatly improve the resulting predictions if granted sufficient training time. The default value of `presets` is `'medium_quality'`, which produces *less* accurate models but facilitates faster prototyping. With `presets`, you can flexibly prioritize predictive accuracy vs. training/inference speed. For example, if you care less about predictive performance and want to quickly deploy a basic model, consider using: `presets=['good_quality', 'optimize_for_deployment']`.

- Provide the parameter `eval_metric` to `TabularPredictor()` if you know what metric will be used to evaluate predictions in your application. Some other non-default metrics you might use include things like: `'f1'` (for binary classification), `'roc_auc'` (for binary classification), `'log_loss'` (for classification), `'mean_absolute_error'` (for regression), `'median_absolute_error'` (for regression).  You can also define your own custom metric function.  For more information refer to [tutorials/tabular_prediction/tabular-custom-metric.ipynb](https://github.com/gidler/autogluon-tutorials/blob/main/tutorials/tabular_prediction/tabular-custom-metric.ipynb)

- Include all your data in `train_data` and do not provide `tuning_data` (AutoGluon will split the data more intelligently to fit its needs).

- Do not specify the `hyperparameter_tune_kwargs` argument (counterintuitively, hyperparameter tuning is not the best way to spend a limited training time budgets, as model ensembling is often superior). We recommend you only use `hyperparameter_tune_kwargs` if your goal is to deploy a single model rather than an ensemble.

- Do not specify `hyperparameters` argument (allow AutoGluon to adaptively select which models/hyperparameters to use).

- Set `time_limit` to the longest amount of time (in seconds) that you are willing to wait. AutoGluon's predictive performance improves the longer `fit()` is alle other features: