# <center>Tabular Playground Series - Sep 2021</center>

This notebook is work in progress...
<hr>

## 1. Problem Definition

Although the dataset for this competition is synthetic, it is based on a real dataset and it has been generated using [CTGAN](https://github.com/sdv-dev/CTGAN). This dataset involves predicting whether a claim will be made on an insurance policy. Features have been anonymized and they have properties relating to real-world features.

- It is a <span style="color:skyblue;">binary (2-class) classification</span> problem. 
- The number of observations for each class is <span style="color:skyblue;">balanced</span>. 
- There are 957,919 observations in the training data with 118 input variables (including 'id') and 1 output variable ('claim'). 
- There are 493,474 observations in the test data with 118 input variables (including 'id')
- There are 493,474 rows in the sample solution with 2 columns ('id','claim')
- The dataset has a <span style="color:skyblue;">lot of missing values</span> which have been encoded with NaN values. 
- The variable names are as follows: ('id', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', f118', 'claim')

<u>Goal</u>: Predict whether a claim will be made on an insurance policy.

<hr>

## 2. Load data

Let's start off by loading the libraries required for this project.

#### Install libraries

Let’s begin by installing the latest stable version of datatable.

In [None]:
!pip install datatable

#### Load libraries

In [None]:
# Load libraries
import datatable as dt
from datatable.models import Ftrl
print(dt.__version__)

import time
from pathlib import Path
import numpy as np
import pandas as pd

# from bokeh.plotting import *
# output_notebook()

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set_context(rc={'figure.figsize':(12, 6)})

# to print all outputs of a cell
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings('ignore')

#### Load data

Initially I tried to read data using pandas, but it was very slow, hence I have used datatable instead.

In [None]:
## Data Table Reading
start = time.time()
data_dir = Path('../input/tabular-playground-series-sep-2021/')
dt_train = dt.fread(data_dir / "train.csv")
dt_test = dt.fread(data_dir / "test.csv")
dt_submission = dt.fread(data_dir / "sample_solution.csv")
end = time.time()
print(end - start)

## 3. Exploratory Data Analysis

- Automatic EDA using sweetviz can be found in [this notebook](https://www.kaggle.com/sugamkhetrapal/tabular-playground-sep-21-eda-dataprep).

- Automatic EDA using dataprep can be found in [this notebook](https://www.kaggle.com/sugamkhetrapal/tabular-playground-sep-21-eda-sweetviz/notebook). Click on the output tab to download the report.

We are also going to cover the following steps:
1. Take a peek at our raw data.
2. Review the dimensions of our dataset.
3. Review the data types of attributes in our data.
4. Summarize the distribution of instances across classes in our dataset.
5. Summarize our data using descriptive statistics.
6. Understand the relationships in our data using correlations.
7. Review the skew of the distributions of each attribute.

##### Peek at our data

Let's review the first five rows of the data.

In [None]:
dt_train.head(5)

##### Dimensions of our data

In [None]:
# number of rows and columns in training dataset
dt_train.shape

- Training dataset has 957,919 rows and 120 columns

In [None]:
# number of rows and columns in test dataset
dt_test.shape

- Test dataset has 493,474 rows and 119 columns

##### Column names

In [None]:
# To get the column names
dt_train.names

- We have the 'id' variable
- We have variables from f1, f2, ..., f118.
- We have target variable titled 'claim'.

Now, let's look at the data types of each of these variables

In [None]:
for i in range(len(dt_train.names)):
    print(dt_train.names[i], ":", dt_train.stypes[i])

- The 'id' variable is of type 'Integer'
- Variables named f1, f2, ..., f118 are of type 'float64'.
- The output variable 'claim' is of type 'boolean'.

Let's check what the submission file looks like.

In [None]:
dt_submission.head(5)

- The submission file has probabilites (0.5) instead of 0 and 1.
- It is mentioned under the evaluation section that for each id in the test set, we must predict a probability for the claim variable

In [None]:
dt_submission.shape

- The submission file has 493,474 rows and 2 colums (id, claim).

##### Summary Statistics

Let us get the mean, minimum, maximum and standard deviation of the columns using datatable

In [None]:
# mean
dt_train.mean()

In [None]:
# max
dt_train.max()

In [None]:
# min
dt_train.min()

In [None]:
# standard deviation
dt_train.sd()

In [None]:
# missing values
# https://www.machinelearningplus.com/data-manipulation/101-python-datatable-exercises-pydatatable/
# How to count NA values in every column of a datatable Frame?
dt_train.countna()

- This dataset has lot of missing values which need to be dropped or treated.

##### Class Distribution (Classification Only)

On classification problems we need to know how balanced the class values are. Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of our project.

In [None]:
# Class Distribution
# start = time.time()
# for i in range(10000):
#     dt_train[:, dt.sum(dt.f.claim), dt.by(dt.f.claim)]
# end = time.time()
# print(end - start)

# Class Distribution in pandas
class_counts = dt_train.to_pandas().groupby('claim').size()
print(class_counts)
# find out how to do this in datatable

- The training dataset appears to be balanced as we have approx. 480K cases in which claims were not made and approx. 477K cases in which claims were made.
- Since data imbalance is not there, it is not required to be treated in this competition.

##### Calculate the mean, minimum, maximum and standard deviation of each column in which claims were made (i.e. claim = 1)

In [None]:
dt_train[dt.f.claim == 1, :].mean()

In [None]:
dt_train[dt.f.claim == 1, :].min()

In [None]:
dt_train[dt.f.claim == 1, :].max()

##### Calculate the mean, minimum, maximum and standard deviation of each column in which claims were not made (i.e. claim = 0)

In [None]:
dt_train[dt.f.claim == 0, :].mean()

In [None]:
dt_train[dt.f.claim == 0, :].min()

In [None]:
dt_train[dt.f.claim == 0, :].max()



#### Correlations Between Attributes
Correlation refers to the relationship between two variables and how they may or may not change together. The most common method for calculating correlation is Pearson's Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in our dataset. As such, it is a good idea to review all of the pairwise correlations of the attributes in our dataset.

In [None]:
start = time.time()

# Pairwise Pearson correlations
correlations = dt_train.to_pandas().corr(method='pearson')
print(correlations)

end = time.time()
print(end - start)

In [None]:
numeric_data = dt_train[:, [int, float]]
numeric_ncols = numeric_data.ncols
numeric_names = list(numeric_data.names)
corr_matrix = dt.Frame([[None] * numeric_ncols] * (numeric_ncols + 1), names=['Columns'] + numeric_names)
corr_matrix[:, 0] = dt.Frame(numeric_names)

for i in range(numeric_data.ncols):
    for j in range(i, numeric_data.ncols):
        corr_matrix[i, j+1] = numeric_data[:, dt.corr(dt.f[i], dt.f[j])]
        corr_matrix[j, i+1] = corr_matrix[i, j+1]

corr_matrix

- List down the attribute combinations which are positively correlated (i.e. > 0.5)
- List down the attribute combinations which are negatively correlated

### Skew of Univariate Distributions

Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow us to perform data preparation to correct the skew and later improve the accuracy of our models.

In [None]:
dt_train.skew()

In [None]:
dt_train.kurt()

In [None]:
dt_train.nunique()

In [None]:
numeric_data = dt_train[:, [int, float]]
numeric_ncols = numeric_data.ncols
numeric_names = list(numeric_data.names)
cov_matrix = dt.Frame([[None] * numeric_ncols] * (numeric_ncols + 1), names=['Columns'] + numeric_names)
cov_matrix[:, 0] = dt.Frame(numeric_names)

for i in range(numeric_data.ncols):
    for j in range(i, numeric_data.ncols):
        cov_matrix[i, j+1] = numeric_data[:, dt.corr(dt.f[i], dt.f[j])]
        cov_matrix[j, i+1] = cov_matrix[i, j+1]

cov_matrix

### Understand our data with visualization

In [None]:
# Univariate Histograms
# dt_train.to_pandas().hist()
# pyplot.show()

In [None]:
# Univariate Density Plots
# dt_train.to_pandas().plot(kind='density', subplots=True, layout=(3,3), sharex=False)
# pyplot.show()

In [None]:
# Box and Whisker Plots
# dt_train.to_pandas().plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
# pyplot.show()

In [None]:
# split into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# https://stackoverflow.com/questions/63022043/how-to-split-datatable-dataframe-into-train-and-test-dataset-in-python
from sklearn.model_selection import train_test_split

X = dt_train[:, [col for col in dt_train.names if col != 'claim']]
y = dt_train[:, -1]

X = X.to_numpy()
y = y.to_numpy()

# dt_df = dt_train.to_numpy()
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.3)

X_train = dt.Frame(X_train)
X_validation = dt.Frame(X_validation)
y_train = dt.Frame(y_train)
y_validation = dt.Frame(y_validation)

In [None]:
# X_train.shape
# X_validation.shape
# y_train.shape
# y_validation.shape

In [None]:
# Train a FTRL model model_ftrl_1 using train_data and train_target and assign the predictions of valid_data to preds_valid_1 and of test to preds_test_1
from datatable.models import Ftrl

model_ftrl_1 = Ftrl()
model_ftrl_1.fit(X_train, y_train)
model_ftrl_1

In [None]:
prediction_validation_1 = model_ftrl_1.predict(X_validation)
prediction_validation_1.head()

In [None]:
X_test = dt_test[:,:]
X_test = X_test.to_numpy()
X_test = dt.Frame(X_test)

prediction_test_1 = model_ftrl_1.predict(X_test)
prediction_test_1.head()

In [None]:
# Display the feature importances of model_ftrl_1 in descending order and calculate the logloss of y_validation and prediction_validation_1
model_ftrl_1.feature_importances[:, :, dt.sort(-dt.f.feature_importance)]

In [None]:
# Print all the column names and column types of data in column-name : column-type format
for i in range(dt_train.ncols):
    print(f'{dt_train.names[i]} : {dt_train.types[i].name}')

In [None]:
preds = dt.cbind(y_validation, prediction_validation_1)
# print(preds) very important to print pred because we will come to know that claim has been renamed to C0
preds[:, -dt.mean(dt.f.C0 * dt.math.log(dt.f['True']) + (1-dt.f.C0) * dt.math.log(dt.f['False']))][0, 0]

In [None]:
submission_ids = dt_submission['id']
print(submission_ids)

In [None]:
# Create submission_1 in the submission format of the competition, write it as submission_1.csv and submit it on Kaggle
submission_1 = dt.Frame(id=submission_ids, claim=prediction_test_1['True'])
submission_1.to_csv('submission_1.csv')
submission_1.head()
# submission scored 0.79455

In [None]:
# Train another FTRL model model_ftrl_2 with nepochs=3, `nbins=10 8, display it's feature importances, score & evaluate it's logloss onvalid_dataand submit the predictionspreds_test_2oftestassubmission_2`**
model_ftrl_2 = Ftrl(nepochs=3, nbins=10**8)
model_ftrl_2.fit(X_train, y_train)
model_ftrl_2

In [None]:
model_ftrl_2.feature_importances[:, :, dt.sort(-dt.f.feature_importance)]

In [None]:
prediction_validation_2 = model_ftrl_2.predict(X_validation)
prediction_validation_2.head()

In [None]:
prediction_test_2 = model_ftrl_2.predict(X_test)
prediction_test_2.head()

In [None]:
preds = dt.cbind(y_validation, prediction_validation_2)
preds[:, -dt.mean(dt.f.C0 * dt.math.log(dt.f['True']) + (1-dt.f.C0) * dt.math.log(dt.f['False']))][0, 0]

In [None]:
submission_2 = dt.Frame(id=submission_ids, claim=prediction_test_2['True'])
submission_2.to_csv('submission_2.csv')
submission_2.head()

In [None]:
# Submit a ensemble of model_ftrl_1 and model_ftrl_2 by averaging the predictions as submission_ensemble
submission_ensemble = dt.cbind(submission_1, submission_2)
submission_ensemble[:, dt.update(claim = 0.5 * dt.f.claim + 0.5 * dt.f.claim)]
del submission_ensemble[:, ['id.0', 'claim.0']]
submission_ensemble.to_csv('submission_ensemble.csv')
submission_ensemble.head()
# submission scored 0.79485

In [None]:
dt_train.countna()

In [None]:
# imputing with a constant

from sklearn.impute import SimpleImputer
train_constant = dt_train.copy()
#setting strategy to 'constant' 
mean_imputer = SimpleImputer(strategy='constant') # imputing using constant value
train_constant[:,:] = mean_imputer.fit_transform(train_constant)
train_constant.countna().sum()

In [None]:
from sklearn.impute import SimpleImputer
train_most_frequent = dt_train.copy()
#setting strategy to 'mean' to impute by the mean
mean_imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median 
train_most_frequent[:,:] = mean_imputer.fit_transform(train_most_frequent)
train_most_frequent.countna().sum()

In [None]:
# https://www.kaggle.com/general/76911
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
MiceImputed = dt_train.copy(deep=True)
mice_imputer = IterativeImputer()
MiceImputed[:, :] = mice_imputer.fit_transform(dt_train)

### Notes and Learning Opportunities

- Replace missing values with mean and evaluate it's impact on model evaluation and performance. (pending)
- Drop missing values and evaluate it's impact on model evaluation and performance (pending)
- Can we use MICE to impute missing values? What will be it's impact on model evaluation and performance? (pending)
- Could we use R to impute missing values using data.table and then use the imputed dataset in python? (pending) https://www.datacamp.com/community/tutorials/using-both-python-r Can we embed R in python using rpy2, do missing value imputation using R and pass on the imputed dataset to python? (find out)
- Could we use number of missing values in each column as a feature? (pending)
- Did we try Hyperparameter optimization using Optuna? (pending)
- How to do visualization on large datasets? (find out)
- Do we have any missing values in test data? Do we need to check it? If not, why not? (find out)

# References and Credits

- https://www.kaggle.com/bextuychiev/7-coolest-packages-top-kagglers-are-using#2.-Datatable
- https://www.kaggle.com/sudalairajkumar/getting-started-with-python-datatable
- cannot use datatable because it does have functionality to handle missing values yet.
- https://github.com/vopani/datatableton#set-04--frame-operations--beginner--exercises-31-40
- http://webcache.googleusercontent.com/search?q=cache:okPGVK9Fxd0J:https://towardsdatascience.com/introducing-datatableton-python-datatable-tutorials-exercises-a0887f4323b0&hl=en&gl=in&strip=1&vwsrc=0
- https://datatable.readthedocs.io/en/latest/manual/comparison_with_pandas.html#missing-functionality
- https://www.kaggle.com/chayan8/missing-value-imputation-using-mice-knn-ckd-data
- https://www.kaggle.com/general/187601
- https://www.kaggle.com/melanie7744/tps9-how-to-transform-your-data try each and test the impact on the model
- http://webcache.googleusercontent.com/search?q=cache:okPGVK9Fxd0J:https://towardsdatascience.com/introducing-datatableton-python-datatable-tutorials-exercises-a0887f4323b0&hl=en&gl=in&strip=1&vwsrc=0
- http://webcache.googleusercontent.com/search?q=cache:NtZTpcPxjRUJ:https://towardsdatascience.com/speed-up-your-data-analysis-with-pythons-datatable-package-56e071a909e9&hl=en&gl=in&strip=1&vwsrc=0
- https://github.com/vopani/datatableton#set-05--column-aggregations--beginner--exercises-41-50
- http://webcache.googleusercontent.com/search?q=cache:uPQQGanfFDUJ:https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13&hl=en&gl=in&strip=1&vwsrc=0
