# **Tabular Boilerplate Notebook**

Tabular modeling takes data in the form of a table (like a spreadsheet or CSV), where the objective is to predict the value in one column based on the values in the other columns. This notebook will serve as a boilerplate handler for tabular data modeling, with sections laid out for each major step of the modeling process.

**REMEMBER**: This boilerplate is just that: boilerplate! It's a good idea to perform your own exploration in a manner that's specific to your given dataset. The default boilerplate uses the Titanic dataset from Kaggle as an example.

## **Setup**

This is the section for setting up your work environment, with boilerplate setups for a number of mainstream options. Don't see one you like? Add your own!

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

### **Imports and Installs**

Declare all your project's required imports and installs here:

In [2]:
# Installs
!pip install -Uqq fastai waterfallcharts treeinterpreter dtreeviz

In [89]:
# Imports
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
import numpy as np
import math
from scipy import stats

In [None]:
# Statsmodels install (only necessary if module is needed and Colab gives trouble)
! pip install --upgrade Cython
! pip install --upgrade git+https://github.com/statsmodels/statsmodels

In [None]:
# Set up some constraints here, if desired
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

### **Colab Setup**

You can set up the Google Colab environment for data by mounting your Google Drive:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

%cd gdrive/My Drive/



---



## **Data Collection**

Data specific to your current task can be collected here. There'll be different setups depending on where you're running this notebook, but the output here will be used for further data exploration.

### **Kaggle Datasets**

Get yourself started with some Kaggle datasets. First, set up your API credentials:

In [9]:
creds = {"username":"INSERT_USERNAME_HERE","key":"INSERT_API_KEY_HERE"}

Then set the credentials up in the `kaggle.json` so that Kaggle knows where to look for them in API calls:

In [10]:
!mkdir .kaggle
!mv .kaggle /root/
!touch /root/.kaggle/kaggle.json

!ls /root/.kaggle

mv: cannot move '.kaggle' to '/root/.kaggle': Directory not empty
kaggle.json


Then write the credentials to `kaggle.json` with the correct permissions setup to enable access to Kaggle datasets:

In [11]:
import json
import zipfile
import os

with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(creds, file)

!chmod 600 /root/.kaggle/kaggle.json

In [6]:
!export KAGGLE_CONFIG_DIR=/root/.kaggle/kaggle.json

Before we finally install Kaggle itself:

In [7]:
!pip install kaggle

from kaggle import api



Now let's fetch our Kaggle data:

In [8]:
if not os.path.exists('data'):
  os.makedirs('data')

# Let's get the Titanic dataset
api.competition_download_files('titanic', path='data')

import zipfile
with zipfile.ZipFile("data/titanic.zip","r") as zip_ref:
    zip_ref.extractall("data")

os.remove('data/titanic.zip')
os.remove('data/gender_submission.csv')

In [9]:
!ls data

test.csv  train.csv


Finally, let's set up some variables for taking the data exploration further:

In [10]:
path = "data"
train_data = "train.csv"



---



## **Exploratory Data Analysis**

EDA can be performed here, where you'll find cells for showing batches, as well as utility functions for displaying certain analytics. It also contains headings to prompt some thinking about possible exploratory approaches. For tabular data in particular, this section will include tree classifiers for model/data fit inspection.

**REMEMBER**: This section is not prescriptive! Add and remove from it as you want and need to.

### **Look at the Data**

In [11]:
# Setup the dataset path
path = "{b}/{d}".format(b=path, d=train_data)

# And read it in to a df
df = pd.read_csv(path, low_memory=False)

In [82]:
print(df.columns)
len(df)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


891

We can then set our dependent variable, the one we care about. We can also assign every other feature to an independent variablet set.

In [73]:
dep_var = "Survived"
ind_var = [c for c in df.columns if c != dep_var]

It's always important to take a look at some of the entries themselves so that we can develop a better intuition for what we're working with.

In [80]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can see `NaN` values, which we'll need to decide how we want to handle, as well as pick out any potentially interesting features based on their values. It may also be useful to pick out "magic features", those which have a strong correlation to the target given some relatively simple transformation.

### **Handling Dates**

We need to define how we intend to handle dates, assuming the dataset contains such a feature. We can apply both automatic date feature engineering through the revelant utility functions, as well as manual engineering for specific cases.

In [91]:
# A simple example
df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()

Unnamed: 0,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elapsed
0,2019.0,12.0,49.0,4.0,2.0,338.0,False,False,False,False,False,False,1575418000.0
1,,,,,,,False,False,False,False,False,False,
2,2019.0,11.0,46.0,15.0,4.0,319.0,False,False,False,False,False,False,1573776000.0
3,2019.0,10.0,43.0,24.0,3.0,297.0,False,False,False,False,False,False,1571875000.0


In [94]:
# Manually adding relevant date information
df['Is_seasonal'] = np.where(df['Month'] == 12, 1, 0)
df.head()

Unnamed: 0,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elapsed,Is_seasonal
0,2019.0,12.0,49.0,4.0,2.0,338.0,False,False,False,False,False,False,1575418000.0,1
1,,,,,,,False,False,False,False,False,False,,0
2,2019.0,11.0,46.0,15.0,4.0,319.0,False,False,False,False,False,False,1573776000.0,0
3,2019.0,10.0,43.0,24.0,3.0,297.0,False,False,False,False,False,False,1571875000.0,0


### **Handling Rank in Ordinal Columns**

Some categorical data will be ranked (*ordinal columns*), and for these features it may be useful to tell Pandas about how these categories are ordered.

In [70]:
# As an example, let's take the passenger class for the Titanic dataset
df['Pclass'].unique()

array([3, 1, 2])

In [78]:
# export 
def define_ordinal_rank(df, col, ranks):
  """
  Defines ordinal ranking of a column in a dataset

  Args:
    df: DataFrame to define for
    col: Column to define for
    ranks: An array of the ordinal ranking for col
  """
  df[col] = df[col].astype('category')
  df[col].cat.set_categories(ranks, ordered=True, inplace=True)

In [75]:
define_ordinal_rank(df, 'Pclass', [1, 2, 3])

### **Automated Data Checks**

Some simple checks can be made on the dataset automatically, depending on what it is you're looking for. Some common checks are performed in this section.

#### **Roulette Target**

A "roulette" occurs when there are duplicate rows with different target values. This makes training much more difficult, as target values may be close to random.

To start, we'll need to define some kind of acceptable amount of duplicates within the system. This should be generally okay as it's possible that some external, non-codified features have an effect on the target variable. So first we define an acceptable proportion of the dataset as duplicates:

In [67]:
ACCEPTANCE_THRES = .02

And then we have two options. One is to do a simple check on the proportion of duplicates in the dataset and match it against our accepted proportion threshold. We do not consider rows that are full duplicates (that is they duplicate both the row values and the targets) so let's first set up a function to find the relevant duplicates:

In [60]:
# export
def get_relevant_duplicates(df, dep_var):
  """
  Gets number of duplicates with differing targets
  """
  ind_var = [c for c in df.columns if c != dep_var]

  # We need to sift out full duplicate rows
  poss_dups = df.duplicated(ind_var).sum()
  full_dups = df.duplicated().sum()
  return poss_dups - full_dups

And then we can implement our simple solution:

In [72]:
# Naive, workable solution
relevant_dups = get_relevant_duplicates(df, dep_var)
is_roulette = (relevant_dups / len(df) * 100) > ACCEPTANCE_THRES

print("Dataset is a roulette:", is_roulette)

Dataset is a roulette: False


This approach can work. A more rigorous approach is to apply a statistical hypothesis test against the accepted threshold and see if that holds up!

For this simple test we claim that the dataset is not a roulette ($H_0$) and perform a [1-proportion test](https://www.tutorialspoint.com/statistics/one_proportion_z_test.htm) to find the associated p-value for this hypothesis. Our alternative hypothesis is that our dataset contains fewer relevant duplicates than our accepted threshold:

In [64]:
# export
import statsmodels.api as sm

ALPHA = .05

def roulette_test(df, dep_var, threshold):
  """
  Performs a 1-proportion z-test on df to check for roulette. 
  Null hypothesis is that the number of duplicates do not 
  constitute a roulette dataset, in that the number is lower than 
  the acceptance threshold
  """
  relevant_dups = get_relevant_duplicates(df, dep_var)
  if relevant_dups == 0: return False

  _, p_val = sm.stats.proportions_ztest(
      relevant_dups, len(df), len(df)*threshold, 'smaller')

  return p_val > ALPHA

In [68]:
is_roulette = roulette_test(df, dep_var, ACCEPTANCE_THRES)
print("Dataset is a roulette:", is_roulette)

Dataset is a roulette: False


This check may or may not be relevant depending on the context of the dataset. Consider an actual roulette wheel's results or the results of a series of horse races, which may in fact contain a number of full duplicates that is higher than our threshold. In such a scenario, it would be beneficial to know before pursuing such a data science project further, as the dataset's value entropy may be too high to warrant further work.


### **Basic Data Cleaning**

In [84]:
print("DF before:", len(df))

# Remove duplicates
df.drop_duplicates(subset=ind_vars, inplace=True)
print("DF after drop duplicates:", len(df))

DF before: 891
DF after drop duplicates: 891




---



## **Model**

Model work can be performed here, with utilities to help with cross-validation and architecture construction.



---



## **Inference and Deployment**

Here the model can finally be put to use, as well as exported for deployment in an external application.



---



## **Exports and Clean Up**

Here you can export any cells with the `#export` comment using `notebook2script.py`, as well as cleaning up any environmental changes such as data downloads to a cloud drive.

### **Exports**

In [None]:
!python notebook2script.py Tabular.ipynb

### **Clean Up**

In [None]:
# Tear down the data folder
!rm -rf data
!ls