## Introduction

This December's TPS competition is a multiclass classification task, with a very imbalanced target. The dataset is composed of data synthetically generated from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction).

## Original dataset description

I modified the formatting of the text a little. **Please notice that some of this info does not apply to the the data in this competition (TPS Dec)**, such as the number of observations in train and test sets; also the range of the hillshades; etc.

> The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:
> 
>  - 1 - Spruce/Fir
>  - 2 - Lodgepole Pine
>  - 3 - Ponderosa Pine
>  - 4 - Cottonwood/Willow
>  - 5 - Aspen
>  - 6 - Douglas-fir
>  - 7 - Krummholz
> 
> The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations).
> 
> Data Fields:
>  - **Elevation** - Elevation in meters
>  - **Aspect** - Aspect in degrees azimuth
>  - **Slope** - Slope in degrees
>  - **Horizontal_Distance_To_Hydrology** - Horz Dist to nearest surface water features
>  - **Vertical_Distance_To_Hydrology** - Vert Dist to nearest surface water features
>  - **Horizontal_Distance_To_Roadways** - Horz Dist to nearest roadway
>  - **Hillshade_9am** - Hillshade index at 9am, summer solstice (0 to 255 index)
>  - **Hillshade_Noon** - Hillshade index at noon, summer solstice (0 to 255 index)
>  - **Hillshade_3pm** - Hillshade index at 3pm, summer solstice (0 to 255 index)
>  - **Horizontal_Distance_To_Fire_Points** - Horz Dist to nearest wildfire ignition points
>  - **Wilderness_Area** - Wilderness area designation (4 binary columns, 0 = absence or 1 = presence)
>  - **Soil_Type** - Soil Type designation (40 binary columns, 0 = absence or 1 = presence)
>  - **Cover_Type** - Forest Cover Type designation (7 types, integers 1 to 7)
> 
> The wilderness areas are:
> 
>  - 1 - Rawah Wilderness Area
>  - 2 - Neota Wilderness Area
>  - 3 - Comanche Peak Wilderness Area
>  - 4 - Cache la Poudre Wilderness Area
> 
> The soil types are:
> 
>  - 1 Cathedral family - Rock outcrop complex, extremely stony.
>  - 2 Vanet - Ratake families complex, very stony.
>  - 3 Haploborolis - Rock outcrop complex, rubbly.
>  - 4 Ratake family - Rock outcrop complex, rubbly.
>  - 5 Vanet family - Rock outcrop complex complex, rubbly.
>  - 6 Vanet - Wetmore families - Rock outcrop complex, stony.
>  - 7 Gothic family.
>  - 8 Supervisor - Limber families complex.
>  - 9 Troutville family, very stony.
>  - 10 Bullwark - Catamount families - Rock outcrop complex, rubbly.
>  - 11 Bullwark - Catamount families - Rock land complex, rubbly.
>  - 12 Legault family - Rock land complex, stony.
>  - 13 Catamount family - Rock land - Bullwark family complex, rubbly.
>  - 14 Pachic Argiborolis - Aquolis complex.
>  - 15 unspecified in the USFS Soil and ELU Survey.
>  - 16 Cryaquolis - Cryoborolis complex.
>  - 17 Gateview family - Cryaquolis complex.
>  - 18 Rogert family, very stony.
>  - 19 Typic Cryaquolis - Borohemists complex.
>  - 20 Typic Cryaquepts - Typic Cryaquolls complex.
>  - 21 Typic Cryaquolls - Leighcan family, till substratum complex.
>  - 22 Leighcan family, till substratum, extremely bouldery.
>  - 23 Leighcan family, till substratum - Typic Cryaquolls complex.
>  - 24 Leighcan family, extremely stony.
>  - 25 Leighcan family, warm, extremely stony.
>  - 26 Granile - Catamount families complex, very stony.
>  - 27 Leighcan family, warm - Rock outcrop complex, extremely stony.
>  - 28 Leighcan family - Rock outcrop complex, extremely stony.
>  - 29 Como - Legault families complex, extremely stony.
>  - 30 Como family - Rock land - Legault family complex, extremely stony.
>  - 31 Leighcan - Catamount families complex, extremely stony.
>  - 32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.
>  - 33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.
>  - 34 Cryorthents - Rock land complex, extremely stony.
>  - 35 Cryumbrepts - Rock outcrop - Cryaquepts complex.
>  - 36 Bross family - Rock land - Cryumbrepts complex, extremely stony.
>  - 37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.
>  - 38 Leighcan - Moran families - Cryaquolls complex, extremely stony.
>  - 39 Moran family - Cryorthents - Leighcan family complex, extremely stony.
>  - 40 Moran family - Cryorthents - Rock land complex, extremely stony.

The description of the soil types could be worth some feature engineering perhaps. For example, how stony the soil is, or what family it belongs to, etc. I intend to explore this in the future.

## Preparation

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

In [None]:
plt.rcParams['figure.figsize'] = (16, 4)

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Loading the data

In [None]:
soil_type_vars = [f'Soil_Type{i}' for i in range(1, 41)]
wilderness_area_vars = [f'Wilderness_Area{i}' for i in range(1, 5)]
binary_vars = soil_type_vars + wilderness_area_vars
numerical_vars = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points']
features = numerical_vars + binary_vars
target = 'Cover_Type'

In [None]:
dtypes = {
    'Id': np.int32,
    'Elevation': np.int16,
    'Aspect': np.int16,
    'Slope': np.int8,
    'Horizontal_Distance_To_Hydrology': np.int16,
    'Vertical_Distance_To_Hydrology': np.int16,
    'Horizontal_Distance_To_Roadways': np.int16,
    'Hillshade_9am': np.int16,
    'Hillshade_Noon': np.int16,
    'Hillshade_3pm': np.int16,
    'Horizontal_Distance_To_Fire_Points': np.int16,
    'Cover_Type': np.int8,
}
binary_vars_dtypes = {c: np.int8 for c in binary_vars}
dtypes.update(binary_vars_dtypes)

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv', dtype=dtypes)
test = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv', dtype=dtypes)

# I created some sampled dataframes to use with some visualizations that would take too long to process
# if I were to use all the data.
train_50k = train.sample(n=50_000, random_state=42)
test_50k = test.sample(n=50_000, random_state=42)

In [None]:
train.info()

In [None]:
test.info()

## Missing Values

Number of missing values for each column in the train and test sets.

In [None]:
train_na_count = train.isna().sum().to_frame('na_count')
test_na_count = test.isna().sum().to_frame('na_count')
pd.concat([train_na_count, test_na_count], axis=1, keys=['train', 'test'], names=['set'])

No missing values. `Cover_Type` is missing from the test set, obviously, as it's the target.

## Target distribution

From the count plot below, we can see that the target (`Cover_Type`) has 7 classes and is very imbalanced.

In [None]:
fig, axs = plt.subplots(1, 2)
sns.countplot(x=train[target], ax=axs[0])
sns.countplot(x=train[target], ax=axs[1])
axs[0].set(title='Target Distribution', yscale='linear');
axs[1].set(title='Target Distribution (Log scale on y axis)', yscale='log');

In [None]:
target_count = train[target].value_counts().to_frame('count')
target_count['ratio'] = target_count['count'] / target_count['count'].sum()
target_count.style.format('{:.4%}', subset='ratio').background_gradient('YlGn')

Classes 1 and 2 account for 93.25% of all data. Class 5 only has a single observation. 😱

## Binary variables

By couting the occurrences for each binary variable, both in train and test data, we can see that most of them are predominantly **0**, except for: `Wilderness_Area1`, which is more balanced; and `Wilderness_Area3`, which is mostly **1**. Also, `Soil_Type7` and `Soil_Type15` are always **0**, so they bring no value, and can be safely dropped.

In [None]:
train_bin_counts = [train[c].value_counts().to_frame(c).sort_index() for c in binary_vars]
test_bin_counts = [test[c].value_counts().to_frame(c).sort_index() for c in binary_vars]
train_bin_counts_df = pd.concat(train_bin_counts, axis=1)
test_bin_counts_df = pd.concat(test_bin_counts, axis=1)
all_bin_counts_df = pd.concat([train_bin_counts_df.T, test_bin_counts_df.T], keys=['train', 'test'], names=['set'], axis=1)
all_bin_counts_df.style.background_gradient('YlGn').format(precision=0)

**Distribution of binary features per class.** The table below shows, for each pair of feature and target class, how much of the values are **1**. So, for example, in the first cell, we can see that only 1.52% of `Sail_Type1` are **1** for Cover Type 1.

In [None]:
train.groupby(target)[binary_vars].mean().T.style.background_gradient('YlGn', axis=1).format('{:.4%}'.format)

## Numerical variables

**Description of numerical variables.** We can see the ranges of some variables don't match the description of the original dataset (beggining of this notebook). E.g., `Hillshade_3pm` is extrapolating the [0, 255] range.

In [None]:
train[numerical_vars].describe(percentiles=[.01, .25, .50, .75, .99]).T

**Range comparison of each numeric variable.** This box plot helps visualize the ranges of each variable, and how they compare to each other. Normalization will be very important for some algorithms.

In [None]:
plt.figure(figsize=(16, 6), dpi=100)
sns.boxplot(data=train_50k[numerical_vars], orient='h', showfliers=False);

**Univariate analysis using KDE.** Both train and test have very similar distribution, except for `Elevation`, I would say... but I don't know what this could mean.

In [None]:
fig, axs = plt.subplots(3, 4, figsize=(16, 10), dpi=100)
axs = axs.ravel()
for i, col_name in enumerate(numerical_vars):
    sns.kdeplot(data=train_50k, x=col_name, color='tab:blue', ax=axs[i])
    sns.kdeplot(data=test_50k, x=col_name, color='tab:orange', ax=axs[i])
for i in [10, 11]:
    axs[i].remove()
fig.tight_layout()

**Univariate analysis using Histograms.** Basically, the same thing. Just a different view.

In [None]:
fig, axs = plt.subplots(3, 4, figsize=(16, 10), dpi=200)
axs = axs.ravel()
common_kwargs = {'stat': 'percent', 'element': 'step'}
extra_kwargs = {
    'Slope': {'binwidth': 1},
    'Hillshade_Noon': {'binwidth': 1},
    'Horizontal_Distance_To_Hydrology': {'binwidth': 10},
    'Horizontal_Distance_To_Roadways': {'binwidth': 25},
    'Hillshade_9am': {'binwidth': 5},
    'Horizontal_Distance_To_Fire_Points': {'binwidth': 25},
}
for i, col_name in enumerate(numerical_vars):
    kwargs = common_kwargs
    if col_name in extra_kwargs:
        kwargs.update(extra_kwargs[col_name])
    sns.histplot(data=train_50k, x=col_name, color='tab:blue', ax=axs[i], **kwargs)
    sns.histplot(data=test_50k, x=col_name, color='tab:orange', ax=axs[i], **kwargs)
for i in [10, 11]:
    axs[i].remove()
fig.tight_layout()

**Pairplot.** The upper triangle is made of scatter plots with the target as hue. It's very messy, but we can see that Cover Type 1, 2, and 3, are very linearly separable by `Elevation`. (More on this next.). KDEs with target as hue on the diagonals, `Elevation` is the most interesting. And I made 2D histograms for the bottom triangle to help identify relantionships between predictor variables. I don't see anything interesting in them, though. I believe if there was an interesting relationship (linear, for example) we would see it pop out in the darker colors.

In [None]:
g = sns.PairGrid(train_50k, vars=numerical_vars, hue=target, palette='tab10', diag_sharey=False)
g.map_upper(sns.scatterplot, alpha=0.2)
g.map_diag(sns.kdeplot)
g.map_lower(sns.histplot, hue=None)
g.fig.set_dpi(200)

TODO: I wish I'd do some contour plots with target as hue, but I had problems generating it. Maybe, later.

Below, I made some box plots to compare the ranges that each feature assumes for each class. Looking at the first subplot, we can see that Cover Type 1 likes high elevations (> 3000). Cover Type 2 likes intermediate heights (between 2500 and 3000). Cover Type 3 likes lower elevations (< 2500). Cover Type 6 and 7 also have there own preferance. And just a reminder that there are very few observations for Cover Type 4 and 5.

In [None]:
fig, axs = plt.subplots(5, 2, figsize=(16, 16))
axs = axs.ravel()
for i, col_name in enumerate(numerical_vars):
    sns.boxplot(data=train, x=col_name, y=target, palette='tab10', orient='h', ax=axs[i])
fig.tight_layout()

## Correlation

**Heatmap of correlations.** My thoughts:
 - `Elavation` and `Cover_Type` have a very negative correlation, because of the preferred range of Cover Type 1, 2, and 3.
 - `Wilderness_Area4` and `Elevation` are somewhat correlated. I imagine this area exists mostly at a certain elevation.
 - `Wilderness_Area3` and `Wilderness_Area1` are very correlated.

In [None]:
corr = train.drop(columns='Id').corr()
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(18, 12))
_ = sns.heatmap(corr, mask=mask, square=True, center=0, cmap='bwr')

**Table with the stronger correlations.** Seeing it as a table can help pick up important correlations you didn't notice on the heatmap.

In [None]:
corr[mask] = np.nan
important_correlations = corr.stack().to_frame().reset_index()
important_correlations.columns = ['var_1', 'var_2', 'corr']
important_correlations.dropna(inplace=True)
important_correlations['corr_abs'] = np.abs(important_correlations['corr'])
important_correlations.sort_values('corr_abs', ascending=False, inplace=True)
important_correlations.query('corr_abs > 0.1').style.background_gradient('YlGn', subset=['corr_abs'])

**Correlation between numeric variables only.**

In [None]:
numeric_corr = train[numerical_vars].corr()
mask = np.zeros_like(numeric_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(16, 10))
_ = sns.heatmap(numeric_corr, mask=mask, square=True, annot=True, fmt='.2%', center=0, cmap='bwr')

## Sum of wilderness areas or soil types

I was curious if there were observations with multiple wilderness areas. And indeed there are (table below). But I wonder if this exists in the original dataset and can be meaningful, or (my guess) if this was something introduced by the GAN.

In [None]:
train[wilderness_area_vars].sum(axis=1).value_counts().to_frame('wilderness_count')

In [None]:
train[soil_type_vars].sum(axis=1).value_counts().to_frame('soil_count')

I think it's just some noise/artifact introduced by the GAN, and doesn't contribute with predictions, as can be seen below.

In [None]:
soil_target = train[[target]].copy()
soil_target['wilderness_count'] = train[wilderness_area_vars].sum(axis=1)
_ = sns.boxplot(data=soil_target, y='wilderness_count', x=target)

In [None]:
soil_target = train[[target]].copy()
soil_target['soil_count'] = train[soil_type_vars].sum(axis=1)
_ = sns.boxplot(data=soil_target, y='soil_count', x=target)