# The december competition with Fastai v2

This notebook is a quick demonstration, who to use the Fastai v2 library for a Kaggle tabular competition. Fastai v2 is based on pytorch and allows you, to build a decent machine learning application. 
For more information please visit the Fastai documentation: https://docs.fast.ai/

In [None]:
from fastai.tabular.all import * 
from fastai.test_utils import show_install
from sklearn.ensemble import RandomForestRegressor
show_install()

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
np.random.seed(91)
torch.manual_seed(91)

The data set is located in the follwoing directory 

In [None]:
path = Path('../input/tabular-playground-series-dec-2021')
Path.BASE_PATH = path
path.ls()

I use Pandas to import them and to verify, where null values are there or some values are missing. The result shows, that the data set is complete, so that no additional data preparation is needed.  

In [None]:
train_df = pd.read_csv(os.path.join(path, 'train.csv'))
test_df = pd.read_csv(os.path.join(path, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(path, 'sample_submission.csv'))

train_df.isna().sum().sum(), test_df.isna().sum().sum(), train_df.isnull().sum().sum(), test_df.isnull().sum().sum()

We can specify whether pseudo lables are added and whether we want to duplicate the rows with specific Cover_Type's.

In [None]:
use_pseudo_lables = True
duplicate_cover_types = True

In [None]:
if use_pseudo_lables:
    labels_df = pd.read_csv('../input/tps12-pseudolabels/tps12-pseudolabels_v2.csv')
    train_df = pd.concat([train_df, labels_df], axis=0)
    train_df.reset_index(drop=True)
    labels_df.isna().sum().sum(), labels_df.isna().sum().sum()

In [None]:
train_df.describe().T

The Cover_Type is the depended value and should be predicted for the test data. I change the Cover_Type from int to category type. With this modification i was able to improve my public score from 0.93057 to 0.95542. The tabular model is unchanged for both runs. Let's see how many different values exists. I will delete the one row with Cover_Type=5. Later on i will combine the predictions of my neural network with some prediction from other other notebooks. These notebooks delete this row.

In [None]:
dep_var = 'Cover_Type'
idx = train_df[train_df[dep_var] == 5].index
train_df.drop(idx, axis = 0, inplace = True)

train_df[dep_var] = train_df[dep_var].astype('category')

In [None]:
from sklearn.utils import shuffle
if duplicate_cover_types:
    print('Duplicate rows with Cover_Type 7')
    seven_df  = shuffle( train_df.loc[train_df[dep_var] == 7], random_state=2520)

    train_df = pd.concat([seven_df, train_df], axis=0)
    del seven_df
    train_df.reset_index(drop=True)

In [None]:
nunOfCoverTypes = len(train_df[dep_var].unique())
nunOfCoverTypes, np.unique(train_df[dep_var], return_counts=True)

I will drop the column 'Id' fromm the data frames, because the values are unique and the don't add any usefull information to our model. The columns 'Soil_Type7' and 'Soil_type15' contain the value 0. Therefore they don't provide any new information to the model and i can drop them too.

In [None]:
train_df.drop(columns=['Id', 'Soil_Type7', 'Soil_Type15'], inplace=True)
test_df.drop(columns=['Id', 'Soil_Type7', 'Soil_Type15'], inplace=True)

It seems, that the values of some columns should be preprocessed. The contian 'strange' value at the first glance. These columns are 'Aspect', 'Hillshade_9am', 'Hillshade_Noon' and 'Hillshade_3pm'. There some other discussion items and notebooks in this competition, where more details described. I will show you my implementation here. To control, whether a preprocessing should be done, i set the following flag doPreprocessing to 'True'


In [None]:
doPreprocessing=True    

The column 'Aspect' stores values of an angle in degree. These values are periodic value with a frequency of 360 degree. I can correct these values to the interval [-360, 360].

In [None]:
def clipAspectValues(df):
    pd.options.mode.chained_assignment = None
    df["Aspect"][df["Aspect"] < 0] += 360
    df["Aspect"][df["Aspect"] > 359] -= 360
    df["Aspect_mod_360"] = df["Aspect"] % 360

The values for the Hillshade_ columns shouldn't be outside the interval [0,255]. Therefore i will clip values to this interval.

In [None]:
def clipHillshadeValues(df):
    
    hill_features = [x for x in df.columns if x.startswith("Hillshade")]
    for col in hill_features:
        df[col] = np.clip(df[col], a_min=0, a_max=255)
        
    df['Hillshade_Noon_is_Bright'] = (df['Hillshade_Noon'] == 255).astype(int)
    df['Hillshade_9am_is_Zero'] = (df['Hillshade_9am'] == 0).astype(int)
    df['hillshade_3pm_is_Zero'] = (df['Hillshade_3pm'] == 0).astype(int)

Let's calculate the Euclidean and the Manhattan distance based on column values for 'Horizontal_Distance_To_Hydrology' and 'Vertical_Distance_To_Hydrology'

In [None]:
def calculateDistance(df):
    df["Hydro_Dist_Eucl"] = (df["Horizontal_Distance_To_Hydrology"]**2 + 
                                df["Vertical_Distance_To_Hydrology"]**2)**0.5
    df["Hydro_Dist_Manh"] = np.abs(df["Horizontal_Distance_To_Hydrology"]) + np.abs(df["Vertical_Distance_To_Hydrology"])

In [None]:
def addCountValues(df):
    soil_features = [x for x in df.columns if x.startswith("Soil_Type")]
    df["Soil_Type_Count"] = df[soil_features].sum(axis=1)
    df[soil_features] = df[soil_features].astype('category')

    
    wilderness_features = [x for x in df.columns if x.startswith("Wilderness_Area")]
    df["Wilderness_Area_Count"] = df[wilderness_features].sum(axis = 1)
    
    hillshade_features = [x for x in df.columns if x.startswith("Hillshade")]
    df["Hillshade_Count"] = df[hillshade_features].sum(axis = 1)

Let's define a function to duplicate the entries for a specific cover type in the training data frame. I will use the function for oversampling entries with the Cover_Type==4 and Cover_Type==6

In [None]:
if doPreprocessing:
    print("Let's start the preprocessing ..")
    clipAspectValues(train_df)
    clipAspectValues(test_df)
    clipHillshadeValues(train_df)
    clipHillshadeValues(test_df)
    calculateDistance(train_df)
    calculateDistance(test_df)
    addCountValues(train_df)
    addCountValues(test_df)
    print("Done ..")
else:
    print("No preprocessing ..")

Let's see how the different cover types are distributed now

In [None]:
np.unique(train_df[dep_var], return_counts=True)

In [None]:
memory_usage_before = train_df.memory_usage().sum() / 1024**2
train_df = df_shrink(train_df)
test_df = df_shrink(test_df)
memory_usage_after = train_df.memory_usage().sum() / 1024**2

print('Memory usage (MByte) before the shrinking:', memory_usage_before, ' , after shrinking: ', memory_usage_after)

I need a list of the column names, which are candidates for category variables and which are no candidates, also called continous variables. The Fastai library offers the function 'cont_cat_split' to do this for us. You can use the optional parameter 'max_card' to specify the maximum number of unique values a column can have for a category variable. I will use the value 10, which is sufficient for this data set. Both lists are used later, to create a corresponding  model. The category variables are mapped into embeddings, the continous variables are mapped to simple linear model. The value for max_cards specifies the ratio between the category and continous variables. Lower max_card values reduces the number of categroies and and increases the number of continous variables. The value max_card=1 produces an empty continous variable list. All columns of the data frame are handled as continous variables.
The parameter dep_var specifies our depended variable 'Cover_Type'. Its column will be skiped when the category and contious variables are determined.

In [None]:
cont_vars, cat_vars = cont_cat_split(train_df, dep_var= dep_var,  max_card=10)
len(cat_vars), len(cont_vars), cat_vars, cont_vars, 

In [None]:
for c in cat_vars:
    print(c, train_df[c].nunique())

The next step is to create a data loader. The Fastai library offers a powerful helper called 'TabularPandas'. It needs the data frame, list of the category and continous variables, the depened variable and a splitter. The splitter divides the data set into two parts: one for the training and one for the validation and for internal optimization step in each epoch. Let's use a rate of 5 to 1. I need a dataloader also, which is created from this TabularPandas instance. The helper function getData does this job and allows you, to get a small dataloader if you want to do a quick prototyping of your model. 

In [None]:
def getData(df, batchSize=1024, randomSplit=True, genSmallDataset=True):
    
  if genSmallDataset: 
    example_idx = np.random.choice(range(len(df)), 250000)
    df = df.iloc[example_idx]
  
  splits = null
  if randomSplit:  
    splits = RandomSplitter(valid_pct=0.2, seed=718)(range_of(df))
  else:
    l = len(df)
    splits = (L(np.arange(0, 0.8*l), use_list=True),
              L(np.arange(0.8*l+1, l-1), use_list=True))
  to_train = TabularPandas(df, 
                           [Categorify,  Normalize],
                           cat_vars,
                           cont_vars, 
                           splits=splits,  
                           device = device,
                           y_block=CategoryBlock(),
                           y_names=dep_var) 

  return to_train.dataloaders(bs=batchSize)

In [None]:
dls = getData(train_df, batchSize=4096, randomSplit=True, genSmallDataset=False)
len(dls.train), len(dls.valid), type(dls.train), dls.train.device

At least i create a learner pasing the dataloader into it. I use the default values for the internal layers as you can see in the reported summary. The model has two hidden layers with 200 and 100 elements as the default. You can change the structure of the hidden layer, using the paramter layers liks this 'layers=[128,64,64,16]'. The hidden layers uses a batch normalization and the ReLU activation function.

In [None]:
my_config = tabular_config(ps=0.25, embed_p=0.25, use_bn=True, bn_cont=True, y_range=(1, 8))
learn = tabular_learner(dls,
                        n_out = nunOfCoverTypes,
                        layers=[512,512,128,128,128,64,64],
                        # layers=[128, 64, 64, 16], for the best score!
                        config=my_config,
                        metrics=[accuracy])
learn.summary()

In [None]:
learn.lr_find()

I will use a maximum learning rate of 3e-3. 
Starting the learning process is quite easy, i will run for 100 epochs and i will save the model with the best, with the lowest validation lost value. The Fastai library offers the SaveModelCallback callback. You must specify the file name only. The option with_opt=True stores the values of the optimizer also.
You will find the new file under models/kaggle_tps_dec2021.pth

In [None]:
learn.fit_one_cycle(150, 2e-3, wd=0.01, cbs=SaveModelCallback(fname='kaggle_tps_dec2021', with_opt=True)) 

To calculate the predictions for this competition, i will load the best model from the training process. Best model means the model where the validation loss has the lowest value.

In [None]:
learn.load('kaggle_tps_dec2021')

let's look at the confusion matrix

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(normalize=True)

I will use the test data frame to get the prediction for the submit. I can use a smaller batch, there are less entries in the test data frame.

In [None]:
dlt = learn.dls.test_dl(test_df, bs=4096) 
nn_preds, _ = learn.get_preds(dl=dlt) 
nn_preds.min(), nn_preds.max(), nn_preds.shape

Let's load the prediction from a XGBoost model and combine them with our own predictions.

In [None]:
xgb_preds = pd.read_parquet('../input/reasonable-xgboost-model/reasonable_xgb_test.pq')

In [None]:
all_pred = (nn_preds.numpy() + xgb_preds.to_numpy())

sample_submission[dep_var] = np.argmax(all_pred, axis=1) +1 
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head(10)

In [None]:
!ls -la 

The End. You can use this notebook and feel free to modify and expand the model to get a better result. Show me your recommendations and results!😀