# Preprocess
- preprocess the data and save them to build models later.

## Removing values
- removing categorical values existing only in train data.
  - It is not meaningful for Machine Learning models to learn categorical values only in train data. That's because they can't use such values to make predictions even after learning.
  - Therefore I removed the values and think them as missing values.
- removing categorical values existing only in test data.
  - In training process, models can't learn categorical values only in test data. So in predicting process, they can't use such values to make predictions.
  - Therefore I removed the values and think them as missing values.

## Factorization
- factorizing categorical features in lexicographical order.

In [1]:
from IPython.core.display import display
import numpy as np
import pandas as pd

In [11]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
test['loss'] = np.nan
train_test = pd.concat([train, test])

In [12]:
print(train.shape)
print(test.shape)

(188318, 132)
(125546, 132)


In [13]:
for column in list(train.select_dtypes(include=['object']).columns):
    
    # remove categorical values existing only in train data or test data
    if train[column].nunique() != test[column].nunique():
        set_train = set(train[column].unique())
        set_test = set(test[column].unique())
        remove_train = set_train - set_test
        remove_test = set_test - set_train
        remove = remove_train.union(remove_test)

        def filter_cat(x):
            if x in remove:
                return np.nan
            return x

        train_test[column] = train_test[column].apply(lambda x: filter_cat(x), 1)
    
    # Factorize the categorical features in lexicographical order
    train_test[column] = pd.factorize(train_test[column].values, sort=True)[0]

In [22]:
# Export as CSV
train_test[train_test['loss'].notnull()].to_csv('data/train_preprocessed.csv', index=False)
train_test[train_test['loss'].isnull()].to_csv('data/test_preprocessed.csv', index=False)