# Experimenting with sparse and dense features

In this notebook I will try to experiment with different strategies to handle the combination of sparse and dense data.
This is needed because features created using BOW-kind preprocessing tools like `tf_idf` will by design be extremely sparse. On the other hand, artificially constructed features are typically dense. We have identified the following general strategies to tackle this issue:

1. Use models robust to many features of varied density and just feed with the concatenation of all features

2. Train different classifiers on the sparse and dense datasets and then ensemble them (stacking/boosting)

3. Use dimensionality reduction tools like PCA or autoencoders to combine sparse and dense features into better ones

In [None]:
import pandas as pd
import os

import sys
sys.path.append("../..")
from toxicity.linear_predictor import LogisticPredictor, SVMPredictor
from toxicity.tuning import tune
from toxicity.utils import TAGS
from common.nlp.preprocessing import tf_idf
from common.nlp.feature_adder import FeatureAdder

data_dir = "../data/"
train = pd.read_csv(data_dir + "train.csv")
test = pd.read_csv(data_dir + "test.csv")

In [None]:
train_ys = {tag: train[tag].values for tag in TAGS}

# Get the sparse dataset
sparse_train, sparse_test = tf_idf(train, test)

In [None]:
predictor = LogisticPredictor(C=4)
predictor.evaluate(sparse_train, train_ys, method='CV')

In [None]:
# Get the dense features
fa_params = {
    "data_dir": data_dir,
    "upper_case": True,
    "word_count": True,
    "unique_words_count": True,
    "letter_count": True,
    "punctuation_count": True,
    "little_case": True,
    "stopwords": True,
    "question_or_exclamation": True,
    "number_bad_words": True
}
fa = FeatureAdder(**fa_params)
    
dense_train, dense_test = fa.add_features(train, test)

## Testing Predictors

We now have both the sparse and dense feature sets. Lets explore their predictive power using our model arsenal.
We will use default parameters for now, but each of those predictor must be tuned to reach its full potential.
Let's test all predictors on both the dense and sparse datasets, as different predictors are expected to perform better for different input.


In [None]:
classes = [SVMPredictor, RandomForestPredictor, LightGBMPredictor]

In [None]:
# Test dense features
for cls in classes:
    predictor = cls()  # Create an object using default parameters - probably suboptimal
    predictor.evaluate(dense_train, train_ys, val_size=0.1)


In [None]:
# Test sparse features
for cls in classes:
    predictor = cls() # Create an object using default parameters - probably suboptimal
    predictor.evaluate(sparse_train, train_ys, val_size=0.1)