# Dimensionality Reduction

In this notebook we will explore how [gensim](https://github.com/RaRe-Technologies/gensim) can be used within a supervised machine learning context. Concretely, we can use the topic modeling tools included in the package to reduce the dimensionality
of our (extremely sparse and wide) input.

In [None]:
import pandas as pd
import numpy as np

import sys
sys.path.append("..") # Append source directory to our Python path

from predictor import Predictor
from linear_predictor import LogisticPredictor
from preprocessing import *
from utils import *

DATA_ROOT = "../data/"

train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

In [None]:
# Extract the true labels needed for training
train_ys = {tag: train[tag].values for tag in TAGS}

# Extract the test set ids needed for submitting
ids = test['id']

## Run the dimensionality reduction algorithms

This will take a lot of time to run (around 20 minutes total on my machine). This is because the algorithms comprises of 
several computationally expensive steps:

1. Tokenize text.
2. Create the train and test corpora.
3. Get the TFIDF sparse representations.
4. Apply dimensionality reduction using LSA or LDA both of which are optimized but still demanding.

In [None]:
train_x, test_x = gensim_preprocess(train, test, model_type='lsi', num_topics=500, report_progress=True, data_dir=DATA_ROOT)

## Feeding the reduced input to sklearn

Let's how our reduced input does using an (untuned) classifier from `sklearn`.

In [None]:
# Create a logistic regression classifier.
lr_params = {"C": 4, "dual": True}
predictor = LogisticPredictor(**lr_params)

# We are currently supporting 3 evaluation methods, stratified CV, random CV and split. Lets check them.
stratified_cv_loss = predictor.evaluate(train_x, train_ys, method='stratified_CV')
cv_loss = predictor.evaluate(train_x, train_ys, method='CV')
split_loss = predictor.evaluate(train_x, train_ys, method='split')

print("CV Stratified CV log loss: {}\nCV log loss: {}\nSplit CV log loss: {}".format(stratified_cv_loss, cv_loss, split_loss))

## Create a submission 

Let's use our classifier to create a sample submittion and submit to [kaggle](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [None]:
create_submission(predictor, train_x, train_ys, test_x, ids, '../submissions/using_lsi.csv')

## Next steps

We could improve this pipeline by carefully tuning the dimensionality reduction steps (trying another `gensim.model`) and a stronger classifier (perhaps `XGBoost`?)

