# Dimensionality Reduction

In this notebook we use gensim and sklearn to reduce the dimensionality
of our (extremely sparse and wide) input.

In [1]:
import pandas as pd
import numpy as np

import sys
sys.path.append("../..") # Append source directory to our Python path

from toxicity.predictor import Predictor
from toxicity.linear_predictor import LogisticPredictor, SVMPredictor
from common.nlp.preprocessing import *
from toxicity.utils import *

import nltk

DATA_ROOT = "../data/"

train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

  from numpy.core.umath_tests import inner1d


In [None]:
# Extract the true labels needed for training
train_ys = {tag: train[tag].values for tag in TAGS}

# Extract the test set ids needed for submitting
ids = test['id']

## Run the dimensionality reduction algorithms

This will take a lot of time to run (around 20 minutes total on my machine). This is because the algorithms comprises of 
several computationally expensive steps:

1. Tokenize text using NLTK's tokenizer.
2. Create the train and test corpora.
3. Get the TFIDF sparse representations.
4. Apply dimensionality reduction using Latent Semantic Analysis (LSA).

In [None]:
train_x, test_x = truncatedsvd_preprocess(train, test, num_topics=500, use_own_tfidf=True, report_progress=True, data_dir=DATA_ROOT, save=True)

## Feeding the reduced input to sklearn

Let's how our reduced input does using an (untuned) classifier from `sklearn`.

In [None]:
# Create a logistic regression classifier.
svm_params = {"C": 1, "dual": True}
predictor = SVMPredictor(**svm_params)

split_loss = predictor.evaluate(train_x, train_ys, method='split')
print("Split CV log loss: {}".format(split_loss))

## Create a submission 

Let's use our classifier to create a sample submittion and submit to [kaggle](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [None]:
create_submission(predictor, train_x, train_ys, test_x, ids, '../submissions/using_lsi.csv')

## Next steps

We could improve this pipeline by carefully tuning the dimensionality reduction steps (trying another `gensim.model`) and a stronger classifier (perhaps `XGBoost`?)

