# Using our developed libraries

In order to quickly use or inspect the developed libraries, as well as interactively visualize their results we can use
those notebooks.

**NOTE**: Please remember to clear all output before commiting any .ipynb file by doing: `Cell -> All Outputs -> Clear` 

In [None]:
import pandas as pd
import numpy as np

import sys
sys.path.append("../..")

from toxicity.predictor import Predictor
from toxicity.linear_predictor import LogisticPredictor

from common.nlp.preprocessing import *
from toxicity.utils import *

train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

In [None]:
# Extract the true labels needed for training
train_ys = {tag: train[tag].values for tag in TAGS}

# Extract the test set ids needed for submitting
ids = test['id']

## Lets test our preprocessing functions.

The `tf_idf` transformation produces a sparse matrix representation.

In [None]:
#train_x, test_x = get_sparse_matrix(train, test, save=True, load=False, data_dir="../data")
#to load the data set use
train_x, test_x = get_sparse_matrix(load=True, data_dir="../data")
print("The tf_idf algorithm created {} features per sample".format(train_x.shape[1]))

## Using the processed dataset to train a linear predictor

Lets check whether our predictor implementations can fit the processed datasets.

In [None]:
lr_params = {"C": 4, "dual": True}
predictor = LogisticPredictor(**lr_params)

# We are currently supporting 3 evaluation methods, stratified CV, random CV and split. Lets check them.
stratified_cv_loss = predictor.evaluate(train_x, train_ys, method='stratified_CV')
cv_loss = predictor.evaluate(train_x, train_ys, method='CV')
split_loss = predictor.evaluate(train_x, train_ys, method='split')

print("CV Stratified CV log loss: {}\nCV log loss: {}\nSplit CV log loss: {}".format(stratified_cv_loss, cv_loss, split_loss))

In [None]:
create_submission(predictor, train_x, train_ys, test_x, ids, '../submissions/first_attempt.csv')