# Preprocessing and first model

A first logistic regression with words and char-level n-grams.

Heavily inspired from [here](https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams)

In [1]:
import os
os.getcwd()

'/home/jovyan/work'

In [2]:
import numpy as np
import pandas as pd
from preprocess import preprocess  # local file. restart kernel if this changed, it won't be re-imported otherwise

## Read the first few rows during crude developing:
#train = pd.read_csv('data/train.csv', nrows=1000).fillna(' ')
#test = pd.read_csv('data/test.csv', nrows=1000).fillna(' ')

## These lines load all data:
train = pd.read_csv('data/train.csv').fillna(' ')
test = pd.read_csv('data/test.csv').fillna(' ')

[train, test, train_text, test_text, all_text, class_names] = preprocess(train, test)

### Inspect preprocessed data

In [3]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
train_text[0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

## Build n-grams

In [4]:
import pickle

#from get_ngrams import get_ngrams
#[train_features, test_features, vectorizers] = get_ngrams(train_text, test_text, Tfidf = True, chars = True)
## Store these results on the hard disk, because it takes a lot of time to compute:
## Dump data (uff, 3.5GB o_O)
#pickle.dump( train_features, open( "intermediate/train_features.pkl", "wb" ) )
#pickle.dump( test_features, open( "intermediate/test_features.pkl", "wb" ) )
#pickle.dump( vectorizers, open( "intermediate/vectorizers.pkl", "wb" ) )

## Retreive data
train_features = pickle.load( open( "intermediate/train_features.pkl", "rb" ) )
test_features = pickle.load( open( "intermediate/test_features.pkl", "rb" ) )
vectorizers = pickle.load( open( "intermediate/vectorizers.pkl", "rb" ) )

re: memory size: [this SO answer](https://stackoverflow.com/questions/563840/how-can-i-check-the-memory-usage-of-objects-in-ipython/565382#565382)


### Inspect n-grams

If you ran get_ngrams with chars=True, you have two vectorizers and must choose e.g. [0] in the following

In [6]:
train_text[123]

'Should say something about his views as an educationalist and socialist political commentator.\n\nLink to http://www.langandlit.ualberta.ca/Fall2004/SteigelBainbridge.html mentions this a bit - he stood as an election candidate for Respect.'

In [7]:
# Get all bigrams (words, not chars) in comment no. 123:

ngram_idxs_in_document_123 = train_features.getrow(123).nonzero()[1]
word_idxs = ngram_idxs_in_document_123[ngram_idxs_in_document_123 < 10000]

words = vectorizers[0].get_feature_names()
[words[i] for i in word_idxs]

['www',
 'views',
 'to',
 'this',
 'something about',
 'something',
 'socialist',
 'should',
 'say',
 'respect',
 'political',
 'mentions',
 'link to',
 'link',
 'http www',
 'http',
 'html',
 'his',
 'he',
 'for',
 'election',
 'candidate',
 'ca',
 'bit',
 'as an',
 'as',
 'and',
 'an',
 'about his',
 'about']

In [21]:
vectorizers[0].get_feature_names()[0:10]

['00', '000', '01', '02', '03', '04', '05', '06', '07', '08']

In [8]:
len(vectorizers[1].get_feature_names())  # [1]: char n-grams

20000

In [9]:
vectorizers[1].get_feature_names()[0]

'\na'

In [16]:
vectorizers[1].get_feature_names()[19000]

'ut '

# Train models

In [None]:
## This would help in reloading modules after editing their files, but
##  train is mistaken for the data frame :(
# import importlib
# importlib.reload(train)

In [4]:
from train import train_model

[predictions, mean_auc] = train_model(train, test, train_features, test_features, class_names,
                                      method = 'logreg', cv = False)
# Ideally, you'd split this function in training and prediction

now training toxic
now training severe_toxic
now training obscene
now training threat
now training insult
now training identity_hate


Logistic regression seems to perform *much* better than SGD

# Store submission csv

In [5]:
predictions.to_csv('submission.csv', index=False)