In [255]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



![img](https://github.com/yandexdataschool/nlp_course/raw/master/resources/banhammer.jpg)

__In this notebook__ you will build an algorithm that classifies social media comments into normal or toxic.
Like in many real-world cases, you only have a small (10^3) dataset of hand-labeled examples to work with. We'll tackle this problem using both classical nlp methods and embedding-based approach.

### Read the comments.tsv with tab separator. 
#### Choose the "should_ban" feature as the target.

In [256]:
import pandas as pd

# <Your code Goes Here>
data = pd.read_csv('comments.tsv', sep='\t')
y = data['should_ban']
data.head()

Unnamed: 0,should_ban,comment_text
0,0,The picture on the article is not of the actor...
1,1,"Its madness. Shes of Chinese heritage, but JAP..."
2,1,Fuck You. Why don't you suck a turd out of my ...
3,1,God is dead\nI don't mean to startle anyone bu...
4,1,THIS USER IS A PLANT FROM BRUCE PERENS AND GRO...


### Split the dataset into training and test sets: 50/50
#### Do not forget to stratify the split, cause we are solving a classification problem.

In [258]:
# <Your code Goes Here>
from sklearn.model_selection import train_test_split

X = data['comment_text']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

In [259]:
print(X_train)
y_train

0      The picture on the article is not of the actor...
1      Its madness. Shes of Chinese heritage, but JAP...
2      Fuck You. Why don't you suck a turd out of my ...
3      God is dead\nI don't mean to startle anyone bu...
4      THIS USER IS A PLANT FROM BRUCE PERENS AND GRO...
                             ...                        
495                       . Crzrussian, slit your wrists
496    "\n\n Image caption - Homosexual? \n\nThe seco...
497    Hello \n\nHi, Ursasapien. I was looking throug...
498    I assume that you have never met foreigners. W...
499    "\nUnsurprisingly, trimming it down to only ga...
Name: comment_text, Length: 500, dtype: object


0      0
1      1
2      1
3      1
4      1
      ..
495    1
496    1
497    0
498    1
499    0
Name: should_ban, Length: 500, dtype: int64

__Note:__ it is generally a good idea to split data into train/test before anything is done to them.

It guards you against possible data leakage in the preprocessing stage. For example, should you decide to select words present in obscene tweets as features, you should only count those words over the training set. Otherwise your algoritm can cheat evaluation.

### Preprocessing and tokenizationv

Comments contain raw text with punctuation, upper/lowercase letters and even newline symbols.

To simplify all further steps, we'll split text into space-separated tokens using one of nltk tokenizers.

Do not forget to lowercase the words before tokenization, cause we want to generate case insensitive tokens.

In [260]:
from nltk import word_tokenize
# X_train = X_train.str.lower()
# X_train=X_train.apply(lambda X: word_tokenize(X))
X_train.head()

0    The picture on the article is not of the actor...
1    Its madness. Shes of Chinese heritage, but JAP...
2    Fuck You. Why don't you suck a turd out of my ...
3    God is dead\nI don't mean to startle anyone bu...
4    THIS USER IS A PLANT FROM BRUCE PERENS AND GRO...
Name: comment_text, dtype: object

In [261]:
from nltk.tokenize import TweetTokenizer

# <Your code Goes Here>
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)


text = 'How to be a grown-up at work: replace "fuck you" with "Ok, great!".'

# kk = tknzr.tokenize(text)

print("before:", text)
print("after:", preprocess(text))

before: How to be a grown-up at work: replace "fuck you" with "Ok, great!".
after: grownup work replac fuck woke great


### Preprocess each comment in train and test

In [262]:
# <Your code Goes Here>
import nltk
from nltk import word_tokenize
from spellchecker import SpellChecker
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag
from nltk.stem import PorterStemmer
import re

# nltk.download('all')
en_stopwords = stopwords.words('english')

def remove_whitespace(text):
    return  " ".join(text.split())

def spell_check(text):
    
    result = []
    spell = SpellChecker()
    for word in text:
        correct_word = spell.correction(word)
        result.append(correct_word)
    
    return result

def remove_stopwords(text):
    result = []
    for token in text:
        if token not in en_stopwords:
            result.append(token)
            
    return result


def remove_punct(text):
    
    tokenizer = RegexpTokenizer(r"\w+")
    lst=tokenizer.tokenize(' '.join(text))
    return lst


def frequent_words(df):
    
    lst=[]
    for text in df.values:
        lst+=text[0]
    fdist=FreqDist(lst)
    return fdist.most_common(10)


def lemmatization(text):
    
    result=[]
    wordnet = WordNetLemmatizer()
    for token,tag in pos_tag(text):
        pos=tag[0].lower()
        
        if pos not in ['a', 'r', 'n', 'v']:
            pos='n'
            
        result.append(wordnet.lemmatize(token,pos))
    
    return result


def stemming(text):
    porter = PorterStemmer()
    
    result=[]
    for word in text:
        result.append(porter.stem(word))
    return result


def remove_tag(text):
    
    text=' '.join(text)
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)


def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)



def preprocess(text):
    text = text.lower()
    text = remove_whitespace(text)
    text = spell_check(text.split())
    text = remove_stopwords(text)
    text = remove_punct(text)
#     text = lemmatization(text)
    text = stemming(text)
    text = remove_tag(text)
    text = remove_urls(text)
    return text

lst = []

def remove_freq_words(text):
    
    result=[]
    for item in text:
        if item not in lst:
            result.append(item)
    
    return result

def preprocessData(data):
    data = data.str.lower()
    data = data.apply(remove_whitespace)
    data = data.apply(lambda X: word_tokenize(X))
    data = data.apply(spell_check)
    data = data.apply(remove_stopwords)
    data = data.apply(remove_punct)
    freq_words = frequent_words(data)
    
    for a,b in freq_words:
        lst.append(b)
        
    data = data.apply(remove_freq_words)
#     data = data.apply(lemmatization)
    data = data.apply(stemming)
    data = data.apply(remove_tag)
    data = data.apply(remove_urls)
    data = data.apply(remove_whitespace)
    return data

X_train = X_train[:90]
X_test = X_test[:10]

print(X_train)
print(X_test)
texts_train = preprocessData(X_train)
texts_test = preprocessData(X_test)
print(texts_train)
print(texts_test)

0     The picture on the article is not of the actor...
1     Its madness. Shes of Chinese heritage, but JAP...
2     Fuck You. Why don't you suck a turd out of my ...
3     God is dead\nI don't mean to startle anyone bu...
4     THIS USER IS A PLANT FROM BRUCE PERENS AND GRO...
                            ...                        
85                           fuck god youn shit asshole
86    Fuck you Jasper you know he's a cunt and if yo...
87    "\n{| style=""border-spacing:8px;margin:0px -8...
88    Hilleri groves\nCategory:WelcomeBotResearch\nH...
89    , and to suggest that is flabbergastingly arro...
Name: comment_text, Length: 90, dtype: object
500    VfD\nOn May 4, 2005, this article was nominate...
501    See page 48 of the Freeh report for Freeh's ex...
502    "\n\n This is of no consequence, I have listed...
503    nigger \n\n{{unblock|reason=Liar 98.162.163.72  }
504    This is what i mena SOME TOTAL BULLYING LITTLE...
505    uhhhh YOU DON'T OWN THIS SITE. SO DON'T TOUCH 

In [266]:
texts_train[89]

'suggest flabbergastingli arrog'

In [267]:
texts_train[5]

'astor hi one first 100 peopl sign free astor account via request page readi start hand account i d still like one astor provid access via email invit get account pleas email swallingwikimedia org subject line astor english wikipedia usernam prefer email address astor account inform given astor provid account otherwis remain privat pleas novemb drop messag say want need account longer meet deadlin assum lost interest provid account next person rather long waitlist thank talk'

In [268]:
assert texts_train[5] ==  'who cares anymore . they attack with impunity .'
assert texts_test[89] == 'hey todds ! quick q ? why are you so gay'
assert len(texts_test) == len(y_test)

AssertionError: 

### Solving it: bag of words

![img](http://www.novuslight.com/uploads/n/BagofWords.jpg)

One traditional approach to such problem is to use bag of words features:
1. build a vocabulary of frequent words (use train data only)
2. for each training sample, count the number of times a word occurs in it (for each word in vocabulary).
3. consider this count a feature for some classifier

__Note:__ in practice, you can compute such features using sklearn. Please don't do that in the current assignment, though.
* `from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer`

#### task: find up to k most frequent tokens in texts_train,
#### sort them by number of occurences (highest first)

In [272]:
from keras.preprocessing.text import Tokenizer

sentence = ["John likes to watch movies. Mary likes movies too."]

def print_bow(sentence) -> None:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentence)
    sequences = tokenizer.texts_to_sequences(sentence)
    word_index = tokenizer.word_index 
    bow = {}
    for key in word_index:
        bow[key] = sequences[0].count(word_index[key])

    print(f"Bag of word sentence 1:\n{bow}")
    print(f'We found {len(word_index)} unique tokens.')

print_bow(sentence)

Bag of word sentence 1:
{'likes': 2, 'movies': 2, 'john': 1, 'to': 1, 'watch': 1, 'mary': 1, 'too': 1}
We found 7 unique tokens.


In [294]:
from collections import Counter
counter = [Counter(re.findall(r'\w+', txt)) for txt in texts_train]


counter[5]

Counter({'astor': 6,
         'hi': 1,
         'one': 2,
         'first': 1,
         '100': 1,
         'peopl': 1,
         'sign': 1,
         'free': 1,
         'account': 7,
         'via': 2,
         'request': 1,
         'page': 1,
         'readi': 1,
         'start': 1,
         'hand': 1,
         'i': 1,
         'd': 1,
         'still': 1,
         'like': 1,
         'provid': 3,
         'access': 1,
         'email': 3,
         'invit': 1,
         'get': 1,
         'pleas': 2,
         'swallingwikimedia': 1,
         'org': 1,
         'subject': 1,
         'line': 1,
         'english': 1,
         'wikipedia': 1,
         'usernam': 1,
         'prefer': 1,
         'address': 1,
         'inform': 1,
         'given': 1,
         'otherwis': 1,
         'remain': 1,
         'privat': 1,
         'novemb': 1,
         'drop': 1,
         'messag': 1,
         'say': 1,
         'want': 1,
         'need': 1,
         'longer': 1,
         'meet': 1,
      

In [298]:
from collections import Counter # <- use me 

k = 10000

bow_vocabulary = [Counter(re.findall(r'\w+', txt)) for txt in texts_train]

print('example features:', sorted(bow_vocabulary)[::100])

example features: [Counter({'articl': 8, 'subject': 4, 'tar': 3, 'knight': 3, 'photo': 3, 'imag': 3, 'pictur': 2, 'actor': 2, 'ad': 2, 'edit': 2, 'use': 2, 'even': 1, 'basic': 1, 'googl': 1, 'search': 1, 'turn': 1, 'written': 1, 'continu': 1, 'relev': 1, 'remov': 1, 'other': 1, 'made': 1, 'revert': 1, 'pleas': 1, 'provid': 1, 'legitim': 1, 'reason': 1, 'inclus': 1, 'fair': 1, 'found': 1, 'mean': 1, 'random': 1, 'placehold': 1})]


In [380]:
def text_to_bow(text):
    """ convert text string to an array of token counts. Use bow_vocabulary. """
#     <YOUR CODE>
    token_counts = sum(bow_vocabulary, Counter())
#     print(token_counts)
#     token_counts = token_counts.values()
#     for k, v in token_counts.items():
#         token_counts[k] = float(v)
    print(token_counts)
    values = [float(x) for x in list(token_counts.values())]
    return np.array(values, 'float32')

In [381]:
X_train_bow = np.stack(list(map(text_to_bow, texts_train)))
X_test_bow = np.stack(list(map(text_to_bow, texts_test)))

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

Counter({'articl': 42, 'page': 40, 'talk': 33, 'one': 31, 'wikipedia': 31, 'block': 28, 'use': 22, 'pleas': 20, 'edit': 19, 'like': 18, 'would': 17, 'think': 17, 'get': 16, 'see': 15, 'make': 14, 'fuck': 14, 'sourc': 14, 'i': 13, 'thank': 13, 'discuss': 13, 'ad': 12, 'know': 12, 'read': 12, 'help': 12, 'user': 11, 'request': 11, 'thing': 11, 'time': 11, 've': 11, 'take': 11, 'may': 11, 'work': 11, 'member': 11, 'remov': 10, 'reason': 10, 'person': 10, 'peopl': 10, 'refer': 10, 'write': 10, 'made': 9, 'provid': 9, 'good': 9, 'go': 9, 'top': 9, 'honorari': 9, 'delet': 9, 'civil': 9, 'style': 9, 'subject': 8, 'right': 8, 'account': 8, 'want': 8, 'question': 8, 'point': 8, 'review': 8, 'believ': 8, 'also': 8, 'list': 8, 'admin': 8, 'ask': 8, 'ignor': 8, 'even': 7, 'imag': 7, 'american': 7, 'well': 7, 'rather': 7, 'tri': 7, 'life': 7, 'place': 7, 'two': 7, 'contribut': 7, 'design': 7, 'serious': 7, 'mean': 6, 'astor': 6, 'still': 6, 'say': 6, 'need': 6, 'interest': 6, 'histori': 6, 'anoth':

In [387]:
np.array(X_train_bow[0])

array([ 2., 42.,  2., ...,  1.,  1.,  1.], dtype=float32)

In [366]:
X_test_bow

array([[ 2., 42.,  2., ...,  1.,  1.,  1.],
       [ 2., 42.,  2., ...,  1.,  1.,  1.],
       [ 2., 42.,  2., ...,  1.,  1.,  1.],
       ...,
       [ 2., 42.,  2., ...,  1.,  1.,  1.],
       [ 2., 42.,  2., ...,  1.,  1.,  1.],
       [ 2., 42.,  2., ...,  1.,  1.,  1.]], dtype=float32)

In [367]:
k_max = len(set(' '.join(texts_train).split()))
assert X_train_bow.shape == (len(texts_train), min(k, k_max))
assert X_test_bow.shape == (len(texts_test), min(k, k_max))
assert np.all(X_train_bow[5:10].sum(-1) == np.array([len(s.split()) for s in  texts_train[5:10]]))
assert len(bow_vocabulary) <= min(k, k_max)
assert X_train_bow[6, bow_vocabulary.index('.')] == texts_train[6].split().count('.')

AssertionError: 

Machine learning stuff: fit, predict, evaluate. You know the drill.

In [394]:
from sklearn.linear_model import LogisticRegression

X_tr = X_train_bow[0].reshape(-1, 1)

print(X_tr[:90])

log_reg1 = LogisticRegression(C=1, penalty='l1', solver='liblinear', random_state=2).fit(X_tr[:90], y_train[:90])


print("Accuracy on Train dataset: ", log_reg1.predict(X_tr[:90]))
# print("Accuracy on Test dataset: ", accuracy_score(Y_test, log_reg1.predict(X_test_bow)))

[[ 2.]
 [42.]
 [ 2.]
 [ 3.]
 [ 3.]
 [ 8.]
 [ 7.]
 [ 3.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 3.]
 [ 3.]
 [ 5.]
 [12.]
 [ 5.]
 [10.]
 [19.]
 [ 5.]
 [ 9.]
 [ 3.]
 [20.]
 [ 9.]
 [ 1.]
 [10.]
 [ 7.]
 [ 2.]
 [ 3.]
 [22.]
 [ 1.]
 [ 6.]
 [ 1.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 4.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 1.]
 [18.]
 [31.]
 [ 1.]
 [ 2.]
 [ 3.]
 [17.]
 [14.]
 [ 7.]
 [ 1.]
 [10.]
 [ 2.]
 [ 1.]
 [ 4.]
 [ 3.]
 [14.]
 [ 3.]
 [ 1.]
 [ 3.]
 [17.]
 [ 1.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 3.]
 [ 1.]
 [ 4.]
 [ 1.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 2.]
 [12.]
 [ 7.]
 [ 1.]
 [ 9.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 1.]
 [ 1.]
 [11.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 8.]]
Accuracy on Train dataset:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [384]:
log_reg1.predict_proba(X_train_bow)[:,1]

array([0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194444,
       0.52194444, 0.52194444, 0.52194444, 0.52194444, 0.52194

In [386]:
from sklearn.metrics import roc_auc_score, roc_curve

for name, X, y, model in [
    ('train', X_train_bow, y_train, bow_model),
    ('test ', X_test_bow, y_test, bow_model)
]:
    proba = model.predict_proba(X)[:, 1]
    auc = roc_auc_score(y, proba)
    plt.plot(*roc_curve(y, proba)[:2], label='%s AUC=%.4f' % (name, auc))

plt.plot([0, 1], [0, 1], '--', color='black',)
plt.legend(fontsize='large')
plt.grid()

ValueError: Found input variables with inconsistent numbers of samples: [500, 90]

### Solving it better: word vectors

Let's try another approach: instead of counting per-word frequencies, we shall map all words to pre-trained word vectors and average over them to get text features.

This should give us two key advantages: (1) we now have 10^2 features instead of 10^4 and (2) our model can generalize to word that are not in training dataset.

We begin with a standard approach with pre-trained word vectors. However, you may also try
* training embeddings from scratch on relevant (unlabeled) data
* multiplying word vectors by inverse word frequency in dataset (like tf-idf).
* concatenating several embeddings
    * call `gensim.downloader.info()['models'].keys()` to get a list of available models
* clusterizing words by their word-vectors and try bag of cluster_ids

__Note:__ loading pre-trained model may take a while. It's a perfect opportunity to refill your cup of tea/coffee and grab some extra cookies. Or binge-watch some tv series if you're slow on internet connection

In [16]:
# !pip install python-Levenshtein

In [276]:
import gensim.downloader 
embeddings = gensim.downloader.load("fasttext-wiki-news-subwords-300")

# If you're low on RAM or download speed, use "glove-wiki-gigaword-100" instead. Ignore all further asserts.

KeyboardInterrupt: 

In [277]:
def vectorize_sum(comment, embedding_dim):
    """
    implement a function that converts preprocessed comment to a sum of token vectors
    """
#     embedding_dim = embeddings.wv.vectors.shape[1] # AttributeError: 'KeyedVectors' object has no attribute 'wv'
    keys = embeddings.index_to_key
    does_not_exist = np.zeros([embedding_dim], dtype='float32')
    features = sum([embeddings.get_vector(word) if word in keys else does_not_exist for word in comment.split()])
    
    return features

assert np.allclose(
    vectorize_sum(comment="who cares anymore . they attack with impunity .", embedding_dim=300)[::70],
    np.array([ 0.0108616 ,  0.0261663 ,  0.13855131, -0.18510573, -0.46380025])
)

In [278]:
extra = "'"
preprocess = lambda text: ' '.join([word.split(extra)[0] if extra in word else word for word in text.split()])
        
texts_test = np.array([preprocess(text) for text in texts_test])
texts_train = np.array([preprocess(text) for text in texts_train])

In [279]:
X_train_wv = np.stack([vectorize_sum(comment=text, embedding_dim=300) for text in texts_train])
X_test_wv = np.stack([vectorize_sum(comment=text, embedding_dim=300) for text in texts_test])

In [282]:
wv_model = LogisticRegression(max_iter=1000).fit(X_train_wv, y_train)

for name, X, y, model in [
    ('bow train', X_train_bow, y_train, bow_model),
    ('bow test ', X_test_bow, y_test, bow_model),
    ('vec train', X_train_wv, y_train, wv_model),
    ('vec test ', X_test_wv, y_test, wv_model)
]:
    proba = model.predict_proba(X)[:, 1]
    auc = roc_auc_score(y, proba)
    plt.plot(*roc_curve(y, proba)[:2], label='%s AUC=%.4f' % (name, auc))

plt.plot([0, 1], [0, 1], '--', color='black',)
plt.legend(fontsize='large')
plt.grid()

assert roc_auc_score(y_test, wv_model.predict_proba(X_test_wv)[:, 1]) > 0.92, "something's wrong with your features"

ValueError: Found input variables with inconsistent numbers of samples: [90, 500]

If everything went right, you've just managed to reduce misclassification rate by a factor of two.
This trick is very useful when you're dealing with small datasets. However, if you have hundreds of thousands of samples, there's a whole different range of methods for that. We'll get there in the second part.