<a href="https://colab.research.google.com/github/RaminParker/Text-Classification-with-German-dataset/blob/master/Text_MultiClassification_German.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification with Naive Bayes

This notebook uses the techniques represented in the following video:
- [video 1](https://www.youtube.com/watch?v=hp2ipC5pW4I&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=4)
- [video 2](https://www.youtube.com/watch?v=dt7sArnLo1g&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=5)

I am following the same basic steps. 
However, many steps had to be addapted to fit this particular use case. Further resources that I used are linked in the text below.

## Links

- [Using FastAI’s ULMFiT to make a state-of-the-art multi-class text classifier](https://medium.com/technonerds/using-fastais-ulmfit-to-make-a-state-of-the-art-multi-label-text-classifier-bf54e2943e83)
- [Using the fastai Data Block API](https://medium.com/@tmckenzie.nz/using-the-fastai-data-block-api-b4818e72155b)

In [0]:
from fastai import *
from fastai.text import *
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [0]:
import sklearn.feature_extraction.text as sklearn_text

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [571]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Import data

In [0]:
path = 'gdrive/My Drive/Arbeit/TeamBank_Work/'

In [0]:
#nrows = None     # read all rows of dataset
nrows = 4000

df = pd.read_csv(path + 'GermanArticles.csv', sep='\t', names=['text'], header=None, nrows=nrows)

In [574]:
df.head()

Unnamed: 0,text
0,Etat;Die ARD-Tochter Degeto hat sich verpflich...
1,Etat;App sei nicht so angenommen worden wie ge...
2,Etat;'Zum Welttag der Suizidprävention ist es ...
3,Etat;Mitarbeiter überreichten Eigentümervertre...
4,Etat;Service: Jobwechsel in der Kommunikations...


## Bring dataset into the correct form

### Change columns

In [0]:
# dropping null value columns to avoid errors 
df.dropna(inplace = True) 

In [0]:
# new data frame with split value columns 
new = df["text"].str.split(";", n = 1, expand = True)  # Expand the splitted strings into separate columns: True

In [577]:
new.head(2)

Unnamed: 0,0,1
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...


In [0]:
 # making separate first name column from new data frame 
df["label"]= new[0] 
  
# making separate last name column from new data frame 
df["content"]= new[1] 
  
# Dropping old Name columns 
df.drop(columns =["text"], inplace = True) 

In [579]:
df.head(2)

Unnamed: 0,label,content
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...


In [0]:
df = df.rename(columns={'content': 'text'})


We are not going to use this dataframe, but are just loading it to get a sense of what our data looks like:

In [581]:
df.head()

Unnamed: 0,label,text
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...
2,Etat,"'Zum Welttag der Suizidprävention ist es Zeit,..."
3,Etat,Mitarbeiter überreichten Eigentümervertretern ...
4,Etat,Service: Jobwechsel in der Kommunikationsbranc...


In [582]:
df['label'].nunique()

5

In [583]:
df.label.unique()

array(['Etat', 'Inland', 'International', 'Kultur', 'Panorama'], dtype=object)

## Add new column

We add a new column that specifies the train and test rows. To do this, we first split the dataframe into two parts. After this step, the first part get the value "true" whearas the second part gets the value "false".

In [0]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42) # split dataframe into two parts

In [585]:
train_df.head()

Unnamed: 0,label,text
3994,Panorama,Frauen kamen mit milden Strafen davon – Urteil...
423,Etat,"""Österreicher wählen eben so, wie sie es vom S..."
2991,International,Zwei Palästinenser erschossen und vier Israeli...
1221,Inland,88 Prozent der Einnahmen kommen von den 300.00...
506,Etat,Das Duo war ursprünglich als einmaliges Feiert...


Check that each part contains all labels!!! 

In [586]:
train_df['label'].value_counts() # check if df contains all labels 

International    1226
Inland            803
Etat              521
Kultur            436
Panorama          214
Name: label, dtype: int64

In [587]:
len(train_df)

3200

In [588]:
len(train_df['label'].unique()) # number of unique labels

5

In [589]:
test_df['label'].value_counts() # check if df contains all labels 

International    285
Inland           212
Etat             147
Kultur           103
Panorama          53
Name: label, dtype: int64

In [590]:
len(test_df)

800

In [591]:
len(test_df['label'].unique()) # number of unique labels

5

Expected: True

We want all labels in each of the two dataframes!

In [592]:
len(train_df['label'].unique())  == len(test_df['label'].unique()) 

True

Add new column to each dataframe specifying that this dataframe belongs either to the train or test set.

In [593]:
train_df['is_valid']='True'   # add new column

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [594]:
test_df['is_valid']='False'   # add new column

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [595]:
test_df.head()

Unnamed: 0,label,text,is_valid
555,Etat,"'Linda Hamiltons Muskelspiel in ""Terminator 2""...",False
3491,Kultur,Medienkunst-Preis ging an UBERMORGEN – lizvlx ...,False
527,Etat,Heute konkret | Wien-Wahl 2015. Diskussion der...,False
3925,Panorama,Trotz schlechter Verhältnisse weniger tödliche...,False
2989,International,Erneut mutmaßlicher palästinensischer Angreife...,False


In [0]:
frames=[train_df, test_df]

In [0]:
df=pd.concat(frames)

In [598]:
df.head()

Unnamed: 0,label,text,is_valid
3994,Panorama,Frauen kamen mit milden Strafen davon – Urteil...,True
423,Etat,"""Österreicher wählen eben so, wie sie es vom S...",True
2991,International,Zwei Palästinenser erschossen und vier Israeli...,True
1221,Inland,88 Prozent der Einnahmen kommen von den 300.00...,True
506,Etat,Das Duo war ursprünglich als einmaliges Feiert...,True


### Add column

In case we need a column that speciefies the split ratio. Not sure if we realy need it, yet!

In [0]:
# df['is_valid']='False'

In [0]:
# len(df)

In [0]:
# ratio = 0.3 # split ratio within the new column is_valid

# ratio_1=round(len(df) * ratio)
# ratio_2=len(df) - ratio_1

# print('Total length: ', len(df))
# print('First ratio: ', ratio_1 )
# print('Second ratio: ', ratio_2 )

In [0]:
# ratio_1 + ratio_2 == len(df)

In [0]:
# df.loc[0:ratio_1, 'is_valid'] = 'True'  # replace value in column

## Save data

In [0]:
df.to_csv(path  + 'texts.csv', sep=',', encoding='utf-8', index=False) # save data to csv

In [0]:
fields = ['label', 'text'] # only read these columns in the next step

In [606]:
df = pd.read_csv(path + 'texts.csv', sep=',', usecols=fields ) # quick check: read csv
df.head()

Unnamed: 0,label,text
0,Panorama,Frauen kamen mit milden Strafen davon – Urteil...
1,Etat,"""Österreicher wählen eben so, wie sie es vom S..."
2,International,Zwei Palästinenser erschossen und vier Israeli...
3,Inland,88 Prozent der Einnahmen kommen von den 300.00...
4,Etat,Das Duo war ursprünglich als einmaliges Feiert...


## Split data

Try to avoid this: “your validation data contains a label that isn’t present in your training set”.

This is going to be a problem since when you validate your model, it can’t really predict efficiently something it has never seen before.

- [Link to forum](https://forums.fast.ai/t/tabulardatabunch-error-your-validation-data-contains-a-label-that-isnt-present-in-the-training-set-please-fix-your-data/33410)

In [0]:
classes = df['label'].unique()
classes.sort()

In [608]:
classes # array of all labels

array(['Etat', 'Inland', 'International', 'Kultur', 'Panorama'], dtype=object)

We will be using TextList from the fastai library:

In [609]:
article_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0, classes=classes))

In [610]:
len(article_reviews.classes)

5

## Exploring what our data looks like

In [611]:
article_reviews.valid.x[0], article_reviews.valid.y[0]

(Text xxbos xxmaj frauen kamen mit xxunk xxmaj strafen davon – xxmaj urteil nicht rechtskräftig . xxmaj graz – xxmaj mit teilweise xxunk xxunk xxmaj xxunk ist am xxmaj montag der xxmaj prozess gegen sechs xxmaj xxunk zu xxmaj ende gegangen . xxmaj drei xxmaj männer und eine xxmaj frau wurden der terroristischen xxmaj vereinigung für schuldig befunden , zwei xxmaj frauen wurden wegen xxmaj xxunk verurteilt . xxmaj die xxmaj männer wurden zu fünf und sechs xxmaj jahren xxmaj haft verurteilt , die xxmaj frauen kamen mit drei , fünf und 15 xxmaj monaten , großteils bedingt , davon . xxmaj die höchste xxmaj strafe , sechs xxmaj jahre unbedingt , wurde über einen xxunk verhängt , der als xxmaj xxunk in einer xxmaj grazer xxmaj moschee tätig war . xxmaj er habe durch seine xxmaj predigten xxmaj männer xxunk , nach xxmaj syrien zu gehen , wenn es ihm auch nicht in allen angeklagten xxmaj fällen nachgewiesen werden habe können , hieß es in der xxmaj xxunk . xxmaj bei den beiden xxunk xxmaj männ

Here, the tokens mostly correspond to words or punctuation, as well as several special tokens, corresponding to unknown words, capitalization, etc.

All those tokens starting with "xx" are fastai special tokens. You can see the list of all of them and their meanings ([in the fastai docs](https://docs.fast.ai/text.transform.html)):

The rules are all listed below, here is the meaning of the special tokens:

- UNK (xxunk) is for an unknown word (one that isn't present in the current vocabulary)
- PAD (xxpad) is the token used for padding, if we need to regroup several texts of different lengths in a batch
- BOS (xxbos) represents the beginning of a text in your dataset
- FLD (xxfld) is used if you set mark_fields=True in your TokenizeProcessor to separate the different fields of texts (if your texts are loaded from several columns in a dataframe)
- TK_MAJ (xxmaj) is used to indicate the next word begins with a capital in the original text
- TK_UP (xxup) is used to indicate the next word is written in all caps in the original text
- TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
- TK_WREP(xxwrep) is used to indicate the next word is repeated n times in the original text (usage xxwrep n {word})

In [612]:
len(article_reviews.train.x), len(article_reviews.valid.x)

(800, 3200)

Notice that ints-to-string and string-to-ints have different lengths.

Reason:
Several words can have the same index. For example all unknown (xxunk) words (words that only appear once). Additionally, many times a word is used multiple times, but this word will always have only one integer.

In [613]:
len(article_reviews.vocab.itos), len(article_reviews.vocab.stoi)


(11064, 109814)

In [614]:
pos = article_reviews.vocab.stoi['gleichstellung']  # string to int
pos

6699

In [615]:
article_reviews.vocab.itos[pos]  # int to string

'gleichstellung'

int to str is a list:

In [616]:
article_reviews.vocab.itos[0:20]  # most often used stings

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 '.',
 ',',
 'der',
 'die',
 'und',
 'in',
 '-',
 'das',
 'den',
 'von',
 'mit']

string to int is a dictonary:

In [0]:
# article_reviews.vocab.stoi  

Let's test that a non-word maps to xxunk:

In [618]:
article_reviews.vocab.itos[article_reviews.vocab.stoi['Poooolitik']]  


'xxunk'

## Creating our term-document matrix

As  covered in the second and third lesson, a term-document matrix represents a document as a "bag of words". That means it doesn't keep track of the order the words are in, just which words occur (and how often).

You can use sklearn's CountVectorizer (as in previous lessons) or you can create our own (similar) version. This is for two reasons:

- to understand what sklearn is doing underneath the hood
- to create something that will work with a fastai TextList
To create our term-document matrix, we first need to learn about counters and sparse matrices.


![alt text](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1535125878/NLTK5_obakq5.png)

### Counters
Here is how they work:

In [0]:
c = Counter([4,2,8,8,4,8])


In [620]:
c

Counter({2: 1, 4: 2, 8: 3})

In [621]:
c.values()


dict_values([2, 1, 3])

In [622]:
c.keys()


dict_keys([4, 2, 8])

### Sparse Matricies

Even though we've reduced the number of words, we still have a lot! Most tokens don't appear in most reviews. We want to take advantage of this by storing our data as a sparse matrix.

A matrix with lots of zeros is called sparse (the opposite of sparse is dense). For sparse matrices, you can save a lot of memory by only storing the non-zero values.

There are the most common sparse storage formats:

- coordinate-wise (scipy calls COO)
- compressed sparse row (CSR)
- compressed sparse column (CSC)

- [Here are examples](http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/3-C/sparse.html)
- [Good explanation](https://youtu.be/hp2ipC5pW4I?t=1575)


A class of matrices (e.g, diagonal) is generally called sparse if the number of non-zero elements is proportional to the number of rows (or columns) instead of being proportional to the product rows x columns.

**Scipy Implementation**

From the Scipy Sparse Matrix Documentation

- To construct a matrix efficiently, use either dok_matrix or lil_matrix. The lil_matrix class supports basic slicing and fancy indexing with a similar syntax to NumPy arrays. As illustrated below, the COO format may also be used to efficiently construct matrices
- To perform manipulations such as multiplication or inversion, first convert the matrix to either CSC or CSR format.
- All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.

## Our version of CountVectorizer

In [623]:
Counter((article_reviews.valid.x)[0].data)

Counter({0: 52,
         2: 1,
         5: 161,
         6: 4,
         9: 30,
         10: 46,
         11: 21,
         12: 19,
         13: 11,
         14: 8,
         16: 5,
         17: 5,
         18: 2,
         19: 8,
         20: 8,
         21: 2,
         22: 1,
         23: 1,
         24: 4,
         25: 9,
         26: 5,
         27: 4,
         28: 4,
         29: 2,
         30: 4,
         31: 2,
         32: 4,
         33: 2,
         34: 3,
         35: 4,
         36: 2,
         37: 8,
         38: 4,
         39: 2,
         41: 1,
         42: 4,
         43: 1,
         44: 2,
         45: 9,
         46: 1,
         48: 1,
         49: 3,
         50: 1,
         51: 3,
         53: 4,
         54: 1,
         56: 5,
         57: 3,
         59: 3,
         62: 2,
         63: 1,
         64: 1,
         65: 2,
         66: 1,
         67: 2,
         69: 3,
         73: 2,
         74: 1,
         75: 2,
         78: 3,
         79: 2,
         86: 1,
     

In [624]:
# Look at first article: we get a list of integers which are representing the article
(article_reviews.valid.x)[0].data

array([   2,    5,  269,  668, ...,  122,   24, 9174,    9])

In [625]:
# Apply counter on the article
Counter((article_reviews.valid.x)[0].data)

Counter({0: 52,
         2: 1,
         5: 161,
         6: 4,
         9: 30,
         10: 46,
         11: 21,
         12: 19,
         13: 11,
         14: 8,
         16: 5,
         17: 5,
         18: 2,
         19: 8,
         20: 8,
         21: 2,
         22: 1,
         23: 1,
         24: 4,
         25: 9,
         26: 5,
         27: 4,
         28: 4,
         29: 2,
         30: 4,
         31: 2,
         32: 4,
         33: 2,
         34: 3,
         35: 4,
         36: 2,
         37: 8,
         38: 4,
         39: 2,
         41: 1,
         42: 4,
         43: 1,
         44: 2,
         45: 9,
         46: 1,
         48: 1,
         49: 3,
         50: 1,
         51: 3,
         53: 4,
         54: 1,
         56: 5,
         57: 3,
         59: 3,
         62: 2,
         63: 1,
         64: 1,
         65: 2,
         66: 1,
         67: 2,
         69: 3,
         73: 2,
         74: 1,
         75: 2,
         78: 3,
         79: 2,
         86: 1,
     

In [626]:
# check what value the number is represeting:
article_reviews.vocab.itos[11]

'der'

In [627]:
# Look at first article:
(article_reviews.valid.x)[0]

Text xxbos xxmaj frauen kamen mit xxunk xxmaj strafen davon – xxmaj urteil nicht rechtskräftig . xxmaj graz – xxmaj mit teilweise xxunk xxunk xxmaj xxunk ist am xxmaj montag der xxmaj prozess gegen sechs xxmaj xxunk zu xxmaj ende gegangen . xxmaj drei xxmaj männer und eine xxmaj frau wurden der terroristischen xxmaj vereinigung für schuldig befunden , zwei xxmaj frauen wurden wegen xxmaj xxunk verurteilt . xxmaj die xxmaj männer wurden zu fünf und sechs xxmaj jahren xxmaj haft verurteilt , die xxmaj frauen kamen mit drei , fünf und 15 xxmaj monaten , großteils bedingt , davon . xxmaj die höchste xxmaj strafe , sechs xxmaj jahre unbedingt , wurde über einen xxunk verhängt , der als xxmaj xxunk in einer xxmaj grazer xxmaj moschee tätig war . xxmaj er habe durch seine xxmaj predigten xxmaj männer xxunk , nach xxmaj syrien zu gehen , wenn es ihm auch nicht in allen angeklagten xxmaj fällen nachgewiesen werden habe können , hieß es in der xxmaj xxunk . xxmaj bei den beiden xxunk xxmaj männe

Construct the term-doc-martix:

Detailed [explanation:](https://youtu.be/hp2ipC5pW4I?t=2688)

In [0]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
#     return (values, j_indices, indptr)

    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)

In [629]:
%%time
val_term_doc = get_term_doc_matrix(article_reviews.valid.x, len(article_reviews.vocab.itos))

CPU times: user 997 ms, sys: 38.2 ms, total: 1.04 s
Wall time: 987 ms


In [630]:
%%time
trn_term_doc = get_term_doc_matrix(article_reviews.train.x, len(article_reviews.vocab.itos))

CPU times: user 255 ms, sys: 7.67 ms, total: 262 ms
Wall time: 250 ms


In [631]:
trn_term_doc.shape

(800, 11064)

In [632]:
# show m x n matrix
trn_term_doc[:10,:10]

<10x10 sparse matrix of type '<class 'numpy.int64'>'
	with 48 stored elements in Compressed Sparse Row format>

In [633]:
val_term_doc.shape

(3200, 11064)

## More data exploration

We could convert our sparse matrix to a dense matrix:

In [634]:
article_reviews.vocab.itos[10:20] # first words

[',', 'der', 'die', 'und', 'in', '-', 'das', 'den', 'von', 'mit']

In [635]:
article_reviews.vocab.itos[-1:] # last word

['xxfake']

In [636]:
val_term_doc.todense()[:10,:10]

matrix([[ 52,   0,   1,   0, ...,   4,   0,   0,  30],
        [ 17,   0,   1,   0, ...,   4,   0,   0,   8],
        [ 64,   0,   1,   0, ...,   1,   0,   0,  30],
        [ 91,   0,   1,   0, ...,   0,   0,   0,  18],
        ...,
        [ 29,   0,   1,   0, ...,   0,   0,   0,   8],
        [ 75,   0,   1,   0, ...,   2,   0,   0,  26],
        [ 25,   0,   1,   0, ...,  11,   0,   0,  16],
        [184,   0,   1,   0, ...,   2,   0,   0,  58]])

### Read term-doc-matrix (example)

In [637]:
# Look at a particular review
review = article_reviews.valid.x[1]; review

Text xxbos " xxmaj österreicher wählen eben so , wie sie es vom xxmaj xxunk kennen : möglichst xxunk und schön braun " : xxmaj tiroler xxmaj xxunk zeigte xxup zdf - xxmaj sendung an . xxmaj wien – xxmaj die xxmaj ergebnisse der xxmaj bundespräsidentenwahl und die xxunk xxmaj prozent , die xxup fpö - xxmaj kandidat xxmaj norbert xxmaj hofer an dem xxmaj wochenende erreichte , xxunk in xxmaj deutschland und xxmaj österreich xxmaj diskussionen aus . xxmaj auf der xxmaj facebook - xxmaj seite der xxup zdf - xxmaj heute - xxmaj show wurde dazu am xxmaj montag nach der xxmaj wahl ein xxmaj bild veröffentlicht , das ein xxmaj xxunk in xxmaj form eines xxmaj xxunk zeigt . xxmaj dazu der xxmaj text : xxmaj österreicher wählen eben so , wie sie es vom xxmaj xxunk kennen : möglichst xxunk und schön braun . xxmaj es seien bereits zwei xxmaj xxunk bezüglich des xxmaj xxunk bei der xxmaj staatsanwaltschaft xxmaj mainz eingetroffen – eine aus xxmaj österreich , die andere aus xxmaj deutschland , beri

Since the word "deutschland" shows up 2 times in this review, I want to confirm that a value of 2 is stored in the term-document matrix, for the row corresponding to this review and the column corresponding to the word "deutschland".

In [638]:
print(article_reviews.vocab.stoi['deutschland'])

val_term_doc[1, article_reviews.vocab.stoi['deutschland'] ]

249


2

In [639]:
val_term_doc[1].sum() # number of tokens in this review

260

In [640]:
val_term_doc[1] # 108 distinct tokens in this review

<1x11064 sparse matrix of type '<class 'numpy.int64'>'
	with 108 stored elements in Compressed Sparse Row format>

In [641]:
len(set(review.data)) # number of distinct tokens in this review

108

The review has 108  distinct tokens in it, and 260 tokens total.

In [642]:
review.data # review in integers

array([   2,   91,    5,  698, ...,  186, 1339, 8444,    9])

In [643]:
review.text # review in text

'xxbos " xxmaj österreicher wählen eben so , wie sie es vom xxmaj xxunk kennen : möglichst xxunk und schön braun " : xxmaj tiroler xxmaj xxunk zeigte xxup zdf - xxmaj sendung an . xxmaj wien – xxmaj die xxmaj ergebnisse der xxmaj bundespräsidentenwahl und die xxunk xxmaj prozent , die xxup fpö - xxmaj kandidat xxmaj norbert xxmaj hofer an dem xxmaj wochenende erreichte , xxunk in xxmaj deutschland und xxmaj österreich xxmaj diskussionen aus . xxmaj auf der xxmaj facebook - xxmaj seite der xxup zdf - xxmaj heute - xxmaj show wurde dazu am xxmaj montag nach der xxmaj wahl ein xxmaj bild veröffentlicht , das ein xxmaj xxunk in xxmaj form eines xxmaj xxunk zeigt . xxmaj dazu der xxmaj text : xxmaj österreicher wählen eben so , wie sie es vom xxmaj xxunk kennen : möglichst xxunk und schön braun . xxmaj es seien bereits zwei xxmaj xxunk bezüglich des xxmaj xxunk bei der xxmaj staatsanwaltschaft xxmaj mainz eingetroffen – eine aus xxmaj österreich , die andere aus xxmaj deutschland , berichte

In [0]:
# [article_reviews.vocab.itos[a] for a in review.data]  # review in text

stoi (string-to-int) is larger than itos (int-to-string):

In [645]:
len(article_reviews.vocab.stoi) - len(article_reviews.vocab.itos)

98751


This is because many words are mapping to unknown. We can confirm here:

In [0]:
unk = []
for word, num in article_reviews.vocab.stoi.items():
    if num==0:
        unk.append(word)

In [647]:
len(unk)

98752

In [648]:
unk[:30] # array of all unknown words

['xxunk',
 'hamiltons',
 'muskelspiel',
 'terminator',
 'dokus',
 'alcatraz',
 '17.30',
 'magazinbürgeranwalt',
 'resetarits',
 'gläserne',
 'hausnummernchaos',
 'gasse',
 'banksafe',
 '18.20',
 'dokumentationbaumeister',
 'porträts',
 'renner',
 'körner',
 '21.55',
 'publizisten',
 'schulmeister',
 'erschlagt',
 'verrate',
 'widerstandskämpferin',
 'käthe',
 'sasso',
 'rationalisthomo',
 'faber',
 'brd',
 'schlöndorff']

## Naive Bayes
We define the log-count ratio $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [649]:
article_reviews.y.classes

array(['Etat', 'Inland', 'International', 'Kultur', 'Panorama'], dtype=object)

In [0]:
x = trn_term_doc
y = article_reviews.train.y
val_y = article_reviews.valid.y

In [0]:
Etat = y.c2i['Etat']
Inland = y.c2i['Inland']
International = y.c2i['International']
Kultur = y.c2i['Kultur']
Panorama = y.c2i['Panorama']

In [652]:
# how often a specific word shows of in reviews with this particular label. 
np.squeeze(np.asarray(x[y.items==Etat].sum(0))) # (the np.squeeze simply removes the dimension of the array)

array([6206,    0,  147,    0, ...,    0,    0,    0,    0], dtype=int64)

For each word in our vocabulary, we are summing up how many times it shows up in a review with a specific label:

In [0]:
p4 = np.squeeze(np.asarray(x[y.items==Panorama].sum(0))) # how often a specific word shows of in reviews with this particular label. 
p3 = np.squeeze(np.asarray(x[y.items==Kultur].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==International].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==Inland].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==Etat].sum(0)))

In [654]:
p1[:10]

array([10413,     0,   212,     0,     0, 33690,  1697,     2,     0,  5779], dtype=int64)

In [0]:
v = article_reviews.vocab

### Using our ratios for even more data exploration

Compare how often "kritik" appears in reviews with certain labels:

In [656]:
print('number of times the word shows up in reviews with label Etat: ', p0[v.stoi['kritik']] )
print('number of times the word shows up in reviews with label Inland: ', p1[v.stoi['kritik']] )
print('number of times the word shows up in reviews with label International: ', p2[v.stoi['kritik']] )
print('number of times the word shows up in reviews with label Kultur: ', p3[v.stoi['kritik']] )
print('number of times the word shows up in reviews with label Panorama: ', p4[v.stoi['kritik']] )

number of times the word shows up in reviews with label Etat:  15
number of times the word shows up in reviews with label Inland:  36
number of times the word shows up in reviews with label International:  30
number of times the word shows up in reviews with label Kultur:  7
number of times the word shows up in reviews with label Panorama:  1


### Reviews with label "Kultur" which have the word "kritik"

In [657]:
v.stoi['kritik'] # integer for that specific token

336

In [658]:
a = np.argwhere((x[:, v.stoi['kritik'] ] > 0))[:,0]; a  # which reviews have a non-zero value for that

array([ 40,  48,  52,  56, ..., 786, 795, 798, 799], dtype=int32)

In [659]:
b = np.argwhere(y.items==Kultur)[:,0]; b  # get all reviews with label Kultur

array([  1,  10,  15,  22, ..., 781, 783, 784, 787])

In [660]:
set(a).intersection(set(b)) # take a intersection to get all reviews with label Kultur which have the word kritik

{115, 448, 613, 638, 653, 778}

In [661]:
review = article_reviews.train.x[115]
review.text

"xxbos ' xxmaj der amerikanische xxmaj xxunk über den xxmaj unterschied zwischen uns heute und den xxmaj nazis damals . xxup standard : xxmaj es gibt umfassende xxmaj bücher über den xxmaj holocaust , manche gelten als xxmaj xxunk . xxmaj was hat xxmaj sie xxunk , ein weiteres , xxmaj black xxmaj xxunk , zu schreiben ? xxmaj snyder : xxmaj die meisten xxmaj autoren berufen sich auf deutsche xxmaj quellen , manchmal auch auf französische . xxmaj das xxmaj problem dabei , dass xxunk % der xxmaj juden , die xxunk sind , nicht xxmaj deutsch konnten . xxmaj um ihre xxmaj erfahrungen und die xxmaj gesellschaften , in denen sie lebten , zu verstehen , muss man ihre xxmaj sprachen können . xxmaj erst dadurch kann man ihre xxmaj sicht der xxmaj dinge kennenlernen , und das habe ich versucht . xxmaj ich wollte zudem mein xxmaj xxunk auf alle betroffenen xxmaj länder richten , auch auf die xxmaj staaten , die schon vor dem xxmaj zweiten xxmaj weltkrieg bzw . zu dessen xxmaj beginn zerstört wurden

## Applying Naive Bayes

We define the log-count ratio $r$ for each word $f$:

$r = \log \frac{f_p}{f_n}$


Ratio of feature $f$ in positive documents:
 $f_p=\frac{\text{number of times a positive document has a feature}}{\text{the number of positive documents}}$

Ratio of feature $f$ in negative documents:
  $f_n=\frac{\text{number of times a negative document has a feature}}{\text{the number of negative documents}}$

In [0]:
# details to this see the steps above
p4 = np.squeeze(np.asarray(x[y.items==Panorama].sum(0))) # number of times a specific word shows of in reviews with this particular label. 
p3 = np.squeeze(np.asarray(x[y.items==Kultur].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==International].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==Inland].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==Etat].sum(0)))

For the ratio, check out this paper: [log-ratio for multiclass problem](http://www.lrec-conf.org/proceedings/lrec2016/pdf/284_Paper.pdf)

In [663]:
# add 1 to make it more numerically stable
pr0 = (p0+1) / ((y.items==Etat).sum() + 1) # averaging over the number of reviews with label "Etat" = (how often a specific word shows of in reviews with this particular label)/(number of reviews with label "Etat") 
#pr0_not = (p0+1) / ((y.items!=Etat).sum() + 1) 
pr0_not = ((p1+p2+p3+p4)+1) / ((y.items!=Etat).sum() + 1) # averaging over the number of not-this-label reviews  

r_Etat = np.log(pr0/pr0_not); r_Etat

array([-0.074985,  1.485895,  0.      ,  1.485895, ...,  0.099601,  0.099601,  0.099601,  1.485895])

In [664]:
pr1 = (p1+1) / ((y.items==Inland).sum() + 1)
#pr1_not = (p1+1) / ((y.items!=Inland).sum() + 1)
pr1_not = ((p0+p2+p3+p4)+1) / ((y.items!=Inland).sum() + 1)

r_Inland = np.log(pr1/pr1_not); r_Inland

array([ 0.12723 ,  1.017134,  0.      ,  1.017134, ..., -0.36916 ,  2.403428,  2.403428,  1.017134])

In [665]:
pr2 = (p2+1) / ((y.items==International).sum() + 1)
#pr2_not = (p2+1) / ((y.items!=International).sum() + 1)
pr2_not = ((p0+p1+p3+p4)+1) / ((y.items!=International).sum() + 1)

r_International = np.log(pr2/pr2_not); r_International

array([-0.487842,  0.590115,  0.      ,  0.590115, ..., -0.796179, -0.796179, -0.796179,  0.590115])

In [666]:
pr3 = (p3+1) / ((y.items==Kultur).sum() + 1)
#pr3_not = (p3+1) / ((y.items!=Kultur).sum() + 1)
pr3_not = ((p0+p1+p2+p4)+1) / ((y.items!=Kultur).sum() + 1)

r_Kultur = np.log(pr3/pr3_not); r_Kultur

array([0.691514, 1.903828, 0.      , 1.903828, ..., 0.517534, 0.517534, 0.517534, 1.903828])

In [667]:
pr4 = (p4+1) / ((y.items==Panorama).sum() + 1)
#pr4_not = (p4+1) / ((y.items!=Panorama).sum() + 1)
pr4_not = ((p0+p1+p2+p3)+1) / ((y.items!=Panorama).sum() + 1)

r_Panorama = np.log(pr4/pr4_not); r_Panorama

array([-0.269652,  2.628419,  0.      ,  2.628419, ...,  4.014713,  1.242125,  1.242125,  2.628419])

### Vocab most likely associated with reviews which have label "International"

In [0]:
biggest = np.argpartition(r_International, -10)[-10:]
smallest = np.argpartition(r_International, 10)[:10]

**Top 10 words** that indicate that the review most likely has the label "International":

In [669]:
[v.itos[k] for k in biggest]

['rousseff',
 'clinton',
 'palästinenser',
 'taliban',
 'kurden',
 'pkk',
 'angreifer',
 'rebellen',
 'damaskus',
 'podemos']

In [670]:
np.argmax(trn_term_doc[:,v.stoi['konferenz']])

205

In [671]:
article_reviews.train.x[ np.argmax(trn_term_doc[:,v.stoi['konferenz']]) ]

Text xxbos xxmaj hoffnung auf xxmaj bildung einer xxmaj einheitsregierung bei xxmaj xxunk in xxmaj marokko . xxmaj tripolis – xxmaj unter starkem xxunk xxmaj druck steigt offenbar die xxmaj chance auf die xxmaj bildung einer xxmaj einheitsregierung in xxmaj libyen : xxmaj die xxmaj präsidenten der beiden rivalisierenden xxmaj xxunk haben sich am xxmaj dienstag einem xxmaj xxunk zufolge zum ersten xxmaj mal getroffen . xxmaj die xxmaj begegnung fand bei einer xxmaj konferenz in xxmaj malta statt , wo die xxmaj unterzeichnung eines xxup un - xxmaj xxunk für das xxunk xxmaj xxunk vorbereitet wurde . xxmaj der xxmaj plan soll am xxmaj donnerstag in xxmaj marokko von den xxmaj konfliktparteien unterschrieben werden . xxmaj in xxmaj libyen gibt es seit dem xxmaj sturz des langjährigen xxmaj machthabers xxmaj muammar al - xxmaj gaddafi vor vier xxmaj jahren keine xxunk xxmaj regierung mehr . xxmaj die xxmaj jihadistenmiliz xxmaj islamischer xxmaj staat ( xxup is ) nutzte das xxmaj xxunk , um 

**Bottom last words** that indicate that the review most likely has the label "International":

In [672]:
[v.itos[k] for k in smallest]

['bellen',
 'orf',
 'fpö',
 'hofer',
 'van',
 'häupl',
 'graz',
 'griss',
 'neos',
 'strache']

### Continuing with Naive Bayes

In [673]:
print((y.items==Panorama).mean()) # average number of reviews with specific label
print((y.items==Kultur).mean())
print((y.items==International).mean())
print((y.items==Inland).mean())
print((y.items==Etat).mean())

0.06625
0.12875
0.35625
0.265
0.18375


In [674]:
(y.items==Panorama).mean() + (y.items==Kultur).mean() + (y.items==International).mean() + (y.items==Inland).mean() +  (y.items==Etat).mean()

1.0

In [683]:
# b = log ( average number of reviews with label x ) / (average number of reviews with NOT label x) 
b_International = np.log((y.items==International).mean() / (y.items!=International).mean()) ; print(b_International)
preds_International = (val_term_doc @ r_International + b_International) > 0 ; print(preds_International) # Here is the formula for Naive Bayes
#(preds_International == (val_y.items==International)).mean() # accuracy 
(preds_International == val_y.items).mean()

-0.5916777203950857
[False False  True False ... False False False False]


0.16125

In [684]:
b_Inland = np.log((y.items==Inland).mean() / (y.items!=Inland).mean()); print(b_Inland)
preds_Inland = (val_term_doc @ r_Inland + b_Inland) > 0; print(preds_Inland)
#(preds_Inland == (val_y.items==Inland)).mean() # accuracy 
(preds_Inland == val_y.items).mean()

-1.0201406732266145
[ True  True False  True ...  True  True  True  True]


0.2525

In [689]:
b_Kultur = np.log((y.items==Kultur).mean() / (y.items!=Kultur).mean()); print(b_Kultur)
preds_Kultur = (val_term_doc @ r_Kultur + b_Kultur) > 0; print(preds_Kultur)
#(preds_Kultur == (val_y.items==Kultur)).mean() # accuracy
(preds_Kultur == val_y.items).mean()

-1.912056422530888
[ True  True  True  True ...  True  True  True  True]


0.2478125

In [686]:
b_Etat = np.log((y.items==Etat).mean() / (y.items!=Etat).mean()); print(b_Etat)
preds_Etat = (val_term_doc @ r_Etat + b_Etat) > 0; print(preds_Etat)
#(preds_Etat == (val_y.items==Etat)).mean() # accuracy 
(preds_Etat == val_y.items).mean()

-1.4911445424976948
[False False False False ... False False False False]


0.1196875

In [687]:
b_Panorama = np.log((y.items==Panorama).mean() / (y.items!=Panorama).mean()); print(b_Panorama)
preds_Panorama = (val_term_doc @ r_Panorama + b_Panorama) > 0; print(preds_Panorama)
#(preds_Panorama == (val_y.items==Panorama)).mean() # accuracy
(preds_Panorama == val_y.items).mean()

-2.6457732715806954
[False False False False ... False False False False]


0.1634375