<a href="https://colab.research.google.com/github/RaminParker/Text-Classification-with-German-dataset/blob/master/TextClassificationGerman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification with Naive Bayes

This notebook uses the techniques represented in the following video:
- [video 1](https://www.youtube.com/watch?v=hp2ipC5pW4I&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=4)
- [video 2](https://www.youtube.com/watch?v=dt7sArnLo1g&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=5)

However, many steps had to be addapted to fit this particular use case

## Links

- [Using FastAI’s ULMFiT to make a state-of-the-art multi-class text classifier](https://medium.com/technonerds/using-fastais-ulmfit-to-make-a-state-of-the-art-multi-label-text-classifier-bf54e2943e83)
- [Using the fastai Data Block API](https://medium.com/@tmckenzie.nz/using-the-fastai-data-block-api-b4818e72155b)

In [0]:
from fastai import *
from fastai.text import *
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [0]:
import sklearn.feature_extraction.text as sklearn_text

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [58]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Import data

In [0]:
path = 'gdrive/My Drive/Arbeit/TeamBank_Work/'

In [0]:
#nrows = None     # read all rows of dataset
nrows = 4000

df = pd.read_csv(path + 'GermanArticles.csv', sep='\t', names=['text'], header=None, nrows=nrows)

In [61]:
df.head()

Unnamed: 0,text
0,Etat;Die ARD-Tochter Degeto hat sich verpflich...
1,Etat;App sei nicht so angenommen worden wie ge...
2,Etat;'Zum Welttag der Suizidprävention ist es ...
3,Etat;Mitarbeiter überreichten Eigentümervertre...
4,Etat;Service: Jobwechsel in der Kommunikations...


## Bring dataset into the correct form

### Change columns

In [0]:
# dropping null value columns to avoid errors 
df.dropna(inplace = True) 

In [0]:
# new data frame with split value columns 
new = df["text"].str.split(";", n = 1, expand = True)  # Expand the splitted strings into separate columns: True

In [64]:
new.head(2)

Unnamed: 0,0,1
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...


In [0]:
 # making separate first name column from new data frame 
df["label"]= new[0] 
  
# making separate last name column from new data frame 
df["content"]= new[1] 
  
# Dropping old Name columns 
df.drop(columns =["text"], inplace = True) 

In [66]:
df.head(2)

Unnamed: 0,label,content
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...


In [0]:
df = df.rename(columns={'content': 'text'})


We are not going to use this dataframe, but are just loading it to get a sense of what our data looks like:

In [68]:
df.head()

Unnamed: 0,label,text
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...
2,Etat,"'Zum Welttag der Suizidprävention ist es Zeit,..."
3,Etat,Mitarbeiter überreichten Eigentümervertretern ...
4,Etat,Service: Jobwechsel in der Kommunikationsbranc...


In [69]:
df['label'].nunique()

5

In [70]:
df.label.unique()

array(['Etat', 'Inland', 'International', 'Kultur', 'Panorama'], dtype=object)

# Add new column

We add a new column that specifies the train and test rows. To do this, we first split the dataframe into two parts. After this step, the first part get the value "true" whearas the second part gets the value "false".

In [0]:
train_df, test_df = train_test_split(df, test_size=0.2) # split dataframe into two parts

In [72]:
train_df.head()

Unnamed: 0,label,text
1765,International,Bisher bekannte sich niemand zu Attentat. Kair...
1046,Inland,Ein antisemitisches Posting kostet die umstrit...
3582,Kultur,Bisherige Eigentümer erhalten im Tausch 35 Pro...
1952,International,"""Krieg der Republikaner"" über ihren Kandidaten..."
247,Etat,"War mehr als 40 Jahre Sportredakteur der ""Kron..."


Check that each part contains all labels!!! 

In [73]:
train_df['label'].value_counts() # check if df contains all labels 

International    1218
Inland            794
Etat              542
Kultur            430
Panorama          216
Name: label, dtype: int64

In [106]:
len(train_df)

3200

In [74]:
len(train_df['label'].unique()) # number of unique labels

5

In [75]:
test_df['label'].value_counts() # check if df contains all labels 

International    293
Inland           221
Etat             126
Kultur           109
Panorama          51
Name: label, dtype: int64

In [107]:
len(test_df)

800

In [76]:
len(test_df['label'].unique()) # number of unique labels

5

Expected: True

We want all labels in each of the two dataframes!

In [77]:
len(train_df['label'].unique())  == len(test_df['label'].unique()) 

True

Add new column to each dataframe specifying that this dataframe belongs either to the train or test set.

In [78]:
train_df['is_valid']='True'   # add new column

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [79]:
test_df['is_valid']='False'   # add new column

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [80]:
test_df.head()

Unnamed: 0,label,text,is_valid
2022,International,Opfer der Hazara-Minderheit aufgereiht und get...,False
3339,Kultur,"'Im Herbst 1990 kam ""Goodfellas"" in die Kinos....",False
1910,International,Präsidentschaftsbewerber gibt Vermögen mit zeh...,False
2499,International,Bericht über Vorbereitungen für Wahlkampagne 2...,False
2964,International,Partei könnte politischen Konflikt im Land wei...,False


In [0]:
frames=[train_df, test_df]

In [0]:
df=pd.concat(frames)

In [83]:
df.head()

Unnamed: 0,label,text,is_valid
1765,International,Bisher bekannte sich niemand zu Attentat. Kair...,True
1046,Inland,Ein antisemitisches Posting kostet die umstrit...,True
3582,Kultur,Bisherige Eigentümer erhalten im Tausch 35 Pro...,True
1952,International,"""Krieg der Republikaner"" über ihren Kandidaten...",True
247,Etat,"War mehr als 40 Jahre Sportredakteur der ""Kron...",True


### Add column

In case we need a column that speciefies the split ratio. Not sure if we realy need it, yet!

In [0]:
# df['is_valid']='False'

In [0]:
# len(df)

In [0]:
# ratio = 0.3 # split ratio within the new column is_valid

# ratio_1=round(len(df) * ratio)
# ratio_2=len(df) - ratio_1

# print('Total length: ', len(df))
# print('First ratio: ', ratio_1 )
# print('Second ratio: ', ratio_2 )

In [0]:
# ratio_1 + ratio_2 == len(df)

In [0]:
# df.loc[0:ratio_1, 'is_valid'] = 'True'  # replace value in column

## Save data

In [0]:
df.to_csv(path  + 'texts.csv', sep=',', encoding='utf-8', index=False) # save data to csv

In [0]:
fields = ['label', 'text'] # only read these columns in the next step

In [91]:
df = pd.read_csv(path + 'texts.csv', sep=',', usecols=fields ) # quick check: read csv
df.head()

Unnamed: 0,label,text
0,International,Bisher bekannte sich niemand zu Attentat. Kair...
1,Inland,Ein antisemitisches Posting kostet die umstrit...
2,Kultur,Bisherige Eigentümer erhalten im Tausch 35 Pro...
3,International,"""Krieg der Republikaner"" über ihren Kandidaten..."
4,Etat,"War mehr als 40 Jahre Sportredakteur der ""Kron..."


## Split data

Try to avoid this: “your validation data contains a label that isn’t present in your training set”.

This is going to be a problem since when you validate your model, it can’t really predict efficiently something it has never seen before.

- [Link to forum](https://forums.fast.ai/t/tabulardatabunch-error-your-validation-data-contains-a-label-that-isnt-present-in-the-training-set-please-fix-your-data/33410)

In [0]:
classes = df['label'].unique()
classes.sort()

In [93]:
classes

array(['Etat', 'Inland', 'International', 'Kultur', 'Panorama'], dtype=object)

We will be using TextList from the fastai library:

In [94]:
article_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0, classes=classes))

In [95]:
len(article_reviews.classes)

5

## Exploring what our data looks like

In [96]:
article_reviews.valid.x[0], article_reviews.valid.y[0]

(Text xxbos xxmaj bisher bekannte sich niemand zu xxmaj attentat . xxmaj kairo – xxmaj bei einem xxmaj xxunk auf einen xxmaj bus der ägyptischen xxmaj sicherheitskräfte sind mindestens drei xxmaj polizisten getötet und weitere 33 verletzt worden . xxmaj der xxmaj sprengsatz sei am xxmaj montag in der xxmaj früh i m xxmaj norden des xxmaj landes xxunk , teilte das xxmaj innenministerium mit . xxmaj zu dem xxmaj anschlag hat sich bisher niemand bekannt . xxmaj seit dem xxmaj sturz des islamistischen xxmaj präsidenten xxmaj mohammed xxmaj xxunk 2013 sowie der xxmaj verurteilung xxmaj xxunk seiner xxmaj anhänger zum xxmaj xxunk kommt es in xxmaj ägypten immer wieder zu xxmaj anschlägen . xxmaj seit xxmaj ende xxmaj juni hat sich die xxmaj situation verschärft . xxmaj bei mehreren xxmaj angriffen – unter anderem i m xxunk xxmaj norden der xxmaj sinai - xxmaj halbinsel – und xxmaj xxunk in xxmaj kairo starben viele xxmaj soldaten und xxmaj polizisten . xxmaj erst am xxmaj donnerstag waren 29

Here, the tokens mostly correspond to words or punctuation, as well as several special tokens, corresponding to unknown words, capitalization, etc.

All those tokens starting with "xx" are fastai special tokens. You can see the list of all of them and their meanings ([in the fastai docs](https://docs.fast.ai/text.transform.html)):

The rules are all listed below, here is the meaning of the special tokens:

- UNK (xxunk) is for an unknown word (one that isn't present in the current vocabulary)
- PAD (xxpad) is the token used for padding, if we need to regroup several texts of different lengths in a batch
- BOS (xxbos) represents the beginning of a text in your dataset
- FLD (xxfld) is used if you set mark_fields=True in your TokenizeProcessor to separate the different fields of texts (if your texts are loaded from several columns in a dataframe)
- TK_MAJ (xxmaj) is used to indicate the next word begins with a capital in the original text
- TK_UP (xxup) is used to indicate the next word is written in all caps in the original text
- TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
- TK_WREP(xxwrep) is used to indicate the next word is repeated n times in the original text (usage xxwrep n {word})

In [97]:
len(article_reviews.train.x), len(article_reviews.valid.x)

(800, 3200)

Notice that ints-to-string and string-to-ints have different lengths.

Reason:
Several words can have the same index. For example all unknown (xxunk) words (words that only appear once). Additionally, many times a word is used multiple times, but this word will always have only one integer.

In [98]:
len(article_reviews.vocab.itos), len(article_reviews.vocab.stoi)


(11080, 109814)

In [99]:
pos = article_reviews.vocab.stoi['gleichstellung']  # string to int
pos

6448

In [100]:
article_reviews.vocab.itos[pos]  # int to string

'gleichstellung'

int to str is a list:

In [101]:
article_reviews.vocab.itos[0:20]  # most often used stings

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 '.',
 ',',
 'die',
 'der',
 'und',
 'in',
 '-',
 'das',
 'den',
 'von',
 'zu']

string to int is a dictonary:

In [0]:
# article_reviews.vocab.stoi  

Let's test that a non-word maps to xxunk:

In [103]:
article_reviews.vocab.itos[article_reviews.vocab.stoi['Poolitik']]


'xxunk'

## Creating our term-document matrix

As  covered in the second and third lesson, a term-document matrix represents a document as a "bag of words". That means it doesn't keep track of the order the words are in, just which words occur (and how often).

You can use sklearn's CountVectorizer (as in previous lessons) or you can create our own (similar) version. This is for two reasons:

- to understand what sklearn is doing underneath the hood
- to create something that will work with a fastai TextList
To create our term-document matrix, we first need to learn about counters and sparse matrices.

### Counters
Here is how they work:

In [0]:
c = Counter([4,2,8,8,4,8])


In [0]:
c

Counter({2: 1, 4: 2, 8: 3})

In [0]:
c.values()


dict_values([2, 1, 3])

In [0]:
c.keys()


dict_keys([4, 2, 8])

### Sparse Matricies

Even though we've reduced the number of words, we still have a lot! Most tokens don't appear in most reviews. We want to take advantage of this by storing our data as a sparse matrix.

A matrix with lots of zeros is called sparse (the opposite of sparse is dense). For sparse matrices, you can save a lot of memory by only storing the non-zero values.

There are the most common sparse storage formats:

- coordinate-wise (scipy calls COO)
- compressed sparse row (CSR)
- compressed sparse column (CSC)

- [Here are examples](http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/3-C/sparse.html)
- [Good explanation](https://youtu.be/hp2ipC5pW4I?t=1575)


A class of matrices (e.g, diagonal) is generally called sparse if the number of non-zero elements is proportional to the number of rows (or columns) instead of being proportional to the product rows x columns.

**Scipy Implementation**

From the Scipy Sparse Matrix Documentation

- To construct a matrix efficiently, use either dok_matrix or lil_matrix. The lil_matrix class supports basic slicing and fancy indexing with a similar syntax to NumPy arrays. As illustrated below, the COO format may also be used to efficiently construct matrices
- To perform manipulations such as multiplication or inversion, first convert the matrix to either CSC or CSR format.
- All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.

# Our version of CountVectorizer

In [0]:
Counter((article_reviews.valid.x)[0].data)


Counter({0: 17,
         2: 1,
         5: 57,
         6: 2,
         9: 7,
         10: 9,
         11: 9,
         12: 6,
         13: 2,
         14: 2,
         15: 4,
         16: 1,
         18: 4,
         19: 2,
         20: 1,
         21: 2,
         22: 2,
         27: 1,
         28: 1,
         29: 2,
         32: 1,
         35: 2,
         39: 1,
         40: 1,
         43: 2,
         44: 1,
         47: 1,
         48: 1,
         49: 1,
         50: 1,
         53: 1,
         56: 1,
         67: 1,
         69: 1,
         73: 1,
         75: 1,
         77: 2,
         83: 1,
         94: 2,
         96: 3,
         114: 1,
         120: 2,
         125: 1,
         143: 2,
         151: 1,
         154: 1,
         181: 1,
         194: 1,
         195: 1,
         234: 1,
         267: 1,
         291: 1,
         333: 1,
         405: 1,
         440: 1,
         580: 1,
         586: 1,
         617: 1,
         622: 1,
         644: 2,
         652: 1,
      

In [0]:
# Look at first article: we get a list of integers which are representing the article
(article_reviews.valid.x)[0].data

array([   2,    5,   11,    6, ...,   22,    5, 4139,    9])

In [0]:
# Apply counter on the article
Counter((article_reviews.valid.x)[0].data)

Counter({0: 17,
         2: 1,
         5: 57,
         6: 2,
         9: 7,
         10: 9,
         11: 9,
         12: 6,
         13: 2,
         14: 2,
         15: 4,
         16: 1,
         18: 4,
         19: 2,
         20: 1,
         21: 2,
         22: 2,
         27: 1,
         28: 1,
         29: 2,
         32: 1,
         35: 2,
         39: 1,
         40: 1,
         43: 2,
         44: 1,
         47: 1,
         48: 1,
         49: 1,
         50: 1,
         53: 1,
         56: 1,
         67: 1,
         69: 1,
         73: 1,
         75: 1,
         77: 2,
         83: 1,
         94: 2,
         96: 3,
         114: 1,
         120: 2,
         125: 1,
         143: 2,
         151: 1,
         154: 1,
         181: 1,
         194: 1,
         195: 1,
         234: 1,
         267: 1,
         291: 1,
         333: 1,
         405: 1,
         440: 1,
         580: 1,
         586: 1,
         617: 1,
         622: 1,
         644: 2,
         652: 1,
      

In [0]:
# check what value the number is represeting:
article_reviews.vocab.itos[11]

'die'

In [0]:
# Look at first article:
(article_reviews.valid.x)[0]

Text xxbos xxmaj die xxup ard - xxmaj tochter xxmaj xxunk hat sich verpflichtet , ab xxmaj august einer xxmaj xxunk zu folgen , die für die xxmaj gleichstellung von xxmaj xxunk sorgen soll . xxmaj in mindestens 20 xxmaj prozent der xxmaj filme , die die xxup ard - xxmaj tochter xxmaj xxunk produziert oder xxunk , sollen ab xxmaj mitte xxmaj august xxmaj frauen xxmaj regie führen . xxmaj xxunk - xxmaj chefin xxmaj christine xxmaj strobl folgt mit dieser xxmaj xxunk der xxmaj forderung von xxmaj pro xxmaj quote xxmaj regie . xxmaj die xxmaj vereinigung von xxmaj xxunk hatte i m vergangenen xxmaj jahr eine xxmaj xxunk gefordert , um den weiblichen xxmaj xxunk mehr xxmaj gehör und ökonomische xxmaj gleichstellung zu verschaffen . xxmaj pro xxmaj quote xxmaj regie kritisiert , dass , während rund 50 xxmaj prozent der xxmaj regie - xxmaj studierenden xxunk seien , der xxmaj anteil der xxmaj xxunk bei xxmaj xxunk nur bei 13 bis 15 xxmaj prozent liege . xxmaj in xxmaj österreich sieht die xxma

Construct the term-doc-martix:

Detailed [explanation:](https://youtu.be/hp2ipC5pW4I?t=2688)

In [0]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))
        
#     return (values, j_indices, indptr)

    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)

In [0]:
%%time
val_term_doc = get_term_doc_matrix(article_reviews.valid.x, len(article_reviews.vocab.itos))

CPU times: user 370 ms, sys: 13 ms, total: 383 ms
Wall time: 365 ms


In [0]:
%%time
trn_term_doc = get_term_doc_matrix(article_reviews.train.x, len(article_reviews.vocab.itos))

CPU times: user 809 ms, sys: 41.3 ms, total: 850 ms
Wall time: 805 ms


In [0]:
trn_term_doc.shape

(2799, 27032)

In [0]:
# show m x n matrix
trn_term_doc[:10,:10]

<10x10 sparse matrix of type '<class 'numpy.int64'>'
	with 50 stored elements in Compressed Sparse Row format>

In [0]:
val_term_doc.shape

(1201, 27032)