<a href="https://colab.research.google.com/github/RaminParker/Text-Classification-with-German-dataset/blob/master/text_classification_movie_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from fastai import *
from fastai.text import *
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [89]:
path = untar_data(URLs.IMDB)
path

PosixPath('/root/.fastai/data/imdb')

In [90]:
path = untar_data(URLs.IMDB_SAMPLE)
path

PosixPath('/root/.fastai/data/imdb_sample')

In [91]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


# Preproceesing German language

[Common pitfalls with the preprocessing of German text for NLP 🇩🇪](https://medium.com/idealo-tech-blog/common-pitfalls-with-the-preprocessing-of-german-text-for-nlp-3cfb8dc19ebe)

[Handling German Text with torchtext](https://www.innoq.com/en/blog/handling-german-text-with-torchtext/)

[German NLP Github](https://github.com/adbar/German-NLP#Tokenization)

See also topic clustering by me --> Emails

# Get data into correct shape

In [0]:
x=df[['text']]
y=df[['label']]

In [0]:
# x.head(2)

In [0]:
# y.head(2)

In [0]:
# split data
trn, val, trn_y,val_y = train_test_split(x, y, test_size = 0.2, random_state = 0) 

In [0]:
trn.reset_index(drop=True, inplace=True)
val.reset_index(drop=True, inplace=True)
trn_y.reset_index(drop=True, inplace=True)
val_y.reset_index(drop=True, inplace=True)

In [97]:
trn.head(2) 

Unnamed: 0,text
0,I ordered this extremely rare and highly overr...
1,If you liked the Grinch movie... go watch that...


In [98]:
trn['text'][0]

"I ordered this extremely rare and highly overrated movie on ebay with very high expectations. I think I paid about 50$ for this movie. As an eternal fan of horror, from cheesy 80s American slashers to European zombie films, I told myself this was going to be great! I can't tell you how wrong I was. First of all, I thought it was gonna be pretty much gorier than it actually is. After all I've had heard about this film, I was almost scared to watch it. The murders are boring. The acting... forget it, there's no acting! The story, even if we don't care, is incredibly bad. It seems they tried to get your attention with some weird sexual scenes and naked girls, but unfortunately in this case it doesn't help the movie. Why? There's no atmosphere, and this is the worst thing about this flick. It's just bad film-making from point A to B. Though it's extremely funny and amusing to watch with your friends and a lot of beers, don't make any effort to get your hands on it. There are so many movie

In [99]:
trn_y['label'][0]

'negative'

# Get data into correct shape II (alternative to I)

https://gist.github.com/manisnesan/cf010a8f078c18a128c2e357ec535753#file-copy-of-lesson3-imdb-ipynb

# Create matrix

CountVectorizer converts a collection of text documents to a matrix of token counts (part of sklearn.feature_extraction.text). 

Link: [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

Link: [very simple term-doc-matrix](https://youtu.be/37sFIak42Sc?t=4011)

Therefore we have to turn our data into a bag-of-words representation.

In [0]:
veczr = CountVectorizer(tokenizer=None, stop_words=None, max_features=None, vocabulary=None) 


fit_transform(trn) finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the same transformation to your validation set, the second line uses just the method transform(val). trn_term_doc and val_term_doc are sparse matrices. trn_term_doc[i] represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [0]:
# create term-doc-matrix based on training set
trn_term_doc = veczr.fit_transform(trn['text']) 

In [0]:
# use the previously fitted model ( previously fitted vocabulary). Use same vocabulary to create bag of words for validation set
val_term_doc = veczr.transform(val['text']) 

In [103]:
trn_term_doc  # a x b matrix: "a" rows (number of docs) and "b" columns (number of words)

<800x16597 sparse matrix of type '<class 'numpy.int64'>'
	with 116168 stored elements in Compressed Sparse Row format>

In [104]:
trn_term_doc[0] # first document. You can see how many words are actually used and how many columns we have for this. Most are zero.

<1x16597 sparse matrix of type '<class 'numpy.int64'>'
	with 125 stored elements in Compressed Sparse Row format>

In [105]:
vocab = veczr.get_feature_names(); vocab[5000:5005]

['enlightenment', 'enlists', 'enlivened', 'ennio', 'enormous']

In [106]:
w0 = set([o.lower() for o in trn['text'][0].split(' ')]); w0 # get all lower case words and split on space

{'50$',
 '80s',
 'a',
 'about',
 'acting!',
 'acting...',
 'actually',
 'after',
 'all',
 'all,',
 'almost',
 'american',
 'amusing',
 'an',
 'and',
 'any',
 'are',
 'as',
 'atmosphere,',
 'attention',
 'b.',
 'bad',
 'bad.',
 'be',
 'beers,',
 'boring.',
 'but',
 "can't",
 'care,',
 'case',
 'cheesy',
 "doesn't",
 "don't",
 'ebay',
 'effort',
 'eternal',
 'european',
 'even',
 'expectations.',
 'extremely',
 'fan',
 'files!',
 'film,',
 'film-making',
 'films,',
 'first',
 'flick.',
 'for',
 'forget',
 'friends',
 'from',
 'funny',
 'get',
 'girls,',
 'going',
 'gonna',
 'gorier',
 'great!',
 'had',
 'hands',
 'heard',
 'help',
 'high',
 'highly',
 'horror,',
 'how',
 'i',
 "i've",
 'if',
 'in',
 'incredibly',
 'is',
 'is.',
 'it',
 "it's",
 'it,',
 'it.',
 'just',
 'lot',
 'make',
 'many',
 'movie',
 'movie.',
 'movies',
 'much',
 'murders',
 'myself',
 'naked',
 'necro',
 'no',
 'of',
 'on',
 'ordered',
 'overrated',
 'paid',
 'point',
 'pretty',
 'rare',
 'scared',
 'scenes',
 'see

In [107]:
len(w0)

138

In [0]:
w2i=veczr.vocabulary_['unfortunately'] # word to integer

In [109]:
trn_term_doc[0,w2i] # number of times this word appears

1

# Naive Bayes
We define the log-count ratio $r$ for each feature (word) $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

[Details click here](https://youtu.be/37sFIak42Sc?t=4648)

[Excel Tabelle](https://github.com/fastai/fastai/blob/master/courses/ml1/excel/naivebayes.xlsx)

In [110]:
x

Unnamed: 0,text
0,Un-bleeping-believable! Meg Ryan doesn't even ...
1,This is a extremely well-made film. The acting...
2,Every once in a long while a movie will come a...
3,Name just says it all. I watched this movie wi...
4,This movie succeeds at being one of the most u...
...,...
995,There are many different versions of this one ...
996,Once upon a time Hollywood produced live-actio...
997,Wenders was great with Million $ Hotel.I don't...
998,Although a film with Bruce Willis is always wo...


In [0]:
x=trn_term_doc
y=trn_y['label'][0]

p = x[y=='positive'].sum(0)+1 # get all rows where class label is 'positive'
q = x[y=='negative'].sum(0)+1 # get all rows where class label is 'negative'
r = np.log((p/p.sum()) / (q/q.sum())) # log of the ratios
b = np.log(len(p)/len(q)) # log of the class ratios ???

Here is the formula for Naive Bayes

In [121]:
val_y['label'][0]

'positive'

In [126]:
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0 # compare if it is bigger or smaller then 0 (not 1 because we are in log-space)
(preds==val_y['label'][0]).mean() # ???

  This is separate from the ipykernel package so we can avoid doing imports until


AttributeError: ignored