<a href="https://colab.research.google.com/github/CALDISS-AAU/sdsphd19_coursematerials/blob/master/notebooks/SDS_PhD_2019_SML_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro To NLP vs. Supervised ML

Roman Jurowetzki, Aalborg University In part based on the Intro from the DeepNLP course by Dan Anastasyev - https://github.com/DanAnastasyev/DeepNLP-Course

![alt text](https://media.giphy.com/media/7jNeb9CVSgyUE/giphy.gif)

In [1]:
# Some initial downloads and installs

!wget -O imdb.zip -qq "http://sds-datacrunch.aau.dk/public/imdb.zip"
!unzip imdb.zip


!pip -q install eli5

Archive:  imdb.zip
  inflating: test.tsv                
  inflating: train.tsv               
[K     |████████████████████████████████| 112kB 2.8MB/s 
[?25h

In this tutorial we will be using the well known IMDB movie review dataset for simple classification with different vectorization techniques:


*   Simple bag-of-words
*   TF-IDF
*   LSI / SVD


We will also look at some more recent approaches to model explainability i.e. "Why did the model decide this or that?"


Finally, we will look at a simple approach to building a **semantic search** based on vector-similarity.


In [2]:
!head train.tsv

is_positive	review
0	"Dreamgirls, despite its fistful of Tony wins in an incredibly weak year on Broadway, has never been what one would call a jewel in the crown of stage musicals. However, that is not to say that in the right cinematic hands it could not be fleshed out and polished into something worthwhile on-screen. Unfortunately, what transfers to the screen is basically a slavishly faithful version of the stage hit with all of its inherent weaknesses intact. First, the score has never been one of the strong points of this production and the film does not change that factor. There are lots of songs (perhaps too many?), but few of them are especially memorable. The closest any come to catchy tunes are the title song and One Night Only - the much acclaimed And I Am Telling You That I Am Not Going is less a great song than it is a dramatic set piece for the character of Effie (Jennifer Hudson). The film is slick and technically well-produced, but the story and characters are surpris

In [3]:
# Read in the files and quickly print the size of the training and test set.

import pandas as pd
import numpy as np

train_df = pd.read_csv("train.tsv", sep="\t")
test_df = pd.read_csv("test.tsv", sep="\t")

print(f'Train size = {len(train_df)}')
print(f'Test size = {len(test_df)}')

Train size = 25000
Test size = 25000


In [5]:
# some basic text cleaning, removing HTML fragments (only a problem here)

import re

pattern = re.compile('<br /><br />')

print(train_df['review'].iloc[3])
print(pattern.subn(' ', train_df['review'].iloc[3])[0])

Spoilers ahead if you want to call them that...<br /><br />I would almost recommend this film just so people can truly see a 1/10. Where to begin, we'll start from the top...<br /><br />THE STORY: Don't believe the premise - the movie has nothing to do with abandoned cars, and people finially understanding what the mysterious happenings are. It's a draub, basic, go to cabin movie with no intensity or "effort".<br /><br />THE SCREENPLAY: I usually give credit to indie screenwriters, it's hard work when you are starting out...but this is crap. The story is flat - it leaves you emotionless the entire movie. The dialogue is extremely weak and predictable boasting lines of "Woah, you totally freaked me out" and "I was wondering if you'd uh...if you'd like to..uh, would you come to the cabin with me?". It makes me want to rip out all my hair, one strand at a time and feed it to myself.<br /><br />THE CHARACTERS: HOLY CRAP!!!! Some have described the characters as flat, I want to take it one 

Regular Expressions is a thing...

Check out this [Cheatsheet](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

In [0]:
# application of the cleaning mask to everthing

train_df['review'] = train_df['review'].apply(lambda text: pattern.subn(' ', text)[0])
test_df['review'] = test_df['review'].apply(lambda text: pattern.subn(' ', text)[0])

## Let's Vectorize using NLTK & Gensim

Keeping things simple and fast... (Spacy is not too fast)


In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
!pip install -qq -U gensim

[K     |████████████████████████████████| 24.2MB 33.0MB/s 
[?25h

In [0]:
# Import the dictionary builder
from gensim.corpora.dictionary import Dictionary

In [0]:
# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

In [0]:
# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

In [0]:
# Import stopwords

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [0]:
# Tokenize our texts and remove stopwords, also kick out numbers, lower everything

train_tokens = train_df['review'].map(lambda t: [tok.lower() for tok in word_tokenize(t) if tok not in stop_words and tok.isalpha()])

In [16]:
# This is an example of the above in super verbose

for review in train_df['review'][:5]:
  toks = word_tokenize(review)
  toks = [token for token in toks if token not in stop_words]
  toks_final = [token.lower() for token in toks if tok.isalpha()]


0    [dreamgirls, despite, fistful, tony, wins, inc...
1    [this, show, comes, interesting, locations, fa...
2    [i, simply, love, movie, i, also, love, ramone...
3    [spoilers, ahead, want, call, i, would, almost...
4    [my, favorite, movie, i, seen, many, movies, o...
Name: review, dtype: object

In [0]:
test_tokens = test_df['review'].map(lambda t: [tok.lower() for tok in word_tokenize(t) if tok not in stop_words and tok.isalpha()])

In [0]:
# Generate a dictionary
dictionary = Dictionary(train_tokens)

In [0]:
# Filter it for extreme stuff
dictionary.filter_extremes(no_below = 10, no_above=0.4)

In [0]:
# construct corpus using this dictionary
train_corpus = [dictionary.doc2bow(doc) for doc in train_tokens]
test_corpus = [dictionary.doc2bow(doc) for doc in test_tokens]

In [0]:
# Tooling to map between corpus (gensim) and matrix - more general
from gensim.matutils import corpus2csc, corpus2dense

In [0]:

X_train = corpus2csc(train_corpus)
X_test = corpus2csc(test_corpus)

In [22]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train.T, train_df.is_positive)

model.score(X_test.T, test_df.is_positive)



0.86032

In [0]:
# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(train_corpus)

In [0]:
train_corpus_tfidf = tfidf[train_corpus]
test_corpus_tfidf = tfidf[test_corpus]

In [0]:
X_train = corpus2csc(train_corpus_tfidf)
X_test = corpus2csc(test_corpus_tfidf)

In [26]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train.T, train_df.is_positive)

model.score(X_test.T, test_df.is_positive)



0.88616

In [0]:
# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

In [0]:
# Training the LSI model
model_lsi = LsiModel(train_corpus_tfidf, num_topics = 300, id2word=dictionary)

In [0]:
# Generating the corpus train & test

train_corpus_lsi = model_lsi[train_corpus_tfidf]
test_corpus_lsi = model_lsi[test_corpus_tfidf]

In [0]:
# turn into matrix
train_lsi_corpus = corpus2dense(train_corpus_lsi, 300 )

test_lsi_corpus = corpus2dense(test_corpus_lsi, 300)

In [32]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(train_lsi_corpus.T, train_df.is_positive)

model.score(test_lsi_corpus.T, test_df.is_positive)



0.87432

In [0]:
# Load the MatrixSimilarity
from gensim.similarities import MatrixSimilarity

# Create the document-topic-matrix
document_topic_matrix_train = MatrixSimilarity(train_corpus_lsi)

# this will create you a document - document similarity matrix (you could import it as a network...)
document_topic_matrix_train_ix = document_topic_matrix_train.index

# Same for test-set
document_topic_matrix_test = MatrixSimilarity(test_corpus_lsi)
document_topic_matrix_test_ix = document_topic_matrix_test.index

In [36]:
# Prepare the query

doc = "absolutely horrible romantic comedy"


vec_bow = dictionary.doc2bow(doc.lower().split()) # convert to bag of words
vec_tfidf = tfidf[vec_bow] # convert to tfidf
vec_lsi = model_lsi[vec_tfidf]  # convert the query to LSI space

print(len(vec_lsi))
print(vec_lsi[:10])

300
[(0, 0.05806680646261308), (1, -0.05000759324915752), (2, -0.018319390513693176), (3, 0.04734079916905681), (4, -0.04012512815492944), (5, -0.09784627403109454), (6, 0.07885579774097201), (7, -0.010419194212512648), (8, 0.04117402645273893), (9, -0.03291463655085148)]


In [37]:
sims = document_topic_matrix_train[vec_lsi]

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in sims[:10]: #ten most similar texts
    print(s, train_df['review'][i])

0.5732981 Does this film suck!! Horrible acting, horrible script, horrible effects, horrible horrible horrible!! Nothing redeeming here for even the most die-hard of horror fans! A crazy killer stalks students at a college. People are showing up dead in the hallways, but still, class carries on as normal??? After about the 4th body, I would think that they could allow the students a few days break! LOL. This about as bad as it gets folks. This film should be shown as a means of torture to criminals. You have been warned!
0.49651295 A patient escapes from a mental hospital, killing one of his keepers and then a University professor after he makes his way to the local college. Next semester, the late prof's replacement and a new group of students have to deal with a new batch of killings. The dialogue is so clichéd it is hard to believe that I was able to predict lines in quotes. This is one of those cheap movies that was thrown together in the middle of the slasher era of the '80's. Des

## Tour Turn

![alt text](https://media.giphy.com/media/eJF3Yaqc70eAUaYtnZ/giphy.gif)


The site https://faketrump.ai/ is an interesting example of AI-powered fake-text generation. They write:

We built an artificial intelligence model by fine-tuning GPT-2 to generate tweets in the style of Donald Trump’s Twitter account. After seeing the results, we also built a discriminator that can accurately detect fake tweets 77% of the time — think you can beat our classifier? Try it yourself!

GPT-2 is a neural transformer-based model, that has been announced by OpenAI in February 2019 and created considerable discussion because they decided - in contrast to their earlier policies - not to release the mode to the public. Their central argument was that the model could be used to produce fake news, spam and alike too easily. The footnote of the faketrump page reads: “Generating realistic fake text has become much more accessible. We hope to highlight the current state of text generation to demonstrate how difficult it is to discern fiction from reality.”

Since then several organizations and researchers have shown that it is possible to develop systems to detect “fake text”. We believe that you too can implement a competitive system.

This assignment is not about Natural Language Processing (NLP) but about being able to deal with sequential data using deep learning. Some basic knowledge from M2 can be useful to squeeze the last 1% performance but you should be able to get great results with pure Keras. The data can be found here and has the following format:

tweet	labels
string	boolean
There are 8000 real Trump tweet and 7348 fake ones.

https://raw.githubusercontent.com/DeepLearnI/trump_tweet_classifier/master/code/tweet_labels.csv
