# Exercise NLP pipeline
You are already familiar with building predictive models on tabular data. In tabular data, you have a feature matrix X, and a target vector Y. Given this data structures, you can apply learning algorithms like neural networks or random forest to learn the relationship between X and Y. In this exercise, you are provided a data set of movie reviews and your goal is to build a classifier predicting whether a review is positive or negative (this task is called sentiment classification). Hence, you have a prediction problem with a binary target Y, which is nothing new for you. What is new for you in this exercise, is that you need to deal with text data instead of tabular data. With text data, you need to process the data to obtain the required feature matrix X. This processing of data is what we call the "NLP pipeline". 

In this exercise, you will need to set up a NLP pipeline. You are provided a data set of movie reviews, where each sample contains a review (just a string cell). To obtain a feature matrix, each samples string cell needs to be transformed into a feature vector $x$. This process is called vectorization. There are multiple possible vectorization procedures. Today, you will implement a bag-of-words model for feature extraction. This feature extraction process involves two steps:
1. Vocabulary building
 * Tokenization: Transforming a review, which is a single string at the beginning, into a vector of strings (tokens).
 * Cleaning and compressing techniques: Reducing the number of distinct tokens. E.g. correcting misspelling of words or transforming all letters to lower case prevents that the same word appears in multiple ways of spelling. Additionaly, similar words (e.g. different forms of a verb) can be united into a single token. 
 * Building a bag-of-words: a vector whose length corresponds exactly to the number of different tokens. Each token is assigned the position within the vector. 
 
2. Feature creation based on term frequency: Each review gets transformed into a feature vector $x$. The length of the feature vector corresponds to the length of the bag-of-words vector, created in step 1. An element $x_{j}$ of the feature vector is calculated by a frequency measure, measuring how frequently token $j$ from the bag-of-words vector occurs in the review. 

The first code cells provide required packages and load the review data set, which you will use for the exercise. Then the exercise begins. In the exercise, you will build the most simple NLP pipeline, which means that you go through steps 1 and 2 of the NLP pipeline, but you skip the "cleaning and compressing" part of step 1. This simple NLP pipeline provides you a feature matrix X (which is possibly not ideal). You will use this feature matrix to build and evaluate a predictive model.

In the tutorial, we will extend your NLP pipeline by including the cleaning and compressing techniques (according techniques are also covered in detail in the demo notebook "nlp_foundations.ipynb"). This will lead to another feature matrix X. We will build another predictive model on this new feature matrix X and compare the performance to the model build by the simplified NLP pipeline.

In [7]:
# required packages
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import random

In [8]:
# Remeber to adjust the path so that it matches your environment
df = pd.read_csv("IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [9]:
## get to know the data
print(df)
df.head()

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [10]:
# Only use the first 10000 observations to reduce run time.
df = df.loc[0:10000,:]

df.reset_index(inplace=True, drop=True)  # dropping the index prohibits a reidentification of the cases in the original data frame
df.sentiment.value_counts()

positive    5028
negative    4973
Name: sentiment, dtype: int64

In [11]:
# Map label
df['sentiment'] = df['sentiment'].map({'positive' : 1, 'negative': 0})
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Exercise (simple NLP pipeline):
You need to transform the text data, contained in the column df["review"], such that it is suitable as a feature matrix X, which you need for predictive model building. This means in detail: <br>
a) Create a list "reviews_tokenized", where each element corresponds to a string vector, representing a review. Use NLTK's word_tokenize() function.

b) Split the review data (reviews_tokenized) as well as the target df['sentiment'] in training and test sets. Use 80% of the data for training. Use sklearn's train_test_split() function.

c) Now, we need to set up a vocabulary for all tokens and apply this vocabulary to obtain feature vectors $x$. We do this using sklearn's TfidfVectorizer. We provide the code to set up the vectorizer below. You need to apply the vectorizer to the data.

In [7]:
def dummy_fun(doc):
    return doc       
vectorizer = TfidfVectorizer(
    analyzer = 'word',
    tokenizer = dummy_fun,
    preprocessor = dummy_fun,
    token_pattern = None)

The TfidfVectorizer did multiple steps at once. To better understand how it works, you should examine the results step by step. <br>
d) Examine the vocabulary it created: How many tokens does it include? Which tokens are included? Would it maybe be better to leave some of these tokens out to reduce the dimension of the vocabulary and the derived feature matrix?

e) Let's recap how feature vectores are generated from this vocabulary. The basic idea of bag-of-words based feature extraction is to generate for each token in the vocabulary a column in the feature matrix X. For an observation $i$ (corresponding to a single review), the entry $X_{i,j}$ of the feature matrix would be 1 if the review contains the token of column $j$ and 0 otherwise. There are some variations to this approach. The Tfidf approach, which we apply in this exercise, encodes $X_{i,j}$ not as 1 if the review contains token $j$ but as the occurence frequency of token $j$ in the review divided by the occurence frequency of token $j$ in the whole document (all reviews of the data set combined). Have a look at the matrix, which the TfidfVectorizer created.   

f) Fit a ridge regression classifier and evaluate the accuracy of the predictions on the training and test set. We provide the code below.

In [None]:
classifier = RidgeClassifier(random_state=42, alpha=0.8)
classifier.fit(reviews_tr, y_train)
pred_test = classifier.predict(reviews_ts)
pred_train = classifier.predict(reviews_tr)
print(metrics.accuracy_score(y_train, pred_train))
print(metrics.accuracy_score(y_test, pred_test))