# Data Preparation
In this exercise we will work with the IMDB sentiment dataset. This dataset contains movie reviews, each with a positive or negative sentiment (quantized by 1 for positive and 0 for negative). The labeled training and testing data is provided on Moodle. 

## Reading and preprocessing the data
To import the tsv file, it is recommended to use the pandas package. The provided file can be imported as follows

In [1]:
import numpy as np
import pandas as pd

# load data as pandas dataframe
train = pd.read_csv('labeledTrainData.tsv', 
                    header=0,
                    delimiter="\t", 
                    quoting=3 )

FileNotFoundError: [Errno 2] No such file or directory: 'labeledTrainData.tsv'

What data type is the variable train? Which values does it contain? Print some examples.

In [None]:
print('Data type:', type(train))
print()
print('Value 0,0:', train.values[0][0])
print('Value 0,1:',train.values[0][1])
print('Value 0,2:',train.values[0][2])


In [None]:
pd.options.display.max_colwidth = 100
train.head(10)

The text strings contain HTML tags, which have to be removed. To do this, use the bs4 package

In [None]:
from bs4 import BeautifulSoup

example1 = BeautifulSoup(train['review'][0],'lxml').get_text()
print(example1)

The imported text contains punctuation, numbers, and all (common) words. For now, we assume that these are not beneficial to the task of sentiment classification, and we want to remove them. Punctuation and numbers can be removed using the regular expressions (re) package

In [None]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub('[^a-zA-Z]',           # The pattern to search for
                      ' ',                   # The pattern to replace it with
                      example1 )  # The text to search
print(letters_only)

It is also beneficial, to convert all letters to lower case and to split the strings into individual words.

In [None]:
lower_case = letters_only.lower()        # Convert to lower case
print('Lower case version:')
print(lower_case)
words = lower_case.split()   # Split into words
print()
print('First Word:', words[0])

For now, we also want to remove common words that do not carry much meaning, such as `a', `is', or `the'. These are often referred to as stop words. A list of stop words can be obtained with the NLTK package:

In [None]:
import nltk
nltk.download('stopwords')  # Download text data sets, including stop words
from nltk.corpus import stopwords # Import the stop word list
stops=stopwords.words('english')
print(stops)

Write a function called `review_prepro` that takes as an input a raw review string and returns a preprocessed review, i.e. a string with HTML tags removes, all lower case letters, no stop words. Then apply this function to the entire training set. Return the list `clean_train_reviews`, which contains all the cleaned reviews.

In [None]:
# function for preprocessing the data
def review_prepro(data, remove_stopwords=False):
    # remove HTML tags
    review_text = BeautifulSoup(data, 'lxml').get_text()
    # remove non-letters and numbers
    letters_only = re.sub( '[^a-zA-Z]',
                          ' ',
                          review_text )
    # make all characters lower case and split the documents into single words
    words = letters_only.lower().split()
    
    if remove_stopwords:
        # remove stop words
        meaningful_words = [ w for w in words if not w in stops ]
        # return concatenated single string
        return ' '.join(meaningful_words)
    else:
        # or don't and concatenate to single string
        return ' '.join(words)

# preprocess train data
num_reviews = train['review'].size

clean_train_reviews = []
for i in range(num_reviews):
    if (i+1)%1000 == 0:
        print('Review {} of {}\n'.format(i+1, num_reviews))
    clean_train_reviews.append( review_prepro(train['review'][i], remove_stopwords=True) )
clean_train_reviews

## Creating Features from a Bag of Words
For generating a bag of words model, we will use the scikit-learn package. Use the following code

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# define the vectorizer
vectorizer = CountVectorizer(analyzer = 'word',   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 50) 
                             
# fit the vectorizer to the data
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# convert to numpy array
train_data_features = train_data_features.toarray()


In [None]:
train_data_features

In [None]:
train_data_features.size
train_data_features.shape

## Black box classifier
To do something meaningful with the generated data, we will use a prebuilt classifier, train it on the training data and then evaluate the learned classifier on the test data `labeledTestData.tsv`. 

First, preprocess the test data the same way as the training data and return the variable `test_data_features`. (Hint: use `vectorizer.transform`)

In [None]:
test = pd.read_csv('labeledTestData.tsv', 
                   header=0,
                   delimiter="\t",
                   quoting=3)

num_test_reviews = test['review'].size
clean_test_reviews = []
for i in range(num_test_reviews):
  #  if (i+1)%1000 == 0:
 #       print('Review {} of {}\n'.format(i+1, num_test_reviews))
    clean_test_reviews.append( review_prepro(test['review'][i], remove_stopwords=True) )

test_data_features = (vectorizer.transform(clean_test_reviews)).toarray()

In [None]:
test_data_features

To train a classifier with logistic regression use the following code

In [None]:
from sklearn.linear_model import LogisticRegression as LR

model = LR()
model.fit( train_data_features, train['sentiment'] ) #the function determines that the inner logic regression
                                                     #of the connections of train_data_features and train['sentiment'](0 or 1)
                                                     #after this step, we can directly use this connection to predict 
                                                     #the test data
p = model.predict_proba( test_data_features )[:,1] #use the logistic regression which was already found from the train data above
output = pd.DataFrame( data={'id':test['id'], 'sentiment':p} )
output

## Evaluate result
We will use the Area Under Curve (AUC - TPR vs FPR curve) metric to measure performance. An AUC score of 0.5 is the same as a random classifier, the closer to 1 the score is the better.

In [None]:
from sklearn.metrics import roc_auc_score as AUC

auc = AUC( test['sentiment'].values, p )
print('AUC score:', auc)

## More sophisticated methods
Use a prebuilt TF-IDF vectorizer and play around with its settings such as stop words and n-grams and the performance of an LR classifier.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tf = TfidfVectorizer( max_features = 5000, 
                             ngram_range = ( 1, 1 ), 
                             sublinear_tf = True )

# fit the vectorizer to the data
train_data_features_tf = vectorizer_tf.fit_transform(clean_train_reviews)
# convert to numpy array
train_data_features_tf = train_data_features_tf.toarray()
test_data_features_tf = (vectorizer_tf.transform(clean_test_reviews)).toarray()


model_tf = LR()
model_tf.fit( train_data_features_tf, train['sentiment'] )
p_tf = model_tf.predict_proba( test_data_features_tf )[:,1] 
output_tf = pd.DataFrame( data={'id':test['id'], 'sentiment':p_tf} )

auc_tf = AUC( test['sentiment'].values, p_tf )
print('AUC score (TF-IDF):', auc_tf)
output_tf