In this activity, we will create another text classifier. Instead of training a machine learning model to discriminate between clickbait headlines and normal headlines, we will train a similar classifier to discriminate between positive and negative movie reviews.

The objectives for our activity are 
* Vectorize the text of IMDB movie reviews and label these as Positive or Negative 
* Train an SVM classifier to predict whether a move review is Positive or Negative
* Check how accurate our classifier is on a held-out test set
* Evaluate our classifier on out-of-context data

NOTE: as we will be using some randomizers in this activity, it is helpful to set the global random seeds to ensure that the results you see are the same as in the examples. Sklearn uses the numpy random seed, and we will also use the `shuffle` function from the built in random library. You can ensure you see the same results by adding the following code above your main code.


```
import random
import numpy as np
random.seed(1337)
np.random.seed(1337)
```

We'll use the aclIMDB dataset of 100k movie reviews from IMDB, 50k each for training and testing. Each dataset has 25k positive reivews and 25k negative ones, so this is a larger dataset than our headlines one. The dataset is available in the Datasets folder in the `aclImdb` folder.

In exercise 1, we had one file, with each line representing a different data item. Now each data item is a bit longer and in its own file, so keep in mind that we'll need to restructure some of our training code accordingly.



In [1]:
import random
import numpy as np
random.seed(1337)
np.random.seed(1337)

1. Import the `os` library and `random` library, and define where our training and test data is stored using four variables: one for training_positive, one for training_negative, one for test_positive and one for test_negative, each pointing at the respective dataset sub directory.

In [2]:
import os
import random

dataset_train_pos_path = "../../Datasets/aclImdb/train/pos/"
dataset_train_neg_path = "../../Datasets/aclImdb/train/neg/"

dataset_test_pos_path = "../../Datasets/aclImdb/test/pos/"
dataset_test_neg_path = "../../Datasets/aclImdb/test/neg/"

2. define a `read_dataset` function which takes a path to a dataset and a label (either "pos" or "neg"), reads the contents of each file in the given directory, and adds these contents into a datastructure that is a list of tuples, where each tuple contains both the text of the file, and the label 'pos' or 'neg'. An example is below. The actual data should be read off disk instead of defined in code.

```
contents_labels = [('this is the text from one of the files', 'pos'), ('this is another text', 'pos')]
```

In [3]:
def read_dataset(dataset_path, label):
    contents_labels = []
    files = os.listdir(dataset_path)
    for fn in files:
        path = os.path.join(dataset_path, fn)
        with open(path) as f:
            s = f.read()
            contents_labels.append((s, label))
    return contents_labels   

3. Use the function you defined above to read each dataset into its own variable. You should have four variables in total: `train_pos`, `train_neg`, `test_pos`, and `test_neg`, each one of which is a list of tuples, containing the relative text and labels.

In [4]:
train_pos = read_dataset(dataset_train_pos_path, "pos")
train_neg = read_dataset(dataset_train_neg_path, "neg")

test_pos = read_dataset(dataset_test_pos_path, "pos")
test_neg = read_dataset(dataset_test_neg_path, "neg")

4. Combine the train_pos and train_neg datasets. Do the same for the test_pos and test_neg ones.

In [5]:
train = train_pos + train_neg
test = test_pos + test_neg

5. Use the `random.shuffle` function to shuffle the train and test datasets separately. This gives us datasets where the training data is mixed up, instead of feeding all the positive and then all the negative examples to the classifier in order.

In [6]:
random.shuffle(train)
random.shuffle(test)

6. Split each of the train and test datasets back into data and labels respectively. You should have four variables again called `train_data`, `y_train`, `test_data`, and `y_test` where the `y` prefix indicates that the respective array contains labels.

In [7]:
train_data, y_train = zip(*train)
test_data, y_test = zip(*test)

7. Import `TfidfVectorizer` from sklearn, initialise an instance of it, fit the vectorizer on the training data, and vectorize both the training and testing data into `X_train` and `X_test` variables respectively. Time how long this takes and print out the shape of the training vectors at the end.

In [8]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data)
X_test = vectorizer.transform(test_data)
print("The dimensions of our vectors:")
print(X_train.shape)
print("- - -")


The dimensions of our vectors:
(25000, 74849)
- - -
CPU times: user 12.5 s, sys: 441 ms, total: 12.9 s
Wall time: 13.3 s


8. Again timing your execution time, import `LinearSVC` from `sklearn` and initialise and instance of it. Fit the SVM on the training data and training labels, and then generate predictions on the test data (`X_test`)

In [9]:
%%time

from sklearn.svm import LinearSVC

svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

predictions = svm_classifier.predict(X_test)

CPU times: user 750 ms, sys: 75.1 ms, total: 825 ms
Wall time: 923 ms


9. Import `accuracy_score` and `classification_report` from sklearn and calculate the results of your predictions using each. How accurate was the classifier?

In [10]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy: {}\n".format(accuracy_score(y_test, predictions)))
print(classification_report(y_test, predictions))


Accuracy: 0.8772

              precision    recall  f1-score   support

         neg       0.87      0.89      0.88     12500
         pos       0.89      0.87      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



See how your classifier performs on data in different topics. Create two restaurant reviews. For example:
```
good_review = "The restaurant was really great! I ate wonderful food and had a very good time"
bad_review = "The restuarant was awful. The staff were rude and the food was horrible. I hated it"
```

Now vectorize each using the same vectorizer and generate predictions for whether each one is negative or positive. Did you classifier guess correctly?

In [11]:
good_review = "The restaurant was really great! I ate wonderful food and had a very good time"
bad_review = "The restuarant was awful. The staff were rude and the food was horrible. I hated it"

restuarant_reviews = [good_review, bad_review]
vectors = vectorizer.transform(restuarant_reviews)
print(svm_classifier.predict(vectors))

['pos' 'neg']
