In this activity, we will create another text classifier. Instead of training a machine learning model to discriminate between clickbait headlines and normal headlines, we will train a similar classifier to discriminate between positive and negative movie reviews.

The objectives for our activity are 
* Vectorize the text of IMDB movie reviews and label these as Positive or Negative 
* Train an SVM classifier to predict whether a move review is Positive or Negative
* Check how accurate our classifier is on a held-out test set
* Evaluate our classifier on out-of-context data

NOTE: as we will be using some randomizers in this activity, it is helpful to set the global random seeds to ensure that the results you see are the same as in the examples. Sklearn uses the numpy random seed, and we will also use the `shuffle` function from the built in random library. You can ensure you see the same results by adding the following code above your main code.

```
import random
import numpy as np
random.seed(1337)
np.random.seed(1337)
```

1. Import the `os` library and `random` library from the Python namespace

In [52]:

dataset_train_pos_path = "train/pos"
dataset_train_neg_path = "train/neg"

dataset_test_pos_path = "test/pos"
dataset_test_neg_path = "test/neg"


def read_dataset(dataset_path, label):
    contents_labels = []
    files = os.listdir(dataset_path)
    for fn in files:
        path = os.path.join(dataset_path, fn)
        with open(path) as f:
            s = f.read()
            contents_labels.append((s, label))
    return contents_labels
        
    
train_pos = read_dataset(dataset_train_pos_path, "pos")
train_neg = read_dataset(dataset_train_neg_path, "neg")

test_pos = read_dataset(dataset_test_pos_path, "pos")
test_neg = read_dataset(dataset_test_neg_path, "neg")

train = train_pos + train_neg
test = test_pos + test_neg

random.shuffle(train)
random.shuffle(test)

train_data, y_train = zip(*train)
test_data, y_test = zip(*test)
    

In [53]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data)
X_test = vectorizer.transform(test_data)
print("The dimensions of our vectors:")
print(X_train.shape)
print("- - -")


The dimensions of our vectors:
(25000, 74849)
- - -
CPU times: user 11.3 s, sys: 196 ms, total: 11.5 s
Wall time: 11.6 s


In [54]:
%%time

from sklearn.svm import LinearSVC

svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

predictions = svm_classifier.predict(X_test)

CPU times: user 636 ms, sys: 13.6 ms, total: 650 ms
Wall time: 664 ms


In [55]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy: {}\n".format(accuracy_score(y_test, predictions)))
print(classification_report(y_test, predictions))


Accuracy: 0.8772

              precision    recall  f1-score   support

         neg       0.87      0.89      0.88     12500
         pos       0.89      0.87      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



In [57]:
good_review = "The restaurant was really great! I ate wonderful food and had a very good time"
bad_review = "The restuarant was awful. The staff were rude and the food was horrible. I hated it"

restuarant_reviews = [good_review, bad_review]
vectors = vectorizer.transform(restuarant_reviews)
print(svm_classifier.predict(vectors))

['pos' 'neg']
