# Natural Language Processing

This project is targeted to the IMDB movie data set. It aims to classify the sentiment of a given review_text.

### Part 1

#### Take a look on the movie review dataset

In [5]:
import pandas as pd
from nlp_proj_utils import get_imdb_dataset

pd.set_option('max_colwidth', 500)

In [6]:
# Load dataset, download if necessary
train, test = get_imdb_dataset()

data already available, skip downloading.
imdb loaded successfully.


In [7]:
train.head()

Unnamed: 0,text,sentiment
0,"For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan ""The Skipper"" Hale jr. as a police Sgt.",pos
1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina's pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things ...",pos
2,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation while they waited for Robbins to retrieve the birdie.",pos
3,"It's a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Based upon Thomas Rockwell's respected Book, How To Eat Fried Worms starts like any children's story: moving to a new town. The new kid, fifth grader Billy Forrester was once popular, but has to start anew. Making friends is never easy, especially when the only prospect is Poindexter Adam. Or Erica, who...",pos
4,"You probably all already know this by now, but 5 additional episodes never aired can be viewed on ABC.com I've watched a lot of television over the years and this is possibly my favorite show, ever. It's a crime that this beautifully written and acted show was canceled. The actors that played Laura, Whit, Carlos, Mae, Damian, Anya and omg, Steven Caseman - are all incredible and so natural in those roles. Even the kids are great. Wonderful show. So sad that it's gone. Of course I wonder abou...",pos


In [8]:
print('train shape:', train.shape)
print('test  shape:', test.shape)

train shape: (25000, 2)
test  shape: (25000, 2)


In [9]:
# Statics on tags
train['sentiment'].value_counts()

pos    12500
neg    12500
Name: sentiment, dtype: int64

## Preprocessing

### Tokenization and Normalization

In [1]:
import nltk
import string

In [10]:
transtbl = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
stopwords = nltk.corpus.stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer()

In [11]:
def preprocessing(line: str) -> str:
    """
    Take a text input and return the preprocessed string.
    i.e.: preprocessed tokens concatenated by whitespace
    """
    line = line.replace('<br />', '').translate(transtbl)
    
    # list
    tokens = [lemmatizer.lemmatize(t.lower(),'v')  #这个‘v'指的是verb
              for t in nltk.word_tokenize(line)   ##也可以写成line.split()
              if t.lower() not in stopwords]
    
    return ' '.join(tokens)

In [19]:
nltk.download('words')

[nltk_data] Downloading package words to /Users/liuxin/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [12]:
from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

In [13]:
for df in train, test:
    df['text_prep'] = df['text'].progress_apply(preprocessing)

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))




In [20]:
train.sample(2)

Unnamed: 0,text,sentiment,text_prep
1473,"Sure, 65 years have passed since Thalberg's last production was filmed. But fellow IMDB members, come on, this movie is surely one of the masterpieces of the 30's! It is a 10.<br /><br />This was the first movie I saw at New York's Museum of Modern Art, around 1970 (I was a teenager). Expensive looking yet with scenes of such poverty, masterfully photographed, often thrilling, and always engaging, to me it was MGM movie-making at its best. What did audiences feel when they glimpsed a locust ...",pos,sure 65 years pass since thalberg last production film fellow imdb members come movie surely one masterpieces 30 10 first movie saw new york museum modern art around 1970 teenager expensive look yet scenes poverty masterfully photograph often thrill always engage mgm movie make best audiences feel glimpse locust attack person person destruction mansion horrific poverty splendor wealth last week watch academy award glimpse senior oscar winner attendance luise rainer grand see actress arguably...
15782,"Sure, it had some of the makings of a good film. The storyline is good, if a bit bland and the acting was good enough though I didn't understand why Olivia d'Abo had such a pronounced Australian accent if her character was supposed to have been raised in the US. My biggest problem, however, was with the wardrobe. I know as rule, the average American is considered a frumpy dresser by any self-respecting European but this was beyond that. Anna's colour combinations were positively ghastly!! An...",neg,sure make good film storyline good bite bland act good enough though understand olivia abo pronounce australian accent character suppose raise us biggest problem however wardrobe know rule average american consider frumpy dresser self respect european beyond anna colour combinations positively ghastly potato sack like sad excuse coat wear throughout film make break hive suppose idea realistic possible many school teachers walk around prada simple mean absolute lack taste word wise


### Build Vocabulary

In [22]:
all_words = [w for text in tqdm_notebook(train['text_prep']) 
             for w in text.split()]


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))




In [26]:
voca = nltk.FreqDist(all_words)
print(voca['sure'])

2677


In [27]:
voca.most_common(10)

[('film', 48170),
 ('movie', 43912),
 ('one', 26747),
 ('make', 23538),
 ('like', 22335),
 ('see', 20773),
 ('get', 18108),
 ('time', 16143),
 ('good', 15124),
 ('character', 14153)]

In [30]:
topwords = [word for word, _ in voca.most_common(10000)]


### Vectorization / Featurization

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
train_x, train_y = train['text_prep'], train['sentiment']
test_x, test_y = test['text_prep'], test['sentiment']

In [35]:
# Use topwords as vocabulary
tf_vec = TfidfVectorizer(vocabulary=topwords)

In [36]:
train_features = tf_vec.fit_transform(train_x)
test_features = tf_vec.transform(test_x)

In [37]:
train_features[0][:50].toarray()

array([[0.        , 0.12884974, 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Training

### [Multinomial NB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

The multinomial Naive Bayes classifier is suitable for **classification with discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [38]:
from sklearn.naive_bayes import MultinomialNB

In [39]:
mnb_model = MultinomialNB()

In [40]:
# Train Model
mnb_model.fit(train_features, train_y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Evaluation

In [41]:
from sklearn import metrics

In [42]:
# Predict on test set
pred = mnb_model.predict(test_features)
print(pred)

['neg' 'pos' 'pos' ... 'neg' 'neg' 'neg']


In [43]:
print('Accuracy: %f' % metrics.accuracy_score(pred,test_y))

Accuracy: 0.833120


In [44]:
# Pass in as keyword arguments to make sure the order is correct
print(
    metrics.classification_report(y_true=test_y, y_pred=pred))

              precision    recall  f1-score   support

         neg       0.81      0.87      0.84     12500
         pos       0.86      0.80      0.83     12500

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000



In [45]:
# Example from sklearn documentation

y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(metrics.classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5



## Save model

In [None]:
_, model, vec = train_with_n_topwords(3000, tfidf=True)

In [None]:
import pickle

with open('tf_vec.pkl', 'wb') as fp:
    pickle.dump(vec, fp)
    
with open('mnb_model.pkl', 'wb') as fp:
    pickle.dump(model, fp)

### Part 2 
In this part, we turn to use a neural network architecture 

In [None]:
import nlp_proj_utils as utils

In [81]:
def load_imdb():
    train, test = utils.get_imdb_dataset()
    TEXT_COL, LABEL_COL = 'text', 'sentiment'
    return (
        train[TEXT_COL], train[LABEL_COL],
        test[TEXT_COL], test[LABEL_COL])

In [82]:
train_text, train_label, test_text, test_label = load_imdb()

data already available, skip downloading.
imdb loaded successfully.


## Prepare Data 

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(
    min_df=2, # ignore word that only appears in 1 document
    ngram_range=(1, 2), # consider both uni-gram and bi-gram
)

In [54]:
train_text, train_label, test_text, test_label = load_imdb()

data already available, skip downloading.
imdb loaded successfully.


In [55]:
# Learn (fit) and transform text into vector
train_x = tfidf_vectorizer.fit_transform(train_text)

# Convert label to 0 and 1 (optional)
train_y = train_label.apply(lambda x: 1 if x == 'pos' else 0)

In [56]:
# Expect 12500 for 1 and 0, instead of pos and neg
train_y.value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

In [57]:
# Apply the same transformer to validation set as well
# Simply call `transform` this time, don't do `fit` again
test_x = tfidf_vectorizer.transform(test_text)
test_y = test_label.apply(lambda x: 1 if x == 'pos' else 0)

In [58]:
# Sanity check
assert test_x.shape == train_x.shape
assert test_y.shape == train_y.shape

### Dimensionality Reduction

 We will be using `SelectKBest` from `sklearn` and using `f_classif` to help up pick up k best features (word). 

In [59]:
from sklearn.feature_selection import SelectKBest

In [62]:
DIM = 20000 # Dimensions to keep, a hyper parameter

# Create a feature selector
# By default, f_classif algorithm is used
# Other available options include mutual_info_classif, chi2, f_regression etc. 

selector = SelectKBest(k=20000)

In [63]:
selector.fit(train_x, train_y)

SelectKBest(k=20000, score_func=<function f_classif at 0x1a3360f730>)

In [64]:
# Apply to both training data and testing data
train_x = selector.transform(train_x)
test_x = selector.transform(test_x)

In [65]:
train_x=train_x.toarray()
test_x=test_x.toarray()

### Build a Multiple-Layer Perceptron Model

In [61]:
from tensorflow.keras.models import Model
from tensorflow.python.keras.layers import Input, Dense, Dropout

In [66]:

def build_mlp_model(input_dim, layers, output_dim, dropout_rate=0.2):
    # Input layer
    X = Input(shape=(input_dim,))
    
    # Hidden layer(s)
    H = X
    for layer in layers:
        H = Dense(layer, activation='relu')(H)
        H = Dropout(rate=dropout_rate)(H)
    
    # Output layer
    activation_func = 'softmax' if output_dim > 1 else 'sigmoid'
    
    Y = Dense(output_dim, activation=activation_func)(H)
    return Model(inputs=X, outputs=Y)

In [67]:
hyper_params = {
    'learning_rate': 1e-3,  # default for Adam
    'epochs': 1000,
    'batch_size': 64,
    'layers': [64, 32,16],
    'dim': DIM,
    'dropout_rate': 0.5,
}

In [68]:
mlp_model = build_mlp_model(
    input_dim=hyper_params['dim'],
    layers=hyper_params['layers'],
    output_dim=1,
    dropout_rate=hyper_params['dropout_rate'],
)

mlp_model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 20000)]           0         
_________________________________________________________________
dense (Dense)                (None, 64)                1280064   
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                528       
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0     

### Compile the Model

In [69]:
from tensorflow.keras.optimizers import Adam

In [70]:
mlp_model.compile(
    optimizer=Adam(lr=hyper_params['learning_rate']),
    loss='binary_crossentropy',
    metrics=['acc'],
)

### Callbacks

We will be using two common callbacks here: `EarlyStopping` and `ModelCheckpoint`. The first is used to prevent overfitting and the second is used to keep track of the best models we got so far.

In [71]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint

In [72]:
early_stoppping_hook = EarlyStopping(
    monitor='val_loss',  # what metrics to track
    patience=2,  # maximum number of epochs allowed without imporvement on monitored metrics 
)

CPK_PATH = 'model_cpk.hdf5'    # path to store checkpoint

model_cpk_hook = ModelCheckpoint(
    CPK_PATH,
    monitor='val_loss', 
    save_best_only=True,  # Only keep the best model
)

### Train the Model

In [73]:
his = mlp_model.fit(
    train_x, 
    train_y, 
    epochs=10,
    validation_data=[test_x, test_y],
    batch_size=hyper_params['batch_size'],
    callbacks=[early_stoppping_hook, model_cpk_hook],
)

print('Training finished')

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Training finished


In [None]:
predicted_y=mlp_model.predict(
    test_x, batch_size=None, verbose=0, steps=None, callbacks=None, max_queue_size=10,
    workers=1, use_multiprocessing=False
)
predicted_y

### Evaluation

Load the best model and do evaluation:

In [75]:
mlp_model.load_weights(CPK_PATH)

Save the model and weights

In [78]:
import h5py
import os

model_root = 'resources/MLP_model'
os.makedirs(model_root, exist_ok=True)

In [79]:
# Save model structure as json
with open(os.path.join(model_root, "network.json"), "w") as fp:
    fp.write(mlp_model.to_json())

# Save model weights
mlp_model.save_weights(os.path.join(model_root, "weights.h5"))

In [80]:
# Accuracy on validation 
mlp_model.evaluate(test_x, test_y)



[0.24774323907375337, 0.89936]