1. Define a `tokenizer` and `tokenizer_porter` function to be used in the model training pipeline 

In [2]:
# Tokenizer function
def tokenizer(text):
    return text.split()

In [3]:
# Tokenizer porter from the NLTK Porter Stemning algorithm
# PIP install
!pip install nltk



In [4]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

2. Define a `stop_word` function to be used in the model training pipeline

In [5]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_word = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


3. Read the preprocessed movie data and break them to train and test sets

In [42]:
# Read the data
import pandas as pd
prepared_data = pd.read_csv('data/movie_data.csv', encoding='utf-8')
# Data splits
X_train = prepared_data.loc[:25000, 'review'].values
y_train = prepared_data.loc[:25000, 'sentiment'].values
X_test = prepared_data.loc[25000:, 'review'].values
y_test = prepared_data.loc[25000:, 'sentiment'].values

4. Logistic Regression Model training pipeline, with `GridSearchCV` as hyperparameter search strategy, Bag of Words for word embedding, and `LIBLINEAR` solver as the classifier.

The earlier defined `tokenizer` and `tokenizer_porter` are also used for words' tokenization.

In [43]:
# Import the neccesary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the Bag of Words Embeddings
tfidf = TfidfVectorizer(strip_accents = None, lowercase = False, preprocessor = None)

# Set the parameter grid for the GridSearchCV
small_param_grid = [
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [None],
        'vect__tokenizer': [tokenizer, tokenizer_porter],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    },
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [stop_word, None],
        'vect__tokenizer': [tokenizer],
        'vect__use_idf': [False],
        'vect__norm': [None],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    },
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [stop_word, None],
        'vect__tokenizer': [tokenizer],
        'vect__use_idf': [True],
        'vect__norm': [None],
        'clf__penalty': ['l2', 'l1'],
        'clf__C': [1.0, 10.0]
    },
]

# Initialize the Logistic Regression-Bag of Words model training pipeline
lr_tfidf = Pipeline([
    ('vect', tfidf),
    ('clf', LogisticRegression(solver='liblinear'))
])

# Attach the Logistic Regression-Bag of Words model training pipeline to the Hyperparameter search grid
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid, scoring = 'accuracy', cv = 10, verbose = 2, n_jobs = -1)

# Fit the Grid search Logistic Regression-Bag of Words model training pipeline with the training set
gs_lr_tfidf.fit(X_train, y_train)

Fitting 10 folds for each of 16 candidates, totalling 160 fits


5. The best parameters from the model training pipeline

In [44]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x00000194376379A0>}


6. The Average 10-fold cross-validation accuracy score of the training set

In [45]:
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

CV Accuracy: 0.890


7. Classification accuracy on the Test dataset

In [46]:
clf = gs_lr_tfidf.best_estimator_
print(f'test Accuracy: {clf.score(X_test, y_test):.3f}')

test Accuracy: 0.893
