In [1]:
!pip install nltk
!pip install numpy
!pip install pandas
!pip install keras

Collecting keras
  Downloading Keras-2.4.3-py2.py3-none-any.whl (36 kB)
Installing collected packages: keras
Successfully installed keras-2.4.3


In [1]:
import nltk
nltk.download ('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Haoran
[nltk_data]     Li\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Haoran
[nltk_data]     Li\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Data Loader

Python provides a lot of packages to load files in different formats. We provide a simple data loader to help you load .csv files.

In [2]:
import pandas as pd

def load_data(file_name):
    """
    :param file_name: a file name, type: str
    return a list of ids, a list of reviews, a list of labels
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
    """
    df = pd.read_csv(file_name)

    return df['id'], df["text"], df['label']

def load_labels(file_name):
    """
    :param file_name: a file name, type: str
    return a list of labels
    """
    return pd.read_csv(file_name)['label']

def write_predictions(file_name, pred):
    df = pd.DataFrame(zip(range(len(pred)), pred))
    df.columns = ["id", "label"]
    df.to_csv(file_name, index=False)

### Feature Extractor


The **feature extractor** is one of the most important parts in a pipeline.
In this tutorial, we introduce four different functions to extract features.


In [3]:
def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)

Given a sentence. *'Text mining is to identify useful information.'*, this function is used to tokenize it to a list of tokens.

In [4]:
tokens = tokenize("Text mining is to identify useful information.")
print(tokens)

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']



The next part is stemming. Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.


In [5]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]

In [6]:
tokens = stem(tokens)
print(tokens)

['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']


A single word is sometimes weakly expressive so that n-gram is a common method about better representation.

In [7]:
def n_gram(tokens, n=1):
    """
    :param tokens: a list of tokens, type: list
    :param n: the corresponding n-gram, type: int
    return a list of n-gram tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.'], 2
    Output: ['text mine', 'mine is', 'is to', 'to identifi', 'identifi use', 'use inform', 'inform .']
    """
    if n == 1:
        return tokens
    else:
        results = list()
        for i in range(len(tokens)-n+1):
            # tokens[i:i+n] will return a sublist from i th to i+n th (i+n th is not included)
            results.append(" ".join(tokens[i:i+n]))
        return results

In [8]:
bi_gram = n_gram(tokens, 2)
print(bi_gram)

['text mine', 'mine is', 'is to', 'to identifi', 'identifi use', 'use inform', 'inform .']


In natural language, some words are high-frequency but meaningless. We usually filter these words out to reduce the size of feature space. We can further filter features with low frequencies.

In [9]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

def filter_stopwords(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of filtered tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    Output: ['text', 'mine', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     if token not in stopwords and not token.isnumeric():
    #         results.append(token)
    # return results

    return [token for token in tokens if token not in stopwords and not token.isnumeric()]

In [14]:
print(filter_stopwords(tokens))

['text', 'mine', 'identifi', 'use', 'inform', '.']


Assume we get all features, the next step is to make these features suitable for the following models.
One simple and common way is to use the one-hot vector (maybe it is not the best solution).

In [11]:
import numpy as np

def get_onehot_vector(feats, feats_dict):
    """
    :param data: a list of features, type: list
    :param feats_dict: a dict from features to indices, type: dict
    return a feature vector,
    """
    # initialize the vector as all zeros
    vector = np.zeros(len(feats_dict), dtype=np.float)
    for f in feats:
        # get the feature index, return -1 if the feature is not existed
        f_idx = feats_dict.get(f, -1)
        if f_idx != -1:
            # set the corresponding element as 1
            vector[f_idx] = 1
    return vector

In [15]:
print(get_onehot_vector(tokens, {"text": 0, "mine": 1, "COMP": 2, "HKUST": 3}))
print(tokens)

[1. 1. 0. 0.]
['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']


In this example, "text" and "mine" appears in the tokens but "COMP" and "HKUST" not. So the 0 th, 1 th elements are ones and other elements are ones.

### Classifier

In this tutorial, we introduce a 1-layer perceptron to classify reviews. This perceptron includes 1 dense layer with the softmax activation.
Keras is the easiest deep learning framework so that we choose it to build this network.

In [17]:
import keras as K
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from keras import metrics

def build_classifier(input_size, output_size, learning_rate=0.1):
    """
    :param input_size: the dimension of the input, type: int
    :param output_size: the dimension of the prediction, type: int
    :param learning_rate: the learning rate for SGD
    return a 1-layer perceptron,
    """
    model = Sequential()
    
    # add 1 layer with softmax activation
    # you should specify the output size, the input_size, and the activation function
    model.add(Dense(output_size, activation="softmax", input_dim=input_size))
    
    # set the loss as categorical_crossentropy, the metric as accuracy, and the optimizer as SGD
    model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=learning_rate), metrics=['accuracy'])
    
    return model

Using TensorFlow backend.


### Connect All Parts

Now we have the data loader, feature extractor, and the classifier. We can connect them to finish this pipeline of classification.

In [21]:
train_file = "data/train.csv"
test_file = "data/test.csv"
ans_file = "data/ans.csv"
pred_file = "data/pred.csv"

# load data
train_ids, train_texts, train_labels = load_data(train_file)
test_ids, test_texts, _ = load_data(test_file)
test_labels = load_labels(ans_file)

# extract features

# tokenization
# the input is the text and the output is the word list
train_tokens = [tokenize(text) for text in train_texts]
test_tokens = [tokenize(text) for text in test_texts]

# stemming
# the input is the word list and the ouput is the stemmed word list
train_stemmed = [stem(tokens) for tokens in train_tokens]
test_stemmed = [stem(tokens) for tokens in test_tokens]

# 2-gram
# the input can be either the stemmed tokens or the original tokens
# and the ouput is the 2_gram representation list
train_2_gram = [n_gram(tokens, 2) for tokens in train_stemmed]
test_2_gram = [n_gram(tokens, 2) for tokens in test_stemmed]

# remove stopwords
# the input should be the stemmed tokens and the output is a cleanner token list
train_stemmed = [filter_stopwords(tokens) for tokens in train_stemmed]
test_stemmed = [filter_stopwords(tokens) for tokens in test_stemmed]

# build the feature list
train_feats = list()
for i in range(len(train_ids)):
    # concatenate the stemmed token list and the 2_gram list together
    train_feats.append(train_stemmed[i] + train_2_gram[i])
test_feats = list()
for i in range(len(test_ids)):
    # concatenate the stemmed token list and the 2_gram list together
    test_feats.append(test_stemmed[i] + test_2_gram[i])

# build the feature dict
feats = set()
# collect all features
for f in train_feats:
    feats.update(f)
print("Size of features:", len(feats))
# build a mapping from features to indices
feats_dict = dict(zip(feats, range(len(feats))))

# build the feats_matrix
# convert each example to a ont-hot vector, and then stack vectors as a matrix
train_feats_matrix = np.vstack([get_onehot_vector(f, feats_dict) for f in train_feats])
test_feats_matrix = np.vstack([get_onehot_vector(f, feats_dict) for f in test_feats])

# convert labels to label_matrix
num_classes = max(train_labels)
# convert each label to a ont-hot vector, and then stack vectors as a matrix
train_label_matrix = K.utils.to_categorical(train_labels-1, num_classes=num_classes)
test_label_matrix = K.utils.to_categorical(test_labels-1, num_classes=num_classes)

# build the classifier
model = build_classifier(len(feats_dict), num_classes, learning_rate=0.1)

Size of features: 100870


We need to specify the training hyperparameters to train a classifier. We set the total_epoch as 10, the batch_size as 100.

In [23]:
# training

model.fit(train_feats_matrix, train_label_matrix, epochs=10, batch_size=100)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x21a3a418508>

Let's try this pipeline for test data!

In [24]:
# evaluation
train_score = model.evaluate(train_feats_matrix, train_label_matrix, batch_size=100)
test_score = model.evaluate(test_feats_matrix, test_label_matrix, batch_size=100)
print("training loss:", train_score[0], "training accuracy", train_score[1])
print("test loss:", test_score[0], "test accuracy", test_score[1])

training loss: 0.5424470275640487 training accuracy 0.8790000081062317
test loss: 1.0928072929382324 test accuracy 0.5649999976158142


You can try different hyperparameters by yourselves.

However, our classifier here is overfitted. The training loss is much smaller than the test loss.
Later we will introduce more strategies to make classifiers better generalized.

In [23]:
# save predictions
test_pred = model.predict_classes(test_feats_matrix)
write_predictions(pred_file, test_pred)