# Multiclass Document Classification using Spark

---

S.Yu. Papulin (papulin_bmstu@mail.ru)

### Contents

- [Multiclass Document Classification using Naive Bayes Multinomial Model](#Multiclass-Document-Classification-using-Naive-Bayes-Multinomial-Model)
    - [Loading Dataset](#Loading-Dataset)
    - [Creating Spark DataFrame](#Creating-Spark-DataFrame)
    - [Splitting Dataset](#Splitting-Dataset)
    - [Vectorizing Documents](#Vectorizing-Documents)
    - [Training Model](#Training-Model)
    - [Testing Model](#Testing-Model)
[Pipelines for Classification](#Pipelines-for-Classification)
    - [Training and Transforming with Pipeline](#Training-and-Transforming-with-Pipeline)
    - [Adding New Transformation to Pipeline](#Adding-New-Transformation-to-Pipeline)
- [Model Selection](#Model-Selection)
    - [Parameter Grid](#Parameter-Grid)
    - [Train-Validation Split](#Train-Validation-Split)
    - [Cross-Validation](#Cross-Validation)

### Starting Spark Session

[OPTIONAL] Environment Setup

In [None]:
import os
import sys

os.environ["SPARK_HOME"]="/home/ubuntu/BigData/spark"
os.environ["PYSPARK_PYTHON"]="/home/ubuntu/ML/anaconda3/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/home/ubuntu/ML/anaconda3/bin/python"

spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, os.path.join(spark_home, "python"))
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

Import `PySpark`:

In [None]:
import pyspark
from pyspark.sql import SparkSession

Run Spark Context

*On YARN:*

In [None]:
# conf = pyspark.SparkConf() \
#         .setAppName("docClassificationApp") \
#         .setMaster("yarn") \
#         .set("spark.submit.deployMode", "client")

# sc = pyspark.SparkContext(conf=conf)

*Locally:*

In [None]:
# Note: spark.executor.* options are not the case in the local mode 
#  as all computation happens in the driver.
conf = pyspark.SparkConf()\
        .set("spark.executor.memory", "1g")\
        .set("spark.executor.core", "2")\
        .set("spark.driver.memory", "2g")\
        .setMaster("local[*]")

In [None]:
spark = SparkSession\
    .builder\
    .config(conf=conf)\
    .getOrCreate()

## Multiclass Document Classification using Naive Bayes Multinomial Model

### Loading Dataset

To avoid additional preprocessing steps, let's get the 20newsgroups dataset from `scikit-learn` library. You can download the raw dataset from [here](http://qwone.com/~jason/20Newsgroups/)

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
data_20newsgroups = fetch_20newsgroups(
    subset="all", remove=["headers", "footer", "quotes"])

In [None]:
print(data_20newsgroups.DESCR)

In [None]:
data_20newsgroups.data[:2]

In [None]:
data_20newsgroups.target[:2]

In [None]:
data_20newsgroups.target_names

In [None]:
list(data_20newsgroups.target_names[i] for i in data_20newsgroups.target[:2])

### Creating Spark `DataFrame`

In [None]:
pairs_doc_target = zip(data_20newsgroups.data, data_20newsgroups.target)

Create a Spark `DataFrame` for the document collection:

In [None]:
from pyspark.sql import Row

In [None]:
df_data = spark.sparkContext.parallelize(pairs_doc_target, 4)\
    .map(lambda x: Row(document=x[0], target=int(x[1])))\
    .toDF()

In [None]:
df_data.show(2, truncate=True)

### Splitting Dataset

In [None]:
df_train, df_test = df_data.randomSplit([0.8, 0.2], seed=1234)
df_train.persist().count(), df_test.persist().count()

### Vectorizing Documents

In [None]:
from pyspark.ml.feature import (
    RegexTokenizer, 
    StopWordsRemover,
    HashingTF, 
    IDF
)

#### Tokenizing

In [None]:
regexTokenizer = RegexTokenizer(inputCol="document", 
                                outputCol="tokens", 
                                gaps=False,
                                pattern="(?!_)[A-Za-z']+")

# regexTokenizer = Tokenizer(inputCol="document", 
#                                 outputCol="tokens")

In [None]:
df_train__tokens = regexTokenizer.transform(df_train)
df_train__tokens\
    .select("tokens")\
    .show(5, truncate=True)

#### Dropping Stop Words

In [None]:
remover = StopWordsRemover(inputCol="tokens", outputCol="filtered")

In [None]:
df_train__filtered = remover.transform(df_train__tokens)
df_train__filtered\
    .select("tokens", "filtered")\
    .show(5)

#### Hashing TF

In [None]:
hashingTF = HashingTF(inputCol="filtered",
                      outputCol="tf", 
                      numFeatures=200000,
                      binary=False)

In [None]:
df_train__tf = hashingTF.transform(df_train__filtered)
df_train__tf\
    .select("tokens", "filtered", "tf")\
    .show(truncate=True)

#### IDF

In [None]:
idf = IDF(inputCol="tf", outputCol="features")
idf_model = idf.fit(df_train__tf)

In [None]:
df_train__tf_idf = idf_model.transform(df_train__tf)
df_train__tf_idf\
    .select("tf", "features")\
    .show(2, True)

### Training Model

In [None]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
nb = NaiveBayes(labelCol="target", 
                featuresCol="features", 
                smoothing=1.0, 
                modelType="multinomial")

In [None]:
model = nb.fit(df_train__tf_idf)

In [None]:
df_train__predictions = model.transform(df_train__tf_idf)
df_train__predictions\
    .select("probability", "target", "prediction")\
    .show(1, True)

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="target", 
                                              predictionCol="prediction",
                                              metricName="accuracy")

In [None]:
train_accuracy = evaluator.evaluate(df_train__predictions)
print("Train accuracy = " + str(train_accuracy))

### Testing Model

#### Accuracy

In [None]:
df_test__tokens = regexTokenizer.transform(df_test)
df_test__filtered = remover.transform(df_test__tokens)
df_test__tf = hashingTF.transform(df_test__filtered)
df_test__tf_idf = idf_model.transform(df_test__tf)
df_test__predictions = model.transform(df_test__tf_idf)
test_accuracy = evaluator.evaluate(df_test__predictions)
print("Test accuracy = " + str(test_accuracy))

#### Plotting Confusion Matrix

In [None]:
true_pred_count = df_test__predictions.select("target", "prediction")\
    .groupBy("target", "prediction")\
    .count()\
    .toPandas()

true_pred_count.head(5)

In [None]:
import numpy as np
import matplotlib.pyplot as plt


def plot_confusion_matrix(true_pred_count, 
                          classes=None,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues, 
                          figsize=(8,8)):
    """
    Plotting Confusion Matrix
    
    Note: The code from [here](1) was adapted to use a Pandas DataFrame with the following structure:
         (index, true_value, predicted_value, count)
    
    Refs:
    [1] https://scikit-learn.org/0.21/auto_examples/model_selection/plot_confusion_matrix.html
    """

    from sklearn.utils.multiclass import unique_labels
    
    classes = classes if classes else unique_labels(true_pred_count["target"], true_pred_count["prediction"])
    
    # Compute confusion matrix
    cm = np.zeros((len(classes), len(classes)), dtype="int")

    for indx, row in true_pred_count.iterrows():
        cm[int(row["target"]), int(row["prediction"])] = row["count"]
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    fig, ax = plt.subplots(figsize=figsize)
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    plt.ylim(len(classes)-0.5, -0.5)
    
    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

In [None]:
plot_confusion_matrix(true_pred_count, data_20newsgroups.target_names)
plt.show()

## Pipelines for Classification

### Training and Transforming with Pipeline

In [None]:
from pyspark.ml import Pipeline

In [None]:
pipeline = Pipeline(
    stages=[
        regexTokenizer, 
        remover, 
        hashingTF.setInputCol("filtered"), 
        idf_model, 
        nb
    ]
)

In [None]:
model__pipelined = pipeline.fit(df_train)
model__pipelined

In [None]:
df_test__predictions = model__pipelined.transform(df_test)

In [None]:
test_accuracy = evaluator.evaluate(df_test__predictions)
print("Test accuracy = " + str(test_accuracy))

### Adding New Transformation to Pipeline

In [None]:
from pyspark.ml.feature import NGram

In [None]:
ngram = NGram(n=2, inputCol="filtered", outputCol="ngrams")

In [None]:
pipeline_ngram = Pipeline(stages=[
    regexTokenizer, 
    remover, 
    ngram,
    hashingTF.setInputCol("ngrams"), 
    idfModel, 
    nb
])

In [None]:
model__pipelined_ngram = pipeline_ngram.fit(df_train)
df_test__predictions = model__pipelined_ngram.transform(df_test)
test_accuracy = evaluator.evaluate(df_test__predictions)
print("Test accuracy = " + str(test_accuracy))

### Classifying New Documents

In [None]:
new_document = [
    "Victory means Jurgen Klopp's side is now unbeaten in its last 64 league games at home "
    "-- a run that stretches back to May 2017. The previous record of 63 was set by Bob Paisley's "
    "team between 1978 and 1981 and was ended by Leicester City. However, history did not repeat "
    "itself at the weekend and Liverpool was a deserved winner against the Foxes, producing "
    "a comprehensive display against one of the most dangerous sides in the English Premier League."]

df_new_document = spark.sparkContext.parallelize(new_document, 1) \
    .map(lambda x: Row(document=x)) \
    .toDF()

df_new_document.show(1, False)

In [None]:
df_test__predictions = model__pipelined.transform(df_new_document)
new_document__prediction = int(df_test__predictions.select("prediction").collect()[0]["prediction"])
data_20newsgroups.target_names[new_document__prediction]

## Model Selection

### Parameter Grid

In [None]:
from pyspark.ml.tuning import (
    ParamGridBuilder, 
    TrainValidationSplit, 
    CrossValidator
)

In [None]:
hashingTF = HashingTF(inputCol="filtered",
                      outputCol="tf",
                      binary=False)

In [None]:
# The number of features
num_features_list = [20000, 200000]

# The smooth parameter of Naive Bayes
nb_smoothing_list = [0.1, 1.0]

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, num_features_list) \
    .addGrid(nb.smoothing, nb_smoothing_list) \
    .build()

In [None]:
pipeline = Pipeline(
    stages=[
        regexTokenizer, 
        remover, 
        hashingTF, 
        idf_model, 
        nb
    ]
)

### Train-Validation Split

In [None]:
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=evaluator,
                           trainRatio=0.8)

In [None]:
split_model = tvs.fit(df_train)

In [None]:
split_model.validationMetrics

In [None]:
model__best = split_model.bestModel
model__best.stages

In [None]:
model__best.transform(df_test) \
    .select("features", "target", "prediction") \
    .show(5)

In [None]:
df_test__predictions = model__best.transform(df_test)
test_accuracy = evaluator.evaluate(df_test__predictions)
print("Test accuracy = {}".format(str(test_accuracy)))
print("Best model parameters:")
print("\tNB Smoothing = {}".format(model__best.stages[-1].getOrDefault("smoothing")))
print("\tNumber of features = {}".format(model__best.stages[2].getNumFeatures()))

### Cross-Validation

In [None]:
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

In [None]:
cv_model = crossval.fit(df_train)

In [None]:
cv_model.avgMetrics

In [None]:
model__best = cv_model.bestModel
model__best.stages

In [None]:
df_test__predictions = model__best.transform(df_test)
test_accuracy = evaluator.evaluate(df_test__predictions)
print("Test accuracy = {}".format(str(test_accuracy)))
print("Best model parameters:")
print("\tNB Smoothing = {}".format(model__best.stages[-1].getOrDefault("smoothing")))
print("\tNumber of features = {}".format(model__best.stages[2].getNumFeatures()))

### Stopping Spark Session

In [None]:
spark.stop()

## References

[Machine Learning Library (MLlib) Guide](http://spark.apache.org/docs/latest/ml-guide.html)