#The Witcher on Azure: a classification problem using Natural Language Processing in Azure Databricks

For people who know me this will not come as a surprise, but two of my favorite topics to talk about are data and the Witcher. I have always been a huge fan of fantasy movies, games and books and the Witcher is easily one of my favorites. For readers who do not know what or who the Witcher is, the Witcher is a fantasy series by Polish writer Andrzej Sapkowski basically describing the adventures of monster hunter Geralt. Recently, a Netflix series based on these books has also been released, which I would definitely recommend. Describing this work of fiction will only be of secondary interest to this article, however. The main focus will reside with **Azure Databricks**.

Databricks is a **“Unified Data Analytics Platform”** which works together with cloud provider Azure to provide an online environment for data science using Apache Spark. In this platform, data scientists, engineers and analysts can come together to work on big data challenges. While the world of big data is multifaceted, and the possibilities Databricks offers are numerous, I want to narrow down the scope of the article to Machine Learning in Databricks. Specifically **Natural Language Processing (NLP)**.

John Snow Labs, named after the English physician and not the Game of Thrones character, has developed an award winning open-source NLP library for Apache Spark. This framework can be easily integrated with Databricks, as they are both founded upon Apache Spark. This library offers a lot of out-of-the-box tools that are essential for NLP. For example, there is an in-built Entity Extractor, Tokenizer, Part of Speech Tagger, Named Entity Recognition and many more great features. They also offer pre-trained pipelines in multiple languages which allow you to identify words and sentences without having to spend (too much) effort in training a model yourself. This means you can instantly move on to the more interesting aspects of NLP. 

In order to demonstrate some of these capabilities, I will perform a quick demonstration on some data I scraped from the internet. Using **R**, specifically the **rvest** library, I scraped the website witcher.fandom.com and extracted the character list containing all the characters that appear in the videogames. The wiki was particularly suited for a scraper, as all the pages follow the same layout. I performed some structuring and cleaning on the data in R, but I also left a block of raw text data that we will be using in this example.

I exported the scraped data in CSV format to my local computer, but I could have moved it to storage in the cloud as well. For this you could use **Azure Data Lake Storage**, which is an easy to use, scalable data lake which is ideal for storing data such as a CSV file. On top of that, being part of the Azure environment it allows for an easy integration with Databricks. I would definitely recommend using this tool if you plan on working with large amounts of unstructured or semi-structured data.  

The problem that I want to tackle on Databricks is a classification problem. Specifically, based on the textual description (which is in free form) I want to classify a character as either a Dwarf or an Elf. This information could always come in handy when you have to decide whether or not the character could be tossed (yes, I’m also a huge fan of the Lord of the Rings). All silliness aside, the classification problem posed here can easily be transferred to other, more real-life scenarios. For example, you could measure and **predict the likelihood of a purchase or the attitude** towards a product a person has based on his or hers recent LinkedIn post or an email sent to your customer support. Similar techniques can readily be deployed in the development of **chatbots**. In fact, the model that I will show here is a simplified version of the model that I am actively developing for a job. What I want to show is that with this piece of technology, the **possibilities are endless**. 

Now, without further ado, I will demonstrate the code. In Databricks I will be using **Pyspark**, although Databricks also offers support for **R, Scala and SQL**. You can follow along with this code, even when you do not have a paid subscription to Databricks as there is also a free Community Edition available. For more information see: https://databricks.com/product/faq/community-edition

First, we need to set up the environment by loading in the required libraries and read in the scraped data that I uploaded to the Databricks filestore in CSV format. I also perform a quick inspection of the data using the code below. This data is typical of what you can expect from scraped data, as there are lots of ill formatted field and missing data.

In [2]:
# All functions needed to run this example
import sparknlp

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, IDF, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import NaiveBayes

import pyspark.sql.functions as F

In [3]:
 df = spark.read.csv("FileStore/tables/witcher_data-1.csv",header=True)

In [4]:
display(df.limit(3))

_c0,name,race,gender,proffesion,affiliation,text
1,,,,,,
2,Abigail,Human,Female,,,"Abigail was a witch who lived in her house in the outskirts of Vizima. She was not well-liked by the villagers and got blamed for many bad or strange occurrences. Her main skill was alchemy, so though the villagers approached her with suspicion, they also relied on her for potions and poisons - which she provided, rarely bothering with questions or moral objections.In the course of Chapter I, Abigail took in the orphaned Alvin after Geralt saved him from the barghests as Shani could not care for him from an inn. She also facilitated a trance where the boy revealed more about the beast and its minions who were plaguing the village.Geralt can buy a blade coating formula from her that makes it easier to battle specters and ghosts. In fact, she buys and sells quite a few things:If her fate was left to the villagers, she cursed Geralt in the name of the Lionhead Spider, which she called the Black Legba. Whether this meant she was also responsible for the misdeeds of the other villagers is unclear."
3,Adalbert,Human,Male,Soldier,Order of the Flaming Rose Vizima City Guard,"Adalbert was a crossbowman of the City Guard who fought in the ranks of the Order of the Flaming Rose.He participated in fierce battles to take control of the Trade Quarter, resulting in the Order winning a Pyrrhic victory. A seriously wounded Adalbert went with his remaining strength to the Cloister of the Flaming Rose in the Temple Quarter. Being in a state of agony, he informed Jacques de Aldersberg what happened, and then told the Master about the death of Roderick de Wett caused by Geralt. The Grand Master then thanked the soldier for his faithful service, just before Adalbert died from his wounds."


The data that I will be using for the classification is the text column and the race column. I take a subset of the data so that only Dwarf and Elf (Aen Seidhe) characters are present. The fact that these are the most frequenly occurring classes, apart from humans, had nothing to do with my choice for these classes *ahum*. In all fairness, the number of observations is quite low, and in a real-life situation you definitely want more. For illustration purposes however, this is fine.

In [6]:
subset = df.where(df.race.isin(["Elf (Aen Seidhe)", "Dwarf"]))

In [7]:
display(subset.limit(3))

_c0,name,race,gender,proffesion,affiliation,text
24,Chireadan,Elf (Aen Seidhe),Male,Tavern owner (canon)Guerrilla fighter (games),Scoia'tael (games),"Chireadan was an elf from the Redanian city of Rinde, a tavern owner, and Errdil's cousin.Despite elves typically not finding humans to be attractive, he was secretly in love with Yennefer, though he never revealed his feelings to the sorceress. However, he didn't let his feelings get in the way when he saw that Geralt was also infatuated with Yennefer and even pulled the others away when he saw the two having sex in his cousin's destroyed inn after fighting off a djinn.In Chapter IV, at the Lakeside, Chireadan, an elf among Toruviel's ragtag group recognizes Geralt. It seems he knows the witcher from somewhere before, but of course Geralt has no memory of him. This previous encounter, which is not described in any further detail, is a reference to the The Last Wish.He is also a sharper, and thus one of the available dice players. He can be found any time, day or night to play. During the day, he is typically sitting at one of the two campfires nearest the Elven Cave. At night, he will be sleeping in the cave, but does not complain at all about being woken up just for a game."
38,Elven craftsman,Elf (Aen Seidhe),Male,Master craftsman,Toruviel,"The elven craftsman was part of Toruviel's band of starving elves who camped in the cave by the Lakeside in Murky Waters. He was looking for four pieces of centipede armor for his work and was willing to pay. He was also a master craftsman, though the term slightly offended his artistic nature, and was capable of doing many things: mirror reassembly being one such skill.He was well versed in the history of Raven's armor."
53,Golan Vivaldi,Dwarf,Male,Banker,Vivaldi Bank (Vizima branch),"""Golan Vivaldi is a dwarf, and part of the """"Vivaldi family"""""


I previously mentioned that there are pre-trained pipelines available. Here, I chose to manually set the stages instead of using the pre-trained pipeline, so you can get a better understanding of what is going on. The data that we have has to undergo a number of changes before we can actually use it. There are some steps shown here that are not strictly necessary for this problem, but could serve to illustrate some additional capabilities. 

The stages that the character text have to go through are **document assembler, sentence detector, tokenizer, stop words cleaner, normalizer, lemmatizer, finisher, countvectorizer, idf, and an indexer**. The first few stages break up the text in individual parts, remove unnecessary words and normalize the remaining words. An example of this is transforming all the same words with the same stem but written in different tenses to one and the same tense. 

In order to use the text to classify the characters, it has to be in a specific format, namely a **vector**. Furthermore, I want to make a count of how many times certain words appear in the text. This can be done with the count vectorizer. For additional information, I also determine the **term frequency-inverse document frequency (TF-IDF)**. This basically lowers the importance of words that appear in every entry. For example, given that all the characters are part of the Witcher universe, the word witcher will likely appear quite frequently. This does not mean, however, that this is very useful word to predict with as it can be applied to all entries. The TF-IDF score account for this fact. Finally, the indexer step translates the values for race to numbers. This is because predictions can only be performed on numbers. All these steps together transform the data in something that we can use to make the prediction. 

Note that in the code below I manually set a number of common stop words to be removed.

In [9]:
words_to_remove_list = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [10]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)
    
tokenizer = Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

stop_words_cleaner = StopWordsCleaner() \
        .setInputCols(["token"]) \
        .setOutputCol("cleanTokens") \
        .setCaseSensitive(False) \
        .setStopWords(words_to_remove_list)
    
normalizer = Normalizer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("normalized")

lemmatizer = LemmatizerModel.pretrained(name='lemma', lang='nl') \
     .setInputCols(['normalized']) \
     .setOutputCol('lemma')

finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setOutputCols(["ntokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False) 
countvectorizer = CountVectorizer(inputCol="ntokens", outputCol="features", minDF = 3.0)

idf = IDF(inputCol="features", outputCol="features_updated")

indexer = StringIndexer(inputCol="race", outputCol="raceIndex")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, stop_words_cleaner, normalizer, lemmatizer, finisher, countvectorizer, idf, indexer])

Before we move to predict anything, I first divide the data in a training and test subset. This gives a better assessment of the performance of the model that we are going to use. Databricks offers many models that can be readily applied. In this case, since the outcome variable is binary, there are a lot of options we can choose from. For example, **Decision trees, logistic regression or Naïve Bayes** are all models which can be applied to this situation. In this case, I chose to use Naïve Bayes as it has been successful for me in similar cases before.

In [12]:
processed_subset = nlp_pipeline.fit(subset).transform(subset)

(trainingData, testData) = processed_subset.randomSplit([0.8, 0.2],seed = 11)

nb = NaiveBayes(modelType="multinomial",labelCol="raceIndex", featuresCol="features_updated")
nbModel = nb.fit(trainingData)
nb_predictions = nbModel.transform(testData)

Now that we have applied the model, we can evaluate the performance. In order to assess this, I look at the **f1 score** which is the harmonic mean of the precision and recall. In this case, we obtain a score of .90 which is pretty good!

In [14]:
evaluator = MulticlassClassificationEvaluator(labelCol="raceIndex", predictionCol="prediction", metricName="f1")
nb_accuracy = evaluator.evaluate(nb_predictions)
print("F1 score of NaiveBayes is = %g"% (nb_accuracy))

Using this model, we were able to make a pretty good distinction between a Dwarf and an Elf. Of course, this model can be further improved and more data should be added in order to obtain better predictions. However, what I have shown here are some of the basic steps and capabilities Databricks offers in terms of NLP and Machine Learning.

To summarize, Azure Databricks offers an easy to use data analytics platform in the cloud. It is able to ingest data from multiple sources, such as a data lake, and apply machine learning on this data. The possibilies in this regard are endless, and in this example I gave a quick demonstration of how to use unstructured text data and use it to determine a fanatasy characters' race using NLP. 

If you are curious about other possibilies Databricks could offer you or you are intrigued by this article please let me know. I am always eager to discuss these topics with interested readers. Also, if you have opportunities or resources for me to expand my knowledge regarding this topic, do not hestitate to contact me! I am also always up for a round of Gwent ;)