#The Witcher on Azure: a classification problem using Natural Language Processing in Azure Databricks

For people who know me this will not come as a surprise, but two of my favorite topics to talk about are data and the Witcher. I have always been a huge fan of fantasy movies, games and books and the Witcher is easily one of my favorites. For readers who do not know what or who the Witcher is, the Witcher is a fantasy series by Polish writer Andrzej Sapkowski basically describing the adventures of monster hunter Geralt. Recently, a Netflix series based on these books has also been released, which I would definitely recommend. Describing this work of fiction will only be of secondary interest to this article, however. The main focus will reside with **Azure Databricks**.

Databricks is a **“Unified Data Analytics Platform”** which works together with cloud provider Azure to provide an online environment for data science using Apache Spark. In this platform, data scientists, engineers and analysts can come together to work on big data challenges. While the world of big data is multifaceted, and the possibilities Databricks offers are numerous, I want to narrow down the scope of the article to Machine Learning in Databricks. Specifically **Natural Language Processing (NLP)**.

John Snow Labs, named after the English physician and not the Game of Thrones character, has developed an award winning open-source NLP library for Apache Spark. This framework can be easily integrated with Databricks, as they are both founded upon Apache Spark. This library offers a lot of out-of-the-box tools that are essential for NLP. For example, there is an in-built Entity Extractor, Tokenizer, Part of Speech Tagger, Named Entity Recognition and many more great features. They also offer pre-trained pipelines in multiple languages which allow you to identify words and sentences without having to spend (too much) effort in training a model yourself. This means you can instantly move on to the more interesting aspects of NLP. 

In order to demonstrate some of these capabilities, I will perform a quick demonstration on some data I scraped from the internet. Using **R**, specifically the **rvest** library, I scraped the website witcher.fandom.com and extracted the character list containing all the characters that appear in the videogames. The wiki was particularly suited for a scraper, as all the pages follow the same layout. I performed some structuring and cleaning on the data in R, but I also left a block of raw text data that we will be using in this example.

I exported the scraped data in CSV format to my local computer, but I could have moved it to storage in the cloud as well. For this you could use **Azure Data Lake Storage**, which is an easy to use, scalable data lake which is ideal for storing data such as a CSV file. On top of that, being part of the Azure environment it allows for an easy integration with Databricks. I would definitely recommend using this tool if you plan on working with large amounts of unstructured or semi-structured data.  

The problem that I want to tackle on Databricks is a classification problem. Specifically, based on the textual description (which is in free form) I want to classify a character as either a Dwarf or an Elf. This information could always come in handy when you have to decide whether or not the character could be tossed (yes, I’m also a huge fan of the Lord of the Rings). All silliness aside, the classification problem posed here can easily be transferred to other, more real-life scenarios. For example, you could measure and **predict the likelihood of a purchase or the attitude** towards a product a person has based on his or hers recent LinkedIn post or an email sent to your customer support. Similar techniques can readily be deployed in the development of **chatbots**. In fact, the model that I will show here is a simplified version of the model that I am actively developing for a job. What I want to show is that with this piece of technology, the **possibilities are endless**. 

Now, without further ado, I will demonstrate the code. In Databricks I will be using **Pyspark**, although Databricks also offers support for **R, Scala and SQL**. You can follow along with this code, even when you do not have a paid subscription to Databricks as there is also a free Community Edition available. For more information see: https://databricks.com/product/faq/community-edition

First, we need to set up the environment by loading in the required libraries and read in the scraped data that I uploaded to the Databricks filestore in CSV format. I also perform a quick inspection of the data using the code below. This data is typical of what you can expect from scraped data, as there are lots of ill formatted field and missing data.

In [2]:
# All functions needed to run this example
import sparknlp

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, IDF, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import NaiveBayes

import pyspark.sql.functions as F

In [3]:
 df = spark.read.csv("FileStore/tables/witcher_data-1.csv",header=True)

In [4]:
display(df)

_c0,name,race,gender,proffesion,affiliation,text
1,,,,,,
2,Abigail,Human,Female,,,"Abigail was a witch who lived in her house in the outskirts of Vizima. She was not well-liked by the villagers and got blamed for many bad or strange occurrences. Her main skill was alchemy, so though the villagers approached her with suspicion, they also relied on her for potions and poisons - which she provided, rarely bothering with questions or moral objections.In the course of Chapter I, Abigail took in the orphaned Alvin after Geralt saved him from the barghests as Shani could not care for him from an inn. She also facilitated a trance where the boy revealed more about the beast and its minions who were plaguing the village.Geralt can buy a blade coating formula from her that makes it easier to battle specters and ghosts. In fact, she buys and sells quite a few things:If her fate was left to the villagers, she cursed Geralt in the name of the Lionhead Spider, which she called the Black Legba. Whether this meant she was also responsible for the misdeeds of the other villagers is unclear."
3,Adalbert,Human,Male,Soldier,Order of the Flaming Rose Vizima City Guard,"Adalbert was a crossbowman of the City Guard who fought in the ranks of the Order of the Flaming Rose.He participated in fierce battles to take control of the Trade Quarter, resulting in the Order winning a Pyrrhic victory. A seriously wounded Adalbert went with his remaining strength to the Cloister of the Flaming Rose in the Temple Quarter. Being in a state of agony, he informed Jacques de Aldersberg what happened, and then told the Master about the death of Roderick de Wett caused by Geralt. The Grand Master then thanked the soldier for his faithful service, just before Adalbert died from his wounds."
4,Adam,Human,Male,,,"Adam was the local poet in Murky Waters, his skills in that area were suspect at best. He was also not so secretly in love with Alina, was considered a fool by Alina's father Tobias Hoffman, and the rest of the village did not seem to have too high an opinion of him either. His occupation, other than aspiring poet, was unknown.Although his true love was ripped from this life by her jealous sister, Adam still clung to the dream of being eternally with Alina. It was during the first lonely days in a rat infested cell that Adam finally found his muse, and penned a poem for Alina.The example above is his finest work, ever, really. In a few conversations, we really get to plumb the depths of his oeuvre. It is bad enough that, at one point, Geralt asks him to simply stop speaking. Here is the poem which prompts the outburst:It might go on, but the witcher just stops him there."
5,Adda the White,"Human (turned into striga as a result of a curse, disenchanted by Geralt of Rivia)",Female,,,"""Dyed red (games)Adda the White was the daughter of Foltest, king of Temeria. She was born a striga as a result of a curse cast by either Ostrit (a local magnate who loved her mother) or Sancia (Foltest's mother). The magnate was in love with the king's sister, the mother of the princess, whose name was also Adda. When he learned of the incestuous relationship between the king and his sister, Ostrit tried to put a curse on the king and this is blamed for Adda's transformation into a striga. However, Foltest's mother was also furious at the incestuous relationship between her children and may also have cursed their child. So it is possible that either Ostrit or Sancia, or both, were the cause of the curse.Adda was named for her mother who died during the birth. Her nickname """"the White"""""
6,Alina,Human,,,,"Alina was a young girl from Murky Waters and the daughter of the village chief, Tobias Hoffman. She was to be married to Julian, a rich merchant from Kovir, much to the disgust of her jealous sister Celina and the poet Adam, who was secretly in love with her.She can initially be found in her house in the village of Murky Waters. Later, once she becomes a noonwraith, she usually hangs out near the ruin mill in the Fields.She seemed strangely resigned to marrying Julian and appeared fond of him, at least in a way, but she also showed no intention whatsoever of ending her affair with Adam which she carried on every day at noon, in the raspberry patch in the fields. She did seem jealous of Celina's attempts to gain Julian's attention, even though this last endeavor was proving fruitless.She had also developed quite an attachment to Alvin. She would have dearly loved to have adopted him and raise him with Julian, but Julian was not so taken with the prospect of raising a child who was, from all reports, a source.The jealous rivalry with her sister reached a peak one day when Celina started to argue with her, near the Raspberry patch, and ended up pushing Alina down and causing Alina to hit her head on a rock, killing her instantly. Because she was murdered in this manner, Alina arised as a noonwraith, doomed to forever roam the fields."
7,Alvin,Human,Male,,,"""Alvin was a boy who managed to escape the barghest attack which cost his foster mother her life. As a result of the shock, he started to divine the future and uttered the Prophecy of Ithlinne. It is supposed that Alvin was a Source, having magical powers he couldn't control.Alvin lived with Abigail until the Reverend snatched him while collecting water for Abigail. The preacher gave the orphan to a group of Salamandra lackeys, who demanded that the dwellers of the Outskirts surrender their children.In Chapter III, Geralt finds Alvin a second time. This time the boy is being held at St. Lebioda's Hospital, until he is kidnapped by Salamandra once again. In Chapter IV, Geralt finds the boy a third time, in the village of Murky Waters.Many fans of the game believe that Alvin is actually the Grand Master of the Order, Jacques de Aldersberg. When the fight between the elves under Toruviel and the Order of the Flaming Rose erupts in full force in the village of Murky Waters, Alvin teleports himself away because he is frightened by an elven warrior. Alvin's ability to teleport himself through time and space stems from the fact that he is a source. It is widely theorized that he flees to the past, perhaps back to where he was raised, and ultimately incorporates what he had learned with Geralt, accepting his perceived fate and becoming the Grand Master.Clues that speak to this hypothesis include the dimeritium pendant Alvin wears. Alvin, the boy has one as does the Grand Master; only difference is that Aldersberg's amulet shows years of wear. But the previous hint, that he went back to time with everything, clothes, knowledge etc. explains it. Another hint lies in the mention that Alvin's favorite game is """"kill the elf"""" where he always plays the Grand Master and wins. Also"
8,Angus,Human,Male,,,"Angus was a local drug dealer who was known to the City Guard, well, at least to Jethro. He was a shady character who can usually be found standing on the corner near The Hairy Bear, the kitty-corner from the Eager Thighs brothel, or in the slums of the Temple Quarter.Geralt must find Angus and get from him information about the Salamandra drug trade in Vizima. It is from him that Geralt gets the information to locate the Salamandra hideout where the manufacture of fisstech seems to be based. He can get this information either by paying the dealer for a letter of recommendation in order to enter the drug lab with minimal fuss or taking that same letter from him by force. Alternatively, our hero can simply enter the lab using force and dispense with the letter altogether."
9,Antoinette,Human,,,,"Corbin (cousin)Ramerot (cousin)Jean-Pierre (fianc�)At the beginning of Chapter V. Her cousins, Buse, Corbin and Ramerot are in hiding in the refugee caves in the swamp cemetery after incurring the disfavour of the king and they are completely out of orens.When Geralt speaks to Antoinette, she mentions that she recognizes Geralt because in her home many years ago, she saw him dancing all night with a sorceress that he was very fond of.Geralt doubts this due to his inability to dance. Antoinette questions if all Witcher's have white hair; Geralt says he is the only one, and she says she must be confused because she remembers him dancing with a sorceress."
10,Azar Javed,Human,Male,Alchemist Mage,Salamandra,"""Azar Javed was a Zerrikanian sorcerer and alchemist who took part in Salamandra's attack on Kaer Morhen. Initially known only as the """"mysterious mage"""""


The data that I will be using for the classification is the text column and the race column. I take a subset of the data so that only Dwarf and Elf (Aen Seidhe) characters are present. The fact that these are the most frequenly occurring classes, apart from humans, had nothing to do with my choice for these classes *ahum*. In all fairness, the number of observations is quite low, and in a real-life situation you definitely want more. For illustration purposes however, this is fine.

In [6]:
subset = df.where(df.race.isin(["Elf (Aen Seidhe)", "Dwarf"]))

In [7]:
display(subset)

_c0,name,race,gender,proffesion,affiliation,text
24,Chireadan,Elf (Aen Seidhe),Male,Tavern owner (canon)Guerrilla fighter (games),Scoia'tael (games),"Chireadan was an elf from the Redanian city of Rinde, a tavern owner, and Errdil's cousin.Despite elves typically not finding humans to be attractive, he was secretly in love with Yennefer, though he never revealed his feelings to the sorceress. However, he didn't let his feelings get in the way when he saw that Geralt was also infatuated with Yennefer and even pulled the others away when he saw the two having sex in his cousin's destroyed inn after fighting off a djinn.In Chapter IV, at the Lakeside, Chireadan, an elf among Toruviel's ragtag group recognizes Geralt. It seems he knows the witcher from somewhere before, but of course Geralt has no memory of him. This previous encounter, which is not described in any further detail, is a reference to the The Last Wish.He is also a sharper, and thus one of the available dice players. He can be found any time, day or night to play. During the day, he is typically sitting at one of the two campfires nearest the Elven Cave. At night, he will be sleeping in the cave, but does not complain at all about being woken up just for a game."
38,Elven craftsman,Elf (Aen Seidhe),Male,Master craftsman,Toruviel,"The elven craftsman was part of Toruviel's band of starving elves who camped in the cave by the Lakeside in Murky Waters. He was looking for four pieces of centipede armor for his work and was willing to pay. He was also a master craftsman, though the term slightly offended his artistic nature, and was capable of doing many things: mirror reassembly being one such skill.He was well versed in the history of Raven's armor."
53,Golan Vivaldi,Dwarf,Male,Banker,Vivaldi Bank (Vizima branch),"""Golan Vivaldi is a dwarf, and part of the """"Vivaldi family"""""
84,Malcolm Stein,Dwarf,Male,Blacksmith Merchant,Scoia'tael,"His main competition in town is the Order armorer, who has recently acquired some Mahakaman anvils remarkably similar to those confiscated by the City Guard from Malcolm's own smithy. But if things seem bleak in Chapter II, they are even bleaker in Chapter III where his workplace is pretty much reduced to a firepit.In Chapter V, it seems Malcolm has relocated to the makeshift forge in Old Vizima to help the Scoia'tael cause. It is he who makes the Scoia'tael (elven) variant of the Raven's armor for Geralt. Although he is not specifically identified, the conversation he has with Geralt would only make sense if it was Malcolm."
91,Munro Bruys,Dwarf,Male,Bouncer[1]Soldier (formerly),Mahakam Volunteer Army (formerly)Zoltan Chivay's company (formerly),"Munro Bruys was an adventurer from Mahakam in the company of Zoltan Chivay during the 1260s. After the Battle of Brenna, in which he avenged one of his mates Caleb Stratton, Bruys planned to do business in Novigrad. Eventually, he ended up a bouncer in Vizima.A dwarf born in Mahakam, Bruys joined Chivay's band.[2] For several days, bard Jezkier traveled and sung by their side. During one such day, their camp had been approached by Brendan. Bruys grew warry but Chivay calmed him, introducing Brendan as an old friend.[3]Once, the group waylaid and robbed a rich hawker who tried to flee Dillingen after townfolk exposed his trade with Scoia'tael. The merchant defended his property like a lion, shouting for aid a few times until the dwarves got him with batons. Intending to fund future enterprises of each individual with the treasure, Bruys helped load the trunks on a wagon.In late summer 1267, Bruys and the rest had been escorting Kernow refugees, mothers and children, to safety. They met Geralt of Rivia and the witcher's own group for the first time on a way toward the Yaruga river from Brokilon. Chivay advised Geralt to join up with dwarves on a way eastward. When the large group became short on provisions, Bruys and Yazon Varda vanished into the dark, only to return at dawn with two full sacks, one with horse grub and the second with jerkies, cheese, huge haggis and more refreshment. Another time, he and Figgis Merluzzo went mushroom picking. Later on, Percival Schuttenbach sensed porridge with his infallible gnomish nose and stated people had to live nearby. Chivay decided he, Schuttenbach and Geralt's companions would investigate, while Bruys and Merluzzo awaited for a signal in a form of a sparrowhawk call. Bruys anxiously asked Chivay when he learned bird sounds. Chivay answered the point was if Bruys heard an unrecognizable sound, he would know it's them.After crossing Brugge, the group stopped at O. Convinced not to carry the treasure any longer, Zoltan ordered Bruys, Varda, Merluzzo and Caleb Stratton to hide it. The rest continued to Fen Carn. Much later, during an unexpected battle between the refugees-killing Nilfgaardian Army and the Temerian Army, Chivay's company reunited. They evaded death by running to the woods, but Stratton was hit by the 7th Daerlanian Cavalry Brigade. After burying Stratton and mourning his passing, the dwarves continued. They met Geralt and his company once more, in Angren, and Bruys gave each member a strong handshake. At long last, the company decided to return home to Mount Carbon.[2]Bruys fighting at Brenna.As the end of the Second Northern War had drawn near, Bruys and Merluzzo were part of officer Zoltan Chivay's unit of the Mahakam Volunteer Army. In spring 1268, most regiments sent by Elder of Mahakam Brouver Hoog suffered great losses and were withdrawn to Vizima. With time to recover, most dwarves enjoyed beer and fist-fighting at The Shaggy Bear tavern.To stop the advancing enemy, Temeria mustered its forces and prepared for the Battle of Brenna. Thousands of dogged dwarven volunteers commanded by Colonel Barclay Els stood the right-wing. Alongside the eight light cavalry companies and infamous Adieu's Free Company, unwavering in the face of Ard Feainn Division riders.When Field Marshal Menno Coehoorn tried to run from the battlefield, half-buried in mud pleading for mercy, Chivay showed him to Bruys. Mistaking the Field Marshal for a Daerlanian cavalryman due to a silver scorpion sigil, Bruys swung his weapon up. Thus, they avenged Stratton by unknowingly killing the Nilfgaardian commanding officer.Bruys always dreamed of building steam and water-powered hammerworks. Following the Peace of Cintra, he agreed to start one in Novigrad with Merluzzo and Chivay.[4] However, the mutual venture flopped.[5] Bruys returned to Vizima, moving there for good as he landed a bouncer position at The Shaggy Bear. A self-titled professional dice player, Bruys earned bonus orens gambling.In 1270, Bruys kept in touch with Zoltan Chivay who told him witcher Geralt, thought dead since pogrom in Rivia, lived. Initially refusing to believe, the dwarf welcomed Geralt with open arms when the witcher entered the tavern in Temple Quarter. When asked, Bruys praised his job as peaceful and reminded Geralt of past adventures.[1]"
120,Ren Grouver,Dwarf,Male,Guerrilla fighter,Scoia'tael Yaevinn,"He can be found in the Vizima sewers of the Trade Quarter, if you get the Echoes of Yesterday quest and you will meet him again at the bank during the Gold Rush quest.If Geralt attacks the Scoia'tael during Gold Rush: He and the Scoia'tael robbers were killed by Geralt."
127,Scoia'tael quartermaster,Dwarf,Male,Merchant Quartermaster,Scoia'tael,"If Geralt has chosen the Scoia'tael path, then the quartermaster can be found on the western outskirts of Old Vizima, not far from the Gate to the Dike. On the Order path, the merchant is the Order quartermaster."
135,Toruviel,Elf (Aen Seidhe),Female,,Scoia'tael Vrihedd Brigade Filavandrel a�n Fidh�il,"""Toruviel aep Sihiel was a free elf from Dol Blathanna or the Blue Mountains. Originally a subject of Filavandrel, she became a Scoia'tael member and later joined the Vrihedd Brigade.Toruviel was part of the squad responsible for contacting Torque, a sylvan spying in Lower Posada for the elves to help them get food and learn agriculture. When their agent was endangered by the witcher Geralt of Rivia, who'd been hired to deal with the sylvan, she was among those who knocked out the witcher and the bard Dandelion and proceeded to tie them up.When the two woke up, Toruviel began to berate the bard and beat him before breaking his lute. Enraged at this, Geralt provoked Toruviel to come closer and, despite being tied up, managed to headbutt the elf, breaking her nose. With that, Toruviel made to kill the witcher but lost control and broke down crying while the other elves proceeded to bandage her nose.Later, after Dana M�adbh intervened and prevented Geralt and Dandelion's execution, the still-bandaged Toruviel gave Dandelion her own lute to replace the one she had destroyed.[1]When Scoia'tael units were formed to fight humans, Toruviel joined a commando together with Yaevinn and Cairbre aep Diared. In June 1267, while the three were on reconnaissance, she kicked Yaevinn to wake him up after Cairbre noticed a lone rider on the road from Tretogor to Aedirn. Though Toruviel tried to discourage her comrade, stating one lone human wasn't worth the effort, Yaevinn shot the commander, unaware that sparing his life would have prevented the provocation at Glevitzingen and the subsequent war.[2]Later, in early August, Toruviel and Ciaran aep Dearbh were wounded and guided by Milva to Col Serrai for healing. Not long after though, the two received news that Coinneach D� Reo's commando was recruiting new elves and went to join him despite Milva's objections.[3]Toruviel and Yaevinn took part in the Battle of Brenna as members of the Vrihedd Brigade under Isengrim Faoiltiarna. At one point in the battle Menno Coehoorn ordered Vrihedd to strike at the point where Redanians under Kobus de Ruyter stood near the Temerians.[4] Toruviel took part in the charge and was wounded in her arm by a 15 year old Redanian pikeman just before she cut his skull. She did not feel pain until the brigade breached the frontline and reached the tents and wagons.[5] There, Yaevinn broke into Rusty's tent, murdering the wounded until he saw a Nilfgaardian. Toruviel called to him to retreat after Blenheim Blenckert's reinforcements arrived.After the defeat at Brenna, a still wounded Toruviel and 7 other members of her commando fled. They spent some time in the forest, starving and pursued by White Rayla. During this time the """"commando"""" met Lucienne's wagon with invalid Temerian veterans inside. Despite her comrades remarking to """"keep dignity"""" and not look the humans in the eye"
149,Yaevinn Light armor Heavy armor,Elf (Aen Seidhe),Male,,Scoia'tael Vrihedd Brigade,"Yaevinn was an elf and a member of Toruviel's Scoia'tael commando. He also had his own commando in the vicinity of Vizima. He had a greater penchant for flowery words and metaphors than most elves, perhaps rivaling Dandelion alone in his prose.Yaevinn participated in the Second Nilfgaard War with Nilfgaard. During the Battle of Brenna, where he was part of the Vrihedd Brigade, he and his commando entered in the medic tent of Rusty and started to kill all the patients but they immediately stopped when they saw that the medic and his assistants were healing wounded people from both the factions, making no difference between Nilfgaardians and Nordlings.He was later seriously injured in that battle but in the end he survived to the wounds.Yaevinn appears in Chapter II of the game. He treats Geralt with hesitant respect and appeals to him to aid his faction; he reminds him that as a mutant, he would never be accepted in human society, only tolerated and could better relate to the cause of the Scoia'tael. He is eloquent and often speaks in verse, employing metaphors and literary allusions to convey his meaning.If the witcher chooses to ally himself with the Scoia'tael, Yaevinn plays an even more central role in the story. It is also Toruviel who is a member of Yaevinn's group rather than the other way around, although he does not seem to outrank her by much.On Iorveth's path during Chapter II, the Scoia'tael leader will discuss Yaevinn's legacy with Geralt in Vergen if the latter sided with the Scoia'tael or remained neutral in The Witcher.Yaevinn, same as Iorveth, does not appear in game and apart from the second one, he is not even mentioned by other characters, but he has his own gwent card as part of Scoia'tael Gwent deck."
151,Yaren Bolt,Dwarf,Male,,,He has no time for the notions of the brickmakers and views them more as trainable simpletons for his enterprise. His opinion of the vodyanoi is equally glowing.


I previously mentioned that there are pre-trained pipelines available. Here, I chose to manually set the stages instead of using the pre-trained pipeline, so you can get a better understanding of what is going on. The data that we have has to undergo a number of changes before we can actually use it. There are some steps shown here that are not strictly necessary for this problem, but could serve to illustrate some additional capabilities. 

The stages that the character text have to go through are **document assembler, sentence detector, tokenizer, stop words cleaner, normalizer, lemmatizer, finisher, countvectorizer, idf, and an indexer**. The first few stages break up the text in individual parts, remove unnecessary words and normalize the remaining words. An example of this is transforming all the same words with the same stem but written in different tenses to one and the same tense. 

In order to use the text to classify the characters, it has to be in a specific format, namely a **vector**. Furthermore, I want to make a count of how many times certain words appear in the text. This can be done with the count vectorizer. For additional information, I also determine the **term frequency-inverse document frequency (TF-IDF)**. This basically lowers the importance of words that appear in every entry. For example, given that all the characters are part of the Witcher universe, the word witcher will likely appear quite frequently. This does not mean, however, that this is very useful word to predict with as it can be applied to all entries. The TF-IDF score account for this fact. Finally, the indexer step translates the values for race to numbers. This is because predictions can only be performed on numbers. All these steps together transform the data in something that we can use to make the prediction. 

Note that in the code below I manually set a number of common stop words to be removed.

In [9]:
words_to_remove_list = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [10]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)
    
tokenizer = Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

stop_words_cleaner = StopWordsCleaner() \
        .setInputCols(["token"]) \
        .setOutputCol("cleanTokens") \
        .setCaseSensitive(False) \
        .setStopWords(words_to_remove_list)
    
normalizer = Normalizer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("normalized")

lemmatizer = LemmatizerModel.pretrained(name='lemma', lang='nl') \
     .setInputCols(['normalized']) \
     .setOutputCol('lemma')

finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setOutputCols(["ntokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False) 
countvectorizer = CountVectorizer(inputCol="ntokens", outputCol="features", minDF = 3.0)

idf = IDF(inputCol="features", outputCol="features_updated")

indexer = StringIndexer(inputCol="race", outputCol="raceIndex")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, stop_words_cleaner, normalizer, lemmatizer, finisher, countvectorizer, idf, indexer])

Before we move to predict anything, I first divide the data in a training and test subset. This gives a better assessment of the performance of the model that we are going to use. Databricks offers many models that can be readily applied. In this case, since the outcome variable is binary, there are a lot of options we can choose from. For example, **Decision trees, logistic regression or Naïve Bayes** are all models which can be applied to this situation. In this case, I chose to use Naïve Bayes as it has been successful for me in similar cases before.

In [12]:
processed_subset = nlp_pipeline.fit(subset).transform(subset)

(trainingData, testData) = processed_subset.randomSplit([0.8, 0.2],seed = 11)

nb = NaiveBayes(modelType="multinomial",labelCol="raceIndex", featuresCol="features_updated")
nbModel = nb.fit(trainingData)
nb_predictions = nbModel.transform(testData)

Now that we have applied the model, we can evaluate the performance. In order to assess this, I look at the **f1 score** which is the harmonic mean of the precision and recall. In this case, we obtain a score of .90 which is pretty good!

In [14]:
evaluator = MulticlassClassificationEvaluator(labelCol="raceIndex", predictionCol="prediction", metricName="f1")
nb_accuracy = evaluator.evaluate(nb_predictions)
print("F1 score of NaiveBayes is = %g"% (nb_accuracy))

Using this model, we were able to make a pretty good distinction between a Dwarf and an Elf. Of course, this model can be further improved and more data should be added in order to obtain better predictions. However, what I have shown here are some of the basic steps and capabilities Databricks offers in terms of NLP and Machine Learning.

To summarize, Azure Databricks offers an easy to use data analytics platform in the cloud. It is able to ingest data from multiple sources, such as a data lake, and apply machine learning on this data. The possibilies in this regard are endless, and in this example I gave a quick demonstration of how to use unstructured text data and use it to determine a fanatasy characters' race using NLP. 

If you are curious about other possibilies Databricks could offer you or you are intrigued by this article please let me know. I am always eager to discuss these topics with interested readers. Also, if you have opportunities or resources for me to expand my knowledge regarding this topic, do not hestitate to contact me! I am also always up for a round of Gwent ;)