In this notebook, I will be showing whether or not cards from the popular Blizzard game Hearthstone can be classified into their types, only by using the text on the cards themselves. All card data is obtained from https://hearthstonejson.com/. 

As shown in the above link, Hearthstone cards have a number of attributes. For our purposes, we're only looking at two; the text of the card, and its type (Hero, Spell, Minion, or Weapon.)

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

First, we read in the file containing all of our card data.

In [2]:
cards = pd.read_json('hscards.json')

Taking a look at the head of the dataframe, there are a number of columns. As mentioned earlier, the ones we want to pay attention to are cards['text'] and cards['type'].

In [3]:
cards.head()

Unnamed: 0,artist,attack,cardClass,classes,collectible,collectionText,cost,dbfId,durability,elite,...,playRequirements,playerClass,race,rarity,referencedTags,set,spellDamage,targetingArrowText,text,type
0,Justin Sweet,4.0,NEUTRAL,,True,,5.0,61,,,...,,NEUTRAL,,COMMON,,EXPERT1,,,<b>Enrage:</b> Your weapon has +2 Attack.,MINION
1,Steve Hui,,ROGUE,,True,,1.0,990,,,...,,ROGUE,,COMMON,[STEALTH],HOF,,,Give your minions <b>Stealth</b> until your ne...,SPELL
2,Raymond Swanland,,PRIEST,,True,,5.0,2999,,,...,,PRIEST,,RARE,,LOE,,,Deal $3 damage to all minions.\nShuffle this c...,SPELL
3,Alex Horley Orlandelli,3.0,WARLOCK,,True,,9.0,777,,1.0,...,,WARLOCK,DEMON,LEGENDARY,,EXPERT1,,,<b>Battlecry:</b> Destroy your hero and replac...,MINION
4,Wayne Reynolds,6.0,NEUTRAL,,True,,6.0,2573,,,...,,NEUTRAL,,EPIC,,TGT,,,<b>Battlecry:</b> Copy your opponent's Hero Po...,MINION


Now, looking at the text of a number of the cards, there's a lot of extra information we don't need in there. Line breaks, html, and punctuation obfuscate the semantic information we're trying to pull out. Here I define a function that does a few things.

1. Checks the function for null values, notably for "vanilla" cards that have no additional rules text.
2. Uses the BeautifulSoup library to strip out all of the html tags
3. Splits the text into characters and removes all punctuation.
4. Rejoins the text and removing any words that are "stopwords", common words in english that lack semantic significance, but are necessary for language to function.


In [8]:
def process_text(text):
    if pd.isnull(text):
        return '' #blank string
    text = BeautifulSoup(text).get_text()
    
    nopunc = [c for c in text if c not in string.punctuation]
    nopunc = ''.join(nopunc) #Rejoin string
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    #Splits sentence at word level and returns all significant words

Next I'm going to import a number of features from Scikit-Learn for our actual analysis. 

<b>CountVectorizer</b> and <b>TfidfTransformer</b> to turn the strings into vector values and calculate their TF-IDF scores (Text Frequency, Inverse Document Frequency, explained further at https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

<b>train_test_split</b> to separate the entire corpus of card text into two sets, training data and testing data.

<b>MultinomialNB</b>, an implementation of the Multinomial Naive Bayes categorical classification model

and <b>Pipeline</b>, an aggregate function which allows us to save a lot of work in performing the other three tasks.

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

In [9]:
pl = Pipeline([('bow',CountVectorizer(analyzer=process_text)),
                ('tfidf',TfidfTransformer()),
                ('classifier',MultinomialNB())])

Here we split our corpus into the training and test data. For this, about 40% of all cards will be used for the testing set. The remaining will be what the model uses to learn the features of each card type.

In [10]:
txt_train,txt_test,label_train,label_test = train_test_split(cards['text'],cards['type'],test_size=.4,random_state=41)

Next, I use the pipeline created above to fit the training data (both the text and the label outcome), essentially "teaching" the model what the association is between the input text and its resulting label.

In [11]:
pl.fit(txt_train,label_train)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Pipeline(steps=[('bow', CountVectorizer(analyzer=<function process_text at 0x000002728D1CDBF8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocesso...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

We then use that fitted model to make predictions on our testing data.

In [15]:
predictions = pl.predict(txt_test)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Finally, using Scikit-Learn's classification_report feature, we're going to check what kind of result those predictions gave us.

In [16]:
from sklearn.metrics import classification_report
print(classification_report(label_test,predictions))

             precision    recall  f1-score   support

       HERO       0.00      0.00      0.00         5
     MINION       0.69      0.99      0.82       321
      SPELL       0.91      0.15      0.25       144
     WEAPON       0.00      0.00      0.00        13

avg / total       0.73      0.70      0.62       483



  'precision', 'predicted', average, warn_for)


<b> Analysis </b>

This classification report tells us a few different things on its own.

First off, there are 9 hero classes in hearthstone. Each of these heroes is represented as a "Hero" card, but within the context of the game none are accessible by players. They also have no text. As such, we can ignore the "HERO" rows entirely.

Hearthstone also has very few cards in the "Weapon" class, around 50 or so in total. However, unlike the HERO cards, they are not so easily ignored. It is very possible that there was simply not enough data for the model to train on to accurately predict what would be a weapon card. As well, due to the characteristics they share with the MINION class, namely "power" and "toughness" (represented as "durability" for weapons in-game), it is likely that some, if not all of the weapons, were inadvertently classified as minions. As the game progresses and more cards are created, I would be interested to see how the addition of more weapons affects their classification.

A precision score of .91 for spells is fairly significant. Our model correctly classified 91% of spells it encountered correctly. The precision of .69 for minions is interesting, and lends credence to two possible conclusions. One, as mentioned above, weapons were incorrectly classified as minions, or two, that the text on spells is more readily associated with the "spell" classification than the text on minions. 
