data source: https://github.com/foundryvtt/pf2e/tree/master/packs/equipment

In this exercise I will try to see if I can predict whether a weapon is advanced or not. Advanced weapons are generally more powerful, having higher damage and/or having a better mix of traits
It will be a bit interesting as most of my predictors will be 0/1 values themselves, but as far as I seen logistic regression should be completely alright with that

I know I have quite low amounts of data (only about 266) but I honestly couldn't figure out anything else for this exercise, I'm not good with "come up with a problem and solve it"

First I'll have to turn the bunch of .json files into a single dataframe and make sure only weapons are in there

In [27]:
import pandas as pd
import json
import os

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [None]:
# testing how a single json would be read in
with open("equipment\\adze.json") as f:
    temp = json.load(f)

temp

{'_id': '8V4mgecGASsQ7fjl',
 'img': 'systems/pf2e/icons/equipment/weapons/adze.webp',
 'name': 'Adze',
 'system': {'baseItem': 'adze',
  'bonus': {'value': 0},
  'bonusDamage': {'value': 0},
  'bulk': {'value': 2},
  'category': 'martial',
  'containerId': None,
  'damage': {'damageType': 'slashing', 'dice': 1, 'die': 'd10'},
  'description': {'value': "<p>A common cutting tool, an adze resembles an axeâ€”but the cutting edge is horizontal, rather than vertical. The adze's shape makes it popular among woodworkers, and tripkee builders often use them to construct their treetop homes. The tool also serves as an effective weapon, due in part to the immense strength required to swing it.</p>"},
  'group': 'axe',
  'hardness': 0,
  'hp': {'max': 0, 'value': 0},
  'level': {'value': 0},
  'material': {'grade': None, 'type': None},
  'price': {'value': {'gp': 1}},
  'publication': {'license': 'ORC',
   'remaster': True,
   'title': 'Pathfinder Player Core 2'},
  'quantity': 1,
  'range': None

In [12]:
# try to load all files
# there was some error originally on the json.load with some unicode characters and I wanted to see what percentage was problematic
directory = os.fsencode("equipment")
weapons = 0
errors = 0
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    with open(f"equipment\\{filename}", 'r') as f:
        try:
            data = json.load(f)
        except UnicodeDecodeError:
            errors += 1
            continue
    if data['type'] == 'weapon':
        weapons += 1

print(f"errors: {errors}\nweapons: {weapons}")

errors: 1
weapons: 908


In [13]:
# creating empty dataframe with the columns I want
columns = [
    "is_advanced",
    "is_melee",
    "average_damage",
    "hands_to_use",
    "bulk",
    "cost",
    "number_of_traits",
]
df = pd.DataFrame(columns = columns)

In [25]:
# now let's get sorting out what we need
# we'll have about 300ish weapons as magical weapons and "specific weapons" have much different power distribution
# we need to look for two things to find these:
# - whether the "name" and the "system.baseItem" is the same (if yes than it is a base weapon (what we want), if not it is a specific weapon)
# - whether it has the "magical" trait, we don't want those

directory = os.fsencode("equipment")

i = 0

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    with open(f"equipment\\{filename}", 'r') as f:
        try:
            data = json.load(f)
        except UnicodeDecodeError:
            continue
    
    try:
        name = data['name']
        if data['type'] != 'weapon':
            continue

        if name.lower().replace(' ', '-') != data['system']['baseItem']:
            continue

        if 'magical' in data['system']['traits']['value']:
            continue
        # okay we weeded out what we don't need, everything else is the base items

        data = data['system']  # as everything is in here honestly
        # now let's put it in the dataframe
        line = []  # "is_advanced", "is_melee", "average_damage", "hands_to_use", "bulk", "cost", "number_of_traits"

        # is_advanced
        if data['category'] == 'advanced':
            line.append(1)
        else:
            line.append(0)

        # is_melee
        if data['group'] in ['bomb', 'bow', 'crossbow', 'dart', 'firearm', 'sling']:
            line.append(0)
        else:
            line.append(1)
        
        # average_damage
        die_amount = data['damage']['dice']
        if data['damage']['die']:
            die_size = int(data['damage']['die'][1:])
        else:
            die_size = 0
        average_damage = die_amount * (die_size / 2 + 0.5)
        line.append(average_damage)

        # hands_to_use
        if 'two-hand' in data['usage']['value']:
            line.append(2)
        else:
            line.append(1)

        # bulk
        if data['bulk']['value'] == 0.1:
            line.append(0)
        else:
            line.append(data['bulk']['value'])

        # cost
        price = data['price']['value']
        cost = price.get('pp', 0) * 10 + price.get('gp', 0) + price.get('sp', 0) * 0.1 + price.get('cp', 0) * 0.01
        line.append(cost)

        # number of traits
        line.append(len(data['traits']['value']))

        df.loc[i] = line
        i += 1
    except Exception as e:
        print(name)
        raise(e)

df

Unnamed: 0,is_advanced,is_melee,average_damage,hands_to_use,bulk,cost,number_of_traits
0,0.0,1.0,5.5,2.0,2.0,1.0,3.0
1,0.0,0.0,2.5,1.0,0.0,4.0,2.0
2,1.0,1.0,3.5,1.0,1.0,5.0,4.0
3,0.0,0.0,4.5,2.0,1.0,25.0,1.0
4,1.0,1.0,4.5,1.0,1.0,2.0,2.0
...,...,...,...,...,...,...,...
261,0.0,1.0,2.5,1.0,0.0,2.0,6.0
262,1.0,0.0,3.5,2.0,2.0,8.0,4.0
263,0.0,0.0,2.5,1.0,1.0,3.0,3.0
264,0.0,1.0,3.5,1.0,0.0,1.0,4.0


In [None]:
# there are some extreme outliars that we need to get rid of as linear regression doesn't work well with them
df = df[df['cost'] != 90]

In [28]:
# now that the dataframe is ready (and quite clean) it's time to get down to business
X = df[[
    "is_melee",
    "average_damage",
    "hands_to_use",
    "bulk",
    "cost",
    "number_of_traits",
]]
y = df['is_advanced']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [29]:
# scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [30]:
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

In [31]:
predictions = logmodel.predict(X_test)

print(classification_report(y_test, predictions))

acc = accuracy_score(y_test, predictions)
print("\nModel acc: {:.2f}%".format(acc * 100))

              precision    recall  f1-score   support

         0.0       0.85      0.97      0.91        66
         1.0       0.60      0.21      0.32        14

    accuracy                           0.84        80
   macro avg       0.73      0.59      0.61        80
weighted avg       0.81      0.84      0.80        80


Model acc: 83.75%


In [32]:
# as usual I'm actually quite happy with an 83.75 for my weirdo database
# I'm sure it would be better with higher row count
# but as it is a quite chaotic real world example, it is quite impressive
# actually I wonder if I would've gotten a better score at it or not, even though I know the actual rules to classify these
# let's see that roc_auc score
roc_auc_score(y, logmodel.predict_proba(X.values)[:, 1])

np.float64(0.7876216001588248)

In [41]:
# yep still pretty nice, over 75%
# let's try some custom/new values
tester_row = [
    1,
    4.5,
    1,
    1,
    10,
    4
]
# this is definitely more over in the advanced category imo, but let's see

tester_row = pd.DataFrame([tester_row])
tester_row = sc.transform(tester_row)
print(f"Probs by catgeory:\n{logmodel.predict_proba(tester_row)}\n")

labels = ["No", "Yes"]
print(f"is it advanced?\n{labels[int(logmodel.predict(tester_row)[0])]}")

Probs by catgeory:
[[0.13830344 0.86169656]]

is it advanced?
Yes




Final thoughts
Logistic regression seems to be useful in cases where a yes/no outcome is calculated in high frequencies but the rules on it is not clearly known AND! having perfect accuracy is not required
An example I could think of is quality assurance for not lifesaving applications on organic products that don't have clearly set parameters

it was quite easy to use, the hard part is figuring out *where* to use it. 

This exercise could possibly be optimized by testing around which columns have low to no impact, and by adding every trait as a separate 0/1 column instead of just a trait number counter.
