# Data Preparation
In order to train the model we need to adapt the data that it will use.

## Feature Selection
Dropping source, type, alignment and languages since they are not influent on the CR of the monnster.
Droppinf instead speed, senses, languages, traits, actions, bonus actions, reactions, legendary actions, lair actions and regional effects because they are not influent in a "black-box" CR calculation like this one that focusses on the raw stats of the monster

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

#Loads existing dataset taken from https://5e.tools/bestiary.html that contains all monsters up to the latest manuals
old_ds=pd.read_csv("./Bestiary.csv",dtype={'Skills':str,'Saving Throws':str,"Damage Vulnerabilities":str,"AC":str})
num_rows=old_ds.shape[0]
training_ds=old_ds.copy()
training_ds=training_ds.drop(columns=["Source","Type","Speed","Alignment","Senses","Languages","Traits","Actions","Bonus Actions","Reactions","Legendary Actions","Mythic Actions","Lair Actions","Regional Effects","Environment"])
training_ds.head(4)

Unnamed: 0,Name,Size,AC,HP,Strength,Dexterity,Constitution,Intelligence,Wisdom,Charisma,Saving Throws,Skills,Damage Vulnerabilities,Damage Resistances,Damage Immunities,Condition Immunities,CR
0,Aarakocra,Medium,12,13 (3d8),10,14,10,11,12,11,,Perception +5,,,,,1/4 (50 XP)
1,Aarakocra Simulacrum,Medium,12,6 (3d4),10,14,10,11,12,11,,Perception +5,,,,,1/8 (25 XP)
2,Aarakocra Spelljammer,Medium,12 (15 with mage armor),40 (9d8),9,14,11,17,12,11,"Int +6, Wis +4","Arcana +6, History +6",,,,,"6 (2,300 XP)"
3,Aartuk Elder,Large,16 (natural armor),75 (10d10 + 20),18,10,15,12,14,12,,,,,,,2 (450 XP)


## Feature extraction
We transform the categorical value of Size into numeric using CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect=CountVectorizer()
count_matrix = count_vect.fit_transform(training_ds["Size"])
training_ds=pd.concat([training_ds,pd.DataFrame(data=count_matrix.toarray(),columns=count_vect.get_feature_names_out())],axis=1)
training_ds=training_ds.drop(columns=["Size"])
training_ds.head(4)

Unnamed: 0,Name,AC,HP,Strength,Dexterity,Constitution,Intelligence,Wisdom,Charisma,Saving Throws,...,Damage Immunities,Condition Immunities,CR,gargantuan,huge,large,medium,or,small,tiny
0,Aarakocra,12,13 (3d8),10,14,10,11,12,11,,...,,,1/4 (50 XP),0,0,0,1,0,0,0
1,Aarakocra Simulacrum,12,6 (3d4),10,14,10,11,12,11,,...,,,1/8 (25 XP),0,0,0,1,0,0,0
2,Aarakocra Spelljammer,12 (15 with mage armor),40 (9d8),9,14,11,17,12,11,"Int +6, Wis +4",...,,,"6 (2,300 XP)",0,0,0,1,0,0,0
3,Aartuk Elder,16 (natural armor),75 (10d10 + 20),18,10,15,12,14,12,,...,,,2 (450 XP),0,0,1,0,0,0,0


## Data Cleaning
The first step is cleaning the data that we are going to use, for example if in the dataset a monster's HP is calculated by any other mean we will default that to 0, also since we are going just by medium HP we will not need how they are calculated, same thinking goes for the CR, we do not need how much XP they give, we just need the CR of the monster.
Last think we do in this phase we ignore the specific vulnerabilities or skills of the monsters and just focus on how many they have.

In [3]:
overall_list=[]
for i in range(num_rows):
    sum=0
    hp_list = [int(s) for s in old_ds["HP"][i].split() if s.isdigit()]
    if hp_list:
        training_ds.loc[i,'HP'] = hp_list[0]
    else:
        training_ds.loc[i,'HP'] = 0
    sum=training_ds["Strength"][i]+training_ds["Dexterity"][i]+training_ds["Constitution"][i]+training_ds["Intelligence"][i]+training_ds["Wisdom"][i]+training_ds["Charisma"][i]
    overall_list.append(sum)
    sum=0
    is_nan=pd.isna(training_ds.loc[i,'Saving Throws'])
    if(is_nan):
        training_ds.loc[i,'Saving Throws']=0
    else:
        training_ds.loc[i,'Saving Throws']=len(old_ds["Saving Throws"][i].split(","))
        
    is_nan=pd.isna(training_ds.loc[i,'Skills'])
    if(is_nan):
        training_ds.loc[i,'Skills']=0
    else:
        training_ds.loc[i,'Skills']=len(old_ds["Skills"][i].split(","))

    is_nan=pd.isna(training_ds.loc[i,'Damage Vulnerabilities'])
    if(is_nan):
        training_ds.loc[i,'Damage Vulnerabilities']=0
    else:
        training_ds.loc[i,'Damage Vulnerabilities']=len(old_ds["Damage Vulnerabilities"][i].split(","))
        
    is_nan=pd.isna(training_ds.loc[i,'Damage Resistances'])
    if(is_nan):
        training_ds.loc[i,'Damage Resistances']=0
    else:
        training_ds.loc[i,'Damage Resistances']=len(old_ds["Damage Resistances"][i].split(","))
        
    is_nan=pd.isna(training_ds.loc[i,'Damage Immunities'])
    if(is_nan):
        training_ds.loc[i,'Damage Immunities']=0
    else: 
        training_ds.loc[i,'Damage Immunities']=len(old_ds["Damage Immunities"][i].split(","))

    is_nan=pd.isna(training_ds.loc[i,'Condition Immunities'])
    if(is_nan):
        training_ds.loc[i,'Condition Immunities']=0
    else:
        training_ds.loc[i,'Condition Immunities']=len(old_ds["Condition Immunities"][i].split(","))
    is_nan=pd.isna(training_ds.loc[i,'CR'])
    if(is_nan):
        training_ds.loc[i,'CR']=0
    else:
        value=old_ds["CR"][i].split("(")[0].strip()
        if(value=="1/4"):
            training_ds.loc[i,'CR']=1/4
        elif(value=="1/8"):
            training_ds.loc[i,'CR']=1/8
        elif(value=="1/2"):
            training_ds.loc[i,'CR']=1/2
        elif(value=="Unknown"):
            training_ds.loc[i,'CR']=1/2
        else:
            training_ds.loc[i,'CR']=int(value)
    is_nan=pd.isna(training_ds.loc[i,'AC'])
    if(is_nan):
        training_ds.loc[i,'AC']=0
    else:
        value=old_ds["AC"][i].split(" ")[0].strip()
        training_ds.loc[i,'AC']=int(value)
training_ds["Overall"]=overall_list
training_ds.head(4)

Unnamed: 0,Name,AC,HP,Strength,Dexterity,Constitution,Intelligence,Wisdom,Charisma,Saving Throws,...,Condition Immunities,CR,gargantuan,huge,large,medium,or,small,tiny,Overall
0,Aarakocra,12,13,10,14,10,11,12,11,0,...,0,0.25,0,0,0,1,0,0,0,68
1,Aarakocra Simulacrum,12,6,10,14,10,11,12,11,0,...,0,0.125,0,0,0,1,0,0,0,68
2,Aarakocra Spelljammer,12,40,9,14,11,17,12,11,2,...,0,6.0,0,0,0,1,0,0,0,74
3,Aartuk Elder,16,75,18,10,15,12,14,12,0,...,0,2.0,0,0,1,0,0,0,0,81


## Feature scaling
Since HP can get to crazy values in order to "help" the model a bit we are going to scale down the HP to a range more familiar for it that is 0-30 like AC or the CR

In [4]:
columns_to_scale=["HP"]
scaler=MinMaxScaler(feature_range=(0,30))
scaler.fit(training_ds[columns_to_scale])
training_ds[columns_to_scale] = scaler.transform(training_ds[columns_to_scale])

training_ds.head(4)

Unnamed: 0,Name,AC,HP,Strength,Dexterity,Constitution,Intelligence,Wisdom,Charisma,Saving Throws,...,Condition Immunities,CR,gargantuan,huge,large,medium,or,small,tiny,Overall
0,Aarakocra,12,0.576923,10,14,10,11,12,11,0,...,0,0.25,0,0,0,1,0,0,0,68
1,Aarakocra Simulacrum,12,0.266272,10,14,10,11,12,11,0,...,0,0.125,0,0,0,1,0,0,0,68
2,Aarakocra Spelljammer,12,1.775148,9,14,11,17,12,11,2,...,0,6.0,0,0,0,1,0,0,0,74
3,Aartuk Elder,16,3.328402,18,10,15,12,14,12,0,...,0,2.0,0,0,1,0,0,0,0,81


In [5]:
training_ds.to_csv("prepared_dataset.csv",index=False, header=True)