# Dungeons and Dragons Dataset Analysis

## Project Outline
Questions to answer
- What classes are most popular?
- What subclasses are most popular?
- What races are most popular?

1. Get the data (python)
    1. Use ```requests``` package to get data from GitHub
    2. Use ```json``` package to parse data
    3. Use ```pandas``` package to create a data frame and save as .csv file
2. Clean / prepare the data (python)
    1. Look at discriptive statistics
    2. Investigte unexpected values
3. Analyse the data (python)
4. Visualize the data in a dashboard (tableau)


## Get the data.
We are using a dataset provided as a json from https://github.com/oganm/dnddata

Since the dataset is quite large (more than 7000 unique characters as of february 2022) we will first download it and save it locally. This way, we don't have to download the data again every time we run an analysis unless we want to update the data.

### Setting up the environment
Here we want to set up all the packages we want to work with in this part

In [102]:
# Import packages for analysis.
import requests
import json
import pandas as pd
import numpy as np
import scipy.stats as stats

After importing all packages, we can now use ```requests``` to send a GET request to the URL and save the JSON response loacally using the ```json``` package.

In [2]:
# Download the json from https://github.com/oganm/dnddata

url = "https://raw.githubusercontent.com/oganm/dnddata/master/data-raw/dnd_chars_unique.json"

with requests.get(url) as content:
    json_data = content.text

data = json.loads(json_data)

# Save the json file locally

with open("json_data.json", "w") as f:
    f.writelines(json.dumps(data, indent=2))

Next, we **load the .json** file we just downloaded. 

This code chunk is separate so it can be run individually if the JSON has already been downloaded.

In [2]:
# open local json file

with open("json_data.json", "r") as f:
    data = json.load(f)

Now to go through the .json file and create a table from the data. Here we select which attributes are interesting and which ones we want to leave out. Finally, we create a dataframe from the data. At this point, we could continue the analysis using python if we want.

In [4]:
# make a Data Frame
data_dict = {
    "Name": [],
    "Race": [],
    "Class_1": [],
    "Subclass_1": [],
    "Class_level_1": [],
    "Class_2": [],
    "Subclass_2": [],
    "Class_level_2": [],
    "Num_of_classes": [],
    "Total_lvl": [],
    "Alignment": [],
    "Skills": [],
    "Feats": [],
    "HP": [],
    "AC": [],
    "Str": [], "Dex": [], "Con": [], "Int": [], "Wis": [], "Cha": [],
    "Spellcaster": [],
    "Num_of_spells": []
}

# loop through the characters in the json and pick out the attributes for analysis
for char in data:
    # name
    char_name = data[char]["name"]["alias"][0]
    data_dict["Name"].append(char_name)

    #race
    char_race = data[char]["race"]["processedRace"][0]
    data_dict["Race"].append(char_race)

    # classes
    main_class = list(data[char]["class"])[0]
    class1 = data[char]["class"][main_class]["class"][0]
    subclass1 = data[char]["class"][main_class]["subclass"][0]
    class_lvl1 = data[char]["class"][main_class]["level"][0]
    data_dict["Class_1"].append(class1)
    data_dict["Subclass_1"].append(subclass1)
    data_dict["Class_level_1"].append(class_lvl1)

    if len(data[char]["class"]) > 1:
        second_class = list(data[char]["class"])[1]
        class2 = data[char]["class"][second_class]["class"][0]
        subclass2 = data[char]["class"][second_class]["subclass"][0]
        class_lvl2 = data[char]["class"][second_class]["level"][0]
    else:
        class2 = None
        subclass2 = None
        class_lvl2 = None

    data_dict["Class_2"].append(class2)
    data_dict["Subclass_2"].append(subclass2)
    data_dict["Class_level_2"].append(class_lvl2)
        
    # number of classes
    data_dict["Num_of_classes"].append(len(data[char]["class"]))

    # character level
    total_lvl = data[char]["level"][0]
    data_dict["Total_lvl"].append(total_lvl)
    
    # alignment
    if data[char]["alignment"]["processedAlignment"][0]:
        alignment = data[char]["alignment"]["processedAlignment"][0]
    else:
        alignment = None
    data_dict["Alignment"].append(alignment)

    # skills / proficiencies
    skill_str = ""
    for skill in data[char]["skills"]:
        if skill_str != "":
            skill_str = skill_str + "; "
        skill_str = skill_str + skill
    data_dict["Skills"].append(skill_str)

    # feats
    feat_str = ""
    for feat in data[char]["feats"]:
        if feat_str != "":
            feat_str = feat_str + "; "
        feat_str = feat_str + feat
    data_dict["Feats"].append(feat_str)

    # HP 
    hp = data[char]["HP"][0]
    data_dict["HP"].append(hp)

    # AC
    ac = data[char]["AC"][0]
    data_dict["AC"].append(ac)

    # attributes
    for atr in ["Str", "Dex", "Con", "Int", "Wis", "Cha"]:
        val = data[char]["attributes"][atr][0]
        data_dict[atr].append(val)

    # spellcaster
    if data[char]["spells"]:
        data_dict["Spellcaster"].append(True)
    else:
        data_dict["Spellcaster"].append(False)

    # number of spells
    data_dict["Num_of_spells"].append(len(data[char]["spells"]))


df = pd.DataFrame(data=data_dict)

Now we can export the data frame we created as a .csv to use for further analysis using other tools.

In [5]:
df.to_csv("dnd_data.csv")

## Clean the data
Let's open the .csv file and have a look at some descriptive statistics!

In [75]:
# open .csv file saved locally and create a dataframe

df = pd.read_csv("dnd_data.csv")
df.describe()

Unnamed: 0.1,Unnamed: 0,Class_level_1,Class_level_2,Num_of_classes,Total_lvl,HP,AC,Str,Dex,Con,Int,Wis,Cha,Num_of_spells
count,7110.0,7110.0,790.0,7110.0,7110.0,7110.0,7110.0,7110.0,7110.0,7110.0,7110.0,7110.0,7110.0,7110.0
mean,3554.5,4.538397,3.259494,1.132489,4.95007,45.299297,15.342897,12.804641,14.64993,14.331786,12.001547,13.13052,13.166948,3.925176
std,2052.624539,3.884045,2.819856,0.495186,4.220756,63.486935,3.6028,4.008986,3.178642,2.611107,3.213572,3.194472,3.69301,7.27204
min,0.0,1.0,1.0,1.0,1.0,-6.0,7.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0
25%,1777.25,1.0,1.0,1.0,2.0,14.0,13.0,10.0,13.0,13.0,10.0,11.0,10.0,0.0
50%,3554.5,4.0,3.0,1.0,4.0,31.5,15.0,12.0,15.0,14.0,12.0,13.0,13.0,0.0
75%,5331.75,6.0,4.0,1.0,6.0,56.0,17.0,16.0,17.0,16.0,14.0,15.0,16.0,6.0
max,7109.0,20.0,19.0,14.0,20.0,3764.0,222.0,103.0,101.0,103.0,99.0,100.0,99.0,98.0


Using the ```describe()``` function we can see a lot of interesting information about our data! For anyone familiar with the game, some of these numbers are odd if not impossible.

Odd things I noticed (there may be more):
1. lowest HP below 0
2. highest HP at over 3000!
3. highest AC at over 200!!!
4. Some stats have very low minimums (especially "Int" and "Cha" with 0)
5. All stats have maximums around 100

In [24]:
# check out all characters with HP lower than their level (this should not be possible)
df[df["HP"] < df["Class_level_1"]]

Unnamed: 0.1,Unnamed: 0,Name,Race,Class_1,Subclass_1,Class_level_1,Class_2,Subclass_2,Class_level_2,Num_of_classes,...,HP,AC,Str,Dex,Con,Int,Wis,Cha,Spellcaster,Num_of_spells
578,578,wizardly_mendel,Orc,Wizard,School of Abjuration,10,,,,1,...,6,10,6,4,5,20,4,4,False,0
858,858,dazzling_blackburn,Kenku,Monk,Way of the Open Hand,8,,,,1,...,-1,20,13,20,14,11,20,11,False,0
1618,1618,trusting_northcutt,Elf,Druid,Circle of the Moon,6,,,,1,...,-5,17,8,16,15,8,18,10,True,12
1649,1649,amazing_dewdney,Human,Cleric,Life Domain,6,,,,1,...,0,21,16,8,14,10,16,10,True,10
1869,1869,gifted_engelbart,Tiefling,Sorcerer,Wild Magic,5,,,,1,...,0,11,8,13,15,13,8,17,True,18
2056,2056,charming_antonelli,Human,Barbarian,Path of the Totem Warrior,5,,,,1,...,1,17,18,13,18,13,12,9,False,0
2170,2170,wizardly_fermat,Aasimar,Sorcerer,Wild Magic,5,,,,1,...,4,11,8,13,16,12,10,18,True,9
2753,2753,laughing_payne,Changeling,Rogue,Assassin,4,,,,1,...,-6,16,10,20,12,12,15,16,False,0
4817,4817,sad_proskuriakova,Human,Bard,,1,,,,1,...,0,11,11,11,11,11,11,11,True,6
5362,5362,naughty_hofstadter,Tabaxi,Bard,,1,,,,1,...,0,13,11,16,11,12,11,15,True,6


It seems we have found a **problem with out dataset**. There is only one value for HP that might be used differently by different users. This could be **used for either maximum HP or current HP**. This means that analysing this value would probabaly require additional information or guessing the way the user interpreted this variable. **We will leave it out of the analysis**.

In [36]:
# check the odd stats

df[(df["Str"] > 35) | (df["AC"] > 50) | (df["HP"] > 500) | (df["Int"] == 0) | (df["Cha"] == 0)]

Unnamed: 0.1,Unnamed: 0,Name,Race,Class_1,Subclass_1,Class_level_1,Class_2,Subclass_2,Class_level_2,Num_of_classes,...,HP,AC,Str,Dex,Con,Int,Wis,Cha,Spellcaster,Num_of_spells
66,66,sharp_almeida,Aarakocra,Barbarian,Path of the Battlerager,20,,,,1,...,1105,66,103,101,103,99,100,99,False,0
86,86,objective_blackwell,,Wizard,School of Necromancy,20,,,,1,...,92,18,9,14,15,21,21,0,True,1
472,472,affectionate_jennings,Human,Fighter,Champion,10,,,,1,...,3764,222,30,22,26,17,34,26,True,6
2967,2967,tender_shaw,Half-Orc,Fighter,Champion,4,,,,1,...,23,16,16,14,16,0,9,16,False,0


We find that there are two characters that are especially odd in terms of AC and HP (numbers 66, 472). They will mess with our analysis when doing calculations so we will remove those two rows.

The characters with especially low stats are weird but so are D&D games, we will keep them in.

We do however happen to see something else: One of the characters does not have a "Race" entry. Let's investigate that further.

In [93]:
# check for NA in Race

print(f'Entries missing "Race": {len(df[df["Race"].isna()])} ({round((len(df[df["Race"].isna()]) / len(df)) * 100, 2)}%)')

Entries missing "Race": 156 (2.19%)


Since one main questions we want to answer is which race is most popular and the missing entries make up only about 2% of entries, we will remove them.

In [97]:
# drop the values we decided to exclude

df_clean = df[(df["AC"] < 50) & (df["Race"].notna())]

df_clean.describe()

Unnamed: 0.1,Unnamed: 0,Class_level_1,Class_level_2,Num_of_classes,Total_lvl,HP,AC,Str,Dex,Con,Int,Wis,Cha,Num_of_spells
count,6952.0,6952.0,762.0,6952.0,6952.0,6952.0,6952.0,6952.0,6952.0,6952.0,6952.0,6952.0,6952.0,6952.0
mean,3554.689298,4.529776,3.238845,1.131185,4.934264,44.447066,15.291715,12.791283,14.630898,14.311421,11.994822,13.114212,13.152474,3.917578
std,2049.869031,3.875004,2.785954,0.496623,4.206252,43.584878,2.556523,3.839953,3.013395,2.383127,3.030935,3.009631,3.539578,7.219793
min,0.0,1.0,1.0,1.0,1.0,-6.0,7.0,1.0,3.0,4.0,0.0,1.0,2.0,0.0
25%,1780.75,1.0,1.0,1.0,2.0,14.0,13.0,10.0,13.0,13.0,10.0,11.0,10.0,0.0
50%,3558.5,4.0,2.0,1.0,4.0,31.0,15.0,12.0,15.0,14.0,12.0,13.0,13.0,0.0
75%,5334.25,6.0,4.0,1.0,6.0,56.0,17.0,16.0,17.0,16.0,14.0,15.0,16.0,6.0
max,7109.0,20.0,19.0,14.0,20.0,444.0,37.0,30.0,30.0,38.0,30.0,46.0,32.0,98.0


These values look a lot better! There are still some very high and low numbers but they seem realistic enough to happen in a game of D&D. We also don't have the missing values anymore that might have messed with visualization.

## Analyse the data
So let's see if we can get some more insights