# Data Analysis and D&D

I'm starting a blog based on advice from two books I've been reading, *Designing Your Life* by Bill Burnett and Dave Evans and "Big Magic" by Elizabeth Gilbert. In Designing Your Life, the authors encourage readers to prototype their passions and take exploratory steps to finding work they enjoy. In Big Magic, Gilbert inspires readers not to be afraid of their creativity, but instead asks "do you have the courage to seek the treasures inside of you?"  This is my first step.  

For years, I've thought being a data scientist would be an amazing job.  I would explore unique data sets and discover amazing insights that only I would be able to find.  I was initiatial inspired by some of the more "pop culture" centered posts on Nate Silver's 538.  While the political and sports analysis was okay, what I found interesting was more relaxed blog posts on the [worst board games invented](https://fivethirtyeight.com/features/the-worst-board-games-ever-invented/) or the 5 different types of [Nicolas Cage Movies](https://fivethirtyeight.com/features/the-five-types-of-nicolas-cage-movies/).

While this shifted my dream from data scientist to data driven story teller, I still didn't have a plan. I'd take some free online courses, take notes, and then, stop.  I told myself once I knew a litle more, or once I finished another class, I'd be ready to make my first real step.  But really, I was scared.  I felt like I could do a little bit of data analysis, but I would never be able to do anything "real."

But that stops today with this blog.  I'm starting this blog to hold myself accountable in becoming a data driven story teller.  The best way to learn something is to be able to explain and teach it.  I might not be right all the time in my explanations, but I have to start somewhere. 

One of the reasons I didn't do well in the past (besides fear) is because I stayed within predefined tutorials on topics I didn't find interesting.  Not only did the tutorials limit my growth by providing scaffolds for getting code up and running, but I'd lose focus on data sets I didn't care about.  So now, I'm going to take Gilbert's advice and explore topics I'm curious about.  

## Curiosity = D&D

Right now, I'm interested in Dungeons and Dragons.  I think it is fascinating how the narrative develops between the players and the Dungeon Master and how anything is possible.  In particular, I've been watching Matt Colville's "[Running the Game](https://www.youtube.com/channel/UCkVdb9Yr8fc05_VbAVfskCA)" and have been impressed in how he develops his stories or how he gives his players little pushes if they need them.


For example, in one of his first episodes, if players are hesitant to save the blacksmith's kidnapped son from goblins, Matt recommended to have an eldery grandmother with a walking stick volunteer to go on the rescue mission.  Chances are the players will feel enough shame, that they will go instead.  This is brilliant.  The players still have free choice, but Matt provides a helpful nudge to help get the adventure started.  
 
So if I'm interested in D&D and data analysis, by combining the two I can make sure I stay engaged.

To start off, I created a python script that generates random ability stats (strength, dexterity, etc.) for a 5th edition D&D characters using three different methodologies including:
* Summing 3d6 (Read as: rolling a 6 sided die 3 times)
* Summing 4d6 dropping the lowest value
* Summing 4d6 dropping the lowest value and two statistics are greater than or equal to 15 (Colville Method)

Matt says he likes his method because it helps players discover their characters.  Maybe the player wasn't sure what type of character to play but seeing high scores ability scores in strength but low scores in intellegence pushes them towards a Barbarian.  But because his methodology requires having at least two stats greater than 15 ensures, that their is something about the character that makes them unique.

So now that I have a generated a bunch of data reflecting different character creation methodologies, the question is, how much better are Matt's generated characters than the other methods?  This is a just a starting point as I can explore other questions like: 
* What is the correlation between stats (holding RollType constant)? It should be near zero, but can I confirm?
* How about correlation between ability and ability modifier? 
* Who was the strongest character generated? The weakest?

But before I get into questions describing the data, I need to crawl before I can walk.  This blog post will document my beginning steps including loading and indexing the data. 

## Reading & Indexing Data

In [1]:
# import functions
import pandas as pd

In [2]:
#read in data frame
df = pd.read_csv('1000CharSimulated20seed.csv', 
                 index_col=0) #Setting index_col at 0 cleans up the data frame

In [3]:
#confirm data frame loaded as expected using the head method
df.head()

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
0,3D6,14,15,9,7,6,9,2,2,-1,-2,-2,-1
1,3D6,13,7,15,8,15,7,1,-2,2,-1,2,-2
2,3D6,8,4,11,12,10,5,-1,-3,0,1,0,-3
3,3D6,10,8,12,8,15,15,0,-1,1,-1,2,2
4,3D6,11,11,12,9,12,8,0,0,1,-1,1,-1


In [4]:
#confirm data frame loaded as expected using the tail method
df.tail(6)

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
2994,Colville,13,9,12,14,18,17,1,-1,1,2,4,3
2995,Colville,13,17,16,15,11,12,1,3,3,2,0,1
2996,Colville,15,15,8,14,9,10,2,2,-1,2,-1,0
2997,Colville,9,15,11,14,12,15,-1,2,0,2,1,2
2998,Colville,15,8,16,10,10,13,2,-1,3,0,0,1
2999,Colville,10,16,18,14,12,14,0,3,4,2,1,2


In [5]:
#use sample method to spot check that the data looks okay
df.sample(10)

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
2187,Colville,12,16,11,12,15,13,1,3,0,1,2,1
2927,Colville,15,13,16,10,15,12,2,1,3,0,2,1
639,3D6,13,10,14,10,7,10,1,0,2,0,-2,0
1921,4D6DropLow,12,12,10,9,15,15,1,1,0,-1,2,2
70,3D6,13,14,12,12,9,12,1,2,1,1,-1,1
615,3D6,10,7,7,10,12,9,0,-2,-2,0,1,-1
2728,Colville,10,17,16,11,16,12,0,3,3,0,3,1
1117,4D6DropLow,13,13,10,17,14,15,1,1,0,3,2,2
932,3D6,10,8,11,8,12,15,0,-1,0,-1,1,2
136,3D6,12,14,11,13,6,6,1,2,0,1,-2,-2


# Indexing

One of the reasons I've struggled retaining python knowledge is my confusion over indexing (the selection of different value) syntax.  I never remember the syntax which leads to confusion and my code not working how I want it to.  So I'm going to explore three different ways of sampling a pandas data frame.  
* bracket based indexing
* .iloc - integer based indexing
* .loc - label based indexing

Let's focus on the bracket method first. 

In [6]:
#selecting the first 'strengh' column 
df['strength']

0       14
1       13
2        8
3       10
4       11
5        8
6        9
7       10
8        8
9        7
10       8
11      13
12      11
13      12
14       7
15      10
16      11
17      17
18      10
19       6
20      11
21      10
22       8
23      13
24      12
25       7
26      10
27      12
28       8
29      11
        ..
2970    14
2971    15
2972    10
2973    15
2974    13
2975    16
2976    15
2977    11
2978    12
2979    17
2980    13
2981    15
2982    15
2983    10
2984    15
2985    16
2986    15
2987    14
2988    15
2989     9
2990    16
2991    17
2992    16
2993    15
2994    13
2995    13
2996    15
2997     9
2998    15
2999    10
Name: strength, Length: 3000, dtype: int64

In [8]:
#selecting the strength, constitution, and wisdom column
df[ ['strength','constitution','wisdom'] ]

Unnamed: 0,strength,constitution,wisdom
0,14,9,6
1,13,15,15
2,8,11,10
3,10,12,15
4,11,12,12
5,8,5,16
6,9,5,7
7,10,16,17
8,8,6,11
9,7,8,11


The above code is one of the reasons why I have gotten tripped up with indexing in the past.  I find the double brackets confusing as it not clear what is going where.   

For example, the below code throws an error:
df['strength','constituion','wisdom'] 

To the best of my knowledge, in order to index more than one column, we need to pass a list of the columns names, and then subset that.  

The list:
['strength','constituion','wisdom']

How to subset:
df[]

Putting the two together
df[ ['strength','constituion','wisdom'] ]

While I was learning this, I found it helpful to have extra spaces between the list and subsetting.  This reminds me that we are using a single list inside brackets.   

In [9]:
#selecting rows 0 through 4
df[0:5]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
0,3D6,14,15,9,7,6,9,2,2,-1,-2,-2,-1
1,3D6,13,7,15,8,15,7,1,-2,2,-1,2,-2
2,3D6,8,4,11,12,10,5,-1,-3,0,1,0,-3
3,3D6,10,8,12,8,15,15,0,-1,1,-1,2,2
4,3D6,11,11,12,9,12,8,0,0,1,-1,1,-1


The above codes says, select all columns and then rows 0 through 5 (that's what the ":" does.  In fact, we don't even need it
as df[0:5] will do the exact same thing). Notice how the syntax is not inclusive, such that python goes up to row 5, but does
not include it.  

In [10]:
#selecting rows 112 through 115 (non inclusive)
df[112:115]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
112,3D6,15,10,12,9,13,13,2,0,1,-1,1,1
113,3D6,12,13,12,6,12,15,1,1,1,-2,1,2
114,3D6,8,10,10,11,8,9,-1,0,0,0,-1,-1


In [11]:
#selecting rows 2995 until the end
df[2995:]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
2995,Colville,13,17,16,15,11,12,1,3,3,2,0,1
2996,Colville,15,15,8,14,9,10,2,2,-1,2,-1,0
2997,Colville,9,15,11,14,12,15,-1,2,0,2,1,2
2998,Colville,15,8,16,10,10,13,2,-1,3,0,0,1
2999,Colville,10,16,18,14,12,14,0,3,4,2,1,2


In [12]:
#so if we can select the columns and the rows, we should be able to combine them to do them at the same time. 

#select dexterity column and row 250
df['dexterity'][250]

9

In [13]:
#select intellegence and wisdom from rows 2890 through 2895
df[ ['intellegence','wisdom'] ][2890:2895]

Unnamed: 0,intellegence,wisdom
2890,8,17
2891,18,11
2892,9,16
2893,11,16
2894,17,9


In [14]:
#But something I get confused on with bracket indexing is that the order doesn't matter.  For example both of the below return
#the same thing

df[ ['intellegence','wisdom'] ][2890:2895]
df[2890:2895][ ['intellegence','wisdom'] ]

Unnamed: 0,intellegence,wisdom
2890,8,17
2891,18,11
2892,9,16
2893,11,16
2894,17,9


# .iloc and .loc indexing

So now that we are done with bracket indexing, let's look at integer based and label based indexing as well.

## .iloc

In [15]:
#return the first 5 rows (or where the index is 0-5 non inclusive)
df.iloc[0:5]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
0,3D6,14,15,9,7,6,9,2,2,-1,-2,-2,-1
1,3D6,13,7,15,8,15,7,1,-2,2,-1,2,-2
2,3D6,8,4,11,12,10,5,-1,-3,0,1,0,-3
3,3D6,10,8,12,8,15,15,0,-1,1,-1,2,2
4,3D6,11,11,12,9,12,8,0,0,1,-1,1,-1


In [16]:
#return the first 5 columns
df.iloc[:,0:5]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence
0,3D6,14,15,9,7
1,3D6,13,7,15,8
2,3D6,8,4,11,12
3,3D6,10,8,12,8
4,3D6,11,11,12,9
5,3D6,8,8,5,10
6,3D6,9,10,5,12
7,3D6,10,5,16,4
8,3D6,8,7,6,6
9,3D6,7,11,8,10


In [18]:
#return rows 5 through 10 AND the columns 5 through 8
df.iloc[5:10, 5:8]

#in other words df.iloc 
#[row_start:row_end(noninclusive),column_start:column_end(noninclusive)]

Unnamed: 0,wisdom,charisma,str mod
5,16,15,-1
6,7,13,-1
7,17,7,0
8,11,11,-1
9,11,8,-2


# .loc

Unlike the interger based .iloc, the label .loc works based on the *names* of the row and column index.  Unfortunetly, this makes it a little confusing in this example because our row are named with integers. This will make it look like our iloc data selections.

In [19]:
#select first 5 rows
df.loc[0:5]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
0,3D6,14,15,9,7,6,9,2,2,-1,-2,-2,-1
1,3D6,13,7,15,8,15,7,1,-2,2,-1,2,-2
2,3D6,8,4,11,12,10,5,-1,-3,0,1,0,-3
3,3D6,10,8,12,8,15,15,0,-1,1,-1,2,2
4,3D6,11,11,12,9,12,8,0,0,1,-1,1,-1
5,3D6,8,8,5,10,16,15,-1,-1,-3,0,3,2


In [21]:
#select all rows, first 5 columns by name
df.loc[:,"roll_type":"intellegence"]

#so just like iloc, loc follows the same format but based on name, not interger
#df.loc [row_start:row_end,column_start:column_end]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence
0,3D6,14,15,9,7
1,3D6,13,7,15,8
2,3D6,8,4,11,12
3,3D6,10,8,12,8
4,3D6,11,11,12,9
5,3D6,8,8,5,10
6,3D6,9,10,5,12
7,3D6,10,5,16,4
8,3D6,8,7,6,6
9,3D6,7,11,8,10


So now have different ways to index our data by using brackets, .iloc, and .loc. But what if we want to look for data where specific conditions are true?  Then I would have to use boolean indexing.

Say I really want to play a strong monk character with high wisdom AND high dexterity. Using conditional formatting, I can search for where both of those statements are true (To do boolean indexing you need to use either the brackets or .loc. Because .loc is purely interger based, it won't work)


In [22]:
#Below returns a series saying where wisdom is true and false.  
df['wisdom'] >= 17

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7        True
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19       True
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2970    False
2971    False
2972    False
2973    False
2974    False
2975     True
2976    False
2977    False
2978     True
2979    False
2980    False
2981    False
2982    False
2983     True
2984     True
2985    False
2986     True
2987     True
2988    False
2989    False
2990    False
2991     True
2992    False
2993    False
2994     True
2995    False
2996    False
2997    False
2998    False
2999    False
Name: wisdom, Length: 3000, dtype: bool

In [23]:
#We can then use the above to select our data frame for only where wisdom is greater than or equal to 17
df[df['wisdom'] >= 17]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
7,3D6,10,5,16,4,17,7,0,-3,3,-3,3,-2
19,3D6,6,8,7,12,17,13,-2,-1,-2,1,3,1
37,3D6,12,12,7,6,17,15,1,1,-2,-2,3,2
141,3D6,8,10,4,6,17,6,-1,0,-3,-2,3,-2
213,3D6,13,12,9,8,17,9,1,1,-1,-1,3,-1
255,3D6,11,9,7,12,17,10,0,-1,-2,1,3,0
299,3D6,10,11,4,10,17,9,0,0,-3,0,3,-1
373,3D6,9,6,11,12,17,8,-1,-2,0,1,3,-1
377,3D6,14,14,6,7,17,9,2,2,-2,-2,3,-1
378,3D6,9,13,15,8,18,7,-1,1,2,-1,4,-2


Nice.  But now lets add the dexterity as well.

In [24]:
#returns a true/list series where wisdom and dexterity are greater than or equal to 17
(df['wisdom'] >= 17) & (df['dexterity'] >= 17)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2970    False
2971    False
2972    False
2973    False
2974    False
2975    False
2976    False
2977    False
2978    False
2979    False
2980    False
2981    False
2982    False
2983    False
2984    False
2985    False
2986    False
2987    False
2988    False
2989    False
2990    False
2991    False
2992    False
2993    False
2994    False
2995    False
2996    False
2997    False
2998    False
2999    False
Length: 3000, dtype: bool

In [25]:
#Great, now let's index our data frame by the above true/false series
df[(df['wisdom'] >= 17) & (df['dexterity'] >= 17)]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
1001,4D6DropLow,13,17,11,12,18,12,1,3,0,1,4,1
1415,4D6DropLow,9,18,11,16,18,15,-1,4,0,3,4,2
1598,4D6DropLow,16,18,8,10,17,13,3,4,-1,0,3,1
1631,4D6DropLow,15,18,18,13,17,12,2,4,4,1,3,1
1920,4D6DropLow,10,17,15,11,18,15,0,3,2,0,4,2
2098,Colville,11,17,11,14,17,15,0,3,0,2,3,2
2199,Colville,12,17,7,14,17,9,1,3,-2,2,3,-1
2201,Colville,15,17,14,16,17,7,2,3,2,3,3,-2
2613,Colville,15,17,9,13,17,17,2,3,-1,1,3,3
2677,Colville,8,17,9,13,17,13,-1,3,-1,1,3,1


So that's conditional selection using brackets, now let's try conditional selection using loc. This time let's say you want to play an intellegent charismatic bard, but, you only want to use Matt Colville's method.

In [27]:
#first let's create the true/false series 
(df['intellegence']>=17) & (df['charisma']>=17) & (df['roll_type'] == "Colville")

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2970    False
2971    False
2972    False
2973    False
2974    False
2975    False
2976    False
2977    False
2978    False
2979    False
2980    False
2981    False
2982    False
2983    False
2984    False
2985    False
2986    False
2987    False
2988    False
2989    False
2990    False
2991    False
2992    False
2993    False
2994    False
2995    False
2996    False
2997    False
2998    False
2999    False
Length: 3000, dtype: bool

In [28]:
#then use loc to select what you are looking for
df.loc[(df['intellegence']>=17) & (df['charisma']>=17) & (df['roll_type'] == "Colville")]

Unnamed: 0,roll_type,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
2147,Colville,13,11,14,17,14,17,1,0,2,3,2,3
2370,Colville,15,17,12,17,15,17,2,3,1,3,2,3
2441,Colville,14,7,10,18,15,17,2,-2,0,4,2,3
2480,Colville,12,8,12,17,10,17,1,-1,1,3,0,3
2535,Colville,6,13,16,18,12,17,-2,1,3,4,1,3
2546,Colville,15,10,14,18,11,17,2,0,2,4,0,3
2738,Colville,14,14,11,18,9,17,2,2,0,4,-1,3
2926,Colville,14,14,13,18,16,17,2,2,1,4,3,3
2948,Colville,13,15,12,18,16,17,1,2,1,4,3,3


Now you have your choice of intellegent, charismatic bards using Matt Colville's method. 

While I'm a little confused about why you would use loc over the normal bracketing indexing, some internet searching suggests that using loc is more explicit and clear.  

So that's it for my maiden voyage blog post.  In it I:
* loaded my data into a dataframe using read_csv()
* performed quick checks using .head(), .tail(), and .sample()
* explored different types of indexing using brackets, .iloc, and .loc
* used boolean indexing to identify where the data met certain conditions 

Now that I have a foundation in exploring and indexing data, next time I'll look at what the numbers say through some basic math.  But I'll also do some exploratory data analysis using a visualization package like matplotlib or seaborn.    

Let me know if you have any feedback on any aspect of the blog post as I'm trying to improve both as a writer and as a data analyst (I wouldn't be surprised if I'm wrong on some of the technical details).  