Last time I explored how to index a data set I created exploring three different methodologies for creating stats for 5E D&D characters.  In this blog post, I'm going to run some descriptive statistics as well as some exploratory data analysis.

In [1]:
#import functions
import pandas as pd
import matplotlib.pyplot as plt

#for in notebook graphic exploration
%matplotlib inline

In [2]:
#read in the data
df = pd.read_csv('1000CharSimulated20seed.csv',index_col=0)

In [3]:
#use the shape function to look at the data
df.shape

#shape gives us the number of rows and the number of columns
#unlike head() or tail(), shape is an attribute of the data frame, not a method 

(3000, 13)

In [4]:
#use describe to get some high level statistics
df.describe()

Unnamed: 0,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,11.915333,11.990333,12.007333,12.015,11.947,11.866333,0.701667,0.737667,0.751667,0.754333,0.73,0.676
std,3.074191,3.187417,3.112851,3.126665,3.150794,3.142784,1.547728,1.616494,1.575275,1.582872,1.593303,1.589927
min,3.0,3.0,3.0,3.0,3.0,3.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0
25%,10.0,10.0,10.0,10.0,10.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,12.0,12.0,12.0,12.0,12.0,12.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,14.0,15.0,14.0,15.0,14.0,14.0,2.0,2.0,2.0,2.0,2.0,2.0
max,18.0,18.0,18.0,18.0,18.0,18.0,4.0,4.0,4.0,4.0,4.0,4.0


The describe function gives us a lot of great information on each of the variables including:
* Count
* Mean
* Standard Deviation
* Whisker PLot Inputs (min, 25th percentile 50th percentile, 75th percentile, and max)

Notice how describe doesn't provide any summary statistics for the qualitative roll_type variable.

While this is useful, it is not quite fair to summarize all the data together because it represents the aggregation of three different methodologies for creating characters.  So what I really need is a way to run describe() three different times, one for each methodology. Luckily, I can do that with groupby

In [5]:
df.groupby('roll_type').describe()
#looks way better on Mac, but not sure why

Unnamed: 0_level_0,Unnamed: 1_level_0,char mod,charisma,con mod,constitution,dex mod,dexterity,int mod,intellegence,str mod,strength,wis mod,wisdom
roll_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3D6,count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
3D6,mean,-0.088,10.361,-0.033,10.444,-0.032,10.433,0.021,10.554,-0.069,10.374,0.062,10.608
3D6,std,1.540334,3.028506,1.494036,2.9376,1.545109,3.044138,1.507591,2.956658,1.448564,2.875481,1.550665,3.079581
3D6,min,-4.0,3.0,-4.0,3.0,-4.0,3.0,-4.0,3.0,-4.0,3.0,-4.0,3.0
3D6,25%,-1.0,8.0,-1.0,8.0,-1.0,8.0,-1.0,9.0,-1.0,8.0,-1.0,8.0
3D6,50%,0.0,10.0,0.0,11.0,0.0,10.0,0.0,11.0,0.0,10.0,0.0,11.0
3D6,75%,1.0,13.0,1.0,13.0,1.0,13.0,1.0,13.0,1.0,12.0,1.0,13.0
3D6,max,4.0,18.0,4.0,18.0,4.0,18.0,4.0,18.0,4.0,18.0,4.0,18.0
4D6DropLow,count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
4D6DropLow,mean,0.825,12.139,0.99,12.455,0.887,12.292,1.012,12.528,0.909,12.319,0.807,12.089


That's way better.  We can also use groupby to look at specific statistics.  For example, let's look at the average for each statistic by the roll_type.

In [7]:
df.groupby('roll_type').mean()

Unnamed: 0_level_0,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
roll_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3D6,10.374,10.433,10.444,10.554,10.608,10.361,-0.069,-0.032,-0.033,0.021,0.062,-0.088
4D6DropLow,12.319,12.292,12.455,12.528,12.089,12.139,0.909,0.887,0.99,1.012,0.807,0.825
Colville,13.053,13.246,13.123,12.963,13.144,13.099,1.265,1.358,1.298,1.23,1.321,1.291


In [8]:
df.groupby('roll_type').median()

Unnamed: 0_level_0,strength,dexterity,constitution,intellegence,wisdom,charisma,str mod,dex mod,con mod,int mod,wis mod,char mod
roll_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3D6,10,10,11,11,11,10,0,0,0,0,0,0
4D6DropLow,12,12,13,13,12,12,1,1,1,1,1,1
Colville,13,14,14,13,14,13,1,2,2,1,2,1


Looking above we can see that Colville Method is higher for each average and greater than or equal to each median of the statistics and for each of the modifiers.  They are also not that different from one to the other (which makes sense.  They were generated the same way, so they should look roughly the same).  

So let's take it a step further and add in a column to calculate the average ability score and the average modifier for each row

In [14]:
df['ability_mean'] = df.loc[:,"strength":"charisma"].mean(axis=1)
df['mod_mean'] = df.loc[:,"str mod":"char mod"].mean(axis=1)

In [25]:
df.groupby('roll_type)'[["ability_mean","mod_mean"]]

SyntaxError: unexpected EOF while parsing (<ipython-input-25-12ccee667ec1>, line 1)