# Graphing SET data with Matplotlib and Pandas

## Data load and prep

In [None]:
import matplotlib.pyplot as plt
import pandas            as pd

In [None]:
url = 'https://raw.githubusercontent.com/RubeRad/tcscs/master/12SETcards.csv'
df = pd.read_csv(url)

In [None]:
df.head(12) # show 12 rows, i.e. the whole dataset

## Exercise

**Q1:** How many occurences of the value 84240 can you find, and what does each signify?

**A:** *(type your answers)*

**Q2:** What do all the NaN signify?

**A:**


In [None]:
df.columns

**NOTE:** The Series names all have quotes, which means they are strings, not numbers. It will be more convenient for them to be numbers. This cell does that by replacing df.columns, which is an array full of strings, with an array starting with the same two strings, but then having all numbers

In [None]:
newcolumns = ['N', '81cN']
numsets = range(0,15) 
newcolumns.extend(numsets)  # same as ['N','81cN',0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]
df.columns = newcolumns

In [None]:
df.columns

**NOTE:** We can see that the column headers are now numbers

**NOTE:** The values in the DataFrame are all total counts. We need to convert them to probabilities, so they are comparable even though the numbers get huger with more cards. That means dividing each column by the column nmaed '81cN'

In [None]:
# Do this only once!
# If you accidentally do it again and make the probabilities too tiny,
# back up to the beginning of the notebook and re-run all the cells
for nset in numsets:         # for nset=0,1,2,...14
    df[nset] /= df['81cN']   # divide the Series of counts by 81-choose-N to convert it to a probability

In [None]:
df.head()
# Now they should look like probabilities
# The first, biggest probability in there should be about 98%, right below it, about 95%

## Graphing the data

In [None]:
# Create a new figure, with a size that is nice and big
fig=plt.figure(figsize=(15,12))

# Create two new subplots. The "21" part in each says 2 rows X 1 column configuration
linax=fig.add_subplot(211) # Final 1 means first of the 2x1
logax=fig.add_subplot(212) # Final 2 means second of the 2x1
logax.set_yscale('log')    # The reason for the second is to have it be log scale

# Use a loop to plot each curve onto each subplot
# It's the same curve, will be rendered linear scale and log scale in the subplots:
#   X values from column 'N' ([3,4,5,...12] cards)
#   Y values from column nset ([0,1,2,...,14] SETs)
for nset in numsets:
    linax.plot(df['N'], df[nset], marker='o')  
    logax.plot(df['N'], df[nset], marker='o')  
    
# Add some explanatory labeling    
linax.set_title("What's the probability of various numbers of SETs as more cards are dealt?")
logax.set_xlabel('Number of cards dealt') # this applies for both
linax.set_ylabel('Probability')
logax.set_ylabel('Probability (log)')
linax.legend(numsets, title='Number of SETs', ncol=3) # same legend also applies to both

## Exercise

**Q3**: Why does the '0 SETs' curve start near probability 1 and only decline, but all the other curves start near 0 and increase, then decrease?

**A:**

**Q4:** For a 12-card deal, what is the most likely (highest probability) number of SETs?

**A:**

# Graphing the data the other way 

In [None]:
oopx = df.transpose()
oopx.head()

**NOTE:** If we just transpose df, columns 'N' and '81cN' become rows, which means they'd be included in the plots of every Series. One way to deal with that is by filtering those rows out.

But the following cell, since we have numsets=[0,1,2,...14] already, we grab only those columns before transposing

In [None]:
fd = df[numsets].transpose()
fd.head(15)
# Should see that 98%, 95% now transposed from the first column to the first row

In [None]:
fd.columns

**NOTE:** these columns are already numbers, but they're the wrong numbers! 

Just like we used nsets=range(0,15) above to spell out all possible numbers of SETs in our data, we use ncards=(3,12) here with the transposed data

In [None]:
ncards=range(3,13) # easier than [3,4,5,6,7,8,9,10,11,12]
fd.columns=ncards

In [None]:
fd.head()
# Should show column headers 3...12

## Exercise

Fill in the code cell below to make plots to view the data in the opposite sense

In [None]:
# Since these are the other way, use variables that are named backwards...
gif=plt.figure(figsize=(15,12))
axlin=gif.add_subplot(211)
axlog=gif.add_subplot(212)

# fill in to plot DataFrame fd, following the example above
# the X series for every plot will be the array nsets (which is [0,1,2,...14])



## Exercise

**Q5:** (Same as Q4) For a 12-card deal, what is the most likely (highest probability) number of SETs? -- how is the same answer evident from these graphs?

**A:**

**Q6:** What's up with the combination 7 cards, 4 SETs? How is the same answer evident in the previous graphs?

**A:**

**Q7:** Is 'start by dealing 12 cards' a better rule than 'start by dealing 9 cards'?

**A:**