### Prepping Data Challenge: 7 letter Scrabble Words (Week 6)

For this challenge, we're going to take a look at 7 letter words which could be potentially high scoring in Scrabble and work out the likelihood of drawing the tiles needed to create this word. Are we going to make our lives easier by assuming that each tile drawn is an independent event and that the order tiles are drawn is irrelevant? Yes, but equally, if you have the statistical brain to calculate the probabilities as dependent events, considering all the possible orderings then we'd love to see that solution! 

### Requirements
 - Input the data
 - Parse out the information in the Scrabble Scores Input so that there are 3 fields:
   - Tile
   - Frequency
   - Points
 - Calculate the % Chance of drawing a particular tile and round to 2 decimal places
   - Frequency / Total number of tiles
 - Split each of the 7 letter words into individual letters and count the number of occurrences of each letter
 - Join each letter to its scrabble tile 
 - Update the % chance of drawing a tile based on the number of occurrences in that word
   - If the word contains more occurrences of that letter than the frequency of the tile, set the probability to 0 - it is impossible to make this word in Scrabble
   - Remember for independent events, you multiple together probabilities i.e. if a letter appears more than once in a word, you will need to multiple the % chance by itself that many times
 - Calculate the total points each word would score
 - Calculate the total % chance of drawing all the tiles necessary to create each word
 - Filter out words with a 0% chance
 - Rank the words by their % chance (dense rank)
 - Rank the words by their total points (dense rank)
 - Output the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Input the data.
with pd.ExcelFile('WK6-7 letter words.xlsx') as xlsx:
    words = pd.read_excel(xlsx, '7 letter words')
    scrabble = pd.read_excel(xlsx, 'Scrabble Scores')

In [3]:
words.head()

Unnamed: 0,7 letter word
0,ability
1,absence
2,academy
3,account
4,accused


In [4]:
scrabble.head()

Unnamed: 0,Scrabble
0,0 points: Blank ×2
1,"1 point: E ×12, A ×9, I ×9, O ×8, N ×6, R ×6, ..."
2,"2 points: D ×4, G ×3"
3,"3 points: B ×2, C ×2, M ×2, P ×2"
4,"4 points: F ×2, H ×2, V ×2, W ×2, Y ×2"


In [5]:
#Parse out the information in the Scrabble Scores Input so that there are 3 fields:
#Tile , Frequency, Points
scrabble['Scrabble'] = scrabble['Scrabble'].str.replace('(\d+\s[a-z]+:)', ' ')
scrabble = scrabble['Scrabble'].str.split(',').explode()
scrabble2 = pd.DataFrame()
scrabble2['Scrabble'] = scrabble

  This is separate from the ipykernel package so we can avoid doing imports until


In [6]:
scrabble2['Tile'] = scrabble2['Scrabble'].str.extract('([a-zA-Z]+)')
scrabble2['Frequency'] = scrabble2['Scrabble'].str.extract('(\d+)').astype('int64')
scrabble2['Points'] = scrabble2.index

In [7]:
scrabble2.head()

Unnamed: 0,Scrabble,Tile,Frequency,Points
0,Blank ×2,Blank,2,0
1,E ×12,E,12,1
1,A ×9,A,9,1
1,I ×9,I,9,1
1,O ×8,O,8,1


In [8]:
#Calculate the % Chance of drawing a particular tile and round to 2 decimal places
#Frequency / Total number of tiles
scrabble2['total number of tiles'] = scrabble2['Frequency'].sum()
scrabble2["% chance of tile"] = scrabble2['Frequency'] / scrabble2['total number of tiles']

In [9]:
#Split each of the 7 letter words into individual letters and count the number of occurrences of each letter
letters = words.assign(Tile = lambda x: x['7 letter word'].str.upper().str.findall('(.)'))\
                .explode('Tile')\
                .groupby(['7 letter word','Tile'], as_index =False).agg(Count=('Tile','size'))

In [10]:
letters.head()

Unnamed: 0,7 letter word,Tile,Count
0,Reading,A,1
1,Reading,D,1
2,Reading,E,1
3,Reading,G,1
4,Reading,I,1


In [11]:
#Join each letter to its scrabble tile 
df = scrabble2[['Tile','Frequency','Points','% chance of tile']].merge(letters, on ='Tile')

In [12]:
#Update the % chance of drawing a tile based on the number of occurrences in that word
#Calculate the total points each word would score
#Calculate the total % chance of drawing all the tiles necessary to create each word

df = df.assign(pChance=lambda x: np.where((x['Count'] > x['Frequency']), 0,
                                           x["% chance of tile"] ** x['Count']),
               tPoints=lambda x: np.where(x['Count'] > x['Frequency'], 0,
                                           x['Points'] * x['Count']))                                          

In [13]:
#Filter out words with a 0% chance
df = df.groupby('7 letter word')\
       .filter(lambda x: x['pChance'].min() > 0.0)\
       .groupby('7 letter word', as_index=False)\
       .agg(pChance = ('pChance','prod'), tPoints=('tPoints','sum'))

In [14]:
df = df.rename(columns={'pChance':'% Chance','tPoints':'Total Points'})

In [15]:
#Rank the words by their % chance (dense rank)
df['Points Rank'] = df['Total Points'].rank(method='dense',ascending=False).astype(int)

In [16]:
#Rank the words by their total points (dense rank)
df['Likelihood Rank'] = df['% Chance'].rank(method='dense', ascending=False).astype(int)

In [17]:
df = df[['Points Rank', 'Likelihood Rank','7 letter word',"% Chance",'Total Points']]

In [18]:
df.head()

Unnamed: 0,Points Rank,Likelihood Rank,7 letter word,% Chance,Total Points
0,16,16,Reading,4.19904e-09,9
1,13,61,ability,6.9984e-10,12
2,14,45,absence,1.24416e-09,11
3,10,79,academy,3.1104e-10,15
4,14,73,account,4.1472e-10,11


In [19]:
df.to_csv('wk6-output.csv', index=False)