# CSC8001: Assignment 1

## Part A - Literary Scrabble [60 marks]

Let's play a round of Literary Scrabble: a game of Scrabble where you can only play words that appear in selected literary classics.  The functions you will write for Part A will let you answer questions like:

- How many unique words does Mark Twain's book the <em>Adventures of Huckleberry Finn</em> have?

- What is the highest scoring word from Lewis Carroll's <em>Alice's Adventures in Wonderland</em> that you can play with the letters 'qazvredl'?

- Which books have the most words which use the letters `j` and `x`?


For Part A complete each of the Word Analysis and Word Questions functions below.  The Word Analysis functions will provide the primary text analysis to help you answer the questions defined in this notebook. The code you write for each Word Question's function will need to call the appropriate Word Analysis function(s) and then complete any additional processing necessary to answer the specific question.  

- Answers to the specific questions above have been provided for you so that you can test your code.  
- Text files for creating word lists are available in the `books` folder.

In [1]:
from __future__ import print_function

import string


### Word Analysis

#### Create Book Word List [12 Marks]

Complete the function below which should read a book's text file, and return a sorted list (ascending) of words (i.e. - no duplicates) extracted from the book's `text_file` that also exist in the official Sowpods list of approved scrabble words.  

To create your book's word list: 
- convert all characters to lowercase;  
- replace hyphens with a single space, `' '`, to split hyphenated words into separate words; 
- strip off all contractions and possessives from words: 's, 're, etc 
- remove all punctuation, whitespace characters and numbers.
- only keep words which also occur in the official Sowpods list

HINT: The Python Standard Library provides various string constants, such as `whitespace` and `punctuation`.  You may want to review the Python Standard Library's sections on string methods and constants.
- [String constants](https://docs.python.org/3/library/string.html#string-constants)
- [String methods](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)

**NOTE: Creating a book's word list may take several seconds of processing time.**.  We recommend you use the smaller book file, `"-ch1.txt"` extracts, while you are testing your code.

In [2]:
import re # required for regular expressions in create_wordlist

def create_wordlist(text_file):
    """ This function takes the textfile as an argument, 
    then returns a sorted list of words that appear in the official 
    Snowpods list of Scrabble approved words
    text_file: string representing the filename
    """
    with open(text_file,'r') as book_file:
        book_string = book_file.read()
        book_string = book_string.lower() #convert to lowercase
        book_string = book_string.replace('-',' ') #replace all hyphens with a single space
        book_string = re.sub(r"'[A-Za-z]+ ?", ' ', book_string) #strip off all contractions and possessives
        book_string = re.sub('[^A-Za-z]',' ', book_string) #remove punctuation and numbers     
        word_list = book_string.split(' ') #split string into list of words, removing whitespace since its the seperator
        
        #create set of unique words
        unique_words=set(word_list)
        #create set of words in sowpods list
        with open('books/sowpods.txt','r') as sowpods:
            sowpods_string = sowpods.read()
            sowpods_string = sowpods_string.lower()
            sowpods_set = set(sowpods_string.split('\n'))
            #create set of words in both sets
            legal_matches = unique_words & sowpods_set
        #store in sorted list
        matching_words=list(legal_matches)
        matching_words.sort()
        matching_words.remove('')
    return matching_words 


#### Word Match [8 Marks]

The `word_match` function below should return `True` or `False` depending on if the `word` can be created from the provided string of `letters`. The `word` does not have to use all of the letters.  NOTE: Each letter in `letters` can only be used once.

For example: 
- `word_match('toe', 'potatoe')`, returns `True`
- `word_match('ball', 'abcledg')`, returns `False`   

In [116]:
def word_match(word, letters):
    """ This function takes a word and a string of letters as arguments, and returns true if the word can be made from the letters.
    If the word contains multiple instances of the letter, it will only return true if there are sufficient instances of the letter in the 'letters' string
    word: the word to try and create
    letters: the letters available to create the word
    """
    for letter in word:
        if letter not in letters:
            return False
        letters = letters.replace(letter,' ',1)
    return True

#### Word Score [8 Marks]

The `word_score` function below should return a word's scrabble score (integer). Use the [English Scrabble letter distribution](https://en.wikipedia.org/wiki/Scrabble_letter_distribution) values to calculate the word's scrabble score.

Example: the word 'affixes' `word_score` should be `20`


Points | Letters
:--: | :--
1 | e, a, i, o, n, r, t, l, s, u
2 | d, g
3 | b, c, m, p
4 | f, h, v, w, y
5 | k
8 | j, x
10 | q, z

In [118]:
def word_score(word):
    """ This function takes a word as an argument and returns it's scrabble score
    word: the word to score
    """
    #set character scores
    character_scores={'e':1, 'a':1, 'i':1, 'o':1, 'n':1, 'r':1, 't':1, 'l':1, 's':1, 'u':1, 'd':2,'g':2,'b':3, 'c':3, 'm':3, 'p':3, 'f':4 , 'h':4, 'v':4, 'w':4, 'y':4,'k':5,'k':8,'x':8,'q':10,'z':10}
    this_word_score=0
    for letter in word:
        this_word_score+=character_scores[letter]
    return this_word_score

#### Find Words [8 Marks]

You're playing Literary Scrabble where you can only play words extracted from famous books. You have pulled your letters and there are lots of possibilities but you obviously want to play a hand which will get you the highest score. 

Complete the `find_words` function below which should return a dictionary of valid words (selected from the provided `words_list`) that can be created from the letters provided. The keys for the returned dictionary are the words, the values are each word's scrabble score. 

- Use the `word_match` function above to find word/letters matches.  
- Use the `word_score` function above to calculate each word's scrabble score.  


In [119]:
def find_words(words_list, letters):
    """ This function takes a list of acceptable words, and a list of letters, 
    then returns a dictionary of all possible words in the word list that can be made by the letters & their scores.
    The word_match & word_score functions are used in this function, so must be defined before it is called
    words_list: a list of all playable words
    letters: the letters available to play
    """
    #Find all matching words in words_list
    matching_words=[]
    for word in words_list:
        if word_match(word,letters):
            matching_words.append(word)

    #create dictionary of words and their scores
    word_scores=dict()            
    
    for word in matching_words:
        word_scores[word]=word_score(word)
    
    return word_scores


### Word Questions

Each Word Question function below will need to call the appropriate Word Analysis function(s) and then inclue any additional code required to answer the specific question.

#### a1: Unique words [6 Marks]

Return the number (integer) of unique words contained in `text_file`.

>How many unique words does Mark Twain's <em> Adventures of Huckleberry Finn</em> have?  (available in the A1 `books` folder)  
>Answer: 422

In [120]:
def a1(text_file):
    """ This function takes a text files location/name as a string, and returns the number of unique words it contains.
    It uses the create_wordlist function, so this must be defined before it is called.
    text_file: string for the location/name of the text file
    """ 
    
    return len(create_wordlist(text_file))

In [121]:
a = a1('books/adventures_of_huckleberry_finn-ch1.txt')
a

422

#### a2: Highest score [8 Marks]

Your playing Literary Scrabble and its your turn.  What is the highest scoring word you can play with the letters you have based on the words from Chapter 1 of Lewis Carroll's <em>Alice's Adventures in Wonderland</em> (available in the A1 data folder).

>Your available scrabble letters are 'qazrvedl'.  Whats the highest scoring word you can play?  
>Answer: ('read', 5)

In [122]:
def a2(text_file, letters):
    """ This function calculates the highest scoring playable word from the text file, using the given letters.
    It returns a tuple of the word, and it's score.
    text_file: string representing the location/name of the text file
    letters:string of all letters available
    """ 
    playable_words=find_words(create_wordlist(text_file),letters)
    max_score=0
    max_word=''
    for word in playable_words:
        if playable_words[word]>max_score:
            max_score=playable_words[word]
            max_word=word
        elif playable_words[word]==max_score: #handle cases where two possible words have equal score by chosing the word with the highest ASCII value
            if word>max_word:
                max_score=playable_words[word]
                max_word=word                
    return (max_word,max_score)

In [123]:
a = a2('books/alices_adventures_in_wonderland-ch1.txt','qazrvedl')
a

('read', 5)

#### a3: Books with the most `j` and `x` words [10 Marks]

You've just played all of your letters. According to the rules of Literary Scrabble, if you're out of letters you can choose to switch to a new literary novel but you have to do so before you choose your new letters.  

You've noticed that no one has played a `j` or a `x` for awhile.  Which may mean there are still some left.  Which is good since these are high value letters, but not good if your next literary book doesn't have many words that contain those letters.  You write a function which counts how many words in a book contain certain letters. Your function accepts and returns a list of tuples, each tuple contains information for one book.  

Your function is passed a list of tuples:
[(book_ID1, text_file1), (book_ID2, text_file2), (book_ID3, text_file3)]

Your function should return a list of tuples:
[('book_ID1', word_count1), ('book_ID2', word_count2), ('book_ID3', word_count3)]

> How many words in *Alices Adventures in Wonderland* and *War of the Worlds* have the letters j or x?  
>Answer: [('Alice', 9), ('War', 24)]

(The books are available in the A1 books folder)

In [124]:
def a3(book_list, letters):
    """ This function calculates how many words contain a set of letters in each book listed
    book_list: list of tuples containing the book name, and it's location/filename
    letters:string containing the set of letters
    """ 
    return_list=[]
    
    for book,file_name in book_list:
        matching_words=create_wordlist(file_name)
        
        words_containing_letters = 0
        for word in matching_words:
            for letter in word:
                if letter in letters:
                    words_containing_letters+=1
        
        return_list.append((book,words_containing_letters))
        
    return return_list

In [125]:
a = a3([('War','books/war_of_the_worlds-ch1.txt'), 
         ('Alice', 'books/alices_adventures_in_wonderland-ch1.txt')], 'jx')
a

[('War', 24), ('Alice', 9)]