# CSC8001: Assignment 1

## Part A - Literary Scrabble [60 marks]

Let's play a round of Literary Scrabble: a game of Scrabble where you can only play words that appear in selected literary classics.  The functions you will write for Part A will let you answer questions like:

- How many unique words does Mark Twain's book the <em>Adventures of Huckleberry Finn</em> have?

- What is the highest scoring word from Lewis Carroll's <em>Alice's Adventures in Wonderland</em> that you can play with the letters 'qazvredl'?

- Which books have the most words which use the letters `j` and `x`?


For Part A complete each of the Word Analysis and Word Questions functions below.  The Word Analysis functions will provide the primary text analysis to help you answer the questions defined in this notebook. The code you write for each Word Question's function will need to call the appropriate Word Analysis function(s) and then complete any additional processing necessary to answer the specific question.  

- Answers to the specific questions above have been provided for you so that you can test your code.  
- Text files for creating word lists are available in the `books` folder.

In [1]:
from __future__ import print_function

import string

### Word Analysis

#### Create Book Word List [12 Marks]

Complete the function below which should read a book's text file, and return a sorted list (ascending) of words (i.e. - no duplicates) extracted from the book's `text_file` that also exist in the official Sowpods list of approved scrabble words.  

To create your book's word list: 
- convert all characters to lowercase;  
- replace hyphens with a single space, `' '`, to split hyphenated words into separate words; 
- strip off all contractions and possessives from words: 's, 're, etc 
- remove all punctuation, whitespace characters and numbers.
- only keep words which also occur in the official Sowpods list

HINT: The Python Standard Library provides various string constants, such as `whitespace` and `punctuation`.  You may want to review the Python Standard Library's sections on string methods and constants.
- [String constants](https://docs.python.org/3/library/string.html#string-constants)
- [String methods](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)

**NOTE: Creating a book's word list may take several seconds of processing time.**.  We recommend you use the smaller book file, `"-ch1.txt"` extracts, while you are testing your code.

In [2]:
def create_wordlist(text_file):
    """ Take single argument 'text_file' and proccess as follows:
            convert all characters to lowercase;
            replace hyphens with a single space, ' ', to split hyphenated words into separate words;
            split line into a list of words
            for each word in this list
                strip off all contractions and possessives from words: 's, 're, etc
                remove all punctuation, whitespace characters and numbers.
            only keep words which also occur in the official Sowpods list
        Return a list of found words
        
    """
    word_list = []
    with open(text_file,'r') as test_book:
        for line in test_book:
            lower_line = line.lower()
            while '-' in lower_line:
                lower_line = lower_line[:lower_line.find('-')]+' '+lower_line[lower_line.find('-')+1:]

            lower_line_list = lower_line.split() #split line into a list of words
            index = 0
            while index < len(lower_line_list):
                current_word = lower_line_list[index]
                if "'" in current_word:
                    found_word = current_word[:current_word.find("'")]
                else:
                    found_word = current_word
                no_punc_found_word = ''
                for c in found_word:
                    if c not in string.punctuation:  #see if the current character is a punctuation character
                        no_punc_found_word += c

                with open ('books/sowpods.txt','r') as sow_pods:
                    for check_line in sow_pods:
                        sline = check_line.strip()
                        sline_lower = sline.lower()
                        if no_punc_found_word == sline_lower:
                            if no_punc_found_word in word_list:
                                pass
                            else:
                                word_list.append(no_punc_found_word)
                index += 1
    
    return word_list


#### Word Match [8 Marks]

The `word_match` function below should return `True` or `False` depending on if the `word` can be created from the provided string of `letters`. The `word` does not have to use all of the letters.  NOTE: Each letter in `letters` can only be used once.

For example: 
- `word_match('toe', 'potatoe')`, returns `True`
- `word_match('ball', 'abcledg')`, returns `False`   

In [3]:
def word_match(word, letters):
    """ Return True or False if argument 'word' can be made from argument 'letters'
    
    """
    match = False
    if len(word) <= len(letters):  #if there are not enough letters can't match
        word_list = list(word)
        letters_list = list(letters)
        word_index = 0
        search_letter = ''
        match = True
        while (word_index < len(word_list)) & match:
            search_letter = word_list[word_index]
            if (word_list.count(search_letter) <= letters_list.count(search_letter)):
                word_list.remove(search_letter)
                letters_list.remove(search_letter)
            else:
                match = False
    
    return match


#### Word Score [8 Marks]

The `word_score` function below should return a word's scrabble score (integer). Use the [English Scrabble letter distribution](https://en.wikipedia.org/wiki/Scrabble_letter_distribution) values to calculate the word's scrabble score.

Example: the word 'affixes' `word_score` should be `20`


Points | Letters
:--: | :--
1 | e, a, i, o, n, r, t, l, s, u
2 | d, g
3 | b, c, m, p
4 | f, h, v, w, y
5 | k
8 | j, x
10 | q, z

In [4]:
def word_score(word):
    """ Return an integer score value, based on defined letter values, for argument 'word'
    
    """
    value1 = 'eaionrtlsu'
    value2 = 'dg'
    value3 = 'bcmp'
    value4 = 'fhvwy'
    value5 = 'k'
    value8 = 'jx'
    value10 = 'qz'
    
    score = 0
    
    if len(word) > 0:
        wordIndex = 0
        while wordIndex < len(word):
            if word[wordIndex] in value1:
                score += 1
            elif word[wordIndex] in value2:
                score += 2
            elif word[wordIndex] in value3:
                score += 3
            elif word[wordIndex] in value4:
                score += 4
            elif word[wordIndex] in value5:
                score += 5
            elif word[wordIndex] in value8:
                score += 8
            else:
                score += 10
            
            wordIndex += 1
            
    return score


#### Find Words [8 Marks]

You're playing Literary Scrabble where you can only play words extracted from famous books. You have pulled your letters and there are lots of possibilities but you obviously want to play a hand which will get you the highest score. 

Complete the `find_words` function below which should return a dictionary of valid words (selected from the provided `words_list`) that can be created from the letters provided. The keys for the returned dictionary are the words, the values are each word's scrabble score. 

- Use the `word_match` function above to find word/letters matches.  
- Use the `word_score` function above to calculate each word's scrabble score.  


In [5]:
def find_words(words_list, letters):
    """ From the argument 'words_list', find what what words can be created from the argument 'letters', and calculate
        a score for each found word.
        Return a dictionary of found words and their scores
        
    """
    #create an empty dictionary for results
    results = dict()
    
    #for each word in words_list find it in letters
    for word in words_list:
        if word_match(word,letters):

        #if its not already there, add the word and its score to found_words dictionary

            if word not in results:
                    results[word] = word_score(word)
    
    #return the found_words dictionary

    return results


### Word Questions

Each Word Question function below will need to call the appropriate Word Analysis function(s) and then inclue any additional code required to answer the specific question.

#### a1: Unique words [6 Marks]

Return the number (integer) of unique words contained in `text_file`.

>How many unique words does Mark Twain's <em> Adventures of Huckleberry Finn</em> have?  (available in the A1 `books` folder)  
>Answer: 422

In [6]:
def a1(text_file):
    """ Return the number of unique words in the supplied 'text_file' and existing in sowpods reference
    
    """ 
    #find all the words in text_file that are in sowpods
    #how long is that list
    answer = 0
    
    myList = create_wordlist(text_file)
    answer = len(myList)
    
    
    return answer

In [7]:
a = a1('books/adventures_of_huckleberry_finn-ch1.txt')
a

422

#### a2: Highest score [8 Marks]

Your playing Literary Scrabble and its your turn.  What is the highest scoring word you can play with the letters you have based on the words from Chapter 1 of Lewis Carroll's <em>Alice's Adventures in Wonderland</em> (available in the A1 data folder).

>Your available scrabble letters are 'qazrvedl'.  Whats the highest scoring word you can play?  
>Answer: ('read', 5)

In [8]:
def a2(text_file, letters):
    """ find the unique words in the text_file
        find what words in text_file can be created from letters .. each word has one unique value .. one to one
        find the score for each found word
        scan the dictionary and find the highest score
        if two scores are the same pick the word with the highest ascii score
        return the word and the value
    
    """
    
    best_word = ''
    best_score = 0
    found_words_dictionary = ()
   
    myList = create_wordlist(text_file)
    found_words_dictionary = find_words(myList,letters)
    
    for each_word in found_words_dictionary:
        
        temp_score = int(found_words_dictionary[each_word])
        
        if temp_score > best_score:
            best_score = temp_score
            best_word = each_word
        elif temp_score == best_score:
            if each_word > best_word:
                best_word = each_word
    
    
    return (best_word,best_score)

In [9]:
a = a2('books/alices_adventures_in_wonderland-ch1.txt','qazrvedl')
a

('read', 5)

#### a3: Books with the most `j` and `x` words [10 Marks]

You've just played all of your letters. According to the rules of Literary Scrabble, if you're out of letters you can choose to switch to a new literary novel but you have to do so before you choose your new letters.  

You've noticed that no one has played a `j` or a `x` for awhile.  Which may mean there are still some left.  Which is good since these are high value letters, but not good if your next literary book doesn't have many words that contain those letters.  You write a function which counts how many words in a book contain certain letters. Your function accepts and returns a list of tuples, each tuple contains information for one book.  

Your function is passed a list of tuples:
[(book_ID1, text_file1), (book_ID2, text_file2), (book_ID3, text_file3)]

Your function should return a list of tuples:
[('book_ID1', word_count1), ('book_ID2', word_count2), ('book_ID3', word_count3)]

> How many words in *Alices Adventures in Wonderland* and *War of the Worlds* have the letters j or x?  
>Answer: [('Alice', 9), ('War', 24)]

(The books are available in the A1 books folder)

In [10]:
def a3(book_list, letters):
    """ for each book in the list
        get the list of valid words
        for each valid word, see if it contains j or x ...or one of the passed letters
        count how many words contain any of the passed letters ie j or x
        return a list of tuples that have the book_id and the number of found words
        
    """
    
    books_index = 0
    my_options = []
    
    while books_index < len(book_list):
        letter_match_count = 0
        word_list = []
        found_word = ''
        found_words = []
        current_book = book_list[books_index][1]
        word_list = create_wordlist(current_book)
        word_list_index = 0
        while word_list_index < len(word_list):
            letters_index = 0
            while letters_index < len(letters):
                if (word_list[word_list_index].find(letters[letters_index]) >= 0):
                    found_word = word_list[word_list_index]
                    if found_word in found_words:
                        pass
                    else:
                        found_words.append(word_list[word_list_index])
                letters_index += 1
            word_list_index += 1
        letter_match_count = len(found_words)
        book_tuple = (book_list[books_index][0],letter_match_count)    
        my_options.append(book_tuple)
            
        books_index += 1
    
    return my_options


In [11]:
a = a3([('War','books/war_of_the_worlds-ch1.txt'), 
         ('Alice', 'books/alices_adventures_in_wonderland-ch1.txt')], 'jx')
a

[('War', 24), ('Alice', 9)]