# Object Oriented Programming Exercise

In this exercise, we'll implement a word counter class that can analyse some text and count the frequency of words in it. You should complete the class `WordCounter` below that:

1. Initialises an attribute at instance creation time (in `__init__`) for storing the counts of word frequency.
2. Has a method `count_words` that takes some text and counts the words in it. This method should take a single argument, a string.
3. Has a method `frequency` that returns the number of occurrances of a word. This method should take a single argument, a string of the word being queried, and should return an integer.
4. Has a method `top_words` that returns a list of the most common words. This method should take a single optional argument (defaulting to 1), an integer indicating the number of words to return, and should return a list of strings of the most frequent words, in descending order of frequency.

## Stretch Goals

To start off with, you should just split the input text by whitespace and count the occurrances of each string. Once this has done, you can try also making the counter case-insensitive and making it remove punctuation (specifically, all characters other than a-z and digits) from words.

## Testing

At the bottom of this notebook you'll find some tests that you can run on the code that use your `WordCount` class to analyse the complete works of William Shakespeare, and compare your computed results against the expected results.

## Resources

You may find these resources from the Python standard library documentation useful:

* https://docs.python.org/3/library/stdtypes.html#string-methods
* https://docs.python.org/3/library/re.html
* https://docs.python.org/3/library/string.html

In [None]:
# Fetch data
from pathlib import Path
if not Path('shakespeare.txt').exists():
    !wget https://s3-eu-west-1.amazonaws.com/faculty-client-teaching-materials/object-oriented-programming/shakespeare.txt

In [None]:
import re

CLEAN_REGEX = re.compile(r"[^a-zA-Z0-9 ]")

In [None]:
class WordCounter:
    """Count the frequency of words in some text."""

    def __init__(self):
        self.counts = {}

    def count_words(self, text):
        """Extract words from the provided text and count them.
        
        Parameters
        ----------
        text : str
            The text to analyse
        """
        for word in text.split():
            clean_word = CLEAN_REGEX.sub("", word).lower()
            if clean_word in self.counts:
                self.counts[clean_word] += 1
            else:
                self.counts[clean_word] = 1

    def frequency(self, word):
        """Get the number of occurrances of a word.
        
        Parameters
        ----------
        word : str
        """
        return self.counts.get(word.lower(), 0)

    def top_words(self, number=1):
        """Return the most frequently occurring words.
        
        Parameters
        ----------
        number : int, optional
            The number of words to return (default: 1)
        """
        sorted_words = sorted(
            self.counts.items(), key=lambda t: t[1], reverse=True
        )
        return [word for word, _ in sorted_words[:number]]


def foo():
    counts = {}

In [None]:
# Example usage
counter = WordCounter()
counter.count_words("Foo bar foo foo. I love Python.")

print('Frequency of "foo":')
print(counter.frequency("foo"))

print("Top words:")
print(counter.top_words(3))

## Tests

Run the cell below to check that your code works as expected. These tests assume that you have implemented the case-insensitive and punctuation removal stretch goals.

In [None]:
counter = WordCounter()

with open("shakespeare.txt") as fp:
    counter.count_words(fp.read())

if counter.frequency("now") == 2778:
    print("PASS")
else:
    print('FAIL: The frequency of "now" did not match that expected')

if counter.frequency("Thee") == 3178:
    print("PASS")
else:
    print('FAIL: The frequency of "thee" did not match that expected')

if counter.top_words() == ["the"]:
    print("PASS")
else:
    print("FAIL: The calculated top words did not match those expected")

if counter.top_words(5) == ["the", "and", "i", "to", "of"]:
    print("PASS")
else:
    print("FAIL: The calculated top 5 words did not match those expected")