# Homework 2: Function Practice

For this assignment:
* Please use base Python only with no imported packages
* Remember to include comments and display your outputs
* Make sure all cells are run before submitting to Canvas

## Create a Text Analysis module

Your ultimate goal will be to write a Python script that contains functions to take in a string, list of strings or a file name, and return the top n most common words using base Python only with no imported packages.

This assignment will walk you through the following steps:
1. Create a `count_words` function
2. Create a `clean_text` function
3. Create a `most_common` function
4. Create a `count_words_from_documents` function
5. Create a `count_words_from_file` function
6. Create a `top_n_words` function
7. Update `top_n_words` to account for different inputs
8. Update `top_n_words` to account for ties
9. Save all of your functions within a `text_analysis.py` file
10. Run the `text_analysis.py` file from your Terminal or Command Prompt and view the results
11. Import the `text_analysis.py` module into your notebook and test it out
12. Turn a string output into word-count pairs

### 1. Create a `count_words` function

Write a function called `count_words` that takes in a phrase and returns a dictionary of word counts. Apply the function to the test phrase and display the output.

**Expected Output:**

`{'i': 1,
 'bought': 1,
 'a': 2,
 'sandwich': 1,
 'with': 1,
 'side': 1,
 'of': 1,
 'chips': 1}`

In [384]:
test_phrase = 'i bought a sandwich with a side of chips'

In [385]:
def count_words(phrase):
    '''Takes in a phrase and returns a dictionary of word counts'''
    words = phrase.split()
    result = {}
    for word in words:
        if word not in result:
            result[word] = 1  # Initialize with 1 for the first occurrence
        else:
            result[word] += 1  # Increase the count by 1 if word has already been stored
    return result

In [386]:
count_words(test_phrase)

{'i': 1,
 'bought': 1,
 'a': 2,
 'sandwich': 1,
 'with': 1,
 'side': 1,
 'of': 1,
 'chips': 1}

### 2. Create a `clean_text` function and update the `count_words` function

Write a function called `clean_text` that takes in a phrase, removes the following punctuation marks `.,?!'"` and makes all the text lowercase. Update the `count_words` function to include the `clean_text` function. Apply the `count_words` function to the test phrase and display the output.

In [387]:
test_phrase_2 = 'I bought a sandwich with a side of chips.'

In [388]:
def clean_text(phrase):
    '''Takes in a phrase, removes punctuations, and makes all the text lowercase'''
    phrase = phrase.lower()
    punctuations = '.,?!\'"'  # Set puncatuations 
    for punctuation in punctuations:  # Remove each puncatuation one at a time
        phrase = phrase.replace(punctuation, '')  # Store the result back to phrase
    return phrase

In [389]:
def count_words(phrase):
    '''Takes in a phrase and returns a dictionary of word counts'''
    phrase = clean_text(phrase)
    words = phrase.split()
    result = {}
    for word in words:
        if word not in result:
            result[word] = 1  # Initialize with 1 for the first occurrence
        else:
            result[word] += 1  # Increase the count by 1 if word has already been stored
    return result

In [390]:
count_words(test_phrase_2)

{'i': 1,
 'bought': 1,
 'a': 2,
 'sandwich': 1,
 'with': 1,
 'side': 1,
 'of': 1,
 'chips': 1}

### 3. Create a `most_common` function

Write a function called `most_common` that takes in a phrase and returns the most common word from the string. It should include the `count_words` function. For the time being, don't worry about dealing with tie breakers -- just return a single most common word. Apply the `most_common` function to the test phrase and display the output.

In [391]:
test_phrase_2 = 'I bought a sandwich with a side of chips.'

In [392]:
def most_common(phrase):
    '''Takes in a phrase and returns the most common word from the string'''
    counter = count_words(phrase)
    # Return the key with the highest count
    most_common_word = max(counter, key = counter.get)
    return most_common_word

In [393]:
most_common(test_phrase_2)

'a'

### 4. Create a `count_words_from_documents` function

Write a function called `count_words_from_documents` that takes in a list of strings (aka documents), cleans the text, counts the words across all the documents and returns the most common word using the functions you've already written. Apply the `count_words_from_documents` function to the test documents list and display the output.

In [394]:
documents = [
    'I enjoyed a delicious homemade lasagna for dinner last night.',
    'She ordered a classic Caesar salad for lunch at the restaurant.',
    'Breakfast is my favorite meal of the day! I always start with a hearty omelette.',
    "We're having grilled chicken with roasted vegetables for tonight's meal.",
    'On a hot summer day, nothing beats a refreshing fruit salad.',
    'They decided to order takeout sushi for a convenient and tasty meal.',
    'Thanksgiving dinner is traditionally a feast of turkey, stuffing, and cranberry sauce.',
    "My grandmother's homemade apple pie is the perfect dessert to end any meal.",
    'For lunch today, I had a delicious avocado toast topped with poached eggs.',
    'We celebrated his birthday with a mouthwatering chocolate cake and vanilla ice cream.',
    'Dinner was a cozy bowl of tomato soup paired with a crunchy grilled cheese sandwich.',
    'I love starting my day with a bowl of oatmeal, topped with fresh berries and honey.',
    'At the family reunion, we enjoyed a huge barbecue feast of ribs, burgers, and corn on the cob.',
    'She baked a loaf of sourdough bread that filled the kitchen with an amazing aroma.',
    'We indulged in a savory beef stew slow-cooked for hours with carrots and potatoes.'
]

In [395]:
def count_words_from_documents(documents):
    '''Cleans the text, counts the words across all the documents and returns the most common word'''
    full_document = ' '.join(documents)# Join all document together
    return most_common(full_document)

In [396]:
count_words_from_documents(documents)

'a'

### 5. Create a `count_words_from_file` function

Write a function called `count_words_from_file` that takes in a file name, reads the file, cleans the text, counts the words and returns the most common word using the functions you've already written. Apply the `count_words_from_file` function to the file name and display the output.

**Reminder:** For this assignment, we're only using base Python, so no imported packages like Pandas, please.

In [397]:
file_name = "peter_pan_chapter_1.txt"

In [398]:
def count_words_from_file(file_name):
    '''Reads the file, cleans the text, counts the words and returns the most common word'''
    file = open(file_name, 'r', encoding = 'utf') 
    doc = file.read() # Read file
    file.close()
    counter = count_words(doc) # Because doc is not a list of string, we cannot use most_common directly
    return max(counter, key = counter.get)

In [399]:
count_words_from_file(file_name)

'the'

### 6. Create a `top_n_words` function

Write a function called `top_n_words` that takes in a string and returns the top n words specified by the user. If the user doesn't specify n, the default value of n should be 3. Apply the `top_n_words` function using the default value of `n` to the test phrase and display the output.

In [400]:
test_phrase_3 = (
    "I love pizza. Pizza is my favorite food. Everyone loves pizza, "
    "but I also enjoy pasta. Pasta with tomato sauce and cheese is delicious. "
    "On weekends, I like to make pizza or pasta for dinner. Pizza is always a hit!"
)

**Example Input:**

`top_n_words(test_phrase_3)`

**Example Output:**

"The word 'pizza' appears 5 times. The word 'i' appears 3 times. The word 'is' appears 3 times."

**Example Input:**

`top_n_words(test_phrase_3, n=4)`

**Example Output:**

"The word 'pizza' appears 5 times. The word 'i' appears 3 times. The word 'is' appears 3 times. The word 'pasta' appears 3 times."

In [401]:
def top_n_words(phrase, n = 3):
    '''Takes in a string and returns the top n words specified by the user'''
    counter = count_words(phrase)
    sorted_counter = sorted(counter, key = lambda x: counter[x], reverse = True) # Sort the counter by the values of each key
    result = ''# Create a variable to store the output string
    for i in range(n):
        result += (f'The word \'{sorted_counter[i]}\' appears {counter[sorted_counter[i]]} times.') # Add each key with their appearances
    return result

In [402]:
top_n_words(test_phrase_3)

"The word 'pizza' appears 5 times.The word 'i' appears 3 times.The word 'is' appears 3 times."

### 7. Update the `top_n_words` function

Update the `top_n_words` function to be able to take in a string, list of strings or file name, and return the top n words specified by the user. Apply the `top_n_words` function using the default value of `n` to `test_phrase_3`, `documents` and `file_name`.

In [403]:
def top_n_words(input_data, n = 3):
    '''Takes in either a string, list of string, or file name, return the top n words specified by the user'''
    if isinstance(input_data, list):# First check if input_date is a list
        phrase = ' '.join(input_data)
    else:
        try:
            file = open(input_data, 'r', encoding='utf-8') # If we can open it, treat it as a file
            phrase = file.read()
            file.close()
        except FileNotFoundError:
            # If it can't be opened, treat it as a regular string
            phrase = input_data
    counter = count_words(phrase)
    sorted_counter = sorted(counter, key = lambda x: counter[x], reverse = True)
    result = ''
    for i in range(n):
        result += (f'The word \'{sorted_counter[i]}\' appears {counter[sorted_counter[i]]} times.')
    return result

In [404]:
top_n_words(test_phrase_3)

"The word 'pizza' appears 5 times.The word 'i' appears 3 times.The word 'is' appears 3 times."

In [405]:
top_n_words(documents)

"The word 'a' appears 15 times.The word 'with' appears 9 times.The word 'for' appears 6 times."

In [406]:
top_n_words(file_name)

"The word 'the' appears 135 times.The word 'and' appears 116 times.The word 'a' appears 75 times."

### 8. Update the `top_n_words` function again

Update the `top_n_words` function to be able to handle ties. Describe your approach (there are multiple correct answers to this -- you will get full credit as long as your explanation makes sense). Apply the `top_n_words` function with `n=5` to `test_phrase_3`, `documents` and `file_name`.

#### Answer: In this approach, I check for ties by comparing the value of each word in the dictioanry through a while loop of n. When there is a tie, n would not be changed. However, when there is no tie, n will be updated by -1. This while loop ends when the nth different value appears, regardless of the ties within.

In [407]:
def top_n_words(input_data, n = 3):
    '''Takes in either a string, list of string, or file name, return the top n words specified by the user'''
    if isinstance(input_data, list):# First check if input_date is a list
        phrase = ' '.join(input_data)
    else:
        try:
            file = open(input_data, 'r', encoding='utf-8') # If we can open it, treat it as a file
            phrase = file.read()
            file.close()
        except FileNotFoundError:
            # If it can't be opened, treat it as a regular string
            phrase = input_data
    counter = count_words(phrase)
    sorted_counter = sorted(counter, key = lambda x: counter[x], reverse = True)
    i = 0
    prev_max = counter[sorted_counter[i]] # Store the previous highest occurence
    result = ''
    while n > 1:
        if i >= len(sorted_counter): # Check if the number of tied words is out of bound
            print('\nToo many ties, out of bound')
            break
        value = counter[sorted_counter[i]]
        # If the value is less than prev_max, meaning there is no tie, hence we could proceed to next highest occurence
        if counter[sorted_counter[i]] < prev_max: 
            n -= 1
            prev_max = value
        result += (f'The word \'{sorted_counter[i]}\' appears {counter[sorted_counter[i]]} times. ')
        i += 1 # Point to next word in the sorted_counter list
    return result

In [408]:
top_n_words(test_phrase_3, 5)


Too many ties, out of bound


"The word 'pizza' appears 5 times. The word 'i' appears 3 times. The word 'is' appears 3 times. The word 'pasta' appears 3 times. The word 'love' appears 1 times. The word 'my' appears 1 times. The word 'favorite' appears 1 times. The word 'food' appears 1 times. The word 'everyone' appears 1 times. The word 'loves' appears 1 times. The word 'but' appears 1 times. The word 'also' appears 1 times. The word 'enjoy' appears 1 times. The word 'with' appears 1 times. The word 'tomato' appears 1 times. The word 'sauce' appears 1 times. The word 'and' appears 1 times. The word 'cheese' appears 1 times. The word 'delicious' appears 1 times. The word 'on' appears 1 times. The word 'weekends' appears 1 times. The word 'like' appears 1 times. The word 'to' appears 1 times. The word 'make' appears 1 times. The word 'or' appears 1 times. The word 'for' appears 1 times. The word 'dinner' appears 1 times. The word 'always' appears 1 times. The word 'a' appears 1 times. The word 'hit' appears 1 times.

In [409]:
top_n_words(documents, 5)

"The word 'a' appears 15 times. The word 'with' appears 9 times. The word 'for' appears 6 times. The word 'the' appears 6 times. The word 'of' appears 6 times. The word 'and' appears 6 times. The word 'i' appears 4 times. The word 'meal' appears 4 times. The word 'dinner' appears 3 times. "

In [410]:
top_n_words(file_name, 5)

"The word 'the' appears 135 times. The word 'and' appears 116 times. The word 'a' appears 75 times. The word 'she' appears 74 times. The word 'of' appears 69 times. "

### 9. Create a `text_analysis.py` file

Save all of your final functions in a `text_analysis.py` file. Include the following code at the bottom of your file.

```
# Example usage
if __name__ == "__main__":
    test_phrase = (
        "I love pizza. Pizza is my favorite food. Everyone loves pizza, "
        "but I also enjoy pasta. Pasta with tomato sauce and cheese is delicious. "
        "On weekends, I like to make pizza or pasta for dinner. Pizza is always a hit!"
    )
    print(top_n_words(test_phrase, n=5))
```

Include this `.py` file in your final submission to Canvas.

### 10. Run the `text_analysis.py` file

Run the `text_analysis.py` script in your Terminal or Command Prompt by entering the command `python text_analysis.py`.

After running the script, describe the output you see and explain why you see that output. Be sure to mention the role of the `if __name__ == "__main__":` portion in your explanation.

In [418]:
!py text_analysis.py


Too many ties, out of bound
The word 'pizza' appears 5 times. The word 'i' appears 3 times. The word 'is' appears 3 times. The word 'pasta' appears 3 times. The word 'love' appears 1 times. The word 'my' appears 1 times. The word 'favorite' appears 1 times. The word 'food' appears 1 times. The word 'everyone' appears 1 times. The word 'loves' appears 1 times. The word 'but' appears 1 times. The word 'also' appears 1 times. The word 'enjoy' appears 1 times. The word 'with' appears 1 times. The word 'tomato' appears 1 times. The word 'sauce' appears 1 times. The word 'and' appears 1 times. The word 'cheese' appears 1 times. The word 'delicious' appears 1 times. The word 'on' appears 1 times. The word 'weekends' appears 1 times. The word 'like' appears 1 times. The word 'to' appears 1 times. The word 'make' appears 1 times. The word 'or' appears 1 times. The word 'for' appears 1 times. The word 'dinner' appears 1 times. The word 'always' appears 1 times. The word 'a' appears 1 times. The

#### Answer: When the script is executed, the function $top_n_words$ is executed, and top n most common words and their counts are shown. The inclusion of the if $__name__ == "__main__"$: block ensures that this analysis only occurs when the script is run directly by the terminal, and not when imported as a module in another script. This makes the script resuable

### 11. Import your newly created `text_analysis` module

Within this notebook, import the `text_analysis` module you just created. Test it out by using it to:
1. View the doc string for the `clean_text` method
2. Apply the `top_n_words` method to the second chapter of Peter Pan using the default arguments
3. Apply the `top_n_words` method to find the top 10 words in the second chapter of Peter Pan

In [419]:
import text_analysis as ta

In [420]:
help(ta.clean_text) # View the doc string

Help on function clean_text in module text_analysis:

clean_text(phrase)
    Takes in a phrase, removes punctuations, and makes all the text lowercase



In [421]:
ta.top_n_words("peter_pan_chapter_2.txt")

The word 'the' appears 124 times.The word 'to' appears 81 times.The word 'and' appears 80 times.

In [422]:
ta.top_n_words("peter_pan_chapter_2.txt", 10)

The word 'the' appears 124 times.The word 'to' appears 81 times.The word 'and' appears 80 times.The word 'it' appears 71 times.The word 'he' appears 67 times.The word 'a' appears 59 times.The word 'was' appears 56 times.The word 'i' appears 54 times.The word 'had' appears 47 times.The word 'her' appears 46 times.

### 12. Turn the string output into word-count pairs

Take the string output of the top 10 words in the second chapter of Peter Pan, and turn it into a list of tuples representing the word-count pairs.

**Input**:

```"The word 'the' appears 124 times. The word 'to' appears 81 times. ..."```

**Output**:

```
[('the', 124),
 ('to', 81),
 ...
```

In [423]:
input_string = top_n_words("peter_pan_chapter_2.txt", 10)

In [424]:
# Initialize an empty list to hold the tuples
word_count_pairs = []

# Split the input string into individual statements based on the period.
statements = input_string.split('. ')

# Process each statement to extract word and count
for statement in statements:
    parts = statement.split(' ')  # Split the statement into words
    if len(parts) == 6:# Check if parts is a valid statement
        word_count_pairs.append((parts[2], parts[4]))  # In the splitted parts, the word at these two position are the word and count

In [425]:
word_count_pairs

[("'the'", '124'),
 ("'to'", '81'),
 ("'and'", '80'),
 ("'it'", '71'),
 ("'he'", '67'),
 ("'a'", '59'),
 ("'was'", '56'),
 ("'i'", '54'),
 ("'had'", '47'),
 ("'her'", '46')]