# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [36]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [37]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [38]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [39]:
shortened_prophet = prophet

for i in range(0, 568):
    shortened_prophet.pop(i)

print(len(shortened_prophet))

13069


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [40]:
for i in range(1, 11):
    print(shortened_prophet[i])

EBook
The
by
Gibran

This
is
the
of
anywhere
the
States


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [43]:
def reference(x: str):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    if x.__contains__("{"):
        return x.split("{")[0] if '{' in x else x
    

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [77]:
prophet_no_filter = list(map(reference, shortened_prophet))
filtered = [x for x in prophet_no_filter if x is not None]
prophet_reference = [x for x in filtered if x != ""]
print(prophet_reference)

['the', 'and\n', 'spirit,\n', 'I\n', 'and\n', '*****\n\n', '*****\n\n', '*****\n\n', 'the\n', 'for\n', 'rend\n', '*****\n\n', 'consumed.\n', 'wine,\n', 'winepress.\n\n*****\n*****\n\n', 'dream,\n', 'bind\n', '*****\n\n', 'in\n', '*****\n\n', 'you\n', 'that\n', 'that\n', '*****\n\n', '*****\n\n', '*****\n\n', 'serpent.\n', 'is\n', 'the\n', '*****\n\n', 'what\n', '*****\n\n', 'your\n', 'would\n', '*****\n\n', 'the\n', '*****\n\n', '*****\n\n', 'your\n', '*****\n\n', 'it.\n', '*****\n\n', '*****\n\n', 'and\n', '*****\n\n', 'space.\n', '*****\n\n', 'to\n', '*****\n\n', 'very\n', 'mountains.\n', '*****\n\n', '*****\n\n', 'veil.\n', '*****\n\n', '*****\n\n', 'dance.\n', 'emptiness;\n', 'the\n', 'offended.\n', 'days.\n', '0125]\n\n', 'doubter;\n', 'crystal.\n', 'after\n']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [79]:
def line_break(x: str):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    if x.__contains__("\n"):
        return x.split("\n") if "\n" in x else [x]

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [None]:
prophet_line = list(map(line_break, prophet_reference))
print(prophet_line)

[None, ['and', ''], ['spirit,', ''], ['I', ''], ['and', ''], ['*****', '', ''], ['*****', '', ''], ['*****', '', ''], ['the', ''], ['for', ''], ['rend', ''], ['*****', '', ''], ['consumed.', ''], ['wine,', ''], ['winepress.', '', '*****', '*****', '', ''], ['dream,', ''], ['bind', ''], ['*****', '', ''], ['in', ''], ['*****', '', ''], ['you', ''], ['that', ''], ['that', ''], ['*****', '', ''], ['*****', '', ''], ['*****', '', ''], ['serpent.', ''], ['is', ''], ['the', ''], ['*****', '', ''], ['what', ''], ['*****', '', ''], ['your', ''], ['would', ''], ['*****', '', ''], ['the', ''], ['*****', '', ''], ['*****', '', ''], ['your', ''], ['*****', '', ''], ['it.', ''], ['*****', '', ''], ['*****', '', ''], ['and', ''], ['*****', '', ''], ['space.', ''], ['*****', '', ''], ['to', ''], ['*****', '', ''], ['very', ''], ['mountains.', ''], ['*****', '', ''], ['*****', '', ''], ['veil.', ''], ['*****', '', ''], ['*****', '', ''], ['dance.', ''], ['emptiness;', ''], ['the', ''], ['offended.', '

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [None]:
def filter_none(list: list):
    filtered = []
    for i in list:
        if i != None:
            filtered.append(i)
    return filtered

filtered_prophet_line = filter_none(prophet_line)
prophet_flat = [word for sublist in filtered_prophet_line for word in sublist]
print(prophet_flat)

['and', '', 'spirit,', '', 'I', '', 'and', '', '*****', '', '', '*****', '', '', '*****', '', '', 'the', '', 'for', '', 'rend', '', '*****', '', '', 'consumed.', '', 'wine,', '', 'winepress.', '', '*****', '*****', '', '', 'dream,', '', 'bind', '', '*****', '', '', 'in', '', '*****', '', '', 'you', '', 'that', '', 'that', '', '*****', '', '', '*****', '', '', '*****', '', '', 'serpent.', '', 'is', '', 'the', '', '*****', '', '', 'what', '', '*****', '', '', 'your', '', 'would', '', '*****', '', '', 'the', '', '*****', '', '', '*****', '', '', 'your', '', '*****', '', '', 'it.', '', '*****', '', '', '*****', '', '', 'and', '', '*****', '', '', 'space.', '', '*****', '', '', 'to', '', '*****', '', '', 'very', '', 'mountains.', '', '*****', '', '', '*****', '', '', 'veil.', '', '*****', '', '', '*****', '', '', 'dance.', '', 'emptiness;', '', 'the', '', 'offended.', '', 'days.', '', '0125]', '', '', 'doubter;', '', 'crystal.', '', 'after', '']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [None]:
def word_filter(x: str):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']

    if x in word_list:
        return False
    else:
        return True

print(word_filter("John"))
print(word_filter("a"))

True
False


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [115]:
prophet_filter = list(filter(word_filter, prophet_flat))

def filter_empty(list: list):
    filtered = []
    for word in list:
        if len(word) > 0:
            filtered.append(word)
    return filtered

print(filter_empty(prophet_filter))

['spirit,', 'I', '*****', '*****', '*****', 'for', 'rend', '*****', 'consumed.', 'wine,', 'winepress.', '*****', '*****', 'dream,', 'bind', '*****', 'in', '*****', 'you', 'that', 'that', '*****', '*****', '*****', 'serpent.', 'is', '*****', 'what', '*****', 'your', 'would', '*****', '*****', '*****', 'your', '*****', 'it.', '*****', '*****', '*****', 'space.', '*****', 'to', '*****', 'very', 'mountains.', '*****', '*****', 'veil.', '*****', '*****', 'dance.', 'emptiness;', 'offended.', 'days.', '0125]', 'doubter;', 'crystal.', 'after']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [130]:
def word_filter_case(x: list):
    word_list = ['and', 'the', 'a', 'an']
    
    if x.lower() in word_list:
        return False
    else:
        return True
    
word_filter_case("a")

False

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [137]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return f"{a} {b}"
    
print(concat_space("hello", "world"))

hello world


In [None]:
# your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [139]:
prophet_string = reduce(concat_space, prophet_filter)
print(prophet_string)

 spirit,  I   *****   *****   *****    for  rend  *****   consumed.  wine,  winepress.  ***** *****   dream,  bind  *****   in  *****   you  that  that  *****   *****   *****   serpent.  is   *****   what  *****   your  would  *****    *****   *****   your  *****   it.  *****   *****    *****   space.  *****   to  *****   very  mountains.  *****   *****   veil.  *****   *****   dance.  emptiness;   offended.  days.  0125]   doubter;  crystal.  after 
