# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [2]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [3]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [4]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [5]:
# your code here
prophet = prophet[568:]
print(len(prophet))

13069


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [6]:
# your code here
print(prophet[0:10])

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [7]:
def reference(prophet):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    return prophet.split('{')[0]
    # your code here
 

In [8]:
# your code here
res=reference(prophet[1])
print(res)


the


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [9]:
# your code here
def reference(line):
    return f"Reference: {line}"

prophet_reference = list(map(reference, prophet))
print(prophet_reference)


['Reference: PROPHET\n\n|Almustafa,', 'Reference: the{7}', 'Reference: chosen', 'Reference: and', 'Reference: the\nbeloved,', 'Reference: who', 'Reference: was', 'Reference: a', 'Reference: dawn', 'Reference: unto', 'Reference: his', 'Reference: own\nday,', 'Reference: had', 'Reference: waited', 'Reference: twelve', 'Reference: years', 'Reference: in', 'Reference: the', 'Reference: city\nof', 'Reference: Orphalese', 'Reference: for', 'Reference: his', 'Reference: ship', 'Reference: that', 'Reference: was', 'Reference: to\nreturn', 'Reference: and', 'Reference: bear', 'Reference: him', 'Reference: back', 'Reference: to', 'Reference: the', 'Reference: isle', 'Reference: of\nhis', 'Reference: birth.\n\nAnd', 'Reference: in', 'Reference: the', 'Reference: twelfth', 'Reference: year,', 'Reference: on', 'Reference: the', 'Reference: seventh\nday', 'Reference: of', 'Reference: Ielool,', 'Reference: the', 'Reference: month', 'Reference: of', 'Reference: reaping,', 'Reference: he\nclimbed', 'Re

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [12]:
def line_break(x):
    return x.split('\n')
    # your code here

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [15]:
# your code here

prophet_line = [line for reference in prophet_reference for line in line_break(reference)]


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [17]:
prophet_flat = [i for sub in prophet_line for i in sub]


In [18]:
# your code here
prophet_flat = [i for sub in prophet_line for i in sub]

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [22]:
def word_filter(x):
    if x in word_list:
       return False  # The word is in the list, return False
    else:
       return True
    # your code here
word_list = ['and', 'the', 'a', 'an']
print(word_filter('and'))
print(word_filter('John'))

False
True


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [26]:
# your code here
def word_filter(x): 
    word_list = ['and', 'the', 'a', 'an']  
    return x not in word_list  

prophet_filter = list(filter(word_filter, word_list))
print(prophet_filter)

[]


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [27]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    return x.lower() not in (word.lower() for word in word_list)

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [28]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here
    return a + ' ' + b

In [29]:
# your code here
from functools import reduce

word_list = ['and', 'the', 'a', 'an']
result = reduce(concat_space, word_list)
print(result) 

and the a an


Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [32]:
# your code here
from functools import reduce
prophet_filter = [
    "Your", "children", "are", "not", "your", "children.",
    "They", "are", "sons", "daughters", "of", "Life's", "longing", "for", "itself."]

prophet_string = reduce(concat_space, prophet_filter)
print(prophet_string)

Your children are not your children. They are sons daughters of Life's longing for itself.
