# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [29]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [11]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [12]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [23]:
# your code here

elements_over_568 = prophet[568:]

print(elements_over_568)

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto', 'his', 'own\nday,', 'had', 'waited', 'twelve', 'years', 'in', 'the', 'city\nof', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to\nreturn', 'and', 'bear', 'him', 'back', 'to', 'the', 'isle', 'of\nhis', 'birth.\n\nAnd', 'in', 'the', 'twelfth', 'year,', 'on', 'the', 'seventh\nday', 'of', 'Ielool,', 'the', 'month', 'of', 'reaping,', 'he\nclimbed', 'the', 'hill', 'without', 'the', 'city', 'walls\nand', 'looked', 'seaward;', 'and', 'he', 'beheld', 'his\nship', 'coming', 'with', 'the', 'mist.\n\nThen', 'the', 'gates', 'of', 'his', 'heart', 'were', 'flung\nopen,', 'and', 'his', 'joy', 'flew', 'far', 'over', 'the', 'sea.\nAnd', 'he', 'closed', 'his', 'eyes', 'and', 'prayed', 'in', 'the\nsilences', 'of', 'his', 'soul.\n\n*****\n\nBut', 'as', 'he', 'descended', 'the', 'hill,', 'a', 'sadness\ncame', 'upon', 'him,', 'and', 'he', 'thought', 'in', 'his\nheart:\n\nHow', 'shall', 'I', 'go', '

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [24]:
# your code here

references = list(filter(lambda x: "{" in x, elements_over_568))
print(references)

['the{7}', '{8}Long\nwere', 'and\n{9}the', '{10}And', '{11}Shall', '{12}he\nhimself', 'spirit,\n{13}and', '{14}And\nshe', 'I\n{15}speak', 'and\n{16}caresses', '{17}your', '{18}To\nknow', '*****\n\n{19}Then', '{20}Sing', '*****\n\n{21}And', '{22}For\nlife', '*****\n\n{23}Then', 'the\n{24}much', 'for\n{25}one', 'rend\n{26}their', '*****\n\n{27}Then', 'consumed.\n{28}For', 'wine,\n{29}let', 'winepress.\n\n*****\n*****\n\n{30}', 'dream,\n{31}assigned', 'bind\n{32}yourself', '{33}And', '{34}And', '*****\n\n{35}Then', 'in\n{36}your', '*****\n\n{37}Then', 'you\n{38}might', 'that\n{39}enters', 'that\n{40}covers', '*****\n\n{41}And', '{42}Forget', '*****\n\n{43}And', '{44}And\nsuffer', '*****\n\n{45}Then', 'serpent.\n{46}But', '{47}So', 'is\n{48}the', 'the\n{49}fruitless,', '{50}And', '*****\n\n{51}Then', '{52}What', 'what\n{53}images', '*****\n\n{54}And', 'your\n{55}nights', 'would\n{56}dethrone,', '*****\n\n{57}And', '{58}For', 'the\n{59}mighty', '*****\n\n{60}And', '{61}within', '*****\n\n{6

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [32]:
import re

def remove_references(items):
    return re.sub(r'\{.*?\}','', items)

cleaned_references = list(map(remove_references, elements_over_568))

print(cleaned_references)




['PROPHET\n\n|Almustafa,', 'the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto', 'his', 'own\nday,', 'had', 'waited', 'twelve', 'years', 'in', 'the', 'city\nof', 'Orphalese', 'for', 'his', 'ship', 'that', 'was', 'to\nreturn', 'and', 'bear', 'him', 'back', 'to', 'the', 'isle', 'of\nhis', 'birth.\n\nAnd', 'in', 'the', 'twelfth', 'year,', 'on', 'the', 'seventh\nday', 'of', 'Ielool,', 'the', 'month', 'of', 'reaping,', 'he\nclimbed', 'the', 'hill', 'without', 'the', 'city', 'walls\nand', 'looked', 'seaward;', 'and', 'he', 'beheld', 'his\nship', 'coming', 'with', 'the', 'mist.\n\nThen', 'the', 'gates', 'of', 'his', 'heart', 'were', 'flung\nopen,', 'and', 'his', 'joy', 'flew', 'far', 'over', 'the', 'sea.\nAnd', 'he', 'closed', 'his', 'eyes', 'and', 'prayed', 'in', 'the\nsilences', 'of', 'his', 'soul.\n\n*****\n\nBut', 'as', 'he', 'descended', 'the', 'hill,', 'a', 'sadness\ncame', 'upon', 'him,', 'and', 'he', 'thought', 'in', 'his\nheart:\n\nHow', 'shall', 'I', 'go', 'in'

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [35]:
# your code here

whole_cleaned_list = list(map(remove_references, prophet))

print(whole_cleaned_list)

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Prophet,', 'by', 'Kahlil', 'Gibran\n\nThis', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and\nmost', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions\nwhatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms\nof', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at\nwww.gutenberg.org.', '', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States,', "you'll\nhave", 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using\nthis', 'ebook.\n\n\n\nTitle:', 'The', 'Prophet\n\nAuthor:', 'Kahlil', 'Gibran\n\nRelease', 'Date:', 'January', '1,', '2019', '[EBook', '#58585]\nLast', 'Updated:', 'January', '3,', '2018\n\n\nLanguage:', 'English\n\nCharacter', 'set', 'enc

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [42]:
def line_break(x):
    return re.sub(r'\n', '', x)

removed_line_breaks = list(map(line_break, whole_cleaned_list))

print(removed_line_breaks)
    


['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Prophet,', 'by', 'Kahlil', 'GibranThis', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'andmost', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictionswhatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'termsof', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'atwww.gutenberg.org.', '', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States,', "you'llhave", 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'usingthis', 'ebook.Title:', 'The', 'ProphetAuthor:', 'Kahlil', 'GibranRelease', 'Date:', 'January', '1,', '2019', '[EBook', '#58585]Last', 'Updated:', 'January', '3,', '2018Language:', 'EnglishCharacter', 'set', 'encoding:', 'UTF-8***', 'START', 'OF', 'THIS', 

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [46]:
# your code here

prophet_line = list(map(line_break, prophet))

print(prophet_line)

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Prophet,', 'by', 'Kahlil', 'GibranThis', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'andmost', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictionswhatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'termsof', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'atwww.gutenberg.org.', '', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States,', "you'llhave", 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'usingthis', 'ebook.Title:', 'The', 'ProphetAuthor:', 'Kahlil', 'GibranRelease', 'Date:', 'January', '1,', '2019', '[EBook', '#58585]Last', 'Updated:', 'January', '3,', '2018Language:', 'EnglishCharacter', 'set', 'encoding:', 'UTF-8***', 'START', 'OF', 'THIS', 

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [45]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

['\ufeff',
 'T',
 'h',
 'e',
 'P',
 'r',
 'o',
 'j',
 'e',
 'c',
 't',
 'G',
 'u',
 't',
 'e',
 'n',
 'b',
 'e',
 'r',
 'g',
 'E',
 'B',
 'o',
 'o',
 'k',
 'o',
 'f',
 'T',
 'h',
 'e',
 'P',
 'r',
 'o',
 'p',
 'h',
 'e',
 't',
 ',',
 'b',
 'y',
 'K',
 'a',
 'h',
 'l',
 'i',
 'l',
 'G',
 'i',
 'b',
 'r',
 'a',
 'n',
 'T',
 'h',
 'i',
 's',
 'e',
 'B',
 'o',
 'o',
 'k',
 'i',
 's',
 'f',
 'o',
 'r',
 't',
 'h',
 'e',
 'u',
 's',
 'e',
 'o',
 'f',
 'a',
 'n',
 'y',
 'o',
 'n',
 'e',
 'a',
 'n',
 'y',
 'w',
 'h',
 'e',
 'r',
 'e',
 'i',
 'n',
 't',
 'h',
 'e',
 'U',
 'n',
 'i',
 't',
 'e',
 'd',
 'S',
 't',
 'a',
 't',
 'e',
 's',
 'a',
 'n',
 'd',
 'm',
 'o',
 's',
 't',
 'o',
 't',
 'h',
 'e',
 'r',
 'p',
 'a',
 'r',
 't',
 's',
 'o',
 'f',
 't',
 'h',
 'e',
 'w',
 'o',
 'r',
 'l',
 'd',
 'a',
 't',
 'n',
 'o',
 'c',
 'o',
 's',
 't',
 'a',
 'n',
 'd',
 'w',
 'i',
 't',
 'h',
 'a',
 'l',
 'm',
 'o',
 's',
 't',
 'n',
 'o',
 'r',
 'e',
 's',
 't',
 'r',
 'i',
 'c',
 't',
 'i',
 'o',
 'n',

In [None]:
# your code here

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [49]:

def word_filter(x):

    word_list = ['and', 'the', 'a', 'an']
    
    if x in word_list:
        return False
    else:
        return True

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [50]:
# your code here
prophet_filter = list(filter(word_filter, prophet))
print(prophet_filter)

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Prophet,', 'by', 'Kahlil', 'Gibran\n\nThis', 'eBook', 'is', 'for', 'use', 'of', 'anyone', 'anywhere', 'in', 'United', 'States', 'and\nmost', 'other', 'parts', 'of', 'world', 'at', 'no', 'cost', 'with', 'almost', 'no', 'restrictions\nwhatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'terms\nof', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at\nwww.gutenberg.org.', '', 'If', 'you', 'are', 'not', 'located', 'in', 'United', 'States,', "you'll\nhave", 'to', 'check', 'laws', 'of', 'country', 'where', 'you', 'are', 'located', 'before', 'using\nthis', 'ebook.\n\n\n\nTitle:', 'The', 'Prophet\n\nAuthor:', 'Kahlil', 'Gibran\n\nRelease', 'Date:', 'January', '1,', '2019', '[EBook', '#58585]\nLast', 'Updated:', 'January', '3,', '2018\n\n\nLanguage:', 'English\n\nCharacter', 'set', 'encoding:', 'UTF-8\n\n***', 'START', 'OF', 'THIS', 'PROJECT', 'GUT

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [53]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    
    if x.lower() in word_list:
        return False
    else:
        return True

In [55]:
case_sens = list(filter(word_filter_case, prophet_filter))
print(case_sens)

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Prophet,', 'by', 'Kahlil', 'Gibran\n\nThis', 'eBook', 'is', 'for', 'use', 'of', 'anyone', 'anywhere', 'in', 'United', 'States', 'and\nmost', 'other', 'parts', 'of', 'world', 'at', 'no', 'cost', 'with', 'almost', 'no', 'restrictions\nwhatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'terms\nof', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at\nwww.gutenberg.org.', '', 'If', 'you', 'are', 'not', 'located', 'in', 'United', 'States,', "you'll\nhave", 'to', 'check', 'laws', 'of', 'country', 'where', 'you', 'are', 'located', 'before', 'using\nthis', 'ebook.\n\n\n\nTitle:', 'Prophet\n\nAuthor:', 'Kahlil', 'Gibran\n\nRelease', 'Date:', 'January', '1,', '2019', '[EBook', '#58585]\nLast', 'Updated:', 'January', '3,', '2018\n\n\nLanguage:', 'English\n\nCharacter', 'set', 'encoding:', 'UTF-8\n\n***', 'START', 'OF', 'THIS', 'PROJECT', 'GUTENBERG', 'EBOO

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [None]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here
    

In [58]:
# your code here

long_string = reduce(lambda x, y: x + " " + y, case_sens)

print(long_string)

The Project Gutenberg EBook of Prophet, by Kahlil Gibran

This eBook is for use of anyone anywhere in United States and
most other parts of world at no cost with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under terms
of Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in United States, you'll
have to check laws of country where you are located before using
this ebook.



Title: Prophet

Author: Kahlil Gibran

Release Date: January 1, 2019 [EBook #58585]
Last Updated: January 3, 2018


Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK PROPHET ***




Produced by David Widger from page images generously
provided by Internet Archive


Transcriber's Note: Page numbers, ie: {20}, are included in this
utf-8 text file. For those wishing to use text file unencumbered
with page numbers open or download Latin-1 file 58585-8.txt.







THE PROPHET

By Kahlil Gib

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [59]:
# your code here

long_string2 = reduce(lambda x, y: x + " " + y, prophet_filter)

print(long_string2)

The Project Gutenberg EBook of The Prophet, by Kahlil Gibran

This eBook is for use of anyone anywhere in United States and
most other parts of world at no cost with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under terms
of Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in United States, you'll
have to check laws of country where you are located before using
this ebook.



Title: The Prophet

Author: Kahlil Gibran

Release Date: January 1, 2019 [EBook #58585]
Last Updated: January 3, 2018


Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK THE PROPHET ***




Produced by David Widger from page images generously
provided by Internet Archive


Transcriber's Note: Page numbers, ie: {20}, are included in this
utf-8 text file. For those wishing to use text file unencumbered
with page numbers open or download Latin-1 file 58585-8.txt.







THE PROPHET

B