# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [9]:
len(prophet)

12457

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [10]:
from copy import copy 
copy_prophet1 = copy(prophet) #accidentally ran the code twice and deleted another 568 words, created new copy 

def skip_first_568(word, index):
    # Return None for first 568 elements, and the word itself for the rest
    return None if index < 568 else word

# Apply the function using map with enumeration to get index
filtered_prophet = list(map(lambda x: skip_first_568(x[1], x[0]), enumerate(copy_prophet1)))

# Remove None values from the result
prophet = list(filter(None, filtered_prophet))

# Print the result to verify
print(f"Number of words after removal: {len(prophet)}")

Number of words after removal: 11889


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [11]:
# your code here
print(prophet[0:10])

['sheaves', 'of', 'corn', 'he', 'gathers', 'you', 'unto\nhimself.\n\nHe', 'threshes', 'you', 'to']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [13]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    parts = x.split('{')
    
    # your code here
    return parts[0] 

# Testing function
print(reference('the{7}'))

the


In [None]:
# your code here
def reference(x):
    parts = x.split('{')
    return parts[0]


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [14]:
reference_prophet = list(map(reference, prophet))
# Print the result to verify
print(prophet[0:10])
print(reference_prophet[0:10])

['sheaves', 'of', 'corn', 'he', 'gathers', 'you', 'unto\nhimself.\n\nHe', 'threshes', 'you', 'to']
['sheaves', 'of', 'corn', 'he', 'gathers', 'you', 'unto\nhimself.\n\nHe', 'threshes', 'you', 'to']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [15]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    # your code here
    return x.split('\n')
#test the code
print(line_break('the\nbeloved'))

['the', 'beloved']


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [17]:
prophet_line = list(map(line_break, reference_prophet))
print(prophet_line[0:10])

[['sheaves'], ['of'], ['corn'], ['he'], ['gathers'], ['you'], ['unto', 'himself.', '', 'He'], ['threshes'], ['you'], ['to']]


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [18]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

['sheaves',
 'of',
 'corn',
 'he',
 'gathers',
 'you',
 'unto',
 'himself.',
 '',
 'He',
 'threshes',
 'you',
 'to',
 'make',
 'you',
 'naked.',
 '',
 'He',
 'sifts',
 'you',
 'to',
 'free',
 'you',
 'from',
 'your',
 'husks.',
 '',
 'He',
 'grinds',
 'you',
 'to',
 'whiteness.',
 '',
 'He',
 'kneads',
 'you',
 'until',
 'you',
 'are',
 'pliant;',
 '',
 'And',
 'then',
 'he',
 'assigns',
 'you',
 'to',
 'his',
 'sacred',
 'fire,',
 'that',
 'you',
 'may',
 'become',
 'sacred',
 'bread',
 'for',
 'God’s',
 'sacred',
 'feast.',
 '',
 '*****',
 '',
 'All',
 'these',
 'things',
 'shall',
 'love',
 'do',
 'unto',
 'you',
 'that',
 'you',
 'may',
 'know',
 'the',
 'secrets',
 'of',
 'your',
 'heart,',
 'and',
 'in',
 'that',
 'knowledge',
 'become',
 'a',
 'fragment',
 'of',
 'Life’s',
 'heart.',
 '',
 'But',
 'if',
 'in',
 'your',
 'fear',
 'you',
 'would',
 'seek',
 'only',
 'love’s',
 'peace',
 'and',
 'love’s',
 'pleasure,',
 '',
 'Then',
 'it',
 'is',
 'better',
 'for',
 'you',
 'that',

In [19]:
prophet_flat = [i for sub in prophet_line for i in sub]
print(prophet_flat[0:20])

['sheaves', 'of', 'corn', 'he', 'gathers', 'you', 'unto', 'himself.', '', 'He', 'threshes', 'you', 'to', 'make', 'you', 'naked.', '', 'He', 'sifts', 'you']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [None]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    if x in word list:
        return False
    else:
        return True


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [22]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    if x in word_list:
        return False
    else:
        return True

def filter(x):
    if word_filter(x):
        return x
    
# Test the function
prophet_filter = list(filter(prophet_flat))
print(prophet_filter[0:20])

['sheaves', 'of', 'corn', 'he', 'gathers', 'you', 'unto', 'himself.', '', 'He', 'threshes', 'you', 'to', 'make', 'you', 'naked.', '', 'He', 'sifts', 'you']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [24]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    for word in x:
        if word.lower() in word_list:
            return False
        else:
            return True
def filter(x):
    if word_filter_case(x):
        return x
prophet_filter_case = list(filter(prophet_flat))
print(prophet_filter_case[0:20])

['sheaves', 'of', 'corn', 'he', 'gathers', 'you', 'unto', 'himself.', '', 'He', 'threshes', 'you', 'to', 'make', 'you', 'naked.', '', 'He', 'sifts', 'you']


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [25]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return a + " " + b

    # your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [28]:
prophet_string = reduce(concat_space, prophet_filter_case)
print(prophet_string[0:40])

sheaves of corn he gathers you unto hims
