In [None]:
!git clone https://github.com/Ingtec/lab-map-filter-reduce-en.git

In [9]:
%cd /content/lab-map-filter-reduce-en


/content/lab-map-filter-reduce-en


In [10]:
import os
print(os.getcwd())

/content/lab-map-filter-reduce-en


In [8]:
import sys
sys.path.append('/content/lab-map-filter-reduce-en')

# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [62]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [34]:
location = 'data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')


In [35]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself.

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [36]:
# Removing the first 568 words from the list
# These words contain book metadata, not part of the actual book
prophet = prophet[568:]

In [37]:
print(len(prophet))

13069


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [39]:
# Checking the first 10 words of the book content
# This helps identify patterns like references or unwanted characters
print(prophet[:10])

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


#### The next step is to create a function that will remove references.

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [40]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed

    Example:
    Input: 'the{7}'
    Output: 'the'
    '''

     # Splitting the string at the '{' character and keeping the first part
    return x.split('{')[0]

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [41]:
# Applying the reference function using map
prophet_reference = list(map(reference, prophet))
print(prophet_reference[:10])  # Checking the first 10 cleaned words

['PROPHET\n\n|Almustafa,', 'the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [42]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character

    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
# Splitting the string on the line break character (\n)
    return x.split('\n')

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [43]:
# Applying the line_break function to the list
prophet_line = list(map(line_break, prophet_reference))
print(prophet_line[:5])  # Checking the first 5 elements (lists of strings)

[['PROPHET', '', '|Almustafa,'], ['the'], ['chosen'], ['and'], ['the', 'beloved,']]


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [None]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

In [49]:
# Flattening the list of lists into a single list using list comprehension
prophet_flat = [word for sublist in prophet_line for word in sublist]
print(prophet_flat[:10])  # Checking the first 10 words

['PROPHET', '', '|Almustafa,', 'the', 'chosen', 'and', 'the', 'beloved,', 'who', 'was']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [50]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list
    and False if the word is in the list.

    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False

    Input: 'John'
    Output: True
    '''

    word_list = ['and', 'the', 'a', 'an']

    return x not in word_list  # Returns True for words NOT in the list

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [51]:
# Applying the word_filter function to remove stop words
prophet_filter = list(filter(word_filter, prophet_flat))
print(prophet_filter[:10])  # Checking the first 10 filtered words

['PROPHET', '', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [52]:
def word_filter_case(x):

    word_list = ['and', 'the', 'a', 'an']

    return x.lower() not in word_list  # Convert both word and list to lowercase for comparison


In [53]:
# Applying the case-insensitive word_filter function
prophet_filter_case = list(filter(word_filter_case, prophet_flat))
print(prophet_filter_case[:10])  # Checking the first 10 filtered words (case insensitive)

['PROPHET', '', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his']


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces.

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [65]:
def concat_space(a, b):
    '''
    Input: Two strings
    Output: A single string separated by a space

    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return f"{a} {b}"  # Join the two strings with a space


In [66]:
# Applying reduce with the corrected concat_space function
prophet_string = reduce(concat_space, prophet_filter_case)

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [67]:
# Display the first 500 characters of the final concatenated string
print(prophet_string[:500])

PROPHET  |Almustafa, chosen beloved, who was dawn unto his own day, had waited twelve years in city of Orphalese for his ship that was to return bear him back to isle of his birth.  in twelfth year, on seventh day of Ielool, month of reaping, he climbed hill without city walls looked seaward; he beheld his ship coming with mist.  Then gates of his heart were flung open, his joy flew far over sea. he closed his eyes prayed in silences of his soul.  *****  But as he descended hill, sadness came up
