# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [122]:
# Run this code:
location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [123]:
prophet_cleansing=list(map(str,prophet[568:]))

print(f"\n\033[1m{len(prophet)-len(prophet_cleansing)}\033[0m words erased from prophet")


[1m568[0m words erased from prophet



If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [124]:
print(prophet_cleansing[0:9])

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [125]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    x_ref=""
    
    for i in x:
        if i!="{":
            x_ref+=i
        else:
            break
    return x_ref

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [6]:
prophet_reference=list(map(reference,prophet_cleansing))

print(f"\n\033[1mReferences removed\033[0m: for example ➜ '{prophet_cleansing[1]}' is now '{prophet_reference[1]}'")


[1mReferences removed[0m: for example ➜ 'the{7}' is now 'the'


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [7]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    x_lbreak = x.split("\n")
    
    return x_lbreak

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [8]:
prophet_line=list(map(line_break,prophet_reference))

print(f"\n\033[1mReferences removed\033[0m: for example ➜ '{prophet_reference[4]}' is now'{prophet_line[4]}'")


[1mReferences removed[0m: for example ➜ 'the
beloved,' is now'['the', 'beloved,']'


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [21]:
prophet_flat = [word for sublista in prophet_line for word in sublista]

print(f"\n\033[1mSublists removed\033[0m: for example ➜ '{prophet_line[0]}'is now '{prophet_flat[0]}','{prophet_flat[1]}','{prophet_flat[2]}'")


[1mSublists removed[0m: for example ➜ '['PROPHET', '', '|Almustafa,']'is now 'PROPHET','','|Almustafa,'


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [10]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    result=""

    if x in word_list:
        result=False
    else:
        result=True
        
    return result       

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [20]:
prophet_filter = list(filter(word_filter_case, prophet_flat))

print(f"\n\033[1mWords removed\033[0m: for example ➜ 4th word is now '{prophet_filter[3]}' instead of '{prophet_flat[3]}'")


[1mWords removed[0m: for example ➜ 4th word is now 'chosen' instead of 'the'


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [11]:
def word_filter_case(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    result=""

    if x.lower() in word_list:
        result=False
    else:
        result=True
        
    return result 

In [215]:

print(f"\n\033[1mword_filter\033[0m funcion's result for 'And' is \033[92m{word_filter('And')}\033[0m, while \033[1mword_filter_case\033[0m is \033[91m{word_filter_case('And')}\033[0m")


[1mword_filter[0m funcion's result for 'And' is [92mTrue[0m, while [1mword_filter_case[0m is [91mFalse[0m



# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [39]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    concat_elements= [a,b]
    result = " ".join(concat_elements)
    
    return result

x=concat_space("John", "Smith")

print(f"\n\033[1mconcat_space\033[0m: turns a='John' and b='Smith' into '{x}' type ➜ {type(x)}")


[1mconcat_space[0m: turns a='John' and b='Smith' into 'John Smith' type ➜ <class 'str'>


Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [49]:
prophet_string = reduce(concat_space, prophet_filter)

print(f"\n\033[1mString converted\033[0m: for example ➜ 1st element is now the letter '{prophet_string[0]}' instead of the word '{prophet_filter[0]}'")


[1mString converted[0m: for example ➜ 1st element is now the letter 'P' instead of the word 'PROPHET'


# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will connect to Ironhack's database and retrieve the data from the *pollution* database. Select the *beijing_pollution* table and retrieve its data.

In [78]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv")

In [79]:

print(f"Database contains \033[1m{len(df)}\033[0m rows and data type is {type(df)}")

Database contains [1m43824[0m rows and data type is <class 'pandas.core.frame.DataFrame'>


Let's look at the data using the `head()` function.

In [121]:
df.head(5)

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [82]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    resultado= x/24
    return resultado

hourly(48)

2.0

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [100]:
pm25_hourly=df[['Iws','Is','Ir']].apply(hourly)
pm25_hourly.head(5)

Unnamed: 0,Iws,Is,Ir
0,0.074583,0.0,0.0
1,0.205,0.0,0.0
2,0.279583,0.0,0.0
3,0.41,0.0,0.0
4,0.540417,0.0,0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [119]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    result=np.std(x)/((x.count())-1)
    
    return result 

sample_sd(pd.Series([1,2,3,4]))

0.37267799624996495