# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [128]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy as np
import pandas as pd
import re

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [129]:
# Run this code:

location = './data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')
    
print(len(prophet))

13637


#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [130]:
# your code here

for i in range(567):
    prophet.pop(0)

print(len(prophet))

13070


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [131]:
# your code here
print(prophet[0:10])

['Farewell................92\n\n\n\n\nTHE', 'PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [132]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    return x.split('{')[0]
    

    # as the entry for the function is a string, we can split for the {. The function will return a list of 2 lists, with element 0 the part of the string before the split, and the element 2 the part of the string after the split. As this second one is not usefull for us, we just select element 0.

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [133]:
# We pass through the map function and our list of strings (prophet) to apply the funtion to all elements. As this function returns an object but we want a list, we convert it using list()
prophet_reference = list(map(reference, prophet))
print(prophet_reference[0:10])

['Farewell................92\n\n\n\n\nTHE', 'PROPHET\n\n|Almustafa,', 'the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [134]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    return x.split('\n')
    
    # This is the same case as above, but now we want both sides of splitted elements

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [135]:
# We do again the same operation as before, getting a list.  
prophet_line = list(map(line_break, prophet_reference))
print(prophet_line[0:10])

[['Farewell................92', '', '', '', '', 'THE'], ['PROPHET', '', '|Almustafa,'], ['the'], ['chosen'], ['and'], ['the', 'beloved,'], ['who'], ['was'], ['a'], ['dawn']]


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [136]:
# your code here
prophet_flat = [i[0] for i in prophet_line]
print(prophet_flat[0:10])

['Farewell................92', 'PROPHET', 'the', 'chosen', 'and', 'the', 'who', 'was', 'a', 'dawn']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [137]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    return (lambda x: True if x in word_list else False)

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [138]:
prophet_filter = list(filter(word_filter, prophet_flat))
prophet_filter

['Farewell................92',
 'PROPHET',
 'the',
 'chosen',
 'and',
 'the',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to',
 'and',
 'bear',
 'him',
 'back',
 'to',
 'the',
 'isle',
 'of',
 'birth.',
 'in',
 'the',
 'twelfth',
 'year,',
 'on',
 'the',
 'seventh',
 'of',
 'Ielool,',
 'the',
 'month',
 'of',
 'reaping,',
 'he',
 'the',
 'hill',
 'without',
 'the',
 'city',
 'walls',
 'looked',
 'seaward;',
 'and',
 'he',
 'beheld',
 'his',
 'coming',
 'with',
 'the',
 'mist.',
 'the',
 'gates',
 'of',
 'his',
 'heart',
 'were',
 'flung',
 'and',
 'his',
 'joy',
 'flew',
 'far',
 'over',
 'the',
 'sea.',
 'he',
 'closed',
 'his',
 'eyes',
 'and',
 'prayed',
 'in',
 'the',
 'of',
 'his',
 'soul.',
 'as',
 'he',
 'descended',
 'the',
 'hill,',
 'a',
 'sadness',
 'upon',
 'him,',
 'and',
 'he',
 'thought',
 'in',
 'his',
 'shall',
 'I',
 'go',
 'in',
 'peac

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [139]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    return (lambda x: True if lower(x) in word_list else False)
    # as the words in the word_list are lower case, instead of comparing x to be in word_list, I compare lower(x).

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [140]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return a + ' ' + b
    # your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [141]:
prophet_string = reduce(concat_space, prophet_filter)
prophet_string

ver was also the doubter; often have I put my finger my own wound that I might have the belief in you and the greater of you. it is with this belief and this that I say, are not enclosed within your bodies, confined to houses or fields. which is you dwells above the and roves with the wind. is not a thing that crawls into sun for warmth or digs holes into for safety, a thing free, a spirit that envelops earth and moves in the ether. these be vague words, then seek not clear them. and nebulous is the beginning of things, but not their end, I fain would have you remember me as beginning. and all that lives, is conceived the mist and not in the crystal. who knows but a crystal is mist decay? would I have you remember in me: which seems most feeble and in you is the strongest and determined. it not your breath that has erected hardened the structure of your is it not a dream which none of you having dreamt, that builded city and fashioned all there is in you but see the tides of that you w

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will connect to Ironhack's database and retrieve the data from the *pollution* database. Select the *beijing_pollution* table and retrieve its data.

In [142]:
# your code here
import sqlalchemy
import pymysql
from sqlalchemy import create_engine

In [143]:
driver = 'mysql+pymysql'
ip = '34.65.10.136'
username = 'data-students'
password = 'iR0nH@cK-D4T4B4S3'
db = 'pollution'
connection_string  = f'{driver}://{username}:{password}@{ip}/{db}'

In [144]:
engine = create_engine(connection_string)
query = 'SELECT * FROM beijing_pollution'
beijing_pollution = pd.read_sql(query,engine)
beijing_pollution

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43819,43820,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
43820,43821,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
43821,43822,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
43822,43823,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


Let's look at the data using the `head()` function.

In [145]:
# your code here
beijing_pollution.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [146]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    return x/24
    # your code here

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [147]:
# your code here
pm25_hourly = beijing_pollution[['Iws', 'Is', 'Ir']].apply(hourly)

In [148]:
pm25_hourly.head()

Unnamed: 0,Iws,Is,Ir
0,0.074583,0.0,0.0
1,0.205,0.0,0.0
2,0.279583,0.0,0.0
3,0.41,0.0,0.0
4,0.540417,0.0,0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [153]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    return np.std(x) / (x.count()-1)
    
    # your code here

In [155]:
sample_sd(pm25_hourly)


Iws    4.754929e-05
Is     7.229519e-07
Ir     1.346183e-06
dtype: float64