# Processing

In this homework, youll implement, document, and test a series of functions to apply control structures and data structures to solve problems mimicking different types of data cleanup and processing. Later, well learn how to use more real-world library functions to complete these tasks more effectively.

For each question, youll be asked to **implement a function**, **document it** with a docstring, and **test it** with doctests. For specific guidance, search for the "style guide" on the course website. Generally:

- To fulfill the documentation requirements, use your own words to provide a brief description of only the details that a client needs to know to call the function.
- To fulfill testing requirements, convert each provided valid function call example into a doctest and additionally write 2 more test cases of your own. You may need to change the given examples slightly to meet doctest requirements.

The `run_docstring_examples` function call at the end of each task will only print a message if test cases fail.

In [None]:
import doctest

## Outside Sources

Update the following Markdown cell to include your name and list your outside sources. Submitted work should be consistent with the curriculum and your sources.

**Name**: Elijah Gallup

1. https://www.geeksforgeeks.org/python/python-string-isalpha-method/
2. https://realpython.com/python-min-and-max/ 

## Task: `text_normalize`

*Text normalization* is the process of removing unwanted characters from a piece of text, such as whitespace or special characters. Write and test a function `text_normalize` that takes a string and returns a new string that keeps only alphabetical characters (ignore whitespace, numbers, non-alphabet characters, etc.) and turns all alphabetical characters to lowercase.

- `text_normalize("Hello")` should return `"hello"`
- `text_normalize("Hello!")` should return `"hello"`
- `text_normalize("heLLo tHEr3!!!")` should return `"hellother"`

In [1]:
import doctest
def text_normalize(text):
    """
    Takes a string of text and returns a lowercase version with only alphabetical characters
    >>> text_normalize("Hello")
    "hello"
    >>> text_normalize("Hello!")
    "hello"
    >>> text_normalize("heLLo tHEr3!!!")
    "hellother"
    >>> text_normalize("67")
    ""
    >>> text_normalize("like67")
    "like"
    """
    newText = ""
    for char in text:
        if (char.isalpha()):
            newText += char.lower()
    return newText



doctest.run_docstring_examples(text_normalize, globals())

**********************************************************************
File "__main__", line 5, in NoName
Failed example:
    text_normalize("Hello")
Expected:
    "hello"
Got:
    'hello'
**********************************************************************
File "__main__", line 7, in NoName
Failed example:
    text_normalize("Hello!")
Expected:
    "hello"
Got:
    'hello'
**********************************************************************
File "__main__", line 9, in NoName
Failed example:
    text_normalize("heLLo tHEr3!!!")
Expected:
    "hellother"
Got:
    'hellother'
**********************************************************************
File "__main__", line 11, in NoName
Failed example:
    text_normalize("67")
Expected:
    ""
Got:
    ''
**********************************************************************
File "__main__", line 13, in NoName
Failed example:
    text_normalize("like67")
Expected:
    "like"
Got:
    'like'


## Task: `average_tokens_per_line`

Write and test a function `average_tokens_per_line` that takes the name of a `.txt` file and returns the average number of tokens per line in the file. For example, if the file `song.txt` contains the text:

```
Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily,
Life is but a dream!
```

The first line has 5 tokens; the second has 4; the third has 4; and the fourth has 5. This gives an average tokens per line of `4.5`.

To write additional test cases, create new text files. From the JupyterLab **File** menu, choose **New** and then **Text File**.

In [10]:
import doctest
def average_tokens_per_line(file):
    """
    Takes a .txt file and returns the average number of tokens (or words) per line in the file.
    >>> average_tokens_per_line("song.txt")
    4.5
    >>> average_tokens_per_line("speech.txt")
    39.03125
    >>> average_tokens_per_line("simple.txt")
    4.0
    """
    with open(file) as f:
        lines = f.readlines()
        line_num = 0
        numTokens = 0
        for line in lines:
            line_num += 1
            tokens = line.split()
            numTokens += len(tokens)
        return numTokens/line_num

doctest.run_docstring_examples(average_tokens_per_line, globals())

## Task: `pair_up`

When creating and processing datasets, sometimes its useful to pair-up identifiers with each data element. For this task, you are given some buggy code that is intended to take a set of identifiers and a set of elements and returns a set of every identifier paired with every element. Since sets are unordered, there is no inherent ordering to the tuples in the result set.

Your task is to identify and correct the bug, and then explain the bugs you encountered and what drew you to your specific fixes. For this task, you do not need to write additional test cases.

**Explanation**: When I initially ran the code, I saw the error thrown that the add function didnt work for the specific object. I went to the previous lecture Strings and Lists to look at the set of useful methods for List. I found that the correct function was append(). I replaced add() with append(). The code then threw an error saying that 

In [32]:
import doctest
def pair_up(identifiers, elements):
    """
    Given two sets, returns a set of tuples where each item in the first set is paired with each
    item in the second set.

    For the doctests, we use the sorted function to ensure a predictable ordering for the tuples
    because sets do not generally guarantee a specific ordering.

    >>> sorted(pair_up({10, 20}, {5, 6, 7}))
    [(10, 5), (10, 6), (10, 7), (20, 5), (20, 6), (20, 7)]
    >>> sorted(pair_up({10, 20}, {"I", "am", "Groot"}))
    [(10, Groot), (10, I), (10, am), (20, Groot), (20, I), (20, am)]
    >>> sorted(pair_up({6, 7}, {4, 1}))
    [(6, 1), (6, 4), (7, 1), (7, 4)]
    >>> sorted(pair_up({"like"}, {6, 7}))
    [(like, 6), (like, 7)]
    """
    result = []
    for identifier in identifiers:
        for element in elements:
            newTuple = (identifier, element)
            result.append(newTuple)
    return result


doctest.run_docstring_examples(pair_up, globals())

**********************************************************************
File "__main__", line 12, in NoName
Failed example:
    sorted(pair_up({10, 20}, {"I", "am", "Groot"}))
Expected:
    [(10, Groot), (10, I), (10, am), (20, Groot), (20, I), (20, am)]
Got:
    [(10, 'Groot'), (10, 'I'), (10, 'am'), (20, 'Groot'), (20, 'I'), (20, 'am')]
**********************************************************************
File "__main__", line 16, in NoName
Failed example:
    sorted(pair_up({"like"}, {6, 7}))
Expected:
    [(like, 6), (like, 7)]
Got:
    [('like', 6), ('like', 7)]


## Task: `five_number_summary`

Write and test a function `five_number_summary` that takes a sorted list of at least 5 numbers and returns a tuple containing the five-number summary of the input: the input lists `(minimum, first-quartile, median, third-quartile, maximum)`. The first quartile is the median of the lower half of the data (including the minimum), and the third quartile is the median of the upper half of the data (including the maximum). The median should be excluded from the calculations of the first and third quartiles.

- `five_number_summary([1, 2, 3, 4, 5])` should return `(1, 1.5, 3, 4.5, 5)`
- `five_number_summary([1, 1, 1, 1, 1])` should return `(1, 1, 1, 1, 1)`
- `five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53])` should return `(30, 31, 36, 45, 53)`
- `five_number_summary([5, 13, 14, 15, 16, 17, 25])` should return `(5, 13, 15, 17, 25)`
- `five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30])` should return `(5, 12.5, 15.5, 27.5, 30)`
- `five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29])` should return `(12, 13, 15.5, 26, 29)`

The following examples of invalid function calls should not be tested:

- `five_number_summary([1])` since the input list does not have at least five numbers
- `five_number_summary([5, 4, 3, 2, 1])` since the input list is not sorted from least to greatest

We recommend defining a helper function to find the median of a given list.

In [36]:
import doctest
def five_number_summary(list):
    """
    Takes a sorted list of 5 or more numbers and returns a tuple 
    containing the input lists: (minimum, first-quartile, median, third-quartile, maximum). 
    The first quartile is the median of the lower half of the data (excluding median), 
    and the third quartile is the median of the upper half of the data (excluding median). 
    >>> five_number_summary([1, 2, 3, 4, 5])
    (1, 1.5, 3, 4.5, 5)
    >>> five_number_summary([1, 1, 1, 1, 1]) 
    (1, 1, 1, 1, 1)
    >>> five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53])
    (30, 31, 36, 45, 53)
    >>> five_number_summary([5, 13, 14, 15, 16, 17, 25])
    (5, 13, 15, 17, 25)
    >>> five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30])
    (5, 12.5, 15.5, 27.5, 30)
    >>> five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29])
    (12, 13, 15.5, 26, 29)
    >>> five_number_summary([10, 20, 30, 40, 50, 60, 70])
    (10, 20, 40, 60, 70)
    >>> five_number_summary([1, 4, 6, 7])
    (1, 2.5, 5.0, 6.5, 7)
    """
    minValue = min(list)
    maxValue = max(list)
    firstQuartile = 0
    if (int(len(list)/2) % 2 == 1):
        firstQuartile = list[int(len(list)/4)]
    else:
        firstQuartile = (list[int(len(list)/4)] + list[int(len(list)/4)-1])/2
    thirdQuartile = 0
    if (int(len(list)/2) % 2 == 1):
        thirdQuartile = list[-int(len(list)/4+1)]
    else:
        thirdQuartile = (list[-1*int(len(list)/4)] + list[-1*int(len(list)/4+1)])/2
    median = 0
    if (len(list) % 2 == 0):
        median = (list[int(len(list)/2)] + list[int(len(list)/2) - 1])/2
    else:
        median = list[int(len(list)/2)]
    if (firstQuartile.is_integer()):
        firstQuartile = int(firstQuartile)
    if (thirdQuartile.is_integer()):
        thirdQuartile = int(thirdQuartile)
    return (minValue, firstQuartile, median, thirdQuartile, maxValue)

doctest.run_docstring_examples(five_number_summary, globals())

## Task: `num_outliers`

An *outlier* is an extreme data point that can influence the shape and distribution of numeric data. $x$ is considered an outlier if either:

- $x$ is less than the first quartile minus 1.5 times the interquartile range
- $x$ is greater than the third quartile plus 1.5 times the interquartile range

The *interquartile range* is defined as the third quartile minus the first quartile.

Write and test a function `num_outliers` that takes a sorted list of at least five numbers and returns the number of data points that would be considered outliers using your `five_number_summary` to calculate the first and third quartiles.

- `num_outliers([1, 2, 3, 4, 5])` should return `0`
- `num_outliers([1, 99, 200, 500, 506, 507])` should return `0`
- `num_outliers([5, 13, 14, 15, 16, 17, 25])` should return `2` (the outliers are 5 and 25)
- `num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101])` should return `2` (the outliers are 100 and 101)
- `num_outliers([8, 10, 10, 11, 11, 12])` should return `1` (the outlier is 8)

The following examples of invalid function calls should not be tested:

- `num_outliers([3, 3, 3])` input list should contain at least five numbers
- `num_outliers([3, 2, 1, 0, 5])` input list should be sorted from least to greatest

In [42]:
import doctest
def num_outliers(numbers):
    """
    Takes a sorted list and returns the number of outliers: extreme points of data
    that can skew results and averages.
    >>> num_outliers([1, 2, 3, 4, 5]) 
    0
    >>> num_outliers([1, 99, 200, 500, 506, 507])
    0
    >>> num_outliers([5, 13, 14, 15, 16, 17, 25])
    2
    >>> num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101])
    2
    >>> num_outliers([8, 10, 10, 11, 11, 12])
    1
    >>> num_outliers([0, 0, 0, 1, 1, 1, 6, 7])
    0
    >>> num_outliers([1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 41, 67, 69])
    2
    """
    # borrowed code from my previous function "five_number_summary"
    outliers = 0
    firstQuartile = 0
    if (int(len(numbers)/2) % 2 == 1):
        firstQuartile = numbers[int(len(numbers)/4)]
    else:
        firstQuartile = (numbers[int(len(numbers)/4)] + numbers[int(len(numbers)/4)-1])/2
    thirdQuartile = 0
    if (int(len(numbers)/2) % 2 == 1):
        thirdQuartile = numbers[-int(len(numbers)/4+1)]
    else:
        thirdQuartile = (numbers[-1*int(len(numbers)/4)] + numbers[-1*int(len(numbers)/4+1)])/2
    interQuartile = thirdQuartile - firstQuartile
    # x is less than the first quartile minus 1.5 times the interquartile range
    # x is greater than the third quartile plus 1.5 times the interquartile range
    # The interquartile range is defined as the third quartile minus the first quartile.
    for i in range(len(numbers)):
        if numbers[i] < firstQuartile - 1.5*interQuartile:
            outliers += 1
        elif numbers[i] > thirdQuartile + 1.5*interQuartile:
            outliers += 1
    return outliers

doctest.run_docstring_examples(num_outliers, globals())

## Task: `reformat_date`

Write and test a function `reformat_date` that takes three strings: a date string, an input date format, and an output date format. This function should return a new date string formatted according to the output date format.

A **date string** is a non-empty string of numbers separated by `/`, such as `"2/20/1991"` or `"1991/02/20"`. The order of date fields (month, day, year) will depend on the date format, and the number of digits for each field can vary but there must be at least one digit for each field.

A **date format** is a non-empty string of the date symbols `"D"`, `"M"`, `"Y"` separated by `/`. Assume the date string will match the date formats (share the same number of `/`s), that any date symbol in the output date format will also appear in the input date format, and that date formats do not duplicate date symbols.

- `reformat_date("12/31/1998", "M/D/Y", "D/M/Y")` returns `"31/12/1998"`
- `reformat_date("1/2/3", "M/D/Y", "Y/M/D")` returns `"3/1/2"`
- `reformat_date("0/200/4", "Y/D/M", "M/Y")` returns `"4/0"`
- `reformat_date("3/2", "M/D", "D")` returns `"2"`

The following examples of invalid function calls should not be tested:

- `reformat_date("3/2", "M/D/Y", "Y/M/D")` date string and input date format do not match
- `reformat_date("3/2", "M/D", "Y/M/D")` input date format missing a field present in the output date format
- `reformat_date("1/2/3/4", "M/D/Y/S", "M/D")` input date format contains a field that is not "D", "M", "Y"
- `reformat_date("1/2/3", "M/M/Y", "M/Y")` input date format contains a duplicate date symbol
- `reformat_date("", "", "")` date strings and date formats must be non-empty strings

In [49]:
import doctest
def reformat_date(date, initFormat, finalFormat):
    """
    >>> reformat_date("12/31/1998", "M/D/Y", "D/M/Y")
    "31/12/1998"
    >>> reformat_date("1/2/3", "M/D/Y", "Y/M/D")
    "3/1/2"
    >>> reformat_date("0/200/4", "Y/D/M", "M/Y")
    "4/0"
    >>> reformat_date("3/2", "M/D", "D")
    "2"
    """
    date = date.split("/")
    initFormat = initFormat.split("/")
    finalFormat = finalFormat.split("/")
    year = ""
    month = ""
    day = ""
    for i in range(len(initFormat)):
        if (initFormat[i] == "Y"):
            year = date[i]
        if (initFormat[i] == "M"):
            month = date[i]
        if (initFormat[i] == "D"):
            day = date[i]
    newDate = ""
    for i in range(len(finalFormat)):
        if (finalFormat[i] == "Y"):
            newDate += year 
        if (finalFormat[i] == "M"):
            newDate += month 
        if (finalFormat[i] == "D"):
            newDate += day
        if (i < len(finalFormat)-1):
            newDate += "/"
    return newDate
doctest.run_docstring_examples(reformat_date, globals())

**********************************************************************
File "__main__", line 4, in NoName
Failed example:
    reformat_date("12/31/1998", "M/D/Y", "D/M/Y")
Expected:
    "31/12/1998"
Got:
    '31/12/1998'
**********************************************************************
File "__main__", line 6, in NoName
Failed example:
    reformat_date("1/2/3", "M/D/Y", "Y/M/D")
Expected:
    "3/1/2"
Got:
    '3/1/2'
**********************************************************************
File "__main__", line 8, in NoName
Failed example:
    reformat_date("0/200/4", "Y/D/M", "M/Y")
Expected:
    "4/0"
Got:
    '4/0'
**********************************************************************
File "__main__", line 10, in NoName
Failed example:
    reformat_date("3/2", "M/D", "D")
Expected:
    "2"
Got:
    '2'


## Testing

Double check that **each task has 2 of your own additional test cases**.

In [None]:
test_results = doctest.testmod()
print(test_results)
assert test_results.failed == 0, "There are failed doctests"
assert test_results.attempted >= 31, "There should be at least 31 total doctests"