# Functions, Arrays, and Files

This notebook goes over material covered in [Chapter 3 of Python for Data Analysis](https://wesmckinney.com/book/python-builtin). Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
import numpy as np
import pandas as pd

## Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50 (no percent sign).

A function definition has a few parts.

### `def`
It always starts with `def` (short for **def**ine):

    def

### Name
Next comes the name of the function.  Like other names we've defined, it can't start with a number or contain spaces. Let's call our function `to_percentage`:
    
    def to_percentage

### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  A function can have any number of arguments (including 0!). 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)
    
If we want our function to take more than one argument, we add a comma between each argument name.

We put a colon after the signature to tell Python it's over. If you're getting a syntax error after defining a function, check to make sure you remembered the colon!

    def to_percentage(proportion):
    

### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing an **indented** triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    

### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab**.  Any lines that are *not* indented and left-aligned with the def statement is considered outside the function. 

Some notes about the body of the function:
- We can write any code that we would write anywhere else.  
- We use the arguments defined in the function signature. We can do this because we assume that when we call the function, values are already assigned to those arguments.
- We generally avoid referencing variables defined *outside* the function.


Now, let's give a name to the number we multiply a proportion by to get a percentage:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

### `return`
The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
        
`return` only makes sense in the context of a function, and **can never be used outside of a function**. `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement.

*Note:*  `return` inside a function tells Python what value the function evaluates to. However, there are other functions, like `print`, that have no `return` value. For example, `print` simply prints a certain value out to the console. 

`return` and `print` are **very** different. 

<font color = 'red'>**Question 1.Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.**</font>

Here's something important about functions: the names assigned *within* a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even though you defined `factor = 100` inside  the body of the `to_percentage` function up above and then called `to_percentage`, you cannot refer to `factor` anywhere except inside the body of `to_percentage`:

In [None]:
# You should see an error when you run this.  (If you don't, you might
# have defined factor somewhere above.)
factor

<font color = 'red'>**Question 2. Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u".)**</font>

*Hint:* To remove all the "a"s from a string, you can use `<that_string>.replace("a", "")`.  The `.replace` method for strings returns a new string, so you can call `replace` multiple times, one after the other. 

In [None]:





# Run this after you define your function above.
disemvowel("Can you read this without vowels?")

### Building functions
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the jam filling.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the number of characters *that aren't vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

<font color = 'red'>**Question 3. Write a function called `num_non_vowels`.  It should take a string as its argument and return a number.  That number should be the number of characters in the argument string that aren't vowels.**</font>

*Hint:* The function `len` can take a string as its argument and returns the number of characters in it.

<font color = 'red'>**Question 4. Write a function called `num_vowels_nonvowels` that return two values: the number of vowel characters and the number of non-vowel characters. Try running the function with the example text below and assign the number of vowels to `total_vowels` and `total_non_vowels`.**</font>

In [None]:
example_text = "Can you read this without vowels?"






## Arrays

We've already gone over lots of ways of storing data such as **lists**, **dictionaries**, and **tuples**. Two more types of object that you can use are the **Numpy array** and the **Pandas Series**. Numpy arrays are quite flexible and have lots of nice properties that we'll see mirrored in Pandas objects. We won't go over all of the different things you might be able to do with Numpy arrays (mostly because we'll move to using Pandas Series and DataFrames), but it is useful to know what types of operations you can do with arrays.

One of the nice things about arrays is that you can do arithmetic operations with large amounts of data much more quickly than you might be able to with lists. Here, we show an example of timing how long it takes to multiply each of 1,000,000 numbers by 2. The first example uses array operations, while the second uses list comprehension. 

In [None]:
my_array = np.arange(1000000)
my_list = list(my_array)

In [None]:
%timeit my_array*2

In [None]:
%timeit [x * 2 for x in my_list]

Notice that doing arithmetic with arrays is also much simpler. Rather than needing to use something like list comprehension, we can simply us arithmetic with arrays like they are scalars, and it will apply the arithmetic operation to each number in the array. 

In [None]:
np.array([1,2,3,4,5]) / 3

If we were to try this using lists (or dictionaries or tuples), we would get an error.

In [None]:
[1,2,3,4,5] / 3

Arithmetic also works with two arrays of the same dimensions.

In [None]:
array1 = np.array([1,2,3,4,5]) 
array2 = np.array([5,5,5,5,5])

array1 + array2

In [None]:
array1/array2

### Some array functions

Some array creation functions are shown in the table below.

|Function | Description|
|---|---|
|array() | Convert input data (list, tuple, array, or other sequence type) to an array|
|arange() | Like the built-in `range` function but returns an array instead of a list|
|ones() | Produce an array of all 1s with the given shape and data type|
|zeros()| Like ones but producing arrays of 0s instead|

For more information on arrays, see the numpy array documentation at: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

<font color = 'red'>**Question 5: Create a array that contains all the powers of 3 starting with 3^0, all the way up to 3^10. Call this array `powers_of_three`.**</font>

<font color = 'red'>**Question 6: Calculate the mean of `powers_of_three` from the previous question manually (using arithmetic). Then, use the `.mean` method to calcuate the mean of the array.**</font>

Note that these are `ndarray`s, meaning n-dimensional arrays. So, you can use them to store data in something like a matrix format too. 

In [None]:
my2darray = np.array([[1,2,3],[4,5,6]])

In [None]:
my2darray.shape

### Pandas Series

Pandas Series have many of the same properties as arrays. They can be used to do arithmetic, and allow for faster computation. 

In [None]:
my_series = pd.Series(my_array)

In [None]:
%timeit my_series*2

Pandas Series also have a lot of the same functionalities as numpy arrays. For example, you can find a mean with Series in the same as you can with Series.

In [None]:
my_series.mean()

In [None]:
my_series.std()

Pandas Series do have a different set of methods associated with them though. For the full list, see the documentation for Pandas Series here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html.

One useful Pandas Series method is the `apply` method. This can be used to apply a function over the values of the series.

In [None]:
example_series = pd.Series([1,2,3,4,6,9,10])

example_series.apply(lambda x: x**2)

The above example could actually be done more succinctly with using arithmetic, but there are lots of times when this is useful.

In [None]:
string_series = pd.Series(['     this ', ' is    ', 'an\n', 'example\n'])
string_series

In [None]:
cleaned_strings = string_series.apply(lambda x: x.strip())
cleaned_strings

In [None]:
# Converting to a list to see the difference better
string_series.to_list()

In [None]:
cleaned_strings.to_list()

<font color = 'red'>**Question 7: Remove all the vowels in `cleaned_strings`.**</font>

## Files

Most of the time, we won't directly open and interact with files except to load datasets from files such as CSV files. However, it can still be useful to know how to interact with text files from Python. Here, we'll go over some basics of how Python can interact with text files.

A basic way of interacting a file is by using the `open` function. This opens a file and returns a stream so that we can interact with the file. After you are done, you can use the `.close` method to close the file. In order to make this process simpler and avoid having to remember to close every time, we can instead use a `with` statement.

In [None]:
with open('example.csv','w') as f:
    f.writelines(['1\n', '2\n', '3\n'])

This code opens an `example.csv` file with write privileges, then write three lines containing 1, 2, and 3. Note that we have to use `\n` in order to go to the next line.

We can also read files using the same general format.

In [None]:
with open('example.csv','r', encoding = 'utf-8') as f:
    list_from_file = f.readlines()
list_from_file

Note that though the file itself has numbers as the text, we read it in as strings. We can address this by changing them into integers as we read it in.

In [None]:
with open('example.csv','r', encoding = 'utf-8') as f:
    list_from_file = [int(x) for x in f]
list_from_file

<font color ='red'>**Question 8: The file `Austen_PrideAndPrejudice.txt` has the full text of Pride and Prejudice as a text file. Read this file into Python as a list called `pride_and_prejudice`. Look at the first 10 lines of the list (don't display all of it! or you'll have a really long Jupyter notebook file when you download as an HTML).**</font>

<font color = 'red'>**Question 9: Clean up the `pride_and_prejudice` using `.strip` to remove any spaces at the beginning or end of a line, then remove any elements of the list that are empty.**</font>

<font color = 'red'>**Question 10: Create a list that has the number of non-vowel characters in each line of `pride_and_prejudice` as well as a list that has the number of vowels in each line of `pride_and_prejudice`.**</font>

<font color = 'red'>**Question 11: What is the average proportion of vowels per character by line of Pride and Prejudice? (Don't worry about removing things like the title and the line that just says, "A NOVEL")**</font>

### Interacting with Files using Pandas

Generally, when we want to open or write a CSV file, we'll actually use Pandas to do this. Pandas has a `read_csv` function that can read files, as well as a `to_csv` method to write its DataFrames as CSV files. We'll go over more of these functionalities later, but an example using pandas is shown below.

In [None]:
pd.read_csv('example.csv', header = None)