# Data Analysis and Visualization

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💭 **Reflection**: Helping you think about programming.<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Write your own Functions](#write) 
2. [Coding Strategies](#strat)
3. [Exploring Data with Pandas](#pandas)
4. [Visualizing DataFrames](#vis)

<a id='write'></a>

# Write your own Functions

Remember, functions are pieces of code that we expect to use over and over again.

One of the most useful programming structures in Python is to write our own functions with a custom functionality that is specific to our goals.

## Basic Function Syntax

Writing a function in Python is pretty easy! You need to know a few things:

*   Functions begin with the keyword `def`.
*   This keyword is followed by the function *name*.
    *   The name must obey the same rules as variable names.
*   The **arguments** or **parameters** are defined in parentheses as variable names.
    *   Use empty parentheses if the function doesn't take any inputs.
*   A colon indicates the end of the function *signature* (the first line).
*   An indented block of code denotes the start of the *body*.
*   The final line should be a `return` statement with the value(s) to be returned from the function.

Let's take a look at a simple function:

In [None]:
def feet_to_meters(feet):
    meters = feet * .304
    return(meters)

Notice how there is **no output** from running the block of code above. This is because defining a function does not run it. The function needs to be **called**, or run, with appropriate arguments to execute the code it contains. 

Let's run this function. We can save the output to a variable and print the result.

In [None]:
meters = feet_to_meters(100)
print(meters)

## 💭 Reflection: Variables and "Scope"

Note how we've used the name `meters` twice above: both within the function definition, and for the variable that takes the output of the function. What's going on here?

Arguments and variables created within the function **only exist within the scope of the function!** So `meters` within the function definition is a *different variable* than `meters` which now holds `30.4`.

In fancy words, the variable `meters` in the function definition only exists **within the scope of** that function definition. This is very important to remember!

## 🥊 Challenge: My First Function

Write a function that converts Celsius temperatures to Fahrenheit. The formula for this conversion is:

$$F = 1.8 * C + 32$$

You can name this function whatever you want. But it makes sense to name it something sensible!

In [None]:
def ___:
    # write your code here
    return(____)

## Principles of Writing Your Own Functions

Function writing is one of the most important skills you can develop as a programmer. However, there is also a lot that can go wrong in the function writing process, leading to time-consuming corrections. Here are some guidelines that can help minimize errors and make the process less painful:

1. **Plan**
    1. What is the overall goal of the function? Is there a function that exists already that does the same thing? 
    2. What is going to be the output of the function? (what data type, how many items)?
    3. What arguments will you need? What pieces of the function do you need to control?
    4. What are the general steps of the program? This can be written in bullet points or "pseudocode".
2. **Write**
    1. Start by writing the code without the function wrapper.
    2. Start small. Write small self-contained blocks of code and put the pieces together. You can also consider sub-functions if it is a particularly complex issue.
    3. Test each part of the function as it is added. Track the input of the function and how it changes at each step. 
    4. Wrap the code in the function syntax.
3. **Test**
    1. Take the function and test *several* cases.
    2. Before running test cases, form an expectation of the result. 
    3. Test the function. Pay attention to both errors and strange results. Make adjustments to account for new cases.
    4. Integrate the function with the rest of the code. Are the input arguments the right type? Does the output flow into the rest of the code?

Let's go through an example of the function development process.

Let's say we have a state name (e.g. California) and we want to generate the postal abbreviation for that state (California --> CA).

1. **Plan**
    1. Generate two-letter abbreviation for a state
    2. Input: string of state name
    3. Output: first two letters of string 
    4. The pseudocode might look like this: 
        ``` 
        function
            select first two characters in the string
            make upper case
            return
        ```

2. **Write**

Let's start with our example string `California` and select the first two characters in the string using string indexing:

In [None]:
ex_state = 'California'
#select first two
first_two = ex_state[:2]

Now we need to make the letters uppercase (And check the output). 

In [None]:
first_two.upper()

Now that we've done the individual steps, we can put it together in the function syntax.

In [None]:
def get_state_abbreviation(state):
    first_two = state[:2]
    abbr = first_two.upper()

Now let's test this out:

In [None]:
print(get_state_abbreviation('California'))

🔔 **Question:** Why is the function returning `None`? What do we need to add?

## Function Arguments

Function **arguments** or **parameters** are specified when defining a function in the parentheses, separated by commas. 

These arguments become variables when the function is executed. The variables are assigned the values passed to the function. We do operations based on the arguments, and return the result.

Let's look at an example function in which we're performing division.

**Question:** What is being divided by what in the following lines of code?

In [None]:
def divide(x, y):
    return(x / y)

print(divide(4, 6))
print(divide(6, 4)) 

The order of the arguments matter; we got different results because each argument had a different role (numerator and denominator).

You can also pass in **keyword arguments**, where each argument is assigned using a name.

In [None]:
print(divide(x=4, y=6))
print(divide(y=6, x=4))

Are the arguments named appropriately? What does x and y stand for? What could be more clear?

Generally, it's good practice to both use well-named arguments and use them in the same order. This is easier to read.

## Default Arguments

We can also specify **default arguments** in functions. When we provide a default argument, the function will use that value when the user does not pass in a value. Default arguments are specified in the function signature.

An expanded version of the `divide()` function is provided below. What is the additional parameter doing? What will be the output of `divide(24,5)`?

In [None]:
# y has default value equal to 10
def divide(x, y, z = True):
    if z:
        return(round(x / y))
    else:
        return(x/y)

We can use default arguments when there are arguments that we will only want to change some of the time. It's good practice to make the default of the argument the item that you will want to use most often.

🔔 **Question:** What do you think the best default for the `z` argument above would be? What might be a better name for that argument?

<a id='strat'></a>

# Coding Strategies

Before we move on, let's discuss five best practices for coding. This is especially useful when you are building your own functions.

### 💡 1: Use comments

To help explain your code, you can include **comments** to explain what is going on. Use comments to explain *why* you are using blocks of code and *how* code works. 

🔔 **Question**: Some code is commented below. What is helpful about the comments? What could be improved?

In [None]:
import numpy as np

#list of proportions
proportion_list = [.96821,.78998,.86898,.98981298]

for prop in proportion_list:
    
    percent = prop * 100 #multiply by 100
    percent = round(percent,1)
    print(percent,'%')



### 💡 2: Keep lines short
Another principle of good Python coding is to keep lines short enough that you don't need to scroll. This is true of both code and comments!

You can use **implied line continuation** inside parentheses, most commonly by including line breaks at commas.

In [None]:
a_long_list = ['bannana','apple','pineapple',
               'mango','strawberry','guava',
               'lychee','peach']

### 💡 3: Pythonic Coding

**Pythonic coding** means taking advantage of the design of Python to make more readable code. 

🔔 **Question**: The following two approaches are equivalent. What is happening? Which do you find easier to read?

In [None]:
s = 'sample.thing'

# First approach
slist = s.split('.')
first_word = slist[0]
first_letter = first_word[0]
first_upper = first_letter.upper()

# Second approach
second_upper = s.split('.')[0][0].upper()

# Check that they are identical 
#print(first_upper, second_upper)

Generally, the second approach is preferred, when possible. 

- It uses less lines of code.
- It avoids needing to track and remember many variable names.
- It is highly readable if you are familiar with Python.

However, disadvantages of the second approach are:

- It can be harder to follow the steps.
- It's harder to debug.

One approach is to write out chunks of code the first way when debugging, then condensing them into fewer lines once the debugging is complete.

### 💡 4: Use Naming Conventions

"The best programmer is the one who can come up with the best names"

* Good names replace comments and make code self-documenting.
* variables, functions, files, etc. should consist of complete words. Try to avoid abbreviations.
* Use this principle in your coding: frequent -> short, infrequent -> long.

In [None]:
# less ideal
a = 1
a = 'a string'
def a():
    pass  # Do something

# more ideal
count = 1
msg = 'a string'
def func():
    pass  # Do something

Even without comments, good names give a good idea of what is going on in your code!

Here are some more style guidelines:

* joined_lower for functions, methods, attributes.
* joined_lower or ALL_CAPS for constants.
* StudlyCaps for classes.
* camelCase only to conform to pre-existing conventions.

Most often you will use the `joined_lower` format for your variables.

### 💡 5: Plan ahead
Planning ahead can help mitigate time spent dealing with bugs and errors in code. General steps for defensive coding are:

1. State the goals of your code as clearly as possible.
2. Plan out the general logic of steps needed to achieve the goal.
3. Translate the steps into code:
    1. Build up steps piece by piece.
    2. Test frequently to make sure code is working as expected and handle bugs as quickly as possible.
4. Check the output.

## 🥊 Challenge: Make a Function

Say you need to write a function that takes in a list of textual data. We've defined a list for you here.

In [None]:
sent_list = ['When I wake up, the other side of the bed is cold.', 
             'My fingers stretch out, seeking Prim’s warmth but finding only the rough canvas cover of the mattress.',
             'She must have had bad dreams and climbed in with our mother.']

Your function is supposed to loop through this list, and for each item, split the words on whitespace, lowercase all items in the list. The function must return all lowercased words in a new list. 
**Use the discussed strategies** to write your code:

1. Use comments to indicate what your code is doing.
2. Use short lines.
3. Make use of Pythonic coding when splitting and lowercasing.
4. Use the common naming conventions we define above.
5. Plan ahead: write down your goals in comments before you write the code!

💡 **Tip**: This function will require two (nested) for-loops!

In [None]:
# YOUR CODE HERE


<a id="pandas"></a>
# Exploring Data with Pandas

We introduced `pandas` in Fundamentals I. It is the most common package used in data analysis, with a focus on data manipulation and processing. We will work some more with `pandas` here, and work towards visualizing our data.

In [None]:
# recall that pandas is frequently imported with the alias pd
import pandas as pd
import numpy as np

We'll use data from the [California Health Interview Survey (CHIS)](https://healthpolicy.ucla.edu/chis/Pages/default.aspx), the nation's largest state health survey. 

The data has the following columns: 

- number_sodas: Number of times drank soda per week
- poverty_line: Poverty Level as Times of 100% Federal Poverty Line (FPL)
- health: General Health Assessment
- race_eth: Self-Reported Race Ethnicity
- feel_safe: Feeling Safe in the Neighborhood
- tenure: Self-Reported Household Tenure
- earnings: Earnings Last Month Before Taxes and Deductions
- hh_income: Household’s Total Annual Income

🔔 **Question**: How many rows are in the data set?

In [None]:
df = pd.read_csv('../../data/gapminder-FiveYearData.csv')
df.head()

## DataFrame Methods

Just like other objects, `DataFrames` have a series of methods that are associated with them. There are many methods for summarizing `pd.DataFrames`. For example [`df.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) will give some summary statistics for a column. Let's look at how `.describe()` works on the `penguins` DataFrame.


🔔 **Question**: Why are only some of the columns in the DataFrame visible in the output below?

In [None]:
df.describe()

This function is good for summarizing numerical data in a dataset. However, sometimes this might not be enough. For example, what if we wanted the median of the penguin mass rather than the mean? 

First, let's select just one column to operate on. We can select an individual column with bracket notation. This is analogous to indexing a list.

🔔 **Question**: What is the type of the output?

In [None]:
df['lifeExp']

A single column of pandas is a `Series` object. This can be treated as a list or other iterable, and allows for you to do calculations over it. 

We can then look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) to see the methods and attributes that are available for `Series` objects. If we want the median, we can use the `.median()` function.

In [None]:
df['lifeExp'].median()

We can also do operations on a column. 

🔔 **Question**: What will happen in the code below? What is the type/shape of the output?

In [None]:
# What are we doing here?
df['gdpPercap'] * df['pop']

This is called a **vectorized operation:** where the operation is applied to each element of the column. This allows you to efficiently apply operations to every item of the Series.

## 🥊 Challenge: Methods

For each of the following methods, what type is the output?

In [None]:
# Counting values
df['lifeExp'].value_counts(ascending=True)

In [None]:
# Detect missing values
df.isnull()

In [None]:
# Remove missing values
df.dropna()

In [None]:
# Operate on string values
df['country'].str[:4]

There are easily several hundred methods asociated with `DataFrames` and `Series`. It is impractical to try to memorize all of them. Often, it's more productive to develop (1) an understanding of what is possible with Python and (2) the ability to learn how to implement new functions by reading documentation and examples!

## 🥊 Challenge: Categorical -> Numeric data

Recall that in our dataset, we have a 'continent' column that includes the values 'Asia', 'Europe', 'Africa', 'Americas', and 'Oceania'. Let's say that for a model, we want to replace these string values with numbers that will serve as input to the model. There are several ways to do this. Look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and see if you can `replace` the strings with a corresponding number.

In [None]:
#YOUR CODE HERE



## Selecting Columns and Rows

We can use `.loc[row, column]` to index columns and rows in the DataFrame. 

We can use a **Boolean mask** (discussed in the previous notebook) to represent which rows to select. A Boolean mask is an operation that takes as input a series and a condition, and outputs a series with `True` where that condition is met and `False` elsewhere.

For example, let's say that we want to measure countries where the life expectancy is under 40.


In [None]:
df['lifeExp'] < 40

Then to get the subset of the entire `penguins` object, we can pass this Boolean mask to `.loc[]`:

In [None]:
df.loc[df['lifeExp'] < 40]

Now, if you wish to subset this DataFrame for columns as well as rows, you can include a columns argument in `.loc[]` that includes a list of columns to subset. 

💡 **Tip**: Note the comma that's separating between the rows and columns we're subsetting!

In [None]:
# Subsetting rows and columns
df.loc[df['lifeExp'] < 40, ['country','year','lifeExp']]

## 🥊 Challenge: Subsetting a DataFrame

1. Modify the `.loc[]` expression above to subset for GDP per capita under 800. Save it to the variable `low_gdp`.
2. Calculate the mean life expectancy for this group (**Hint**: use `.mean()`).

In [None]:
# YOUR CODE HERE




<a id="vis"></a>
# Visualizing DataFrames

We often want to look at our data visually. Fortunately, `pandas` also offers some basic plotting functions that can be useful in exploring a data set. In this section, we will cover two basic types of plots: histograms and scatter plots. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for further information on plotting and plot customization.

### Histograms

A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`.

💡 **Tip**: Use a histogram if you want to show distributions of continuous variables.

In [None]:
print('Plot A: 5 Bins')
fig = df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=5)

Note the `bins` keyword argument when calling the histogram. It changes the number of "bins" or "buckets" in the histogram. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin.

🔔 **Question**: Let's plot two more histograms with different amounts of bins. Which of the 3 plots would you pick, and why?

In [None]:
print('Plot B: 10 Bins')
fig = df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=10)

In [None]:
print('Plot C: 20 bins')
df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=20)

### Bar Plots

Bar plots show the relationship between a numeric and a categoric variable. Here, we use the "country"  (categorical) and "lifeExp" (numeric) columns.

💡 **Tip**: Use a bar plot when you want to illustrate differences in frequencies of some category.

Let's retrieve the 10 data points with the lowest life expectancy in our data using `.sort_values()`, and then plot those data points in a bar plot.


In [None]:
# Sort values based on low life expectancy, get top-10 
low_lifeExp = df.sort_values('lifeExp',ascending=True)[:10]

# Plot
low_lifeExp.plot.bar(x = 'country', y = 'lifeExp', figsize = (6,4));


### Scatter Plots

Scatter plots visualize the relationship between different variables. We can create a scatter plot by specifying the columns to use for the `x` and `y` axes. Notice that instead of calling it on a single column of data, we are using `df.plot(kind='scatter')`.

💡 **Tip**: Use scatter plots when you have two variables that pair well together.

In [None]:
fig = df.plot(kind='scatter',
              x='lifeExp',
              y='gdpPercap',
              title='Relationship between GDP per capita and life expectancy')

🔔 **Question**: Do you notice any pattern in the data? What might be causing that pattern?

## `matplotlib` 

So far, we've built plots directly from Pandas DataFrames. On the back end, Pandas uses Matplotlib, a very popular visualization library in Python, in order to do this. But we can also use Matplotlib directly.

### Creating a boxplot

A boxplot (or "box and whiskers" plot) displays the distribution and skewness of numerical data by displaying the data quartiles (or percentiles) and averages. This includes the minimum, first quartile, median, third quartile, and maximum. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

💡 **Tip**: Use a boxplot when you want to illustrate variation in a single float or integer, and to identify outliers.

In [None]:
import matplotlib.pyplot as plt

df.boxplot(column=['lifeExp']);

The led line above is the median value in the data; the bottom and top of the box are the first and third quartiles.

Let's make boxplots of life expectancy **_by_** continent in the gapminder dataset.

In [None]:
# For each continent
df.boxplot(column=['lifeExp'], 
            by = 'continent', 
            figsize = (5, 4)
           )

plt.title("");

Note the circles, which refer to outliers in the data.