# Data Analysis and Visualization

* * * 

### Icons Used In This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
💭 **Reflection**: Helping you think about programming.<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Write Your Own Functions](#write) 
2. [Exploring Data with Pandas](#expl)
3. [Visualizing DataFrames](#vis)

<a id='write'></a>

# Write Your Own Functions

Remember, functions are pieces of code that we expect to use over and over again.

One of the most useful programming structures in Python is to write our own functions with a custom functionality that is specific to our goals.

## Basic Function Syntax

Writing a function in Python is pretty easy! You need to know a few things:

*   Functions begin with the keyword `def`.
*   This keyword is followed by the function *name*.
    *   The name must obey the same rules as variable names.
*   The **arguments** or **parameters** are defined in parentheses as variable names.
    *   Use empty parentheses if the function doesn't take any inputs.
*   A colon indicates the end of the function *signature* (the first line).
*   An indented block of code denotes the start of the *body*.
*   The final line should be a `return` statement with the value(s) to be returned from the function.

Let's take a look at a simple function:

In [None]:
def feet_to_meters(feet):
    meters = feet * .304
    return meters

Notice how there is **no output** from running the block of code above. This is because defining a function does not run it. The function needs to be **called**, or run, with appropriate arguments to execute the code it contains. 

Let's run this function. We can save the output to a variable and print the result.

In [None]:
meters = feet_to_meters(100)
print(meters)

## 💭 Reflection: Variables and "Scope"

Note how we've used the name `meters` twice above: both within the function definition, and for the variable that takes the output of the function. What's going on here?

Arguments and variables created within the function **only exist within the scope of the function!** So `meters` within the function definition is a *different variable* than `meters` which now holds `30.4`.

In fancy words, the variable `meters` in the function definition only exists **within the scope of** that function definition. This is very important to remember!

## 🥊 Challenge: My First Function

Write a function that converts Celsius temperatures to Fahrenheit. The formula for this conversion is:

$$F = 1.8 * C + 32$$

You can name this function whatever you want. But it makes sense to name it something sensible!

In [None]:
def ___:
    # YOUR CODE HERE
    return ____

## 🥊 Challenge: Make a Function

Say you need to write a function that takes in a list of text data. We've defined a list for you here.

In [None]:
sent_list = ['When I wake up, the other side of the bed is cold.', 
             'My fingers stretch out, seeking Prim’s warmth but finding only the rough canvas cover of the mattress.',
             'She must have had bad dreams and climbed in with our mother.']

Your function is supposed to loop through this list, and for each item, split the words on whitespace, lowercase all items in the list. The function must return all lowercased words in a new list. 
**Use the discussed strategies** to write your code:

1. Use comments to indicate what your code is doing.
2. Use short lines.
3. Make use of Pythonic coding when splitting and lowercasing.
4. Use the common naming conventions we define above.
5. Plan ahead: write down your goals in comments before you write the code!

💡 **Tip**: This function will require two (nested) for-loops!

In [None]:
# YOUR CODE HERE


<a id="expl"></a>

# Exploring Data with Pandas

We introduced `pandas` in Fundamentals I. It is the most common package used in data analysis, with a focus on data manipulation and processing. We will work some more with `pandas` here, and work towards visualizing our data.

In [None]:
# Recall that pandas is frequently imported with the alias pd
import pandas as pd
import numpy as np

We'll use data from the [California Health Interview Survey (CHIS)](https://healthpolicy.ucla.edu/chis/Pages/default.aspx), the nation's largest state health survey. 

The data has the following columns: 

- `number_sodas`: Number of times drank soda per week
- `poverty_line`: Poverty Level as Times of 100% Federal Poverty Line (FPL)
- `health`: General Health Assessment
- `race_eth`: Self-Reported Race Ethnicity
- `feel_safe`: Feeling Safe in the Neighborhood
- `tenure`: Self-Reported Household Tenure
- `earnings`: Earnings Last Month Before Taxes and Deductions
- `hh_income`: Household’s Total Annual Income

🔔 **Question**: How many rows are in the data set?

In [None]:
df = pd.read_csv('../../data/gapminder-FiveYearData.csv')
df.head()

## DataFrame Methods

Just like other objects, data frames have a series of methods – functions that work on them specifically.

There are many methods for summarizing data frames (which often are assigned as `df`). For example [`df.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) will give some summary statistics for a column. Let's look at how `.describe()` works on our `DataFrame`.

🔔 **Question**: Why are only some of the columns in the `DataFrame` visible in the output below?

In [None]:
df.describe()

This function is good for summarizing numerical data in a dataset. However, sometimes this might not be enough. For example, what if we wanted the median of life expectancy?

First, let's select just one column to operate on. We can select an individual column with bracket notation. This is analogous to indexing a list.

🔔 **Question**: What is the type of the output?

In [None]:
df['lifeExp']

A single column of pandas is a `Series` object. This can be treated as a list or other iterable, and allows for you to do calculations over it. 

We can then look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) to see the methods and attributes that are available for `Series` objects. If we want the median, we can use the `.median()` function.

In [None]:
df['lifeExp'].median()

We can also do operations on a column. 

🔔 **Question**: What will happen in the code below? What is the type/shape of the output?

In [None]:
# What are we doing here?
df['gdpPercap'] * df['pop']

This is called a **vectorized operation:** where the operation is applied to each element of the column. This allows you to efficiently apply operations to every item of the `Series`.

## 🥊 Challenge: Methods

For each of the following methods, what type is the output?

In [None]:
# Counting values
df['lifeExp'].value_counts(ascending=True)

In [None]:
# Detect missing values
df.isnull()

In [None]:
# Remove missing values
df.dropna()

In [None]:
# Operate on string values
df['country'].str[:4]

💡 **Tip**: There are hundreds of methods associated with `DataFrames` and `Series`. Don't memorize all of them. Instead, get used to new functions by reading documentation and examples!

## 🥊 Challenge: Categorical to Numeric data

Recall that in our dataset, we have a 'continent' column that includes the values 'Asia', 'Europe', 'Africa', 'Americas', and 'Oceania'. Let's say that for a model, we want to replace these string values with numbers that will serve as input to the model. There are several ways to do this. Look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and see if you can `replace` the strings with a corresponding number.

In [None]:
# YOUR CODE HERE


<a id="vis"></a>
# Visualizing DataFrames

We often want to look at our data visually. Fortunately, `pandas` also offers some basic plotting functions that can be useful in exploring a data set. In this section, we will cover two basic types of plots: histograms and scatter plots. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for further information on plotting and plot customization.

### Histograms

A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`.

💡 **Tip**: Use a histogram if you want to show distributions of continuous variables.

In [None]:
print('Plot A: 5 Bins')
fig = df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=5)

Note the `bins` keyword argument when calling the histogram. It changes the number of "bins" or "buckets" in the histogram. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin.

🔔 **Question**: Let's plot two more histograms with different amounts of bins. Which of the 3 plots would you pick, and why?

In [None]:
print('Plot B: 10 Bins')
fig = df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=10)

In [None]:
print('Plot C: 20 bins')
df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=20)

### Bar Plots

Bar plots show the relationship between a numeric and a categoric variable. Here, we use the "country"  (categorical) and "lifeExp" (numeric) columns.

💡 **Tip**: Use a bar plot when you want to illustrate differences in frequencies of some category.

Let's retrieve the 10 data points with the lowest life expectancy in our data using `.sort_values()`, and then plot those data points in a bar plot.


In [None]:
# Sort values based on low life expectancy, get top 10
low_lifeExp = df.sort_values('lifeExp', ascending=True)[:10]

# Visualize with bar plot
low_lifeExp.plot.bar(x='country', y='lifeExp', figsize=(6,4));

### Scatter Plots

Scatter plots visualize the relationship between different variables. We can create a scatter plot by specifying the columns to use for the `x` and `y` axes. Notice that instead of calling it on a single column of data, we are using `df.plot(kind='scatter')`.

💡 **Tip**: Use scatter plots when you have two variables that pair well together.

In [None]:
fig = df.plot(kind='scatter',
              x='lifeExp',
              y='gdpPercap',
              title='Relationship between GDP per capita and life expectancy')

🔔 **Question**: Do you notice any pattern in the data? What might be causing that pattern?