# Python Intermediate: Iteration and Visualization

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Apply `for` loops for repeated computations.
* Apply several useful Pandas methods such as `.describe()` and `.value_counts()`.
* Use the `.plot()` method in Pandas to create simple visualizations. 
    
</div>


### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Sections
1. [Loops](#loops)
2. [DataFrame Methods](#meth)
3. [Visualizing DataFrames](#vis)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../data/gapminder_gni.csv')
df.head()

<a id='loops'></a>

# Loops

The strength of using computers is their speed. We can leverage this by facilitating repeated computation with **loops**. In programming, there are generally two kinds of loops: for loops and while loops. 

A **[for loop](https://www.w3schools.com/python/python_for_loops.asp)** executes some statements once *for* each value in an interable (like a list or a string). It says: "*for* each thing in this group, *do* these operations".

A **[while loop](https://www.w3schools.com/python/python_while_loops.asp)** says: "*while* Condition A is true, *do* these operations".  We don't use these loops frequently in this type of programming so we won't cover them here.

Let's take a look at the syntax of a for loop using the above example:

In [None]:
# We use a variable containing a list with the values to be iterated through
lifeExp_list = [28.801, 30.332, 31.997]

# Initialize the loop
for each in lifeExp_list:
    print(round(each))

# This will only be printed when the loop has ended!
print('The loop has ended.')

Note that the above example is pretty easy to read:

"**for** each pressure **in** our list, print out the rounded number".

## Loop Syntax

Let's break down the syntax of the `for` loop more closely.

*   The colon at the end of the first line signals the start of a *block* of statements.
*   The indented line(s) following the colon indicate the lines to run as a part of the loop (also known as the body).
*   Unindented lines following the loop will execute **after** all iterations of the for loop are complete.
*   `for loop_variable in collection:` The loop variable is what gets plugged into the calculations in the body of the loop, and the collection is the group of values being looped through.
*   Loop variables:
    *   Are created on demand.
    *   Can have any name (though your code is more readable if these names are meaningful!).
    *   Act as placeholders for the loop.

## 🥊 Challenge 1: Fixing Loop Syntax

The following block of code contains **three errors** that are preventing it from running properly. What are the errors? How would you fix them?

In [None]:
for number in [2.12, 3.432, 5.23]
print(n)

## Loops With Strings, Series, `range`

Loops can loop over any iterable data type. An **iterable** is any data type that can be iterated over, like a sequence. Generally, anything that can be indexed (e.g. accessed with `values[i]`) is an iterable.

For example, a string is iterable, so it is possible to loop through a string!

Let's take a look at an example:

In [None]:
example_string = 'afghanistan'

for char in example_string:
    # Use the upper() method on char
    print(char.upper())

## Aggregating Values With Loops

A common strategy in programs is to:
1.  Initialize an *accumulator* variable appropriate to the datatype of the output:
    * `int` : `0`
    * `str` : `''`
    * `list` : `[]`
2.  Update the variable with values from a collection through a for loop. Typical update operations are:
    * `int` : `+`
    * `str` : `+`
    * `list` : `.append()`
    
The result of this is a single list, number, or string with a summary value for the entire collection being looped over.

Returning to the tire pressure example, we can make a new list with all of the tire pressures rounded:

In [None]:
rounded_numbers = []

for number in numbers: 
    rounded = round(number)
    rounded_numbers.append(rounded)

print('Rounded numbers:', rounded_numbers)

💡 **Tip**: Remember: indenting matters in Python! Jupyter automatically indents for you – but if you want to move multiple lines of code at once, you can select them and then hit `Control + ]` to indent them (move to the right), or `Control + [` to dedent them (move to the left). If you are on a Mac, use `Command` instead of `Control`.

## 🥊 Challenge 2: Aggregation Practice

Below are a few examples showing the different types of quantities you might aggregate using a for loop. These loops are partially filled out. Finish them and test that they work!

1. Find the total length of the strings in the given list. Store this quantity in a variable called `total`.

In [None]:
total = 0
words = ['red', 'green', 'blue']

for w in words:
    ... = ... + len(w)

print(total)

2. Find the length of each word in the list, and store these lengths in another list called `lengths`.

In [None]:
lengths = ...
words = ['red', 'green', 'blue']

for w in words:
    lengths....(...)

print(lengths)

3. Concatenate all words into a single string called `result`.

In [None]:
words = ['red', 'green', 'blue']
result = ...

for ... in ...:
    ...

print(result)

4. Create an acronym, as a single string, representing the list of words. Each part of the acronym should consist of the first letter of each word, capitalized. For example, your loop should output `"RGB"` for the input `["red", "green", "blue"]`. For this one, write the entire loop yourself!

In [None]:
words = ['red', 'green', 'blue']

# YOUR CODE HERE


💡 **Tip**: Python runs loops without showing you all the steps it takes. If you want to visualize all steps, check out [pythontutor.com](https://pythontutor.com/python-debugger.html#mode=edit). Try copy-pasting one of your answers in the last challenge!

## How I Learned to Stop Using `for` Loops and Love Vectorization

Let's say we want to multiply GDP per capita (`gdpPercap`) by population (`pop`) in order to get the total GDP of a country. We could do so using a `for` loop:

In [None]:
%%timeit
gdpTotal = []
df_length = len(df)

for each in range(df_length):
    gdp = df['gdpPercap'].iloc[each]
    pop = df['pop'].iloc[each]
    gdpTotal.append(gdp * pop)

But this operation is slow, and not preferred. In Pandas, we will want to use [**vectorized**](https://www.geeksforgeeks.org/vectorized-operations-in-numpy) operations. We can just multiply two columns, and Pandas will know we want to multiply each row of both columns!

In [None]:
%%timeit
gdpTotal = df['gdpPercap'] * df['pop']

## 🥊 Challenge 3: Get Vectorized

Say our `year` column contains wrong information and we need to add one year to each value. Use a vectorized operation to get this done.

In [None]:
# YOUR CODE HERE


<a id='meth'></a>

# DataFrame Methods

Just like other objects, data frames have a series of methods – functions that work on them specifically.

### `.describe()`

[`df.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) will give some summary statistics for a column. Let's look at how `.describe()` works on our `DataFrame`.

🔔 **Question**: Why are only some of the columns in the `DataFrame` visible in the output below?

In [None]:
df.describe()

This function is good for summarizing numerical data in a dataset. However, sometimes this might not be enough. For example, what if we wanted the median of life expectancy?

First, let's select just one column to operate on. We can select an individual column with bracket notation. This is analogous to indexing a list.

🔔 **Question**: What is the type of the output?

In [None]:
df['lifeExp']

A single column of pandas is a `Series` object. This can be treated as a list or other iterable, and allows for you to do calculations over it. 

We can then look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) to see the methods and attributes that are available for `Series` objects. If we want the median, we can use the `.median()` function.

In [None]:
df['lifeExp'].median()

### `.value_counts()`

There are many methods for summarizing data frames (which often are assigned as `df`). For instance, `df.value_counts()` returns a `Series` containing counts of unique values.

In [None]:
df['continent'].value_counts()

## 🥊 Challenge 4: Dealing With Missing Values

Dealing with missing values is important, even if some methods in Pandas automatically exclude them.

1. Find the missing values of `df['gniPercap']` using the `.isna()` method. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) to see how it works.
2. Then, run the `sum()` of that output to see how many missing values we have in total.
3. Remove all missing values in the column using the `.dropna()` method. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.dropna.html) to see how it works.

In [None]:
# YOUR CODE HERE


💡 **Tip**: There are hundreds of methods associated with `DataFrames` and `Series`. Don't memorize all of them. Instead, get used to new functions by reading documentation and examples!

## 🥊 Challenge 5: Categorical to Numeric data

Recall that in our dataset, we have a 'continent' column that includes the values 'Asia', 'Europe', 'Africa', 'Americas', and 'Oceania'. Let's say that for some Machine Learning model we're building, we want to replace these string values with numbers that will serve as input. 

There are several ways to do this. Look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and see if you can `replace` the strings with a corresponding number.

In [None]:
# YOUR CODE HERE


<a id="vis"></a>
# Visualizing DataFrames

We often want to look at our data visually. Fortunately, `pandas` also offers some basic plotting functions that can be useful in exploring a data set. In this section, we will cover two basic types of plots: histograms and scatter plots. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for further information on plotting and plot customization.

### Histograms

A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`.

💡 **Tip**: Use a histogram if you want to show distributions of continuous variables.

In [None]:
print('Plot A: 5 Bins')
fig = df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=5)

Note the `bins` keyword argument when calling the histogram. It changes the number of "bins" or "buckets" in the histogram. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin.

🔔 **Question**: Let's plot two more histograms with different amounts of bins. Which of the 3 plots would you pick, and why?

In [None]:
print('Plot B: 10 Bins')
fig = df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=10)

In [None]:
print('Plot C: 20 bins')
df['lifeExp'].plot(kind='hist', title='Histogram of life expectancy', bins=20)

###  Scatter Plots

Scatter plots visualize the relationship between different variables. We can create a scatter plot by specifying the columns to use for the `x` and `y` axes. 

In [None]:
# YOUR CODE HERE
df.plot(kind='scatter', x='lifeExp', y='gdpPercap')

### Bar Plots

Bar plots show the relationship between a numeric and a categoric variable. Here, we use the "country"  (categorical) and "lifeExp" (numeric) columns.

💡 **Tip**: Use a bar plot when you want to illustrate differences in frequencies of some category.

Let's retrieve the 10 data points with the lowest life expectancy in our data using `.sort_values()`, and then plot those data points in a bar plot.


In [None]:
# Sort values based on low life expectancy, get top 10
low_lifeExp = df.sort_values('lifeExp', ascending=True)[:10]

# Visualize with bar plot
low_lifeExp.plot.bar(x='country', y='lifeExp', figsize=(6,4));

🔔 **Question**: Do you notice any pattern in the data? What might be causing that pattern?

## 🥊 Challenge 6: Loops and Plots

Let's say you have a list of countries you want to compare life expectancy for using a single lineplot. We will create a function for this.

We have set up the list and function for you. Your goal is to:
1. Add three country names in the DataFrame to `country_list`.
2. Add two parameters to the function; one for a DataFrame, and one for the list of countries.
3. Within the function block, loop over the list of countries. 
4. Within the for-loop, create a subset of the DataFrame using a comparison operator that sets `country_data` to the subset of the country you are looping over in the list.
5. In the `label=` parameter of `plt.plot()`, fill in the loop variable name.


Run the cell when you're done: if you've succeeded, you should see a single line plot with life expectancy for all of the countries in `country_list`.

💡 **Tip**: If you have time left, try to add labels and title to the plot using  `plt.xlabel()`, `plt.ylabel()`, and `plt.title()`. See [this resource](https://www.w3schools.com/python/matplotlib_labels.asp) for more information!


In [None]:
# YOUR CODE HERE

country_list = [..., ..., ...]

def plot_life_expectancy(..., ...):
    for ... in ...:
        country_data = ...
        plt.plot(country_data['year'], country_data['lifeExp'], label=...)
    plt.legend()
    plt.show()

plot_life_expectancy(df, countries)

<div class="alert alert-success">

## ❗ Key Points

* A `for` loop executes some statements once for each value in an interable.
* `for` loops work on lists and other list-like structures, but also on other iterables such as strings!
* We typically use an aggregator variable to store some information we retrieve using a `for` loop.    
* The `.describe()` method in Pandas summarizes numerical data in a dataset.
* We typically do not want to use for-loops in Pandas - instead, we use "vectorized" operations.
* The `.plot()` method in Pandas takes a `kind=` argument that determines what kind of plot it is - such as `scatter` or `hist`.
* A histogram shows the distribution of a variable using binned values.
* A scatterplot visualizes the relationship between different variables.

</div>