<sup>This notebook is adapted from https://github.com/data-8/data8assets and licensed for reuse under [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](http://creativecommons.org/licenses/by-nc/4.0/).</sup>

# Exercise 7: Tables

Welcome to Exercise 7!  

In this exercise, we'll learn about *tables* -- Pandas `DataFrame` objects -- which let us work with multiple arrays of data about the same things.  

First, set up the imports by running the cell below.

In [None]:
import numpy as np
import pandas as pd

## 1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year (as [estimated](http://www.census.gov/population/international/data/worldpop/table_population.php) by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [None]:
population_amounts = np.array(pd.read_csv("world_population.csv")["Population"])
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`DataFrame`*, a 2-dimensional type of dataset. 

The expression below:

- creates an table using the expression `pd.DataFrame()`,
- adds two columns by providing a dictionary 
    - with the keys as the strings `Population` and `Year`,
    - with the values as the variables `population_amounts` and `years`
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. Ther names `population_amounts` and `years` were assigned above to two arrays of the same length. You can find the documentation on how to create `DataFrame` objects [here](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).

In [None]:
population = pd.DataFrame({
    "Population": population_amounts,
    "Year": years
    }
)
population

Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

## 2. Creating Tables

**Question 2.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [None]:
top_10_movie_ratings = np.array([9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8])
top_10_movie_names = np.array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = ...
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

#### Loading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `pandas` or `pd` functions.

`pd.read_csv` takes one argument, a path to a data file (a string) and returns a `DataFrame`.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

**Question 2.2.** The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [None]:
imdb = pd.read_csv('imdb.csv')
imdb

Notice the part about "... (240 rows omitted)."  This table is big enough that only a few of its rows are displayed, but the others are still there.  10 are shown, so there are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this exercise's folder](./). You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 3. Using lists

A *list* is another Python sequence type, similar to an array. It's different than an array because the values it contains can all have different types. A single list can contain `int` values, `float` values, and strings. Elements in a list can even be other lists! A list is created by giving a name to the list of values enclosed in square brackets and separated by commas. For example, `values_with_different_types = ['p', 4, ['ds', 'ex' 8]]`

Lists can be useful when working with tables because they can describe the contents of one row in a table, which often  corresponds to a sequence of values with different types. A list of lists can be used to describe multiple rows.

Each column in a table is a collection of values with the same type (an array). If you create a table column from a list, it will automatically be converted to an array. A row, on the ther hand, mixes types.

Here's a simple table about flowers. (Run the cell below.)

In [None]:
# Run this cell to recreate the table
import pandas as pd
import numpy as np
flowers = pd.DataFrame({
    'Number of petals': np.array([8, 34, 5),
    'Name', np.array(['lotus', 'sunflower', 'rose'])
})
flowers

**Question 3.1.** Create a list that describes a new fourth row of this table. The details can be whatever you want, but the list must contain two values: the number of petals (an `int` value) and the name of the flower (a string). How about the "pondweed"? Its flowers have zero petals.

In [None]:
my_flower = ['pondweed', 0]
my_flower

**Question 3.2.** `my_flower` fits right in to the `flowers` table. Complete the cell below to create a table of seven flowers that includes your flower as the fourth row followed by `other_flowers`. You can use `.loc[ ... ]` to add one extra row by passing a `np.array` with a list of values (in this case `my_flower`) corresponding to a new row.

In [None]:
# Use the indexer method .loc[<index>] to add a row to the flowers table

flowers.loc[ ... ] = ...
flowers

## 4. Analyzing datasets
With just a few table methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can get an array that contains the data in that column:

In [None]:
imdb["Rating"]

The value of that expression is an array, exactly the same kind of thing you'd get if you typed in `np.array([8.4, 8.3, 8.3, ... ])`.

**Question 4.1.** Find the rating of the highest-rated movie in the dataset.

*Hint:* Think back to the functions you've learned about for working with arrays of numbers.  Ask for help if you can't remember one that's useful for this.

In [None]:
highest_rating = ...
highest_rating

That's not very useful, though.  You'd probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the entire table by rating, which ensures that the ratings and titles will stay together.

In [None]:
np.sorted(imdb["Rating"])

Well, that actually doesn't help much, either -- we sorted the movies from lowest -> highest ratings.  To look at the highest-rated movies, sort in reverse order:

In [None]:
imdb.sort_values(by='Rating', ascending=False)

(The `ascending=False` bit is called an *optional argument*. It has a default value of `True`, so when you explicitly tell the function `ascending=False`, then the function will sort in descending order.)

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Some details about sort:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort_values` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb.sort_values(by="Rating")` is a *copy of `imdb`*; the `imdb` table doesn't get modified. For example, if we called `imdb.sort_values(by="Rating")`, then running `imdb` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Rating" column, the movies would all end up with the wrong ratings.

**Question 4.2.** Create a version of `imdb` that's sorted chronologically, with the earliest movies first.  Call it `imdb_by_year`.

In [None]:
imdb_by_year = ...
imdb_by_year

**Question 4.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Starting with `imdb_by_year`, extract the Title column to get an array, then use `item` to get its first item.

In [None]:
earliest_movie_title = ...
earliest_movie_title

## 5. Finding pieces of a dataset
Suppose you're interested in movies from the 1940s.  Sorting the table by year doesn't help you, because the 1940s are in the middle of the dataset.

Instead, we use the data frame method `query`.

In [None]:
forties = imdb.query("Decade==1940")
forties

Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`forties`** to a table whose rows are the rows in the **`imdb`** table where the **`'Decade'`**s **`are` `equal` `to` `1940`**.

**Question 5.1.** Compute the average rating of movies from the 1940s.

*Hint:* The function `np.average` computes the average of an array of numbers.

In [None]:
average_rating_in_forties = ...
average_rating_in_forties

Now let's dive into the details a bit more.  `query` takes 1 argument, a string describing a 2 part predicate:

1. The name of a column.  `query` finds rows where that column's values meet some criterion.
2. Something that describes the criterion that the column needs to meet, called a predicate.

To create our predicate, we called use the binary boolean operator `==` with the value we wanted, 1940.  We'll see other predicates soon.

`query` returns a table that's a copy of the original table, but with only the rows that meet the given predicate.

**Question 5.2.** Create a table called `ninety_nine` containing the movies that came out in the year 1999.  Use `query`.

In [None]:
ninety_nine = ...
ninety_nine

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other predicates.  Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`<column> == n`|`Column_Name == 50`|Find rows with values equal to 50|
|`<column> != n`|`Column_Name != 50"`|Find rows with values not equal to 50|
|`<column> > n`|`Column_Name > 50`|Find rows with values above (and not equal to) 50|
|`<column> >= n`|`Column_Name >= 50`|Find rows with values above 50 or equal to 50|
|`<column> < n`|`Column_Name < 50`|Find rows with values below 50|
|`n <= <column> < n`|`2 <= Column_Name < 10`|Find rows with values above or equal to 2 and below 10|

The `pandas` documentation section on [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html) has more examples.


**Question 5.3.** Using `where` and one of the predicates from the table above, find all the movies with a rating higher than 8.5.  Put their data in a table called `really_highly_rated`.

In [None]:
really_highly_rated = ...
really_highly_rated

**Question 5.4.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [None]:
average_20th_century_rating = ...
average_21st_century_rating = ...
print("Average 20th century rating:", average_20th_century_rating)
print("Average 21st century rating:" average_21st_century_rating)

The property `size` tells you how many rows are in a table.  (A "property" is just a method that doesn't need to be called by adding parentheses.)

In [None]:
num_movies_in_dataset = imdb.size
num_movies_in_dataset

**Question 5.5.** Use `num_rows` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies.

In [None]:
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

**Question 5.6.** Here's a challenge: Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays, operating elementwise like `+` or `*`.  So `np.array([5, 6, 7]) % 2` is `array([1, 0, 1])`.

*Hint 3:* Create a column called "Year Remainder" that's the remainder when each movie's release year is divided by 2.  Make a copy of `imdb` that includes that column.  Then use `where` to find rows where that new column is equal to 0.  Then use `num_rows` to count the number of such rows.

In [None]:
num_even_year_movies = ...
num_even_year_movies

**Question 5.7.** Check out the `population` table from the introduction to this exercise.  Compute the year when the world population first went above 6 billion.

In [None]:
year_population_crossed_6_billion = ...
year_population_crossed_6_billion

## 7. Other useful methods

There are a lot more `pandas` methods that may be useful and you can read the documentation online [here]().

In particular, these sections may be of use:
* [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)
* [Working with text data](https://pandas.pydata.org/pandas-docs/stable/text.html) in Pandas 
* [Grouping data](https://pandas.pydata.org/pandas-docs/stable/text.html)
* [Merge, join, and concatenate](https://pandas.pydata.org/pandas-docs/stable/merging.html) on `DataFrames`
* **Automatically** generate summary statistics on a `DataFrame` with [describe()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) and [info()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html)

Alright! You're finished with Exercise 7!

If you are running this notebook using Binder, choose **Save and Checkpoint** from the **File** menu, **rename** your notebook to add a hyphen and your initials to the notebook name e.g. `Ex07_Tables-DJ`, then choose **Download as Notebook** and save it to your computer or USB stick.

If you are running this notebook on your own machine, choose **Save and Checkpoint** from the **File** menu, choose **Make a copy** from the **File** menu, then **rename** your notebook to add a hyphen and your initials to the notebook name e.g. rename from `Ex07_Tables-Copy1` to `Ex07_Tables-DJ`.