# Data 80A/180A Data Science for Everyone


# Lab 3: Data Types, Arrays, Tables, and Visualization
#### Today's lab

Welcome to Lab 3!  In today's lab, you'll learn:

1. data types 
2. arrays 
3. tables 
4. visualization

Reading:
 * [Chapter 4: Data Types](https://www.inferentialthinking.com/chapters/04/Data_Types.html)
 * [Chapter 5: Sequences](https://www.inferentialthinking.com/chapters/05/Sequences.html) 
 * [Chapter 6: Tables](https://www.inferentialthinking.com/chapters/06/Tables.html)
 * [Chapter 7: Visualization](https://www.inferentialthinking.com/chapters/07/Visualization.html)

First, set up the tests and imports by running the cell below.

In [None]:
# Just run this cell
import numpy as np
import math
from datascience import *

import otter
grader = otter.Notebook()

# 1. Strings
Text is one of the most common data types used in computer programs. A piece of text, that is, a sequence of characters, is called a **string** in Python. A string might contain a single character, a word, a sentence, or a whole book.

To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes (`'`) and double quotes (`"`) are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols. 

We've seen strings before in `print` statements.  Below, three different strings are passed as arguments to the `print` function.

In [None]:
print("I love", 'Data Science', 'and Data 80.')

Any name can be assigned to any string.

In [None]:
one = 'two'
plus = '+'
print(one, plus, one)

**Question 1.1.** Yuri Gagarin was the first person to travel through outer space.  When he emerged from his capsule upon landing on Earth, he [reportedly](https://en.wikiquote.org/wiki/Yuri_Gagarin) had the following conversation with a woman and girl who saw the landing:

    The woman asked: "Can it be that you have come from outer space?"
    Gagarin replied: "As a matter of fact, I have!"

The cell below contains unfinished code.  Fill in the `...`s so that it prints out this conversation *exactly* as it appears above.

In [None]:
woman_asking = ...
woman_quote = '"Can it be that you have come from outer space?"'
gagarin_reply = 'Gagarin replied:'
gagarin_quote =  ... 

print(woman_asking, woman_quote)
print(gagarin_reply, gagarin_quote)

In [None]:
grader.check("q11")  

## 1.1. String Methods

We can run operations on Strings using **methods**. Recall that methods and functions are not technically the same thing, but we'll be using them interchangeably for the purposes of this course.

Here's a sketch of how to call methods on a string:

    <expression that evaluates to a string>.<method name>(<argument>, <argument>, ...)
    
One example of a string method is `replace`, which replaces all instances of some part of the original string (or a *substring*) with a new string. 

    <original string>.replace(<old substring>, <new substring>)
    
`replace` returns (evaluates to) a new string, leaving the original string unchanged.
    
Try to predict the output of this example, then run the cell!

In [None]:
# Replace one letter
hello = 'Hello'
print(hello.replace('o', 'a'), hello)

Calling a function on the results of other functions

In [None]:
# Calling replace on the output of another call to replace
'train'.replace('t', 'ing').replace('in', 'de')

Here's a picture of how Python evaluates a "chained" method call like that:

<img src="chaining_method_calls.png"/>

**Question 1.1.1.** Use `replace` to transform the string `'hitchhiker'` into `'matchmaker'`. Assign your result to `new_word`.

In [None]:
new_word = ...
new_word

In [None]:
grader.check("q111")  # it was called q111

There are many more string methods in Python, but most programmers don't memorize their names or how to use them.  In the "real world," people usually just search the internet for documentation and examples. A complete [list of string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) appears in the Python language documentation. [Stack Overflow](http://stackoverflow.com) has a huge database of answered questions that often demonstrate how to use these methods to achieve various ends.

## 1.2. Converting Numbers to Strings

Strings and numbers are different *types* of values, even when a string contains the digits of a number. For example, evaluating the following cell causes an error because an integer cannot be added to a string.

In [None]:
# In this expression 8 is a number and "8" is a string.
# Running this code will produce an error.
8 + "8"

However, there are built-in functions to convert numbers to strings and strings to numbers. Some of these built-in functions have restrictions on the type of argument they take:

|Function |Description|
|-|-|
|`int`|Converts a string of digits or a float to an integer ("int") value|
|`float`|Converts a string of digits (perhaps with a decimal point) or an int to a decimal ("float") value|
|`str`|Converts any value to a string|

Try to predict what data type and value `example` evaluates to, then run the cell.

In [None]:
example = 8 + int("10") + float("8")

print(example)
print("This example returned a " + str(type(example)) + "!")

## 1.3. Passing Strings to Functions

String values can be arguments to functions and can be returned by functions. 

The function `len` (derived from the word "length") takes a single string as its argument and returns the number of characters (including spaces) in the string.

Note that it doesn't count *words*. `len("one small step for man")` evaluates to 22, not 5.

**Question 1.3.1.**  Use `len` to find the number of characters in the long string in the next cell.  Characters include things like spaces and punctuation. Assign `sentence_length` to that number.

(The string is the first sentence of the English translation of the French [Declaration of the Rights of Man](http://avalon.law.yale.edu/18th_century/rightsof.asp).)  

In [None]:
a_very_long_sentence = "The representatives of the French people, organized as a National Assembly, believing that the ignorance, neglect, or contempt of the rights of man are the sole cause of public calamities and of the corruption of governments, have determined to set forth in a solemn declaration the natural, unalienable, and sacred rights of man, in order that this declaration, being constantly before all the members of the Social body, shall remind them continually of their rights and duties; in order that the acts of the legislative power, as well as those of the executive power, may be compared at any moment with the objects and purposes of all political institutions and may thus be more respected, and, lastly, in order that the grievances of the citizens, based hereafter upon simple and incontestable principles, shall tend to the maintenance of the constitution and redound to the happiness of all."
sentence_length = ...
sentence_length

In [None]:
grader.check("q131")

## 2. Arrays

Arrays allow us to put many values in one place so that we can operate on them as a group. An array is a sequence of values of the same type.

## 2.1. Making Arrays

###  `make_array`
One of the way to create an array is to call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.

In [None]:
make_array(0.125, 4.75, -1.3)

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them to names or use them as arguments to functions. For example, `len(<some_array>)` returns the number of elements in `some_array`.

In [None]:
my_array = make_array(1, 2, 3, 4)
len(my_array)

**Question 2.1.1.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  

*Hint:* How did you get the values $\pi$ and $e$ in Lab 2?  You can refer to them in exactly the same way here.

<!--
BEGIN QUESTION
name: q211
-->

In [None]:
interesting_numbers = ...
interesting_numbers

In [None]:
grader.check("q211") 

**Question 2.1.2.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you evaluate `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the data types in the array are strings.

In [None]:
hello_world_components = ...
hello_world_components

In [None]:
grader.check("q212") 

###  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie"). The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with (as we did in the first code cell):

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The line of code `np.arange(start, stop, step)` evaluates to an array with all the numbers starting at `start` and counting up by `step`, stopping **before** `stop` is reached.

Run the following cells to see some examples!

In [None]:
# This array starts at 4, counts up by 1, and doesn't contain 9
# because np.arange stops *before* the stop value is reached
np.arange(4, 9, 1)

In [None]:
# This array starts at 1 and counts up by 2
# and then stops before 6
np.arange(1, 6, 2)

**Question 2.1.3.** Use `np.arange` to create an array with 5 numbers starting at 100.  (So its elements are 100, 101, 102, 103, 104)

In [None]:
import numpy as np
my_array = ...
my_array

In [None]:
grader.check("q213")

## 2.2. Working with a Single Element of an Array ("Indexing")
Let's work with a more interesting dataset.  Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to manually create a table later in this lab! Run the first cell. 

The second cell creates an array called `population_amounts` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau](https://www.census.gov/data-tools/demo/idb).)

In [None]:
world_population = Table.read_table("world_population.csv")
world_population

In [None]:
population_amounts = Table.read_table("world_population.csv").column("Population")
population_amounts

Notice from above that a column of a table is an array (i.e., a sequence of elements). 

Here's how we get the first element of `population_amounts`, which is the world population in the first year in the dataset, 1950.

In [None]:
# Read the first element of the array population_amounts
population_amounts.item(0)

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population_amounts`.  Read and run each cell.

In [None]:
# The 13th element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population_amounts.item(12)
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = population_amounts.item(65)
population_2015

In [None]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population_amounts.item(66)
population_2016

Since `make_array` returns an array, we can call `.item(3)` on its output to get its 4th element, just like we "chained" together calls to the method `replace` earlier.

In [None]:
make_array(-1, -3, 4, -2).item(3)

**Question 2.2.1.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population_amounts` using `item`.

In [None]:
population_1973 = ...
population_1973

In [None]:
grader.check("q221")

## 2.3. Doing Something to Every Element of an Array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements. 

For example, you can calculate tips on several restaurant bills at once (in this case 3 bills with tip being 20%):


In [None]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)

# Array multiplication
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="array_multiplication.jpg">

**Question 2.3.1.** Suppose the total charge at a restaurant is the original bill plus the tip. If the tip is 20%, that means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`, and assign the resulting array to `total_charges`.

In [None]:
total_charges =  ...
total_charges

In [None]:
grader.check("q231") # was 232

The array `more_restaurant_bills` contains 100,000 bills!  Let's compute the total charge for each one.  Notice how the total charge computation code stays exactly the same. 


In [None]:
# Just run this cell
more_restaurant_bills = Table.read_table("more_restaurant_bills.csv").column("Bill")
more_total_charges = 1.2 * more_restaurant_bills
more_total_charges

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).  For example, the following cell computes the sum of all the bills in `more_restaurant_bills`, *including tips*.


In [None]:
more_sum_of_bills = sum(more_total_charges)
more_sum_of_bills

**Question 2.3.2.**  What is the sum of all the bills in `restaurant_bills`, *including tips*?

In [None]:
sum_of_bills = ...
sum_of_bills

In [None]:
grader.check("q232") 

## 3. Creating Tables

In the cell below we have two arrays. The first one, `population_amounts`, contains the world population in each year (estimated by the US Census Bureau). The second array, `years`, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [None]:
years = np.arange(1950, 2015+1, 1)
print("Population column:", population_amounts)
print()
print("Years column:", years)

Suppose we want to answer this question:

> In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments: "Population", population_amounts, "Year", years
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data is combined into a single table! It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance.

**Question 3.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [None]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

# Create a Table with two columns: the first column Rating should contain the array top_10_movie_ratings
# while the second column Name should contain the array top_10_movie_names
top_10_movies =  ...

# print table to view
top_10_movies

In [None]:
grader.check("q31")

#### Loading a table from a file

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common.

`Table.read_table(...)` takes one argument (a path to a data file in string format) and returns a table.  

In previous examples, we've loaded a number of CSV files.  Next,`imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [None]:
imdb = Table.read_table('imdb.csv')
imdb


Where did `imdb.csv` come from? Take a look at [this lab's folder](./). You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 4. More Table Operations!

Now that you've worked with arrays, let's add a few more methods to the list of table operations that you saw in Lab 2.

### `column`

`column` takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**. 

In [None]:
# Returns an array of movie names
top_10_movies.column('Name')

### `take`
The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a **new table** with only those rows. 

You'll usually want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

In [None]:
# Take first 5 movies of top_10_movies
top_10_movies.take(np.arange(0, 5, 1))

Let's practice combining the operations we've learned in this lab and those from the previous one to answer questions about the `population` and `imdb` tables. First, let's look at the `population` table from Section 3.

In [None]:
# Run this cell to display the population table.
population

Next, compute the year when the world population first went above 6 billion. Assign the year to `year_population_crossed_6_billion`.


In [None]:
year_population_crossed_6_billion = population.where('Population', are.above_or_equal_to(6*10**9)).column('Year').item(0)
year_population_crossed_6_billion

Make sure you understand the above code.

Finally, let's find the average rating for movies released before the year 2000 and the average rating for movies released in the year 2000 or after for the movies in `imdb`.

It helps to think of the steps you need to do: find movies released in 20th/21st centuries, find the ratings, and then take the average.  Try to put them in an order that makes sense.  Again, make sure you understand the code.

In [None]:
before_2000 = np.mean(imdb.where('Year', are.below(2000)).column('Rating')) 
after_or_in_2000 = np.mean(imdb.where('Year', are.above_or_equal_to(2000)).column('Rating'))

print("Average before 2000 rating:", before_2000)
print("Average after or in 2000 rating:", after_or_in_2000)

## 5. Visualization

Let's plot some of the data from `imdb.csv`. 

In [None]:
# These lines set up graphing capabilities.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
# Let's display the table
imdb

In [None]:
# Let's sort the table by Rating, showing the max ratings at the top
sorted_imdb  = imdb.sort('Rating', descending=True) 
sorted_imdb

In [None]:
# Let's look at the relationship between the number of Votes and Rating with a scatter plot
imdb.scatter('Rating', 'Votes')

Note: the (implicit) unit for the y-axis *Votes* is *millions*. Notice that the two dots in the 9.2 Rating represent the top two rows of the sorted_imdb table!

In [None]:
# Top rated movies
# Let's visualize the rating per movie with a bar plot for only movies with rating greater that 8.8
high_ratings = imdb.where('Rating', are.above(8.8))
high_ratings.barh('Title', 'Rating')

**Question 5.1.**  Make a histogram to visualize the ratings distribution in the imdb table.
We have added how the plot should look like right below in Figure 5.1. 
[See Section 7.2.](https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html)

The y-axis Count represents the number of movies with Rating corresponding to the x-axis. 

In [None]:
# Write your solution here
# Your plot should match Figure 5.1 below (or very similar)
# In your histogram plot, set the parameter "normed=False"
...

Figure 5.1
<img src="q5.1.png">

Congratulations, you're done with Lab 3!  

Be sure to:
- **run** all the tests 
- **save** your notebook and **download** a pdf version of it,
- **submit** your work to Canvas, 
- and ask one of the lab instructors to **check you off**.