# Lists and arrays

Elements of Data Science

by [Allen Downey](https://allendowney.com)

[MIT License](https://opensource.org/licenses/MIT)

## Tuples

In the previous notebook, we used a tuple to represent a latitude and longitude, and I said that a tuple is a sequence of elements.  In the case of latitude and longitude, the sequence only contained two elements, and they were both floating-point numbers.

But in general a tuple can contain any number of elements, and the elements can be values of any type.

The following is a tuple of three integers:

In [24]:
1, 2, 3

Notice that when Python displays a tuple, it puts the elements in parentheses.

When you type a tuple, you can put it in parentheses if you think it is easier to read that way, but you don't have to.

In [25]:
(1, 2, 3)

The elements can be any type.  Here's a tuple of strings:

In [26]:
'Data', 'Science'

And the elements don't have to be the same type.  Here's a tuple with a string, and integer, and a floating-point number.

In [27]:
'one', 2, 3.14159 

If you have a string, you can convert it to a tuple using the `tuple` function:

In [28]:
tuple('DataScience')

The result is a tuple of single-character strings.

**Exercise:** When you create a tuple, the parentheses are options, but the commas are required.  So how to you think you create a tuple with a single element?  You might be tempted to write:

In [9]:
x = (5)
x

But you will find that the result is just a number, not a tuple.

In [10]:
type(x)

To make a tuple with a single element, you need a comma:

In [11]:
t = 5,
t

In [12]:
type(t)

## Lists

Python provides another way to store a sequence of elements, a list.

To create a list, you put a sequence of elements in square brackets.

In [31]:
[1, 2, 3]

Lists and tuples are very similar.  They can contain any number of elements, the elements can be any type, and the elements don't have to be the same type.

The only difference is that you can modify a list; tuples are immutable.  This difference will matter later, but for now we can ignore it.

When you make a list, the brackets are required, but if there is a single element, you don't need a comma.  So you can make a list like this:

In [32]:
single = [5]

In [33]:
type(single)

It is also possible to make a list with no elements, like this:

In [34]:
empty = []

In [35]:
type(empty)

The `len` function computes the length (number of elements) in a list or tuple.

In [36]:
len([1, 2, 3])

In [37]:
len(single)

In [38]:
len(empty)

**Exercise:** Create a list with 4 elements; then use `type` to confirm that it's a list, and `len` to confirm that it has 4 elements.

In [16]:
# Solution goes here

In [17]:
# Solution goes here

In [18]:
# Solution goes here

There's a lot more we could do with lists, but that's enough to get started.  In the next section, we'll use lists to store data about sandwich prices.

## Sandwiches

In September 2019, *The Economist* published an article comparing sandwich prices in Boston and London: "[Why Americans pay more for lunch than Britons do](https://www.economist.com/finance-and-economics/2019/09/07/why-americans-pay-more-for-lunch-than-britons-do)"

It includes this graph showing prices of several sandwiches in the two cities:

<img src="https://github.com/AllenDowney/ElementsOfDataScience/raw/master/figs/20190907_FNC941.png" width="400"/>

Here are the sandwich names from the graph, as a list of strings.

In [39]:
name_list = ['Lobster roll',
    'Chicken caesar',
    'Bang bang chicken',
    'Ham and cheese',
    'Tuna and cucumber',
    'Egg'
]

I contacted *The Economist* to ask for the data they used to create that graph, and they were kind enough to share it with me.

Here are the corresponding sandwich prices in Boston:

In [19]:
boston_price_list = [9.99, 7.99, 7.49, 7, 6.29, 4.99]

So the lobster roll is \$9.99 in Boston.

The egg sandwich is \$4.99.

Here are the prices in London, converted to dollars at \$1.25 / Â£1.

In [21]:
london_price_list = [7.5, 5, 4.4, 5, 3.75, 2.25]

Lists provide some arithmetic operators, but they might not do what you want.  For example, you can "add" two lists:

In [22]:
boston_price_list + london_price_list

But it concatenates the two lists, which is not very useful in this example.

To compute differences between prices, you might try subtracting lists, but you would get an error.

In [23]:
boston_price_list - london_price_list

We can solve this problem with a NumPy array.

## NumPy arrays

We've already seen that the NumPy library provides math functions.  It also provides a type of sequence called an array.

You can create a new array with the `np.array` function, starting with a list or tuple.

In [44]:
import numpy as np

boston_price_array = np.array(boston_price_list)
london_price_array = np.array(london_price_list)

The type of the result is `numpy.ndarray`.

In [45]:
type(boston_price_array)

The "nd" stands for "n-dimensional"; NumPy arrays can have any number of dimensions.  But for now we will work with one-dimensional sequences.

If you display an array, Python displays the elements:

In [46]:
boston_price_array

You can also display the "data type" of the array, which is the type of the elements:

In [47]:
boston_price_array.dtype

The elements of a NumPy array can be any type, but they all have to be the same type.

Most often the elements are numbers, but you can also make an array of strings.

In [48]:
name_array = np.array(name_list)
name_array

In this example, the `dtype` is `<U17`.  You don't have to understand this code, but if you are curious, the `U` indicates that the elements are Unicode strings; Unicode is the standard Python uses to represent strings.

Now, here's why NumPy arrays are useful: they can do arithmetic.  For example, to compute the differences between Boston and London prices, we can write: 

In [79]:
differences = boston_price_array - london_price_array
differences

Subtraction is done "elementwise"; that is, NumPy lines up the two arrays and subtracts corresponding elements.  The result is a new array.

## Mean and standard deviation

NumPy provides functions that compute statistical summaries like the mean: 

In [80]:
np.mean(differences)

So we could describe the difference in prices like this: "Sandwiches in Boston are more expensive by $2.64, on average".

We could also compute the means first, and then compute their difference:

In [81]:
np.mean(boston_price_array) - np.mean(london_price_array)

And that turns out to be the same thing: the difference in means is the same as the mean of the differences.

As an aside, many of the NumPy functions also work with lists, so we could also do this:

In [82]:
np.mean(boston_price_list) - np.mean(london_price_list)

**Exercise:** Standard deviation is way to quantify the variability in a set of numbers.  The NumPy function that computes standard deviation is `np.std`.

Compute the standard deviation of sandwich prices in Boston and London.  By this measure, which set of prices is more variable?

In [83]:
# Solution goes here

**Exercise:** The definition of the mean, in math notation, is

$\mu = \frac{1}{N} \sum_i x_i$

where $x$ is a sequence of elements, $N$ is the number of elements, and $\mu$ is their mean.

The definition of standard deviation is

$\sigma = \sqrt{\frac{1}{N} \sum_i (x_i - \mu)^2}$

Compute the standard deviation of `boston_price_list` using NumPy functions `np.mean` and `np.sqrt` and see if you get the same result as `np.std`.

Note: You should do this exercise using only features we have discussed so far.

In [84]:
x = boston_price_list

In [85]:
# Solution goes here

Note: This definition of standard deviation is sometimes called the "population standard deviation".  You might have seen another definition with $N-1$ in the denominator; that's the "sample standard deviation".  We'll use the population standard deviation for now and come back to this issue later.

## Absolute and relative differences

In the previous section we computed "absolute" differences in price; that's what you get when you subtract two quantities.

But often when we make this kind of comparison, we are interested in "relative" difference, which are differences expressed as a fraction or percentage of a quantity.

Taking the lobster roll as an example, the absolute difference in price is:

In [86]:
9.99 - 7.5

We can express that difference as a fraction of the London price, like this:

In [87]:
(9.99 - 7.5) / 7.5

Or as a percentage of the London price, like this:

In [88]:
(9.99 - 7.5) / 7.5 * 100

So we might say that the lobster roll is 33% more expensive in Boston.

But putting London in the denominator was an arbitrary choice.  We could also compute the difference as a percentage of the Boston price:

In [89]:
(9.99 - 7.5) / 9.99 * 100

If we do that calculation, we might say the lobster roll is 25% cheaper in London.

When you read this kind of comparison, you should make sure you understand which quantity is in the denominator, and you might want to think about why that choice was made.

In this example, if you want to make the difference seem bigger, you might put London prices in the denominator.

Now we can use the arrays to compute all of the relative differences:

In [90]:
absolute_differences = boston_price_array - london_price_array
absolute_differences

In [91]:
relative_differences = absolute_differences / london_price_array
relative_differences

In [92]:
percent_differences = relative_differences * 100
percent_differences

In this example, relative differences are more variable than absolute differences.

We can use `np.min` and `np.max` to compute the range of absolute differences:

In [93]:
np.min(absolute_differences), np.max(absolute_differences)

The differences are between \\$2 and \\$3.10.

Here is the range of percent differences:

In [94]:
np.min(percent_differences), np.max(percent_differences)

The range is quite wide: the lobster roll is only 33% more expensive in Boston; the egg sandwich is 117% percent more (that is, more than twice the price).

**Exercise:** What are the percent differences if we put the Boston prices in the denominator?  What is the range of those differences?  Write a sentence that summarizes the results.

In [95]:
# Solution goes here

In [96]:
# Solution goes here

In [97]:
# Solution goes here

## Summarizing relative differences

Because the range of relative differences is wide, it is not clear how we should best summarize it.  One option is the mean of the percent differences:

In [98]:
np.mean(percent_differences)

So we might say, on average, sandwiches are 65% more expensive in Boston.

But another way to summarize the data is to compute the mean price in each city, and then compute the percentage difference of the means:

In [99]:
boston_mean = np.mean(boston_price_array)
london_mean = np.mean(london_price_array)

(boston_mean - london_mean) / london_mean * 100

So we might say that the average sandwich price is 56% higher in Boston.

As this example demonstrates:

* With relative and percentage differences, the mean of the differences is not the same as the difference of the means.

* When you report data like this, you should think about different ways to summarize the data.

* When you read a summary of data like this, make sure you understand what summary was chosen and what it means.

In this example, I think the second option (the relative difference in the means) is more meaninful, because it reflects the difference in price between "baskets of goods" that include one of each sandwich.