# NumPy
In the first 2 courses, we used nested lists in Python to represent datasets. Python lists offer a few advantages when representing data:

lists can contain mixed types
lists can shrink and grow dynamically
Using Python lists to represent and work with data also has a few key disadvantages:

to support their flexibility, lists tend to consume lots of memory
they struggle to work with medium and larger sized datasets

NumPy is a library that combines the **flexibility and ease-of-use of Python with the speed of C**. In this mission, we'll start by getting familiar with the core NumPy data structure and then build up to using NumPy to work with the dataset world_alcohol.csv, which contains data on how much alcohol is consumed per capita in each country.

## Arrays
The core data structure in NumPy is the ndarray object, which stands for N-dimensional array. An array is a collection of values, similar to a list. N-dimensional refers to the number of indices needed to select individual values from the object.

* 1-dimensional: Vectore
* 2-dimensional: Matrix

To use NumPy, we first need to import it into our environment. NumPy is commonly imported using the alias np:



In [2]:
import numpy as np

We can directly construct arrays from lists using the numpy.array() function. To construct a vector, we need to pass in a single list (with no nesting):

In [4]:
vector = np.array([5, 10, 15, 20])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

print(vector)
print(matrix)

[ 5 10 15 20]
[[ 5 10 15]
 [20 25 30]
 [35 40 45]]


## ndarray.shape
It's often useful to know how many elements an array contains. We can use the ndarray.shape property to figure out how many elements are in the array.

In [5]:
vector = np.array([10, 20, 30])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

vector_shape = vector.shape
matrix_shape = matrix.shape

print(vector_shape)
print(matrix_shape)

(3,)
(3, 3)


# Using numpy
We can read in datasets using the numpy.genfromtxt() function. Our dataset, world_alcohol.csv is a comma separated value dataset. We can specify the delimiter using the delimiter parameter:

In [9]:
nfl = np.genfromtxt("world_alcohol.csv", delimiter=",")

The above code would read in a file named world_alcohol.csv file into a NumPy array. NumPy arrays are represented using the `numpy.ndarray` class. We'll refer to ndarray objects as NumPy arrays in our material.

## Data types in Numpy

Each value in a NumPy array **has to have the same data type**. NumPy data types are similar to Python data types, but have slight differences. You can find a full list of NumPy data types [here](https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html). Here are some of the common ones:

NumPy will automatically figure out an appropriate data type when reading in data or converting lists to arrays. You can check the data type of a NumPy array using the dtype property.

In [7]:
numbers = np.array([1, 2, 3, 4])
numbers.dtype

dtype('int64')

The data type of world_alcohol is float. Because all of the values in a NumPy array have to have the same data type, NumPy attempted to convert all of the columns to floats when they were read in. The numpy.genfromtxt() function will attempt to guess the correct data type of the array it creates.

In this case, the WHO Region, Country, and Beverage Types columns are actually strings, and couldn't be converted to floats. When NumPy can't convert a value to a numeric data type like float or integer, it uses a special nan value that stands for "not a number". NumPy assigns an na value, which stands for "not available", when the value doesn't exist. nan and na values are types of missing data. We'll dive more into how to deal with missing data in later missions.

The whole first row of world_alcohol.csv is a header row that contains the names of each column. This is not actually part of the data, and consists entirely of strings. Since the strings couldn't be converted to floats properly, NumPy uses nan values to represent them.

To specify the data type for the entire NumPy array, we use the keyword argument **dtype** and set it to "U75". This specifies that we want to read in each value as a 75 byte unicode data type.


In [3]:
world_alcohol = np.genfromtxt("world_alcohol.csv", dtype="U75", skip_header=1, delimiter=",")
print(world_alcohol)


[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
 ..., 
 ['1986' 'Europe' 'Switzerland' 'Spirits' '2.54']
 ['1987' 'Western Pacific' 'Papua New Guinea' 'Other' '0']
 ['1986' 'Africa' 'Swaziland' 'Other' '5.15']]


## Indexing arrays

In [11]:
#Matrix indices are matrix[row_index][column_index], e.g.:

third_country = world_alcohol[2][2]
print(third_country)

Cte d'Ivoire


## Slicing arrays

We can use value slices to select subsets of arrays just like we can with lists:

Like lists, vector slicing is from the first index up to but not including the second index. Matrix slicing is a bit more complex, and has four forms:

* When we want to select **one entire dimension**, and a **single element from the other**.
* When we want to select **one entire dimension**, and a **slice of the other**.
* When you want to select **a slice of one dimension**, and a single element from the other.
* When we want to slice **both dimensions**.

In [28]:
vector = np.array([5,15,20])
print("vector[1:]")
print(vector[1:])


matrix = np.array([[1,3,6,9],[2,4,7,10],[3,6,9,11]])

## Slicing whole dimensions
print("matrix[:,2]")
print(matrix[:,2])

## Slicing whole dimension and a slice of the other 
print("matrix[1,2:]")
print(matrix[1,2:]) # second row, third and fourth column.

vector[1:]
[15 20]
matrix[:,2]
[6 7 9]
matrix[1,2:]
[ 7 10]


**Reminder:**
The colon by itself `:` specifies that the entirety of a single dimension should be selected. Think of the colon as selecting from the first element in a dimension up to and including the last element.

### Example:
* Assign the whole third column from world_alcohol to the variable countries.
* Assign the whole fifth column from world_alcohol to the variable alcohol_consumption.


In [23]:
countries = world_alcohol[:,2]
alcohol_consumption = world_alcohol[:,4]

print(countries)
print(type(countries))
print(alcohol_consumption)

['Viet Nam' 'Uruguay' "Cte d'Ivoire" ..., 'Switzerland' 'Papua New Guinea'
 'Swaziland']
<class 'numpy.ndarray'>
['0' '0.5' '1.62' ..., '2.54' '0' '5.15']


### Slicing one whole dimension and another single one

* Assign all the rows and the first 2 columns of world_alcohol to first_two_columns.
* Assign the first 10 rows and the first column of world_alcohol to first_ten_years.
* Assign the first 10 rows and all of the columns of world_alcohol to first_ten_rows.

In [29]:
first_two_columns = world_alcohol[:,:2]
first_ten_years = world_alcohol[:10,0]
first_ten_rows = world_alcohol[:10,:]

### Slice along both dimensions simultaneously.

* Assign the first 20 rows of the columns at index 1 and 2 of world_alcohol to first_twenty_regions.


In [30]:
first_twenty_regions = world_alcohol[:20,1:3]

## Array Comparisons

```
vector = numpy.array([5, 10, 15, 20])
vector == 10
```


will generate the _vector_ [False, True, False, False]

In [6]:
vector = np.array([5, 10, 15, 20])
vector == 10

array([False,  True, False, False], dtype=bool)

In [17]:
# Extract the third column in world_alcohol, and compare it to the string Canada. Assign the result to countries_canada.
countries_canada = ('Canada' == world_alcohol[:,2])
print(countries_canada)

[False False False ..., False False False]


In [15]:
# Extract the first column in world_alcohol, and compare it to the string 1984. Assign the result to years_1984.
years_1984 = ('1984' == world_alcohol[:,0])
print(years_1984)

[False False False ..., False False False]


## Selecting Elements


Comparisons give us the power to **select elements in arrays using Boolean vectors**.

In [5]:
vector = np.array([5, 10, 15, 20])
equal_to_ten = (vector == 10)

print(vector[equal_to_ten])

[10]


**The code above:**

* Creates vector.
* Compares vector to the value 10, which generates a Boolean vector [False, True, False, False]. It assigns the result to equal_to_ten.
* Uses equal_to_ten to only select elements in vector where equal_to_ten is True. This results in the vector [10].

We can use the same principle to **select rows in matrices**:



In [7]:
matrix = np.array ([
    [5,10,15],
    [20,25,30],
    [35,40,45]
    ]
    )
second_column_25 = (matrix[:,1] == 25)
print(second_column_25)

[False  True False]


In [8]:
print(matrix[second_column_25,:])

[[20 25 30]]


In [27]:
## exercise:
# select column in matrix where last row is 45

last_row_45 = (matrix[-1,:] == 45)
print(last_row_45)

print(matrix[:,last_row_45])


[False False  True]
[[15]
 [30]
 [45]]


In [39]:
#select all rows of world_alcohol where the country is "Algeria"

country_is_algeria = (world_alcohol[:,2] == "Algeria")
print(country_is_algeria)

consumption_algeria = world_alcohol[country_is_algeria,:]
print(consumption_algeria)

[False False False ..., False False False]
[['1984' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1987' 'Africa' 'Algeria' 'Beer' '0.17']
 ['1987' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1986' 'Africa' 'Algeria' 'Wine' '0.1']
 ['1984' 'Africa' 'Algeria' 'Other' '0']
 ['1989' 'Africa' 'Algeria' 'Beer' '0.16']
 ['1989' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1989' 'Africa' 'Algeria' 'Wine' '0.23']
 ['1986' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1984' 'Africa' 'Algeria' 'Wine' '0.12']
 ['1985' 'Africa' 'Algeria' 'Beer' '0.19']
 ['1985' 'Africa' 'Algeria' 'Other' '0']
 ['1986' 'Africa' 'Algeria' 'Beer' '0.18']
 ['1985' 'Africa' 'Algeria' 'Wine' '0.11']
 ['1986' 'Africa' 'Algeria' 'Other' '0']
 ['1989' 'Africa' 'Algeria' 'Other' '0']
 ['1987' 'Africa' 'Algeria' 'Other' '0']
 ['1984' 'Africa' 'Algeria' 'Beer' '0.2']
 ['1985' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1987' 'Africa' 'Algeria' 'Wine' '0.1']]


## Comparisons With Multiple Conditions

On the last screen, we made comparisons based on a single condition. We can also perform comparisons with multiple conditions by specifying each one separately, then joining them with an ampersand (&). When constructing a comparison with multiple conditions, it's critical to **put each one in parentheses**.

In [49]:
# AND
vector = np.array([5, 10, 15, 20])
comparison_10_20 = (vector >= 10) & (vector <= 20)
print(comparison_10_20)

[False  True  True  True]


In [50]:
# OR 
equal_to_ten_or_five = (vector == 10) | (vector == 5)
print(equal_to_ten_or_five)

[ True  True False False]


In [52]:
# Print rows with Algeria and 1986

algeria_and_1986 = (world_alcohol[:,0] == "1986") & (world_alcohol[:,2] == "Algeria")
rows_algeria_and_1986 = world_alcohol[algeria_and_1986,:]
print(rows_algeria_and_1986)

[['1986' 'Africa' 'Algeria' 'Wine' '0.1']
 ['1986' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1986' 'Africa' 'Algeria' 'Beer' '0.18']
 ['1986' 'Africa' 'Algeria' 'Other' '0']]


In [57]:
#Get countries
unique_countries = []
unique_continents = []


for row in world_alcohol:
    if row[2] not in unique_countries:
        unique_countries.append(row[2])

    if row[1] not in unique_continents:
        unique_continents.append(row[1])

print(unique_countries)
print(unique_continents)


['Viet Nam', 'Uruguay', "Cte d'Ivoire", 'Colombia', 'Saint Kitts and Nevis', 'Guatemala', 'Mauritius', 'Angola', 'Antigua and Barbuda', 'Nigeria', 'Botswana', "Lao People's Democratic Republic", 'Afghanistan', 'Guinea-Bissau', 'Costa Rica', 'Seychelles', 'Norway', 'Kenya', 'Myanmar', 'Romania', 'Turkey', 'Comoros', 'Tunisia', 'United Kingdom of Great Britain and Northern Ireland', 'Bahrain', 'Italy', 'Sierra Leone', 'Micronesia (Federated States of)', 'Mauritania', 'Russian Federation', 'Egypt', 'Sweden', 'Qatar', 'Burkina Faso', 'Austria', 'Czech Republic', 'Ukraine', 'China', 'Lithuania', 'Zimbabwe', 'Trinidad and Tobago', 'Mexico', 'Nicaragua', 'Malta', 'Switzerland', 'Finland', 'Saudi Arabia', 'Kuwait', 'El Salvador', 'Suriname', 'Croatia', 'Somalia', 'Syrian Arab Republic', 'Iran (Islamic Republic of)', 'Papua New Guinea', 'Libya', 'Bolivia (Plurinational State of)', 'Iraq', 'Namibia', 'Uganda', 'Togo', 'Madagascar', 'Mali', 'Pakistan', 'Cameroon', 'Jamaica', 'Malawi', 'Netherland

## Replacing values

We can also use comparisons to replace values in an array, based on certain conditions. Here's an example of how we would do this for a vector:

In [4]:
vector = np.array([5, 10, 15, 20])
equal_to_ten_or_five = (vector == 10) | (vector == 5)
vector[equal_to_ten_or_five] = 50
print(vector)


[50 50 15 20]


We can perform the same replacement on a matrix. To do this, we'll need to use indexing to select a column or row first:

In [5]:
matrix = np.array([
    [5,10,15],[20,25,30],[35,40,45]
])
second_column_25 = matrix[:,1] == 25
print(second_column_25)
matrix[second_column_25,1] = 1000

print(matrix)

[False  True False]
[[   5   10   15]
 [  20 1000   30]
 [  35   40   45]]


### Replacing Empty Strings

Because world_alcohol currently has a **unicode datatype**, all of the values in the last column are strings. To add these values together or perform any other mathematical operations on them, we'll have to convert the data in the column to floats.

If we try to convert empty data (`""`) in the column to floats without removing these values first, we'll get a `ValueError``

--> Replace empty values "" with 0 (in order to convert to float)

In [6]:
empty_value = world_alcohol[:,4] == ''
world_alcohol[empty_value,4] = 0

print(world_alcohol[empty_value,:])

[['1985' 'Africa' 'Comoros' 'Other' '0']
 ['1986' 'Europe' 'Italy' 'Other' '0']
 ['1985' 'Europe' 'Lithuania' 'Other' '0']
 ..., 
 ['1984' 'Europe' 'Czech Republic' 'Other' '0']
 ['1987' 'Americas' 'Canada' 'Other' '0']
 ['1986' 'Europe' 'Poland' 'Other' '0']]


We can now convert the last column to float values with the **astype()** method

In [7]:
alcohol_consumption = world_alcohol[:,4]
alcohol_consumption = alcohol_consumption.astype(float)

#change original matrix
world_alcohol[:,4] = world_alcohol[:,4].astype(float)

print(alcohol_consumption)
print(world_alcohol[:,4])

[ 0.    0.5   1.62 ...,  2.54  0.    5.15]
['0.0' '0.5' '1.62' ..., '2.54' '0.0' '5.15']


## Computing with NumPy

Now that alcohol_consumption consists of numeric values, we can perform computations on it. NumPy has a few **built-in methods** that operate on arrays. You can view all of them in the documentation. For now, here are a few important ones:

* sum() -- Computes the sum of all the elements in a vector, or the sum along a dimension in a matrix
* mean() -- Computes the average of all the elements in a vector, or the average along a dimension in a matrix
* max() -- Identifies the maximum value among all the elements in a vector, or the maximum along a dimension in a matrix

https://docs.scipy.org/doc/numpy-1.10.1/index.html

In [8]:
print(alcohol_consumption.sum())

3908.96


In [12]:
print(alcohol_consumption.mean())

1.20017193737


In [14]:
matrix = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
print(matrix.sum())

45


In [22]:
# Extract Canada's alcohol consumption in 1986, convert empty values to 0, convert all to float and sum it up


canada_1986 = world_alcohol[((world_alcohol[:,2] == "Canada") & (world_alcohol[:,0] == "1986" )),:]
print(canada_1986)


canada_1986[(canada_1986[:,4] == ''),4] = 0

canada_alcohol = canada_1986[:,4].astype(float)
print(canada_alcohol)

total_canadian_drinking = canada_alcohol.sum()
print(total_canadian_drinking)

[['1986' 'Americas' 'Canada' 'Other' '0.0']
 ['1986' 'Americas' 'Canada' 'Spirits' '3.11']
 ['1986' 'Americas' 'Canada' 'Beer' '4.87']
 ['1986' 'Americas' 'Canada' 'Wine' '1.33']]
[ 0.    3.11  4.87  1.33]
9.31


Now that we know how to calculate the average consumption of all types of alcohol for a single country and year, we 
can scale up the process and make the same calculation for all countries in a given year. Here's a rough process:

* Create an empty dictionary called totals.
* Select only the rows in world_alcohol that match a given year. Assign the result to year.
* Loop through a list of countries. For each country:
  * Select only the rows from year that match the given country.
  * Assign the result to country_consumption.
  * Extract the fifth column from country_consumption.
  * Replace any empty string values in the column with the string 0.
  * Convert the column to the float data type.
  * Find the sum of the column.
  * Add the sum to the totals dictionary, with the country name as the key.
* After the code executes, you'll have a dictionary containing all of the country names as keys, with the associated alcohol consumption totals as the values.

In [33]:
countries = ["Canada", "Germany"]
totals = {}
year = world_alcohol[(world_alcohol[:,0] == "1989"),:]

for country in countries:
    country_consumption = year[(year[:,2] == country),:]
    print(country_consumption)
    country_consumption[(country_consumption[:,4] == ''),4] = 0
    print(country_consumption[:,4].astype(float).sum())
    totals[country] = country_consumption[:,4].astype(float).sum()
    

[['1989' 'Americas' 'Canada' 'Wine' '1.27']
 ['1989' 'Americas' 'Canada' 'Spirits' '2.91']
 ['1989' 'Americas' 'Canada' 'Beer' '4.82']
 ['1989' 'Americas' 'Canada' 'Other' '0.0']]
9.0
[['1989' 'Europe' 'Germany' 'Other' '0.0']
 ['1989' 'Europe' 'Germany' 'Spirits' '2.43']
 ['1989' 'Europe' 'Germany' 'Beer' '8.47']
 ['1989' 'Europe' 'Germany' 'Wine' '3.74']]
14.64


In [34]:
print(totals)

{'Canada': 9.0, 'Germany': 14.640000000000001}


In [37]:
for row in totals:
    print(row) #key
    print(totals[row]) #value


Canada
9.0
Germany
14.64


In [40]:
#find max consumption in totals

highest_value = 0
highest_key = None

for country in totals:
    if totals[country] > highest_value:
        highest_value = totals[country]
        highest_key = country

print(highest_key + ' ' + str(highest_value))

Germany 14.64
