In [1]:
import numpy
world_alcohol=numpy.genfromtxt("world_alcohol.csv",delimiter=",")
print(type(world_alcohol))
print(world_alcohol)

<type 'numpy.ndarray'>
[[             nan              nan              nan              nan
               nan]
 [  1.98600000e+03              nan              nan              nan
    0.00000000e+00]
 [  1.98600000e+03              nan              nan              nan
    5.00000000e-01]
 ..., 
 [  1.98600000e+03              nan              nan              nan
    2.54000000e+00]
 [  1.98700000e+03              nan              nan              nan
    0.00000000e+00]
 [  1.98600000e+03              nan              nan              nan
    5.15000000e+00]]


In [3]:
vector=numpy.array([10,20,30] )
matrix=numpy.array([[5,10,15],[20,25,30],[35,40,45]])
vector_shape=vector.shape
matrix_shape=matrix.shape
print(vector_shape)
print(matrix_shape)

(3L,)
(3L, 3L)




There are a few concepts we haven't been introduced to yet that we'll get into one by one:

    Many items in world_alcohol are nan.
    The entire first row is nan.
    Some of the numbers are written like 1.98600000e+03.

The data type of world_alcohol is float. Because all of the values in a NumPy array have to have the same data type, NumPy attempted to convert all of the columns to floats when they were read in. The numpy.genfromtxt() function will attempt to guess the correct data type of the array it creates.

In this case, the WHO Region, Country, and Beverage Types columns are actually strings, and couldn't be converted to floats. When NumPy can't convert a value to a numeric data type like float or integer, it uses a special nan value that stands for Not a Number. NumPy assigns an na value, which stands for Not Available, when the value doesn't exist. nan and na values are types of missing data. We'll dive more into how to deal with missing data in later missions.

The whole first row of world_alcohol.csv is a header row that contains the names of each column. This is not actually part of the data, and consists entirely of strings. Since the strings couldn't be converted to floats properly, NumPy uses nan values to represent them.

If you haven't seen scientific notation before, you might not recognize numbers like 1.98600000e+03. Scientific notation is a way to condense how very large or very precise numbers are displayed. We can represent 100 in scientific notation as 1e+02. The e+02 indicates that we should multiply what comes before it by 10 ^ 2(10 to the power 2, or 10 squared). This results in 1 * 100, or 100. Thus, 1.98600000e+03 is actually 1.986 * 10 ^ 3, or 1986. 1000000000000000 can be written as 1e+15.

In this case, 1.98600000e+03 is actually longer than 1986, but NumPy displays numeric values in scientific notation by default to account for larger or more precise numbers.

Reading in the data properly

Our data wasn't read in properly, which resulted in NumPy trying to convert strings to floats, and nan values. We can fix this by specifying in the numpy.genfromtxt() function that we want to read in all the data as strings. While we're at it, we can also specify that we want to skip the header row of world_alcohol.csv.

We can do this by:

    Specifying the keyword argument dtype when reading in world_alcohol.csv, and setting it to "U75". This specifies that we want to read in each value as a 75 byte unicode data type. We'll dive more into unicode and bytes later on, but for now, it's enough to know that this will read in our data properly.
    Specifying the keyword argument skip_header, and setting it to 1. This will skip the first row of world_alcohol.csv when reading in the data.


In [4]:
world_alcohol=numpy.genfromtxt("world_alcohol.csv",skip_header=1,dtype="U75",delimiter=",")
print(world_alcohol)

[[u'1986' u'Western Pacific' u'Viet Nam' u'Wine' u'0']
 [u'1986' u'Americas' u'Uruguay' u'Other' u'0.5']
 [u'1985' u'Africa' u"Cte d'Ivoire" u'Wine' u'1.62']
 ..., 
 [u'1986' u'Europe' u'Switzerland' u'Spirits' u'2.54']
 [u'1987' u'Western Pacific' u'Papua New Guinea' u'Other' u'0']
 [u'1986' u'Africa' u'Swaziland' u'Other' u'5.15']]


In [5]:
uruguay_other_1986=world_alcohol[1,4]
third_country=world_alcohol[:,2]

In [6]:
first_two_columns=world_alcohol[:,0:2]
first_ten_years=world_alcohol[0:10,0]
first_ten_rows=world_alcohol[0:10,:]
print('first_two_columns',first_two_columns)
print('first_ten_years',first_ten_years)
print('first_ten_rows',first_ten_rows)

('first_two_columns', array([[u'1986', u'Western Pacific'],
       [u'1986', u'Americas'],
       [u'1985', u'Africa'],
       ..., 
       [u'1986', u'Europe'],
       [u'1987', u'Western Pacific'],
       [u'1986', u'Africa']], 
      dtype='<U75'))
('first_ten_years', array([u'1986', u'1986', u'1985', u'1986', u'1987', u'1987', u'1987',
       u'1985', u'1986', u'1984'], 
      dtype='<U75'))
('first_ten_rows', array([[u'1986', u'Western Pacific', u'Viet Nam', u'Wine', u'0'],
       [u'1986', u'Americas', u'Uruguay', u'Other', u'0.5'],
       [u'1985', u'Africa', u"Cte d'Ivoire", u'Wine', u'1.62'],
       [u'1986', u'Americas', u'Colombia', u'Beer', u'4.27'],
       [u'1987', u'Americas', u'Saint Kitts and Nevis', u'Beer', u'1.98'],
       [u'1987', u'Americas', u'Guatemala', u'Other', u'0'],
       [u'1987', u'Africa', u'Mauritius', u'Wine', u'0.13'],
       [u'1985', u'Africa', u'Angola', u'Spirits', u'0.39'],
       [u'1986', u'Americas', u'Antigua and Barbuda', u'Spirits', u'1.5

In [9]:
year_1986=world_alcohol[:,0]=='1986'
Algeria=world_alcohol[:,2]=='Algeria'
is_algeria_and_1986=year_1986 & Algeria
print(world_alcohol[is_algeria_and_1986,:])

[[u'1986' u'Africa' u'Algeria' u'Wine' u'0.1']
 [u'1986' u'Africa' u'Algeria' u'Spirits' u'0.01']
 [u'1986' u'Africa' u'Algeria' u'Beer' u'0.18']
 [u'1986' u'Africa' u'Algeria' u'Other' u'0']]


In [None]:
Replace '' by true or false for float datatype

In [10]:
is_value_empty=world_alcohol[:,4]==''
world_alcohol[is_value_empty,4]='0'

In [14]:
alcohol_consumption=world_alcohol[:,4]
print(alcohol_consumption.dtype)
alcohol_consumption=alcohol_consumption.astype(float)
print(alcohol_consumption.dtype)
print(alcohol_consumption)

<U75
float64
[ 0.    0.5   1.62 ...,  2.54  0.    5.15]


In [15]:
total_alcohol=alcohol_consumption.sum(axis=0)
average_alcohol=alcohol_consumption.mean(axis=0)
print(total_alcohol)
print(average_alcohol)

3908.96
1.20017193737


In [19]:
countries=numpy.unique(world_alcohol[:,2])
print(len(countries))
print(countries)



164
[u'Afghanistan' u'Albania' u'Algeria' u'Angola' u'Antigua and Barbuda'
 u'Argentina' u'Australia' u'Austria' u'Bahamas' u'Bahrain' u'Bangladesh'
 u'Belarus' u'Belgium' u'Belize' u'Benin' u'Bhutan'
 u'Bolivia (Plurinational State of)' u'Botswana' u'Brazil'
 u'Brunei Darussalam' u'Bulgaria' u'Burkina Faso' u'Burundi' u'Cabo Verde'
 u'Cambodia' u'Cameroon' u'Canada' u'Central African Republic' u'Chad'
 u'Chile' u'China' u'Colombia' u'Comoros' u'Congo' u'Costa Rica' u'Croatia'
 u"Cte d'Ivoire" u'Cuba' u'Cyprus' u'Czech Republic'
 u"Democratic People's Republic of Korea"
 u'Democratic Republic of the Congo' u'Denmark' u'Djibouti'
 u'Dominican Republic' u'Ecuador' u'Egypt' u'El Salvador'
 u'Equatorial Guinea' u'Eritrea' u'Ethiopia' u'Fiji' u'Finland' u'France'
 u'Gabon' u'Gambia' u'Germany' u'Ghana' u'Greece' u'Guatemala' u'Guinea'
 u'Guinea-Bissau' u'Guyana' u'Haiti' u'Honduras' u'Hungary' u'Iceland'
 u'India' u'Indonesia' u'Iran (Islamic Republic of)' u'Iraq' u'Ireland'
 u'Israel' u'It

In [21]:
#calculate the average consumption of all types of alcohol for a single country and year 1989
totals = {}
print(len(countries))
is_1989=world_alcohol[:,0]=='1989'
consumption_year=world_alcohol[is_1989,:]
for country in countries:
    select_country=consumption_year[:,2]==country
    country_consumption=consumption_year[select_country,:]
    alcohol_consumption=country_consumption[:,4]
    empty_string=alcohol_consumption[:]==''
    alcohol_consumption[empty_string]='0'
    alcohol_consumption=alcohol_consumption.astype(float)
    totals[country]=alcohol_consumption.sum()
print(totals)    

164
{u'Canada': 9.0, u'Sao Tome and Principe': 2.5699999999999998, u'United Republic of Tanzania': 5.9000000000000004, u'Lithuania': 0.0, u'Cambodia': 0.33000000000000002, u'Ethiopia': 0.8600000000000001, u'Swaziland': 6.6799999999999997, u'Argentina': 10.82, u'Cameroon': 6.3599999999999994, u'Burkina Faso': 3.9900000000000002, u'Ghana': 1.8599999999999999, u'Saudi Arabia': 0.14999999999999999, u'Slovenia': 12.969999999999999, u'Guatemala': 2.4700000000000002, u'Kuwait': 0.0, u'Russian Federation': 5.3499999999999996, u'Jordan': 0.19, u'Spain': 13.280000000000001, u'Liberia': 5.6100000000000003, u'Netherlands': 10.030000000000001, u'Pakistan': 0.02, u'Oman': 1.03, u'Cabo Verde': 2.79, u'Seychelles': 3.3000000000000003, u'Gabon': 9.3399999999999999, u'New Zealand': 11.52, u'Yemen': 0.20000000000000001, u'Jamaica': 3.0299999999999998, u'Albania': 1.73, u'Samoa': 2.6299999999999999, u'United Arab Emirates': 4.4299999999999997, u'India': 1.6599999999999999, u'Lesotho': 2.02, u'Kenya': 2.81

In [22]:
#Finding the Country that Drinks the Most
highest_value = 0
highest_key = None
for k,v in totals.items():
    if v>highest_value:
        highest_value=v
        highest_key=k
print(highest_value)    
print(highest_key)

16.29
Hungary


NumPy Strengths and Weaknesses

You should now have a good foundation in NumPy, and in handling issues with your data. NumPy is much easier to work with than lists of lists, because:

    It's easy to perform computations on data.
    Data indexing and slicing is faster and easier.
    We can convert data types quickly.

Overall, NumPy makes working with data in Python much more efficient. It's widely used for this reason, especially for machine learning.

You may have noticed some limitations with NumPy as you worked through the past two missions, though. For example:

    All of the items in an array must have the same data type. For many datasets, this can make arrays cumbersome to work with.
    Columns and rows must be referred to by number, which gets confusing when you go back and forth from column name to column number.

In the next few missions, we'll learn about the Pandas library, one of the most popular data analysis libraries. Pandas builds on NumPy, but does a better job addressing the limitations of NumPy.