# Python: Data analysis with numpy

**Goal**: perform statistical computations and interprete the results!

## Goal

The **goal** of this part is to analyze the data of the dataset world_alcohol with numpy. As a reminder, this dataset lists the alcohol consumption by country. We will look at which countries consume the most alcohol per population.

In [1]:
import numpy as np

In [2]:
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75")
world_alcohol

array([['Year', 'WHO region', 'Country', 'Beverage Types',
        'Display Value'],
       ['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

## Perform comparisons

In this section, we will learn how to perform comparisons with numpy. These last will return booleans (True or False).

In [3]:
# example for vector
vector = np.array([5, 10, 15, 20])
vector

array([ 5, 10, 15, 20])

In [4]:
vector == 5

array([ True, False, False, False])

In [5]:
# example for matrix
matrix = np.array([[5,10,15], [20,25,30], [35,40,45]])
matrix

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [6]:
matrix == 25

array([[False, False, False],
       [False,  True, False],
       [False, False, False]])

### Training

In this practice, we will try to answer the following questions:

* extract the 3rd column of world_alcohol and compare it to the country "Canada", assign the result to the variable countries_canada

* extract the first column of world_alcohol and compare it to the string "1984", assign the result to the variable years_1984

In [7]:
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75", skip_header=1)
world_alcohol

array([['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

In [8]:
countries_canada = (world_alcohol[:,2] == "Canada")
countries_canada

array([False, False, False, ..., False, False, False])

In [9]:
years_1984 = (world_alcohol[:,0] == "1984")
years_1984

array([False, False, False, ..., False, False, False])

## Selecting items

In this section we will have how to select specific elements from a numpy array. To do this, we will use the comparisons made above, these return vectors or matrix of booleans that we will use to condition the selection of elements from a numpy array.

In [10]:
# example with vector
print(vector)

[ 5 10 15 20]


In [11]:
is_equal_to_20 = (vector == 20)
vector_20 = vector[is_equal_to_20]

In [12]:
print(vector_20)

[20]


In [13]:
# example with matrix
print(matrix)

[[ 5 10 15]
 [20 25 30]
 [35 40 45]]


In [14]:
col_contain_25 = (matrix[:,1] == 25)
matrix_row_contain_25 = matrix[col_contain_25,:]

In [15]:
print(matrix_row_contain_25)

[[20 25 30]]


### Training

In this practice, we will try to answer the following questions:

* compare the 3rd column of world_alcohol to the string "Senegal"

* assign the result to the variable country_is_senegal

* select only the lines of world_alcohol for which country_is_senegal is true

* assign the result to the variable country_senegal

* display the results

* do the same work to retrieve all rows corresponding to the year "1984" and assign the result to the years_1984 variable

In [16]:
country_is_senegal = (world_alcohol[:,2] == "Senegal")
country_senegal = world_alcohol[country_is_senegal]

In [17]:
print(country_senegal)

[['1989' 'Africa' 'Senegal' 'Beer' '0.28']
 ['1986' 'Africa' 'Senegal' 'Spirits' '0.02']
 ['1984' 'Africa' 'Senegal' 'Spirits' '0.05']
 ['1989' 'Africa' 'Senegal' 'Wine' '0.32']
 ['1986' 'Africa' 'Senegal' 'Wine' '0.22']
 ['1989' 'Africa' 'Senegal' 'Spirits' '0.01']
 ['1984' 'Africa' 'Senegal' 'Other' '0']
 ['1985' 'Africa' 'Senegal' 'Beer' '0.22']
 ['1985' 'Africa' 'Senegal' 'Spirits' '0.04']
 ['1987' 'Africa' 'Senegal' 'Other' '0']
 ['1987' 'Africa' 'Senegal' 'Beer' '0.26']
 ['1984' 'Africa' 'Senegal' 'Beer' '0.2']
 ['1985' 'Africa' 'Senegal' 'Wine' '0.19']
 ['1986' 'Africa' 'Senegal' 'Other' '0']
 ['1985' 'Africa' 'Senegal' 'Other' '0']
 ['1987' 'Africa' 'Senegal' 'Spirits' '0.04']
 ['1989' 'Africa' 'Senegal' 'Other' '0']
 ['1986' 'Africa' 'Senegal' 'Beer' '0.2']
 ['1984' 'Africa' 'Senegal' 'Wine' '0.33']
 ['1987' 'Africa' 'Senegal' 'Wine' '0.16']]


In [18]:
years_is_1984 = (world_alcohol[:,0] == "1984")
years_1984 = world_alcohol[years_is_1984]

In [19]:
print(years_1984)

[['1984' 'Africa' 'Nigeria' 'Other' '6.1']
 ['1984' 'Eastern Mediterranean' 'Afghanistan' 'Other' '0']
 ['1984' 'Americas' 'Costa Rica' 'Wine' '0.06']
 ...
 ['1984' 'Europe' 'Latvia' 'Spirits' '7.5']
 ['1984' 'Africa' 'Angola' 'Wine' '0.57']
 ['1984' 'Africa' 'Central African Republic' 'Wine' '0.46']]


## Perform comparisons with multiple conditions

As a reminder, comparisons are the key concepts in numpy tables for selecting the desired elements. To realize multiple conditions, we use the ```&``` and ```|``` operators. Do not hesitate to use brackets to improve the reading of the code and to avoid mistakes.

In [20]:
# example with vector
vector

array([ 5, 10, 15, 20])

In [21]:
equal_to_5_and_10 = ((vector == 5) & (vector == 10))
equal_to_5_and_10

array([False, False, False, False])

In [22]:
equal_to_5_or_10 = ((vector == 5) | (vector == 10))
equal_to_5_or_10

array([ True,  True, False, False])

### Training

In this practice, we will try to answer the following questions:

* select the rows where the country is "Senegal" and the year is "1986"
* create this double comparison and assign the result to the variable is_senegal_and_1986
* use the variable is_senegal_and_1986 to select the corresponding rows in the table world_alcohol
* assign the result to the variable rows_with_senegal_and_1986
* display the result

In [23]:
is_senegal_and_1986 = ((world_alcohol[:,2] == "Senegal") & (world_alcohol[:,0] == "1984"))
rows_with_senegal_and_1986 = world_alcohol[is_senegal_and_1986]

In [24]:
print(rows_with_senegal_and_1986)

[['1984' 'Africa' 'Senegal' 'Spirits' '0.05']
 ['1984' 'Africa' 'Senegal' 'Other' '0']
 ['1984' 'Africa' 'Senegal' 'Beer' '0.2']
 ['1984' 'Africa' 'Senegal' 'Wine' '0.33']]


## Replace values in a numpy array

In [25]:
# example with vector
vector

array([ 5, 10, 15, 20])

In [26]:
equal_to_5_or_10

array([ True,  True, False, False])

In [27]:
vector[equal_to_5_or_10] = 10

In [28]:
vector

array([10, 10, 15, 20])

In [29]:
# example for matrix
matrix

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [30]:
second_column_25 = (matrix[:,1] == 25)
second_column_25

array([False,  True, False])

In [31]:
matrix[second_column_25, 1] = 50

In [32]:
matrix

array([[ 5, 10, 15],
       [20, 50, 30],
       [35, 40, 45]])

### Training

In this practice, we will try to answer the following questions:

* create a numpy array world_alcohol_2 equal world_alcohol to duplicate it under another name
* replace all the years "1986" in the first column of world_alcohol_2 with "2018"
* replace all "Wine" alcohols in the 4th column of world_alcohol_2 with "Beer"

In [33]:
world_alcohol_2 = world_alcohol.copy()
world_alcohol_2

array([['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

In [34]:
world_alcohol_2[:,0][world_alcohol_2[:,0] == "1986"] = "2018"
world_alcohol_2

array([['2018', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['2018', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ...,
       ['2018', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['2018', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

In [35]:
world_alcohol_2[:,3][world_alcohol_2[:,3] == "Wine"] = "Beer"
world_alcohol_2

array([['2018', 'Western Pacific', 'Viet Nam', 'Beer', '0'],
       ['2018', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Beer', '1.62'],
       ...,
       ['2018', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['2018', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

## Replace empty strings

In this practice, we will try to answer the following questions:

* compare all the elements of the 5th column of world_alcohol with the empty string i.e. "
* assign the result to the variable is_value_empty
* select all the values of the 5th column of world_alcohol for which is_value_empty is equal to True and finally replace them by the string '0'

In [36]:
is_value_empty = (world_alcohol[:,4] == '')
world_alcohol[is_value_empty] = '0'
world_alcohol

array([['1986', 'Western Pacific', 'Viet Nam', 'Wine', '0'],
       ['1986', 'Americas', 'Uruguay', 'Other', '0.5'],
       ['1985', 'Africa', "Cte d'Ivoire", 'Wine', '1.62'],
       ...,
       ['1986', 'Europe', 'Switzerland', 'Spirits', '2.54'],
       ['1987', 'Western Pacific', 'Papua New Guinea', 'Other', '0'],
       ['1986', 'Africa', 'Swaziland', 'Other', '5.15']], dtype='<U75')

## Converting data types

In [37]:
# example
string_vector = np.array(["1", "2", "3", "4", "5"])
string_vector

array(['1', '2', '3', '4', '5'], dtype='<U1')

In [38]:
float_vector = string_vector.astype(float)
float_vector

array([1., 2., 3., 4., 5.])

In [39]:
int_vector = string_vector.astype(int)
int_vector

array([1, 2, 3, 4, 5])

### Training

In this practice, we will try to answer the following questions:

* extract the 5th column of world alcohol and assign the result to the variable alcohol_consumption
* use the astype() method to convert alcohol_consumption to decimal (float)

In [40]:
alcohol_consumption = world_alcohol[:,4]
alcohol_consumption = alcohol_consumption.astype(float)
alcohol_consumption

array([0.  , 0.5 , 1.62, ..., 2.54, 0.  , 5.15])

## Performing mathematical computations with numpy

In [41]:
# example with vector
vector

array([10, 10, 15, 20])

In [42]:
# sum()
vector.sum()

55

In [43]:
# mean()
vector.mean()

13.75

In [44]:
# max()
vector.max()

20

In [45]:
# example with matrix
matrix

array([[ 5, 10, 15],
       [20, 50, 30],
       [35, 40, 45]])

In [46]:
# sum on the rows
matrix.sum(axis=1)

array([ 30, 100, 120])

In [47]:
# sum on the columns
matrix.sum(axis=0)

array([ 60, 100,  90])

### Training

In this practice, we will try to answer the following questions:

* use the sum() method to calculate the sum of the alcohol_consumption values and assign the result to the total_alcohol variable
* use the method mean() to calculate the average of the values of alcohol_consumption and assign the result to the variable average_alcohol 
* display the results

In [48]:
total_alcohol = alcohol_consumption.sum()
total_alcohol

3908.96

In [49]:
average_alcohol = alcohol_consumption.mean()
average_alcohol

1.2001719373656738

## Calculate the total annual consumption

In this practice, we will try to answer the following questions:

* create a matrix named france_1986 which contains all the rows of world_alcohol corresponding to the year "1986" and the country "France"
* extract the 5th column of france_1986, replace any empty string (") with '0' and convert the column to decimal (float) and assign the result to the variable france_alcohol
* calculate the sum of france_alcohol and assign the result to the variable total_france_drinking
* display the result

In [50]:
france_1986 = world_alcohol[(world_alcohol[:,0] == "1986") & (world_alcohol[:,2] == "France"),:]
france_1986

array([['1986', 'Europe', 'France', 'Other', '0.25'],
       ['1986', 'Europe', 'France', 'Spirits', '2.71'],
       ['1986', 'Europe', 'France', 'Beer', '2.55'],
       ['1986', 'Europe', 'France', 'Wine', '10.62']], dtype='<U75')

In [51]:
france_alcohol = france_1986[:,4]
france_alcohol

array(['0.25', '2.71', '2.55', '10.62'], dtype='<U75')

In [52]:
france_alcohol[france_alcohol == ''] = "0"
france_alcohol = france_alcohol.astype(float)
france_alcohol

array([ 0.25,  2.71,  2.55, 10.62])

In [53]:
total_france_drinking = france_alcohol.sum()
total_france_drinking

16.13

## Calculate the consumption for each country

In this practice, we will try to answer the following questions:

* first of all, we create an empty dictionary which will contain all the countries and their associated alcohol consumption, we will named it totals
* then select the rows of world_alcohol corresponding to the given year, say 1989 and assign the result to the year variable
* select all the countries in a list called countries
* go through all the countries in the list using a loop and for each country:
    * select the lines of year corresponding to this country
    * assign the result to the variable country_consumption
    * extract the 5th column of country_consumption
    * replace any empty string in this column with 0
    * convert the column into a decimal (float)
    * calculate the sum of the column
    * add the sum to the totals dictionary, with the country name as key and the sum as value
* display the dictionary totals

In [54]:
totals = {}

year = world_alcohol[world_alcohol[:,0] == "1989",:]
countries = world_alcohol[:,2]

for country in countries:
    country_consumption = year[year[:,2] == country,:]
    alcohol_consumption = country_consumption[:,4]
    alcohol_consumption[alcohol_consumption == ''] = "0"
    alcohol_consumption = alcohol_consumption.astype(float)
    totals[country] = alcohol_consumption.sum()

In [55]:
print(totals)

{'Viet Nam': 0.16, 'Uruguay': 7.4399999999999995, "Cte d'Ivoire": 2.2, 'Colombia': 6.960000000000001, 'Saint Kitts and Nevis': 4.65, 'Guatemala': 2.47, 'Mauritius': 3.54, 'Angola': 2.28, 'Antigua and Barbuda': 4.69, 'Nigeria': 6.74, 'Botswana': 4.63, "Lao People's Democratic Republic": 5.95, 'Afghanistan': 0.0, 'Guinea-Bissau': 2.67, 'Costa Rica': 5.3999999999999995, 'Seychelles': 3.3000000000000003, 'Norway': 5.08, 'Kenya': 2.82, 'Myanmar': 0.16, 'Romania': 8.41, 'Turkey': 0.72, '0': 0.0, 'Tunisia': 0.95, 'United Kingdom of Great Britain and Northern Ireland': 9.99, 'Bahrain': 4.89, 'Sierra Leone': 4.380000000000001, 'Micronesia (Federated States of)': 0.0, 'Mauritania': 0.02, 'Russian Federation': 5.35, 'Egypt': 0.42000000000000004, 'Sweden': 7.47, 'Qatar': 1.4500000000000002, 'Burkina Faso': 3.99, 'Austria': 13.9, 'Czech Republic': 13.009999999999998, 'Ukraine': 5.32, 'China': 3.33, 'Zimbabwe': 4.92, 'Trinidad and Tobago': 4.68, 'Mexico': 5.1, 'Nicaragua': 2.5, 'Malta': 7.13, 'Switz

## Find the country that consumes the most alcohol

In this practice, we will try to answer the following questions:

* create a variable highest_value which will keep in memory the highest value of the dictionary totals. We set it to 0 to start with 
* create a similar variable named highest_key which will keep in memory the name of the country associated with the highest value and set it to None
* browse each country in the totals dictionary and if the value associated with the country is greater than highest_value, assign the value in question to the highest_value variable and assign the corresponding key (country name) to the highest_key variable
* display the country that consumes the most alcohol (variable highest_key)

In [56]:
highest_value = 0
highest_key = None

for country in totals:
    consumption = totals[country]
    
    if highest_value < consumption:
        highest_value = consumption
        highest_key = country

In [57]:
print(highest_key, ':', highest_value)

Hungary : 16.29
