### Recap of last week

### Functions that return a single value for the array
- np.sum
- np.min
- np.max
- np.mean
- np.std (standard deviation)
- len (note, not part of Numpy) the array length
  

### Some new functions

- `np.sort` sorts an array
- `np.argsort` returns the locations where lowest to highest values are
- `np.diff` returns the difference between values in an array
  - note that this will be one location shorter than the original array
  - for [1,5,4,6]
  - diff will be [4,-1,2]
- `np.abs` returns absolute values (i.e. sign removed)
- `np.sqrt` returns the square root

**You don't need to remember these function names all at once, which is unlikely. Your memory will be deepened through exercies.**

**You do need to remember where to find the function names when you need them.**

#### Examples of functions that work on arrays

In [123]:
import numpy as np

# An example array
salary_array = np.array([20000,30000,10000,40000,500000,35000, 60000, 20000])
print(salary_array)

[ 20000  30000  10000  40000 500000  35000  60000  20000]


In [124]:
# np.sort

sorted_salaries = np.sort(salary_array)
print(sorted_salaries, "  after sort")

[ 10000  20000  20000  30000  35000  40000  60000 500000]   after sort


In [125]:
# np.argsort
arg_sort = np.argsort(salary_array)
print(arg_sort)
# meaninig of the output: 
# the smallest value is at locatioin 2; the second smallest value is at location 0; etc.

[2 0 7 1 5 3 6 4]


### Ways to access arrays
- a single position 
- multiple positions (last week)
- boolean array (this week)

#### What is a bolean value?
A boolean value is a data type with one of two possible values: _True_ or _False_. Mind the capital letter.

In [126]:
# let's print the data type of a boolean value
b = True
print(type(b))

<class 'bool'>


A boolean array can be used to select values from an array.

In [127]:
salary_array = np.array([20000,30000,10000,40000,500000,35000, 60000, 20000])

bool_array =   np.array([True, True, False, False, True, False, False, False])

values = salary_array[bool_array] # filter an array with a bool array

print(values)

[ 20000  30000 500000]


A logical operation on an array results in a boolean array

In [128]:
# which salaries are above 25000
idx = salary_array > 25000
print(idx)

[False  True False  True  True  True  True False]


In [129]:
# the boolean array can bu used to select values
salaries_above_25000 = salary_array[idx]
print(salaries_above_25000)

[ 30000  40000 500000  35000  60000]


In [130]:
# how many people have a salary above 25000?

print('how many people have a salary above 25000?')
print(np.sum(idx))
print (len(salaries_above_25000))

# The above two commands give the same answer
# Notes: in math calculations, True is automatially coverted to 1, False to 0.

how many people have a salary above 25000?
5
5


In [131]:
# What is the sum of the values above 25000?
print('sum = ' , np.sum(salaries_above_25000))

sum =  665000


#### A more practical example: What is the mean salary for females?

In [134]:
salary_array = np.array([20000,30000,10000,40000,500000,35000, 60000, 20000])
# The genders for the 8 people are
gender_array = np.array(['female', 'male','other', 'female', 'other', 'female', 'male', 'male'])


**3-step method**

In [135]:
# 1: get the indexes (boolean values)
idx_f = gender_array == 'female'
# 2: get the values
salaries_f = salary_array [idx]
# 3: do the calculation
mean_f = np.mean(values)

print('1: get the indexes: ' ,idx_f)
print('2: get the values: ', salaries_f)
print('3: do the calculation: ', mean_f)

1: get the indexes:  [ True False False  True False  True False False]
2: get the values:  [ 30000  40000 500000  35000  60000]
3: do the calculation:  183333.33333333334


**2-step method**

In [136]:
# 1: get the values using boolean indexing
salaries_f = salary_array[gender_array == 'female']
# 2: do the calculation
mean_female = np.mean(salaries_f)

print('1: get the values: ', salaries_f)
print('2: calculate the mean:',  mean_female)

1: get the values:  [20000 40000 35000]
2: calculate the mean: 31666.666666666668


**1-step method**

In [137]:
mean_female = np.mean( salary_array[gender_array == 'female'] )
print (mean_female)

31666.666666666668


#### What are the top 2 values and bottom 2 values of females' salaries?

In [138]:
#first, sort the famele salaries 
sorted_f = np.sort(salaries_f)
print(sorted_f)

[20000 35000 40000]


In [139]:
# get the first 2 values of the sorted array (that are the lowest)
low_2_f = sorted_f[ :2]
print(low_2_f)

[20000 35000]


In [140]:
# get the last 2 values of the sorted array (that are the highest)

In [141]:
high_2_f = sorted_f[-2: ]
print(high_2_f)

[35000 40000]


**Array indexing illustration:**

<img src=array_position.png>

# Exercise

Use the same example data.

In [143]:
import numpy as np

# The salaries for the 8 people are
salary_array = np.array([20000,30000,10000,40000,500000,35000, 60000, 20000])
# The genders for the 8 people are
gender_array = np.array(['female', 'male','other', 'female', 'other', 'female', 'male', 'male'])

#### Q1. How many males are there in the data?

In [144]:
# put your code here and use print() to display your answer



#### Q2. What is the mean, std, maximum and mininum salary for males?

In [145]:
# put your code here and use print() to display your answer



#### Q3. What are the top 3 and bottom 3 salaries in the whole data?

In [146]:
# put your code here and use print() to display your answer



#### Q4. Devide the sum of the top 3 people's salaries by the sum of all the 8 people's salaries to obtain the percentage.  

In [None]:
# put your code here and use print() to display your answer
