# Exercises 04 - NumPy for array manipulation

This week we will work with the `numpy` library for numerical manipulations.
The reference guide for NumPy can be found here: https://docs.scipy.org/doc/numpy-1.16.1/reference/. It is
quite huge, but the sections of main interests for us are in the "routines" (a synonym for "functions") section,
in particular:
    - Array creation routines
    - Array manipulation routines
    - Input and output
    - Statistics

Some exercises have been modified from https://www.machinelearningplus.com/python/101-numpy-exercises-python/. You can try more exercises there, if you like!

In [21]:
# Import the NumPy library, using `np` as an alias
import numpy as np
import os

# let's set the precision to three decimal digits

np.set_printoptions(precision=3)

### 1. Create a 1D array of numbers from 99 to 110 

Desired output: ```#> array([99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])```

In [22]:
# Write your solution here

arr = np.arange(99, 111)
arr

array([ 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])

### 2. Create a 3x3 numpy array of all False values 

In [23]:
# Write your solution here

np.full((3, 3), False, dtype=bool)


array([[False, False, False],
       [False, False, False],
       [False, False, False]])

### 3. Replace all odd numbers in ```arr``` with -1

Sample input: ```np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])```

Desired output: ```#>  array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])```

In [24]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[arr % 2 == 1] = -1
arr

array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])

### 4 Stack two arrays vertically



Consider these two numpy arrays as input:

```
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)
```

```
a = [[0 1 2 3 4]
 [5 6 7 8 9]]
 
b = [[1 1 1 1 1]
 [1 1 1 1 1]]
```

Desired output:
```
c = [[0, 1, 2, 3, 4],
     [5, 6, 7, 8, 9],
     [1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1]])
```

In [25]:
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)

# Answers
# Method 1:
np.concatenate([a, b], axis=0)

# Method 2:
np.vstack([a, b])

# Method 3:
np.r_[a, b]
#> array([[0, 1, 2, 3, 4],
#>        [5, 6, 7, 8, 9],
#>        [1, 1, 1, 1, 1],
#>        [1, 1, 1, 1, 1]])

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

## Exercises of data manipulation on the "500_Person_Gender_Height_Weight_Index" dataset.

We will use a "500_Person_Gender_Height_Weight_Index" dataset containing information about height (in cm) and weights (in kg) for 500 subjects, classified by Gender. You can have a look at the dataset here in Kaggle: https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex
You do not have to download it though. The CSV (comma separated values) file 
can be found in the "datasets" directory: 

We are not going to use the Body Mass Index (BMI) information provided in the dataset. 
We will compute the BMI later on using NumPy.

Let's import the "500_Person_Gender_Height_Weight_Index" dataset. You can import the dataset keeping the text column intact, passing `dtype`='object' as argument. In this case, though all the numeric values will be stored as bytes. We will have to set the `names` argument to `True` as the first line of the file (the "header") contains the column names. 

Read more details about the textual data importing functions of NumPy here: 
- `np.loadtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.loadtxt.html#numpy.loadtxt
- `np.genfromtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

You can see all the arguments you can supply to the functions. All the arguments that have a default value are
optional.

In [26]:
# Import `height-weight` keeping the text column intact.
url = os.path.abspath(os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv'))
hw_dataset = np.genfromtxt(url, delimiter=',', names=True, dtype='object')
hw_dataset

array([(b'Male', b'174', b'96', b'4'), (b'Male', b'189', b'87', b'2'),
       (b'Female', b'185', b'110', b'4'),
       (b'Female', b'195', b'104', b'3'), (b'Male', b'149', b'61', b'3'),
       (b'Male', b'189', b'104', b'3'), (b'Male', b'147', b'92', b'5'),
       (b'Male', b'154', b'111', b'5'), (b'Male', b'174', b'90', b'3'),
       (b'Female', b'169', b'103', b'4'), (b'Male', b'195', b'81', b'2'),
       (b'Female', b'159', b'80', b'4'),
       (b'Female', b'192', b'101', b'3'), (b'Male', b'155', b'51', b'2'),
       (b'Male', b'191', b'79', b'2'), (b'Female', b'153', b'107', b'5'),
       (b'Female', b'157', b'110', b'5'), (b'Male', b'140', b'129', b'5'),
       (b'Male', b'144', b'145', b'5'), (b'Male', b'172', b'139', b'5'),
       (b'Male', b'157', b'110', b'5'), (b'Female', b'153', b'149', b'5'),
       (b'Female', b'169', b'97', b'4'), (b'Male', b'185', b'139', b'5'),
       (b'Female', b'172', b'67', b'2'), (b'Female', b'151', b'64', b'3'),
       (b'Male', b'190', b'95', b'

Otherwise you can import the "500_Person_Gender_Height_Weight_Index" with the correct datatypes for the numeric values, specifying `dtype` for each column.

In [27]:
url = os.path.abspath(os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv'))
print(url)
hw_dataset = np.genfromtxt(
    url, delimiter=',', names=True, dtype=[ '|S15', np.float, np.float, np.int]
)
hw_dataset[:4]

/Users/massi/Projects/ox-conted/intoduction-to-python-programming-data-science-master/datasets/500_Person_Gender_Height_Weight_Index.csv


array([(b'Male', 174.,  96., 4), (b'Male', 189.,  87., 2),
       (b'Female', 185., 110., 4), (b'Female', 195., 104., 3)],
      dtype=[('Gender', 'S15'), ('Height', '<f8'), ('Weight', '<f8'), ('Index', '<i8')])

### 5. How to convert a 1d array of tuples to a 2d numpy array

Convert the `hw_dataset` to a numeric-only 2D array `hw_data` by omitting the "Gender" text column and the "Index" numeric field. Create a `hw_label` 1D array containing only the "Gender" text field. Keep the same indexing/order as in the original array.

In [28]:
# Solution:
# Method 1: Convert each row to a list and get the first 4 items
hw_data = np.array([list(row)[1:3] for row in hw_dataset])
hw_labels = np.array([list(row)[0] for row in hw_dataset])
print(hw_data[:4])
print(hw_labels[:4])

# Alt Method 2: Import only the first 4 columns from source url
hw_data = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[1,2], skip_header=True)
hw_labels = np.genfromtxt(url, delimiter=',', dtype='|S15', usecols=[0], skip_header=True)
print(hw_data[:4])
print(hw_labels[:4])

[[174.  96.]
 [189.  87.]
 [185. 110.]
 [195. 104.]]
[b'Male' b'Male' b'Female' b'Female']
[[174.  96.]
 [189.  87.]
 [185. 110.]
 [195. 104.]]
[b'Male' b'Male' b'Female' b'Female']


### 6. Split the datasets in two groups

Split the dataset in `hw_data` according to the labels in `hw_labels`. Hint: You can create a dictionary with the two different labels ("Male" and "Female") as keys, and the two split datasets as values

In [29]:
# Write your solution here

# Solution 1
hw_by_label = {}
for label in set(hw_labels):
    hw_by_label[label] = hw_data[hw_labels == label]
print(hw_by_label)

# Solution 2: dict comprehension
hw_by_label = {
    label: hw_data[hw_labels == label] for label in set(hw_labels)
}
print(hw_by_label)

{b'Male': array([[174.,  96.],
       [189.,  87.],
       [149.,  61.],
       [189., 104.],
       [147.,  92.],
       [154., 111.],
       [174.,  90.],
       [195.,  81.],
       [155.,  51.],
       [191.,  79.],
       [140., 129.],
       [144., 145.],
       [172., 139.],
       [157., 110.],
       [185., 139.],
       [190.,  95.],
       [187.,  62.],
       [179., 152.],
       [153., 121.],
       [178.,  52.],
       [144.,  80.],
       [157.,  56.],
       [161., 118.],
       [185.,  76.],
       [181., 111.],
       [161.,  72.],
       [140., 152.],
       [163., 110.],
       [172., 105.],
       [196., 116.],
       [172.,  92.],
       [178., 127.],
       [143.,  88.],
       [193.,  54.],
       [190.,  83.],
       [175., 135.],
       [178., 117.],
       [141.,  80.],
       [180.,  75.],
       [165., 104.],
       [181.,  51.],
       [164.,  75.],
       [186., 118.],
       [168., 123.],
       [198.,  50.],
       [145., 117.],
       [177.,  61.],
   

### 7. Compute the monovariate statistics for each group

For each label compute the key statistics for both weight and height. 
- mean
- median
- standard deviation
- variance
- interquartile range

Read carefully the numpy documentation for the statistics routines.
Answer to the following questions:

1) Which is the gender with the highest mean value for `height`? Female 

2) Which is the gender with the smallest median value for `weight`? Male

3) Which is the gender with the shortest interquartile range for `weight`? Female

In [30]:
statistics = {}
for gender, dataset in hw_by_label.items():
    statistics[gender] = {
        'mean': np.mean(dataset, axis=0),
        'median': np.median(dataset, axis=0),
        'quantiles': np.quantile(dataset, [0.25, 0.5, 0.75], axis=0),
        'std': np.std(dataset, axis=0),
        'var': np.var(dataset, axis=0)
}
statistics

# male interquartile range for weight
male_iqr_w = statistics[b'Male']['quantiles'][2][1] - statistics[b'Male']['quantiles'][0][1]
female_iqr_w = statistics[b'Female']['quantiles'][2][1] - statistics[b'Female']['quantiles'][0][1]
print("Male IQR for weight: {}".format(male_iqr_w))
print("Female IQR for weight: {}".format(female_iqr_w))

Male IQR for weight: 57.0
Female IQR for weight: 56.0


### 8. Compute the covarance and correlation matrix for each group

1) Compute the `height`-`weight` covariance matrix for each gender. Are the values on the diagonal matching the values computed with the variance functions in the previous step? If not, can you understand why, and how you can obtain coherent values?

2) Compute the `height`-`weight` correlation matrix for each gender.

In [31]:
# Compute the covariance matrix here

cov_mat = {
    label: dict(cov=np.cov(dataset.T)) for label, dataset in hw_by_label.items()
}
print(cov_mat)

### DDOF is by default == 0 in np.var while it is == 1 for np.cov

cov_mat = {
    label: dict(cov=np.cov(dataset.T, ddof=0)) for label, dataset in hw_by_label.items()
}
cov_mat

{b'Male': {'cov': array([[ 291.237,  -14.197],
       [ -14.197, 1013.323]])}, b'Female': {'cov': array([[ 246.861,   14.278],
       [  14.278, 1086.495]])}}


{b'Male': {'cov': array([[ 290.048,  -14.139],
         [ -14.139, 1009.187]])},
 b'Female': {'cov': array([[ 245.893,   14.222],
         [  14.222, 1082.234]])}}

In [32]:
# Compute the correlation matrix here
cor_mat = {
    label: dict(cov=np.corrcoef(dataset.T)) for label, dataset in hw_by_label.items()
}
print(cor_mat)

{b'Male': {'cov': array([[ 1.   , -0.026],
       [-0.026,  1.   ]])}, b'Female': {'cov': array([[1.   , 0.028],
       [0.028, 1.   ]])}}


### 9. How to create a new column from existing columns of a numpy array

Create a new column for "Mass-body Index" ("MBI")  in `hw_data`, where MBI is:

$$MBI =  \frac{weight [kg]}{(height [m])^2}$$



In [33]:
# Compute MBI 
# (remeber to convert the weights in kg!!)
mbi = hw_data[:, 1] / (hw_data[:, 0]/100)**2

# Introduce new dimension to match hw_data 2-dimensions
mbi = mbi[:,np.newaxis]

hw_data = np.hstack((hw_data, mbi))
hw_data

array([[174.   ,  96.   ,  31.708],
       [189.   ,  87.   ,  24.355],
       [185.   , 110.   ,  32.14 ],
       ...,
       [141.   , 136.   ,  68.407],
       [150.   ,  95.   ,  42.222],
       [173.   , 131.   ,  43.77 ]])

### 10. Convert a quantitative variable to a categorical one

Now create an array `hw_mbi_labels` where you assign each record in `hw_data` to one of these categories: 
    - 'UNDERWEIGHT': MBI < 18.5
    - 'NORMAL': 18.5 =< MBI < 25 
    - 'OVERWEIGHT': 25 =< MBI < 30
    - 'OBESE': MBI >= 30
    
Then, count the number of occurrences of each category per gender.

In [34]:
# Label the record
# Hint: write a function to assign labels to the record then use the numpy appropriate function 
# to apply the function to each row
def assign_label(row):
    bmi = row[-1]
    if bmi < 18.5:
        return 'UNDERWEIGHT'
    if bmi >= 18.5 and bmi < 25.0:
        return 'NORMAL'
    if bmi >= 25.0 and bmi < 30.0:
        return 'OVERWEIGHT'
    if bmi >= 30.0:
        return 'OBESE'
    
hw_bmi_labels = np.apply_along_axis(assign_label, 1, hw_data)
# better option as `np.apply_along_axis` truncates the labels
hw_bmi_labels = np.array([assign_label(row) for row in hw_data])
hw_bmi_labels

mbi_labels_by_gender = {
    gender: hw_bmi_labels[hw_labels == gender] for gender in set(hw_labels)
}
print(mbi_labels_by_gender)

cts = {}
for gender in set(hw_labels):
    unique, counts = np.unique(mbi_labels_by_gender[gender], return_counts=True)
    cts[gender] = dict(zip(unique, counts))
cts

{b'Male': array(['OBESE', 'NORMAL', 'OVERWEIGHT', 'OVERWEIGHT', 'OBESE', 'OBESE',
       'OVERWEIGHT', 'NORMAL', 'NORMAL', 'NORMAL', 'OBESE', 'OBESE',
       'OBESE', 'OBESE', 'OBESE', 'OVERWEIGHT', 'UNDERWEIGHT', 'OBESE',
       'OBESE', 'UNDERWEIGHT', 'OBESE', 'NORMAL', 'OBESE', 'NORMAL',
       'OBESE', 'OVERWEIGHT', 'OBESE', 'OBESE', 'OBESE', 'OBESE', 'OBESE',
       'OBESE', 'OBESE', 'UNDERWEIGHT', 'NORMAL', 'OBESE', 'OBESE',
       'OBESE', 'NORMAL', 'OBESE', 'UNDERWEIGHT', 'OVERWEIGHT', 'OBESE',
       'OBESE', 'UNDERWEIGHT', 'OBESE', 'NORMAL', 'OBESE', 'OBESE',
       'OBESE', 'UNDERWEIGHT', 'OBESE', 'OBESE', 'OBESE', 'OBESE',
       'OBESE', 'OBESE', 'OVERWEIGHT', 'OBESE', 'OVERWEIGHT', 'OBESE',
       'OBESE', 'OBESE', 'OBESE', 'OBESE', 'OBESE', 'NORMAL',
       'OVERWEIGHT', 'OBESE', 'OBESE', 'OBESE', 'OBESE', 'OBESE', 'OBESE',
       'OVERWEIGHT', 'OVERWEIGHT', 'OVERWEIGHT', 'OBESE', 'OBESE',
       'NORMAL', 'OVERWEIGHT', 'NORMAL', 'OBESE', 'OBESE', 'OBESE',
       'OBESE'

{b'Male': {'NORMAL': 28, 'OBESE': 165, 'OVERWEIGHT': 31, 'UNDERWEIGHT': 21},
 b'Female': {'NORMAL': 38, 'OBESE': 167, 'OVERWEIGHT': 37, 'UNDERWEIGHT': 13}}

### 11. Compute the frequencies of each category within genders

Compute the percentages of the four categories ('UNDERWEIGHT', 'NORMAL', 'OVERWEIGHT', 'OBESE'), and anwer these questions:

- Which gender has the highest percentage of 'OBESE' subjects
- Which gender has the highest petcentahe of 'subjects that are neither 'OVERWEIGHT' nor 'OBESE'?

In [35]:
# Compute the percentages here:


prc_cts = {}
for gender, counts in cts.items():
    total_count = sum(cts[gender].values())
    print('Total count for Gender {} is {}'.format(gender, total_count))
    prc_cts[gender] = {
        label: (count/total_count)*100 for label, count in cts[gender].items() 
    }
prc_cts

Total count for Gender b'Male' is 245
Total count for Gender b'Female' is 255


{b'Male': {'NORMAL': 11.428571428571429,
  'OBESE': 67.3469387755102,
  'OVERWEIGHT': 12.653061224489795,
  'UNDERWEIGHT': 8.571428571428571},
 b'Female': {'NORMAL': 14.901960784313726,
  'OBESE': 65.49019607843137,
  'OVERWEIGHT': 14.50980392156863,
  'UNDERWEIGHT': 5.098039215686274}}

### 12. Normalize an array so the values range exactly between 0 and 1

Normalization is an important pre-processing step before feeding a dataset to a data science (e.g a machine learning) algorithm.
Create a normalized form of `hw_data`'s "height" and "weight" whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.

In [36]:
# Write your solution here (#FIXME to be modified)
h_min, h_max = hw_data[:, 0].min(), hw_data[:, 0].max()
w_min, w_max = hw_data[:, 1].min(), hw_data[:, 1].max()

print('Max weight: {} - Min weight: {} - Max height: {} - Min height: {}'.format(
    w_max, w_min, h_max, h_min
))

normalized_h = (hw_data[:, 0] - h_min)/(h_max - h_min)
normalized_w = (hw_data[:, 1] - w_min)/(w_max - w_min)
print(normalized_h)
# or 
normalized_h = (hw_data[:, 0] - h_min)/hw_data[:, 0].ptp()  # Thanks, David Ojeda!
normalized_w = (hw_data[:, 1] - w_min)/hw_data[:, 1].ptp()
print(normalized_h)

Max weight: 160.0 - Min weight: 50.0 - Max height: 199.0 - Min height: 140.0
[0.576 0.831 0.763 0.932 0.153 0.831 0.119 0.237 0.576 0.492 0.932 0.322
 0.881 0.254 0.864 0.22  0.288 0.    0.068 0.542 0.288 0.22  0.492 0.763
 0.542 0.186 0.847 0.797 0.39  0.661 0.22  0.644 0.932 0.339 0.288 0.831
 0.966 0.068 0.525 0.763 0.593 0.153 0.288 0.356 0.712 0.763 0.814 0.695
 0.356 0.    0.475 0.61  0.39  0.542 0.949 0.797 0.542 0.644 0.407 0.051
 0.864 0.017 0.898 0.847 0.593 0.661 0.542 0.475 0.407 0.915 0.22  0.644
 0.017 0.678 0.763 0.966 0.424 0.475 0.61  0.695 0.407 0.441 0.847 0.78
 0.475 0.983 0.593 0.085 0.322 0.763 0.644 0.729 0.915 0.627 0.966 0.508
 0.034 0.339 0.932 0.847 1.    0.237 0.356 0.983 0.881 0.932 0.441 0.322
 0.695 0.153 0.169 0.102 0.847 0.881 0.627 0.136 0.424 0.102 0.068 0.61
 0.475 0.797 0.797 0.746 0.305 0.305 0.915 0.085 0.712 0.237 0.475 0.797
 0.305 0.458 0.525 0.729 0.847 0.915 0.525 0.322 0.492 0.458 0.678 0.39
 0.    0.966 0.915 0.    0.932 0.475 0.949 0.    0

You will be able to find more exercises on numpy here: https://www.machinelearningplus.com/python/101-numpy-exercises-python/