# Exercises 04 - NumPy for array manipulation

This week we will work with the `numpy` library for numerical manipulations.
The reference guide for numpy can be found here: https://docs.scipy.org/doc/numpy-1.16.1/reference/. It is
quite huge, but the sections of main interests for us are in the "routines" (a synonym for "functions") section,
in particular:
    - Array creation routines
    - Array manipulation routines
    - Input and output
    - Statistics

Some exercises have been modified from https://www.machinelearningplus.com/python/101-numpy-exercises-python/. You can try more exercises there, if you like!

In [1]:
# Import the NumPy library, using `np` as an alias
import numpy as np

### 1. Create a 1D array of numbers from 99 to 110 

Desired output: ```#> array([99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])```

In [2]:
# Write your solution here



### 2. Create a 3x3 numpy array of all False values 

In [3]:
# Write your solution here



### 3. Replace all odd numbers in ```arr``` with -1

Sample input: ```np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])```

Desired output: ```#>  array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])```

In [4]:
# Sample input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Write your solution here:


### 4 Stack two arrays vertically



Consider these two numpy arrays as input:

```
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)
```

```
a = [[0 1 2 3 4]
 [5 6 7 8 9]]
 
b = [[1 1 1 1 1]
 [1 1 1 1 1]]
```

Desired output:
```
c = [[0, 1, 2, 3, 4],
     [5, 6, 7, 8, 9],
     [1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1]])
```

In [5]:
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)

# Write your solution here

## Exercises of data manipulation on the "500_Person_Gender_Height_Weight_Index" dataset.

We will use a "500_Person_Gender_Height_Weight_Index" dataset containing information about height (in cm) and weights (in kg) for 500 subjects, classified by Gender. You can have a look at the dataset here in Kaggle: https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex
You do not have to download it though. The CSV (comma separated values) file 
can be found in the "datasets" directory: 

We are not going to use the Body Mass Index (BMI) information provided in the dataset. 
We will compute the BMI later on using Numpy.

Let's import the "500_Person_Gender_Height_Weight_Index" dataset. You can import the dataset keeping the text column intact, passing `dtype`='object' as argument. In this case, though all the numeric values will be stored as bytes. We will have to set the `names` argument to `True` as the first line of the file (the "header") contains the column names. 

Read more details about the textual data importing functions of Numpy here: 
- `np.loadtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.loadtxt.html#numpy.loadtxt
- `np.genfromtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

You can see all the arguments you can supply to the functions. All the arguments that have a default value are
optional.

In [6]:
import os
# Import `height-weight` keeping the text column intact.
url = os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv')
hw_dataset = np.genfromtxt(url, delimiter=',', names=True, dtype='object')
hw_dataset

array([(b'Male', b'174', b'96', b'4'), (b'Male', b'189', b'87', b'2'),
       (b'Female', b'185', b'110', b'4'),
       (b'Female', b'195', b'104', b'3'), (b'Male', b'149', b'61', b'3'),
       (b'Male', b'189', b'104', b'3'), (b'Male', b'147', b'92', b'5'),
       (b'Male', b'154', b'111', b'5'), (b'Male', b'174', b'90', b'3'),
       (b'Female', b'169', b'103', b'4'), (b'Male', b'195', b'81', b'2'),
       (b'Female', b'159', b'80', b'4'),
       (b'Female', b'192', b'101', b'3'), (b'Male', b'155', b'51', b'2'),
       (b'Male', b'191', b'79', b'2'), (b'Female', b'153', b'107', b'5'),
       (b'Female', b'157', b'110', b'5'), (b'Male', b'140', b'129', b'5'),
       (b'Male', b'144', b'145', b'5'), (b'Male', b'172', b'139', b'5'),
       (b'Male', b'157', b'110', b'5'), (b'Female', b'153', b'149', b'5'),
       (b'Female', b'169', b'97', b'4'), (b'Male', b'185', b'139', b'5'),
       (b'Female', b'172', b'67', b'2'), (b'Female', b'151', b'64', b'3'),
       (b'Male', b'190', b'95', b'

Otherwise you can import the "500_Person_Gender_Height_Weight_Index" with the correct datatypes for the numeric values, specifying `dtype` for each column.

In [7]:
url = os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv')
hw_dataset = np.genfromtxt(
    url, delimiter=',', names=True, dtype=[ '|S15', np.float, np.float, np.int]
)
hw_dataset[:4]

array([(b'Male', 174.,  96., 4), (b'Male', 189.,  87., 2),
       (b'Female', 185., 110., 4), (b'Female', 195., 104., 3)],
      dtype=[('Gender', 'S15'), ('Height', '<f8'), ('Weight', '<f8'), ('Index', '<i8')])

### 5. How to convert a 1d array of tuples to a 2d numpy array

Convert the `hw_dataset` to a numeric-only 2D array `hw_data` by omitting the "Gender" text column and the "Index" numeric field. Create a `hw_label` 1D array containing only the "Gender" text field. Keep the same indexing/order as in the original array.

In [8]:
# Write your solution here

### 6. Split the datasets in two groups

Split the dataset in `hw_data` according to the labels in `hw_labels`. Hint: You can create a dictionary with the two different labels ("Male" and "Female") as keys, and the two split datasets as values

In [9]:
# Write your solution here



### 7. Compute the monovariate statistics for each group

For each label compute the key statistics for both weight and height. 
- mean
- median
- standard deviation
- variance
- interquartile range

Read carefully the numpy documentation for the statistics routines.
Answer to the following questions:

1) Which is the gender with the highest mean value for `height`?

2) Which is the gender with the smallest median value for `weight`?

3) Which is the gender with the shortest interquartile range for `weight`?

In [None]:
# Write your solution here



### 8. Compute the covarance and correlation matrix for each group

1) Compute the `height`-`weight` covariance matrix for each gender. Are the values on the diagonal matching the values computed with the variance functions in the previous step? If not, can you understand why, and how you can obtain coherent values?

2) Compute the `height`-`weight` correlation matrix for each gender.

In [None]:
# Compute the covariance matrix here:



In [None]:
# Compute the correlation matrix here



### 9. How to create a new column from existing columns of a numpy array

Create a new column for "Mass-body Index" ("MBI")  in `hw_data`, where MBI is:

$$MBI =  \frac{weight [kg]}{(height [m])^2}$$



In [None]:
# Compute MBI here:



### 10. Convert a quantitative variable to a categorical one

Now create an array `hw_mbi_labels` where you assign each record in `hw_data` to one of these categories: 
    - 'UNDERWEIGHT': MBI < 18.5
    - 'NORMAL': 18.5 =< MBI < 25 
    - 'OVERWEIGHT': 25 =< MBI < 30
    - 'OBESE': MBI >= 30
    
Then, count the number of occurrences of each category per gender.

In [None]:
# Label the record here
# Hint: write a function to assign labels to the record then use the numpy appropriate function 
# to apply the function to each row



### 11. Compute the frequencies of each category within genders

Compute the percentages of the four categories ('UNDERWEIGHT', 'NORMAL', 'OVERWEIGHT', 'OBESE'), and anwer these questions:

- Which gender has the highest percentage of 'OBESE' subjects
- Which gender has the highest petcentahe of 'subjects that are neither 'OVERWEIGHT' nor 'OBESE'?

In [None]:
# Compute the percentages here:



### 12. Normalize an array so the values range exactly between 0 and 1

Normalization is an important pre-processing step before feeding a dataset to a data science (e.g a machine learning) algorithm.
Create a normalized form of `hw_data`'s "height" and "weight" whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.

In [None]:
# Write your solution here 



You will be able to find more exercises on numpy here: https://www.machinelearningplus.com/python/101-numpy-exercises-python/