# Exercises 04 - NumPy for array manipulation

This week we will work with the `numpy` library for numerical manipulations.
The reference guide for numpy can be found here: https://docs.scipy.org/doc/numpy-1.16.1/reference/. It is
quite huge, but the sections of main interests for us are in the "routines" (a synonym for "functions") section,
in particular:
    - Array creation routines
    - Array manipulation routines
    - Input and output
    - Statistics

Some exercises have been modified from https://www.machinelearningplus.com/python/101-numpy-exercises-python/. You can try more exercises there, if you like!

In [1]:
# Import the NumPy library, using `np` as an alias
import numpy as np

### 1. Create a 1D array of numbers from 99 to 110 

Desired output: ```#> array([99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])```

In [4]:
# Write your solution here

new_array = np.arange(99,111)
new_array

array([ 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])

### 2. Create a 3x3 numpy array of all False values 

In [12]:
# Write your solution here
false_array= np.ones((3,3))>2
false_array

array([[False, False, False],
       [False, False, False],
       [False, False, False]])

### 3. Replace all odd numbers in ```arr``` with -1

Sample input: ```np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])```

Desired output: ```#>  array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])```

In [13]:
# Sample input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Write your solution here:
arr[arr%2==1]=-1
arr

array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])

### 4 Stack two arrays vertically



Consider these two numpy arrays as input:

```
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)
```

```
a = [[0 1 2 3 4]
 [5 6 7 8 9]]
 
b = [[1 1 1 1 1]
 [1 1 1 1 1]]
```

Desired output:
```
c = [[0, 1, 2, 3, 4],
     [5, 6, 7, 8, 9],
     [1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1]])
```

In [36]:
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)

# Write your solution here
c= np.vstack((a,b))
print("c=",c)

c= [[0 1 2 3 4]
 [5 6 7 8 9]
 [1 1 1 1 1]
 [1 1 1 1 1]]


## Exercises of data manipulation on the "500_Person_Gender_Height_Weight_Index" dataset.

We will use a "500_Person_Gender_Height_Weight_Index" dataset containing information about height (in cm) and weights (in kg) for 500 subjects, classified by Gender. You can have a look at the dataset here in Kaggle: https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex
You do not have to download it though. The CSV (comma separated values) file 
can be found in the "datasets" directory: 

We are not going to use the Body Mass Index (BMI) information provided in the dataset. 
We will compute the BMI later on using Numpy.

Let's import the "500_Person_Gender_Height_Weight_Index" dataset. You can import the dataset keeping the text column intact, passing `dtype`='object' as argument. In this case, though all the numeric values will be stored as bytes. We will have to set the `names` argument to `True` as the first line of the file (the "header") contains the column names. 

Read more details about the textual data importing functions of Numpy here: 
- `np.loadtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.loadtxt.html#numpy.loadtxt
- `np.genfromtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

You can see all the arguments you can supply to the functions. All the arguments that have a default value are
optional.

In [16]:
import os
# Import `height-weight` keeping the text column intact.
url = os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv')
hw_dataset = np.genfromtxt(url, delimiter=',', names=True, dtype='object')
hw_dataset

array([(b'Male', b'174', b'96', b'4'), (b'Male', b'189', b'87', b'2'),
       (b'Female', b'185', b'110', b'4'),
       (b'Female', b'195', b'104', b'3'), (b'Male', b'149', b'61', b'3'),
       (b'Male', b'189', b'104', b'3'), (b'Male', b'147', b'92', b'5'),
       (b'Male', b'154', b'111', b'5'), (b'Male', b'174', b'90', b'3'),
       (b'Female', b'169', b'103', b'4'), (b'Male', b'195', b'81', b'2'),
       (b'Female', b'159', b'80', b'4'),
       (b'Female', b'192', b'101', b'3'), (b'Male', b'155', b'51', b'2'),
       (b'Male', b'191', b'79', b'2'), (b'Female', b'153', b'107', b'5'),
       (b'Female', b'157', b'110', b'5'), (b'Male', b'140', b'129', b'5'),
       (b'Male', b'144', b'145', b'5'), (b'Male', b'172', b'139', b'5'),
       (b'Male', b'157', b'110', b'5'), (b'Female', b'153', b'149', b'5'),
       (b'Female', b'169', b'97', b'4'), (b'Male', b'185', b'139', b'5'),
       (b'Female', b'172', b'67', b'2'), (b'Female', b'151', b'64', b'3'),
       (b'Male', b'190', b'95', b'

Otherwise you can import the "500_Person_Gender_Height_Weight_Index" with the correct datatypes for the numeric values, specifying `dtype` for each column.

In [15]:
url = os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv')
hw_dataset = np.genfromtxt(
    url, delimiter=',', names=True, dtype=[ '|S15', np.float, np.float, np.int]
)
hw_dataset[:4]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  url, delimiter=',', names=True, dtype=[ '|S15', np.float, np.float, np.int]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  url, delimiter=',', names=True, dtype=[ '|S15', np.float, np.float, np.int]


array([(b'Male', 174.,  96., 4), (b'Male', 189.,  87., 2),
       (b'Female', 185., 110., 4), (b'Female', 195., 104., 3)],
      dtype=[('Gender', 'S15'), ('Height', '<f8'), ('Weight', '<f8'), ('Index', '<i4')])

### 5. How to convert a 1d array of tuples to a 2d numpy array

Convert the `hw_dataset` to a numeric-only 2D array `hw_data` by omitting the "Gender" text column and the "Index" numeric field. Create a `hw_label` 1D array containing only the "Gender" text field. Keep the same indexing/order as in the original array.

In [3]:
# Write your solution here
hw_data = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[1,2], skip_header=True)
hw_labels = np.genfromtxt(url, delimiter=',', dtype='|S15', usecols=[0], skip_header=True)
print(hw_data[:4])
print(hw_labels[:4])

[[174.  96.]
 [189.  87.]
 [185. 110.]
 [195. 104.]]
[b'Male' b'Male' b'Female' b'Female']


### 6. Split the datasets in two groups

Split the dataset in `hw_data` according to the labels in `hw_labels`. Hint: You can create a dictionary with the two different labels ("Male" and "Female") as keys, and the two split datasets as values

In [5]:
# Write your solution here

hw_by_label = {
    label: hw_data[hw_labels == label] for label in set(hw_labels)
}

males = hw_by_label[b'Male']
females = hw_by_label[b'Female']

print(males)
print (females)

[[174.  96.]
 [189.  87.]
 [149.  61.]
 [189. 104.]
 [147.  92.]
 [154. 111.]
 [174.  90.]
 [195.  81.]
 [155.  51.]
 [191.  79.]
 [140. 129.]
 [144. 145.]
 [172. 139.]
 [157. 110.]
 [185. 139.]
 [190.  95.]
 [187.  62.]
 [179. 152.]
 [153. 121.]
 [178.  52.]
 [244.  80.]
 [157.  56.]
 [161. 118.]
 [185.  76.]
 [181. 111.]
 [161.  72.]
 [140. 152.]
 [163. 110.]
 [172. 105.]
 [196. 116.]
 [172.  92.]
 [178. 127.]
 [143.  88.]
 [193.  54.]
 [190.  83.]
 [175. 135.]
 [178. 117.]
 [141.  80.]
 [180.  75.]
 [165. 104.]
 [181.  51.]
 [164.  75.]
 [186. 118.]
 [168. 123.]
 [198.  50.]
 [145. 117.]
 [177.  61.]
 [197. 119.]
 [142.  69.]
 [160. 139.]
 [195.  69.]
 [199. 156.]
 [154. 105.]
 [161. 155.]
 [195. 126.]
 [166. 160.]
 [159. 154.]
 [149.  66.]
 [190. 135.]
 [148.  60.]
 [144. 108.]
 [187. 122.]
 [187. 138.]
 [158.  96.]
 [194. 115.]
 [182. 151.]
 [154.  54.]
 [194. 108.]
 [171. 147.]
 [159. 124.]
 [180. 149.]
 [163. 123.]
 [140.  79.]
 [197. 125.]
 [194. 106.]
 [195.  98.]
 [140.  52.]

### 7. Compute the monovariate statistics for each group

For each label compute the key statistics for both weight and height. 
- mean
- median
- standard deviation
- variance
- interquartile range

Read carefully the numpy documentation for the statistics routines.
Answer to the following questions:

1) Which is the gender with the highest mean value for `height`?

2) Which is the gender with the smallest median value for `weight`?

3) Which is the gender with the shortest interquartile range for `weight`?

In [7]:
# Write your solution here
hw_by_label = {
    label: hw_data[hw_labels == label] for label in set(hw_labels)
}
print(hw_by_label)

statistics = {}
for gender, dataset in hw_by_label.items():
    statistics[gender] = {
        'mean': np.mean(dataset, axis=0),
        'median': np.median(dataset, axis=0),
        'quantiles': np.quantile(dataset, [0.25, 0.5, 0.75], axis=0),
        'std': np.std(dataset, axis=0),
        'var': np.var(dataset, axis=0)
}
statistics

male_iqr_w = statistics[b'Male']['quantiles'][2][1] - statistics[b'Male']['quantiles'][0][1]
female_iqr_w = statistics[b'Female']['quantiles'][2][1] - statistics[b'Female']['quantiles'][0][1]
print("Male IQR for weight: {}".format(male_iqr_w))
print("Female IQR for weight: {}".format(female_iqr_w))


{b'Male': array([[174.,  96.],
       [189.,  87.],
       [149.,  61.],
       [189., 104.],
       [147.,  92.],
       [154., 111.],
       [174.,  90.],
       [195.,  81.],
       [155.,  51.],
       [191.,  79.],
       [140., 129.],
       [144., 145.],
       [172., 139.],
       [157., 110.],
       [185., 139.],
       [190.,  95.],
       [187.,  62.],
       [179., 152.],
       [153., 121.],
       [178.,  52.],
       [244.,  80.],
       [157.,  56.],
       [161., 118.],
       [185.,  76.],
       [181., 111.],
       [161.,  72.],
       [140., 152.],
       [163., 110.],
       [172., 105.],
       [196., 116.],
       [172.,  92.],
       [178., 127.],
       [143.,  88.],
       [193.,  54.],
       [190.,  83.],
       [175., 135.],
       [178., 117.],
       [141.,  80.],
       [180.,  75.],
       [165., 104.],
       [181.,  51.],
       [164.,  75.],
       [186., 118.],
       [168., 123.],
       [198.,  50.],
       [145., 117.],
       [177.,  61.],
   

### 8. Compute the covarance and correlation matrix for each group

1) Compute the `height`-`weight` covariance matrix for each gender. Are the values on the diagonal matching the values computed with the variance functions in the previous step? If not, can you understand why, and how you can obtain coherent values?

2) Compute the `height`-`weight` correlation matrix for each gender.

In [8]:
# Compute the covariance matrix here:

cov_mat = {
    label: dict(cov=np.cov(dataset.T)) for label, dataset in hw_by_label.items()
}
print(cov_mat)

{b'Male': {'cov': array([[ 311.0295082 ,  -24.98114754],
       [ -24.98114754, 1013.32295082]])}, b'Female': {'cov': array([[ 246.8614482 ,   14.27761309],
       [  14.27761309, 1086.49507488]])}}


In [9]:
# Compute the correlation matrix here

cor_mat = {
    label: dict(cov=np.corrcoef(dataset.T)) for label, dataset in hw_by_label.items()
}
print(cor_mat)

{b'Male': {'cov': array([[ 1.        , -0.04449771],
       [-0.04449771,  1.        ]])}, b'Female': {'cov': array([[1.        , 0.02756862],
       [0.02756862, 1.        ]])}}


### 9. How to create a new column from existing columns of a numpy array

Create a new column for "Mass-body Index" ("MBI")  in `hw_data`, where MBI is:

$$MBI =  \frac{weight [kg]}{(height [m])^2}$$



In [10]:
# Compute MBI here:
mbi = hw_data[:, 1] / (hw_data[:, 0]/100)**2

# Introduce new dimension to match hw_data 2-dimensions
mbi = mbi[:,np.newaxis]

hw_data = np.hstack((hw_data, mbi))
hw_data


array([[174.        ,  96.        ,  31.70828379],
       [189.        ,  87.        ,  24.35542118],
       [185.        , 110.        ,  32.14024836],
       ...,
       [141.        , 136.        ,  68.40702178],
       [150.        ,  95.        ,  42.22222222],
       [173.        , 131.        ,  43.77025627]])

### 10. Convert a quantitative variable to a categorical one

Now create an array `hw_mbi_labels` where you assign each record in `hw_data` to one of these categories: 
    - 'UNDERWEIGHT': MBI < 18.5
    - 'NORMAL': 18.5 =< MBI < 25 
    - 'OVERWEIGHT': 25 =< MBI < 30
    - 'OBESE': MBI >= 30
    
Then, count the number of occurrences of each category per gender.

In [13]:
# Label the record here
# Hint: write a function to assign labels to the record then use the numpy appropriate function 
# to apply the function to each row

def assign_label(row):
    bmi = row[-1]
    if bmi < 18.5:
        return 'UNDERWEIGHT'
    if bmi >= 18.5 and bmi < 25.0:
        return 'NORMAL'
    if bmi >= 25.0 and bmi < 30.0:
        return 'OVERWEIGHT'
    if bmi >= 30.0:
        return 'OBESE'
hw_bmi_labels = np.array([assign_label(row) for row in hw_data])
hw_bmi_labels

mbi_labels_by_gender = {
    gender: hw_bmi_labels[hw_labels == gender] for gender in set(hw_labels)
}

cts = {}
for gender in set(hw_labels):
    unique, counts = np.unique(mbi_labels_by_gender[gender], return_counts=True)
    cts[gender] = dict(zip(unique, counts))
cts

{b'Male': {'NORMAL': 28, 'OBESE': 164, 'OVERWEIGHT': 31, 'UNDERWEIGHT': 22},
 b'Female': {'NORMAL': 38, 'OBESE': 167, 'OVERWEIGHT': 37, 'UNDERWEIGHT': 13}}

### 11. Compute the frequencies of each category within genders

Compute the percentages of the four categories ('UNDERWEIGHT', 'NORMAL', 'OVERWEIGHT', 'OBESE'), and anwer these questions:

- Which gender has the highest percentage of 'OBESE' subjects
- Which gender has the highest petcentahe of 'subjects that are neither 'OVERWEIGHT' nor 'OBESE'?

In [14]:
# Compute the percentages here:
prc_cts = {}
for gender, counts in cts.items():
    total_count = sum(cts[gender].values())
    print('Total count for Gender {} is {}'.format(gender, total_count))
    prc_cts[gender] = {
        label: (count/total_count)*100 for label, count in cts[gender].items() 
    }
prc_cts


Total count for Gender b'Male' is 245
Total count for Gender b'Female' is 255


{b'Male': {'NORMAL': 11.428571428571429,
  'OBESE': 66.93877551020408,
  'OVERWEIGHT': 12.653061224489795,
  'UNDERWEIGHT': 8.979591836734693},
 b'Female': {'NORMAL': 14.901960784313726,
  'OBESE': 65.49019607843137,
  'OVERWEIGHT': 14.50980392156863,
  'UNDERWEIGHT': 5.098039215686274}}

### 12. Normalize an array so the values range exactly between 0 and 1

Normalization is an important pre-processing step before feeding a dataset to a data science (e.g a machine learning) algorithm.
Create a normalized form of `hw_data`'s "height" and "weight" whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.

In [None]:
# Write your solution here 



You will be able to find more exercises on numpy here: https://www.machinelearningplus.com/python/101-numpy-exercises-python/