# Exercises 04 - NumPy for array manipulation

This week we will work with the `numpy` library for numerical manipulations.
The reference guide for numpy can be found here: https://docs.scipy.org/doc/numpy-1.16.1/reference/. It is
quite huge, but the sections of main interests for us are in the "routines" (a synonym for "functions") section,
in particular:
    - Array creation routines
    - Array manipulation routines
    - Input and output
    - Statistics

Some exercises have been modified from https://www.machinelearningplus.com/python/101-numpy-exercises-python/. You can try more exercises there, if you like!

In [1]:
# Import the NumPy library, using `np` as an alias
import numpy as np

np.__version__

'1.17.2'

### 1. Create a 1D array of numbers from 99 to 110 

Desired output: ```#> array([99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])```

In [2]:
# Write your solution here

oneDArray = np.arange(99, 111)
oneDArray

array([ 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])

### 2. Create a 3x3 numpy array of all False values 

In [3]:
# Write your solution here

squareMat3by3False = np.full((3,3), False)
squareMat3by3False

array([[False, False, False],
       [False, False, False],
       [False, False, False]])

### 3. Replace all odd numbers in ```arr``` with -1

Sample input: ```np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])```

Desired output: ```#>  array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])```

In [4]:
# Sample input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Write your solution here:
arr[arr%2 != 0] = -1
arr

array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])

### 4 Stack two arrays vertically



Consider these two numpy arrays as input:

```
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)
```

```
a = [[0 1 2 3 4]
 [5 6 7 8 9]]
 
b = [[1 1 1 1 1]
 [1 1 1 1 1]]
```

Desired output:
```
c = [[0, 1, 2, 3, 4],
     [5, 6, 7, 8, 9],
     [1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1]])
```

In [5]:
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)

# Write your solution here

c = np.vstack([a, b])
print(c)

[[0 1 2 3 4]
 [5 6 7 8 9]
 [1 1 1 1 1]
 [1 1 1 1 1]]


## Exercises of data manipulation on the "500_Person_Gender_Height_Weight_Index" dataset.

We will use a "500_Person_Gender_Height_Weight_Index" dataset containing information about height (in cm) and weights (in kg) for 500 subjects, classified by Gender. You can have a look at the dataset here in Kaggle: https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex
You do not have to download it though. The CSV (comma separated values) file 
can be found in the "datasets" directory: 

We are not going to use the Body Mass Index (BMI) information provided in the dataset. 
We will compute the BMI later on using Numpy.

Let's import the "500_Person_Gender_Height_Weight_Index" dataset. You can import the dataset keeping the text column intact, passing `dtype`='object' as argument. In this case, though all the numeric values will be stored as bytes. We will have to set the `names` argument to `True` as the first line of the file (the "header") contains the column names. 

Read more details about the textual data importing functions of Numpy here: 
- `np.loadtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.loadtxt.html#numpy.loadtxt
- `np.genfromtxt()`: https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

You can see all the arguments you can supply to the functions. All the arguments that have a default value are
optional.

In [6]:
# Import `height-weight` keeping the text column intact.
url = '../datasets/500_Person_Gender_Height_Weight_Index.csv'
iris = np.genfromtxt(url, delimiter=',', names=True , dtype='object' )
iris

array([(b'Male', b'174', b'96', b'4'), (b'Male', b'189', b'87', b'2'),
       (b'Female', b'185', b'110', b'4'),
       (b'Female', b'195', b'104', b'3'), (b'Male', b'149', b'61', b'3'),
       (b'Male', b'189', b'104', b'3'), (b'Male', b'147', b'92', b'5'),
       (b'Male', b'154', b'111', b'5'), (b'Male', b'174', b'90', b'3'),
       (b'Female', b'169', b'103', b'4'), (b'Male', b'195', b'81', b'2'),
       (b'Female', b'159', b'80', b'4'),
       (b'Female', b'192', b'101', b'3'), (b'Male', b'155', b'51', b'2'),
       (b'Male', b'191', b'79', b'2'), (b'Female', b'153', b'107', b'5'),
       (b'Female', b'157', b'110', b'5'), (b'Male', b'140', b'129', b'5'),
       (b'Male', b'144', b'145', b'5'), (b'Male', b'172', b'139', b'5'),
       (b'Male', b'157', b'110', b'5'), (b'Female', b'153', b'149', b'5'),
       (b'Female', b'169', b'97', b'4'), (b'Male', b'185', b'139', b'5'),
       (b'Female', b'172', b'67', b'2'), (b'Female', b'151', b'64', b'3'),
       (b'Male', b'190', b'95', b'

Otherwise you can import the "500_Person_Gender_Height_Weight_Index" with the correct datatypes for the numeric values, specifying `dtype` for each column.

In [7]:
url = '../datasets/500_Person_Gender_Height_Weight_Index.csv'
hw_dataset = np.genfromtxt(url, delimiter=',', names=True, dtype=[ '|S15', np.float, np.float, np.int])
hw_dataset[:4]

array([(b'Male', 174.,  96., 4), (b'Male', 189.,  87., 2),
       (b'Female', 185., 110., 4), (b'Female', 195., 104., 3)],
      dtype=[('Gender', 'S15'), ('Height', '<f8'), ('Weight', '<f8'), ('Index', '<i8')])

### 5. How to convert a 1d array of tuples to a 2d numpy array

Convert the `hw_dataset` to a numeric-only 2D array `hw_data` by omitting the "Gender" text column and the "Index" numeric field. Create a `hw_label` 1D array containing only the "Gender" text field. Keep the same indexing/order as in the original array.

In [8]:
# Write your solution here
h_dataset = np.array(hw_dataset['Height']).reshape(500,1)
w_dataset = np.array(hw_dataset['Weight']).reshape(500,1)

hw_data = np.hstack((h_dataset,w_dataset))

hw_label = hw_dataset['Gender'].reshape(500,1)

In [9]:
# 2 dimensional array omitting the Gender and index fields 
print(hw_data.shape)
print(hw_data)

# 1 dimensional array only containg the gender field
print(hw_label)

(500, 2)
[[174.  96.]
 [189.  87.]
 [185. 110.]
 [195. 104.]
 [149.  61.]
 [189. 104.]
 [147.  92.]
 [154. 111.]
 [174.  90.]
 [169. 103.]
 [195.  81.]
 [159.  80.]
 [192. 101.]
 [155.  51.]
 [191.  79.]
 [153. 107.]
 [157. 110.]
 [140. 129.]
 [144. 145.]
 [172. 139.]
 [157. 110.]
 [153. 149.]
 [169.  97.]
 [185. 139.]
 [172.  67.]
 [151.  64.]
 [190.  95.]
 [187.  62.]
 [163. 159.]
 [179. 152.]
 [153. 121.]
 [178.  52.]
 [195.  65.]
 [160. 131.]
 [157. 153.]
 [189. 132.]
 [197. 114.]
 [144.  80.]
 [171. 152.]
 [185.  81.]
 [175. 120.]
 [149. 108.]
 [157.  56.]
 [161. 118.]
 [182. 126.]
 [185.  76.]
 [188. 122.]
 [181. 111.]
 [161.  72.]
 [140. 152.]
 [168. 135.]
 [176.  54.]
 [163. 110.]
 [172. 105.]
 [196. 116.]
 [187.  89.]
 [172.  92.]
 [178. 127.]
 [164.  70.]
 [143.  88.]
 [191.  54.]
 [141. 143.]
 [193.  54.]
 [190.  83.]
 [175. 135.]
 [179. 158.]
 [172.  96.]
 [168.  59.]
 [164.  82.]
 [194. 136.]
 [153.  51.]
 [178. 117.]
 [141.  80.]
 [180.  75.]
 [185. 100.]
 [197. 154.]
 [1

### 6. Split the datasets in two groups

Split the dataset in `hw_data` according to the labels in `hw_labels`. Hint: You can create a dictionary with the two different labels ("Male" and "Female") as keys, and the two split datasets as values

In [10]:
# Write your solution here

dictionary = {}

for x in range(500):
    for gender in hw_label[x]:
        if gender == b'Male':
            keyM = 'Male'
            dictionary.setdefault(keyM, [])
            dictionary[keyM].append((hw_data[x][0],hw_data[x][1]))
        if gender == b'Female':
            keyF = 'Female'
            dictionary.setdefault(keyF, [])
            dictionary[keyF].append((hw_data[x][0],hw_data[x][1]))


In [11]:
dictionary

{'Male': [(174.0, 96.0),
  (189.0, 87.0),
  (149.0, 61.0),
  (189.0, 104.0),
  (147.0, 92.0),
  (154.0, 111.0),
  (174.0, 90.0),
  (195.0, 81.0),
  (155.0, 51.0),
  (191.0, 79.0),
  (140.0, 129.0),
  (144.0, 145.0),
  (172.0, 139.0),
  (157.0, 110.0),
  (185.0, 139.0),
  (190.0, 95.0),
  (187.0, 62.0),
  (179.0, 152.0),
  (153.0, 121.0),
  (178.0, 52.0),
  (144.0, 80.0),
  (157.0, 56.0),
  (161.0, 118.0),
  (185.0, 76.0),
  (181.0, 111.0),
  (161.0, 72.0),
  (140.0, 152.0),
  (163.0, 110.0),
  (172.0, 105.0),
  (196.0, 116.0),
  (172.0, 92.0),
  (178.0, 127.0),
  (143.0, 88.0),
  (193.0, 54.0),
  (190.0, 83.0),
  (175.0, 135.0),
  (178.0, 117.0),
  (141.0, 80.0),
  (180.0, 75.0),
  (165.0, 104.0),
  (181.0, 51.0),
  (164.0, 75.0),
  (186.0, 118.0),
  (168.0, 123.0),
  (198.0, 50.0),
  (145.0, 117.0),
  (177.0, 61.0),
  (197.0, 119.0),
  (142.0, 69.0),
  (160.0, 139.0),
  (195.0, 69.0),
  (199.0, 156.0),
  (154.0, 105.0),
  (161.0, 155.0),
  (195.0, 126.0),
  (166.0, 160.0),
  (159.0, 1

### 7. Compute the monovariate statistics for each group

For each label compute the key statistics for both weight and height. 
- mean
- median
- standard deviation
- variance
- interquartile range

Read carefully the numpy documentation for the statistics routines.
Answer to the following questions:

1) Which is the gender with the highest mean value for `height`?

2) Which is the gender with the smallest median value for `weight`?

3) Which is the gender with the shortest interquartile range for `weight`?

In [12]:
# Write your solution here

males   = np.array(dictionary.get('Male'))
females = np.array(dictionary.get('Female'))

# Male and Females Height mean
malesHeightMean   =   np.mean(males[:, 0])
print("males height mean",malesHeightMean)

femalesHeightMean = np.mean(females[:, 0])
print("Females height mean",femalesHeightMean)

print("The gender with the highest mean height are Females\n")

# Male and Females Weight mean
malesWeightMean   =   np.mean(males[:, 1])
print("males Weight mean |",malesWeightMean)
femalesWeightMean = np.mean(females[:, 1])
print("females Weight mean |",femalesWeightMean,"\n")

# Male and Females Height median
malesHeightMedian   =   np.median(males[:, 0])
print("males height median |",malesHeightMedian)

femalesHeightMedian = np.median(females[:, 0])
print("females height median |",femalesHeightMedian,"\n")

# Male and Females Weight median
malesWeightMedian   =   np.median(males[:, 1])
print("males weight median |",malesWeightMedian)

femalesWeightMedian = np.median(females[:, 1])
print("Females weight median |",femalesWeightMedian)

print("The gender with the lowest median weight are Males\n")

# Male and Females Height Standard Deviation 
malesHeightStdD   =   np.std(males[:, 0]) 
print("males height standard deviation |",malesHeightStdD)

femalesHeightStdD = np.std(females[:, 0])
print("females height standard deviation |",femalesHeightStdD, "\n")

# Male and Females Weight Standard Deviation 
malesWeightStdD  =    np.std(males[:, 1]) 
print("males weight standard deviation |",malesWeightStdD)

femalesWeightStdD = np.std(females[:, 1])
print("females weight standard deviation |",femalesWeightStdD, "\n")

# Male and Females Height Variance
malesHeightVar  =    np.var(males[:, 0], ddof=1) 
print("males height Variance |",malesHeightVar)

femalesHeightVar = np.var(females[:, 0], ddof=1)
print("females height Variance |",femalesHeightVar,"\n")

# Male and Females Weight Variance
malesWeightVar  =    np.var(males[:, 1], ddof=1) 
print("males weight Variance |",malesWeightVar)

femalesWeightVar = np.var(females[:, 1], ddof=1)
print("females weight Variance |",femalesWeightVar,"\n")

# Male and Females Height interquartile range
maleHeightQ3, maleHeightQ1 = np.percentile(males[:, 0], [75 ,25])
maleHeightIQR = maleHeightQ3 - maleHeightQ1
print("male Height interquartile range (IQR) |",maleHeightIQR)

femaleHeightQ3, femaleHeightQ1 = np.percentile(females[:, 0], [75 ,25])
femaleHeightIQR = femaleHeightQ3 - femaleHeightQ1
print("female Height interquartile range (IQR) |",femaleHeightIQR, "\n")

# Male and Females Weight interquartile range
maleWeightQ3, maleWeightQ1 = np.percentile(males[:, 1], [75 ,25])
maleWeightIQR = maleWeightQ3 - maleWeightQ1
print("male Weight interquartile range (IQR) |",maleWeightIQR)

femaleWeightQ3, femaleWeightQ1 = np.percentile(females[:, 1], [75 ,25])
femaleWeightIQR = femaleWeightQ3 - femaleWeightQ1
print("female Weight interquartile range (IQR) |",femaleWeightIQR)
print("the gender with the shortest interquartile range (IQR) are females")

males height mean 169.64897959183673
Females height mean 170.22745098039215
The gender with the highest mean height are Females

males Weight mean | 106.31428571428572
females Weight mean | 105.69803921568628 

males height median | 171.0
females height median | 170.0 

males weight median | 105.0
Females weight median | 106.0
The gender with the lowest median weight are Males

males height standard deviation | 17.030801896695337
females height standard deviation | 15.680987344256557 

males weight standard deviation | 31.767702762011453
females weight standard deviation | 32.89732982904258 

males height Variance | 291.23693542990964
females height Variance | 246.8614482013278 

males weight Variance | 1013.3229508196721
females weight Variance | 1086.4950748803456 

male Height interquartile range (IQR) | 29.0
female Height interquartile range (IQR) | 27.0 

male Weight interquartile range (IQR) | 57.0
female Weight interquartile range (IQR) | 56.0
the gender with the shortest interq

### 8. Compute the covarance and correlation matrix for each group

1) Compute the `height`-`weight` covariance matrix for each gender. Are the values on the diagonal matching the values computed with the variance functions in the previous step? If not, can you understand why, and how you can obtain coherent values?

2) Compute the `height`-`weight` correlation matrix for each gender.

In [13]:
# Compute the covariance matrix here:
malesHeight = males[:, 0]
malesWeight = males[:, 1]

femaleHeight = females[:, 0]
femaleWeight = females[:, 1]


malesHeightWeightCOV = np.cov(malesHeight, malesWeight)
print("Males Height Weight Covariance Matrix")
print(malesHeightWeightCOV,"\n")

print("Females Height Weight Covariance Matrix")
femaleHeightWeightCOV = np.cov(femaleHeight, femaleWeight)
print(femaleHeightWeightCOV)


Males Height Weight Covariance Matrix
[[ 291.23693543  -14.19660422]
 [ -14.19660422 1013.32295082]] 

Females Height Weight Covariance Matrix
[[ 246.8614482    14.27761309]
 [  14.27761309 1086.49507488]]


In [14]:
# Compute the correlation matrix here

maleHeightWeightCOR = np.corrcoef(malesHeight, malesWeight)
print("Males Height Weight Correlation Matrix")
print(maleHeightWeightCOR,"\n")

print("Females Height Weight Covariance Matrix")
femaleHeightWeightCOR = np.corrcoef(femaleHeight, femaleWeight)
print(femaleHeightWeightCOR)


Males Height Weight Correlation Matrix
[[ 1.         -0.02613288]
 [-0.02613288  1.        ]] 

Females Height Weight Covariance Matrix
[[1.         0.02756862]
 [0.02756862 1.        ]]


### 9. How to create a new column from existing columns of a numpy array

Create a new column for "Mass-body Index" ("MBI")  in `hw_data`, where MBI is:

$$MBI =  \frac{weight [kg]}{(height [m])^2}$$



In [15]:
# Compute MBI here:
heightCM = hw_data[:,0]
weightKG = hw_data[:,1]
heightM  = heightCM/100

hw_MBI = np.round(np.divide(weightKG,np.power(heightM, 2)), decimals=3)
hw_MBI = hw_MBI.reshape(500,1)
hwi_data = np.hstack((hw_data, hw_MBI))
hwi_data

array([[174.   ,  96.   ,  31.708],
       [189.   ,  87.   ,  24.355],
       [185.   , 110.   ,  32.14 ],
       ...,
       [141.   , 136.   ,  68.407],
       [150.   ,  95.   ,  42.222],
       [173.   , 131.   ,  43.77 ]])

### 10. Convert a quantitative variable to a categorical one

Now create an array `hw_mbi_labels` where you assign each record in `hw_data` to one of these categories: 
    - 'UNDERWEIGHT': MBI < 18.5
    - 'NORMAL': 18.5 =< MBI < 25 
    - 'OVERWEIGHT': 25 =< MBI < 30
    - 'OBESE': MBI >= 30
    
Then, count the number of occurrences of each category per gender.

In [29]:
# Label the record here
# Hint: write a function to assign labels to the record then use the numpy appropriate function 
# to apply the function to each row

def MBI_CATEGORIES(MBI):
    
    if MBI < 18.5:
        return 'UNDERWEIGHT'
    if 18.5 <= MBI < 25:
        return 'NORMAL'
    if 25 <= MBI < 30:
        return 'OVERWEIGHT'
    if MBI >= 30:
        return 'OBESE'

hw_mbi_labels = np.array( [MBI_CATEGORIES(MBI) for MBI in hwi_data[:,2]] )

hw_mbi_labels_gender = { gender: hw_mbi_labels[hw_label.reshape(500) == gender] for gender in np.unique(hw_label) }

uniqueM, countM = np.unique(hw_mbi_labels_gender.get(b'Male'), return_counts=True)
uniqueF, countF = np.unique(hw_mbi_labels_gender.get(b'Female'), return_counts=True)

male_mbi_category_count   = dict(zip(uniqueM, countM))
print('male MBI category count')
print(male_mbi_category_count)
Female_mbi_category_count = dict(zip(uniqueF, countF))
print('Female MBI category count')
print(Female_mbi_category_count)

male MBI category count
{'NORMAL': 28, 'OBESE': 165, 'OVERWEIGHT': 31, 'UNDERWEIGHT': 21}
Female MBI category count
{'NORMAL': 38, 'OBESE': 167, 'OVERWEIGHT': 37, 'UNDERWEIGHT': 13}


### 11. Compute the frequencies of each category within genders

Compute the percentages of the four categories ('UNDERWEIGHT', 'NORMAL', 'OVERWEIGHT', 'OBESE'), and anwer these questions:

- Which gender has the highest percentage of 'OBESE' subjects
- Which gender has the highest petcentahe of 'subjects that are neither 'OVERWEIGHT' nor 'OBESE'?

In [17]:
# Compute the percentages here:



### 12. Normalize an array so the values range exactly between 0 and 1

Normalization is an important pre-processing step before feeding a dataset to a data science (e.g a machine learning) algorithm.
Create a normalized form of `hw_data`'s "height" and "weight" whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.

In [18]:
# Write your solution here 



You will be able to find more exercises on numpy here: https://www.machinelearningplus.com/python/101-numpy-exercises-python/