## Activity 03: Filtering, Sorting, and Reshaping

Following up on the last activity, we are asked to deliver some more complex operations.   
We will, therefore, continue to work with the same dataset, our `normal_distribution.csv`.

#### Loading the dataset

In [1]:
# importing the necessary dependencies
import numpy as np

In [3]:
# loading the Dataset
dataset = np.genfromtxt('./data/normal_distribution.csv', delimiter=',')
dataset

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189,
        101.34745901],
       [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752, 105.7197853 , 101.23162942,
         93.87155456],
       [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
         98.80084371, 105.95297652,  98.37481387, 106.54654286,
        107.22482426],
       [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
        105.48508838,  91.6604946 , 106.1472841 ,  95.08715803,
        103.40412146],
       [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
         93.37126331, 108.57980357, 100.79478953,  94.20019732,
         96.10020311],
       [102.80387079,  98.29687616,  93.24376389,  97.24130034,
         89.03452725,  96.2832753 , 104.60344836, 101.13442416,
         97.62787811],
       [106.71751618, 102.97585605,  98.45723272, 100.72418901,
        106.39798503,  95.4649

---

#### Filtering

To get better insights into our dataset, we want to only look at the value that fulfills certain conditions.   
Our client reaches out to us and asks us to provide lists of values that fulfills these conditions:
- all values greater than 105 (>105)
- all values that are between 90 and 95 (>90 and <95)
- the indices of all values that have a delta of less than 1 to 100 (x-100 < 1)

In [7]:
# values that are greater than 105
dataset[dataset>105]

array([107.43534677, 105.7197853 , 110.18889465, 105.95297652,
       106.54654286, 107.22482426, 113.42090475, 105.48508838,
       106.1472841 , 105.85269352, 108.57980357, 106.71751618,
       106.39798503, 106.83273763, 106.47551845, 105.30350449,
       106.03868807, 110.44484313, 106.6471081 , 105.0320535 ,
       107.02874163, 105.07475277, 106.57364584, 107.22482426,
       107.19119932, 108.09423367, 109.40523174, 106.11454989,
       106.57052697, 105.13668343, 105.37011896, 110.44484313,
       105.86078488, 106.89005002, 106.57364584, 107.40064604,
       106.38276709, 106.46476468, 110.43976681, 105.02389857,
       106.05042487, 106.89005002])

In [25]:
# values that are between 90 and 95
dataset[np.where((dataset > 90) & (dataset < 95))]

array([92.02628776, 92.9267508 , 92.65657752, 93.87155456, 90.93318132,
       91.37294597, 91.6604946 , 93.37126331, 94.20019732, 93.24376389,
       94.35373179, 92.5748759 , 91.37294597, 92.87730812, 93.87155456,
       92.75048583, 93.97853495, 91.32093303, 92.0108226 , 93.18884302,
       93.83969256, 94.5081787 , 94.59300658, 93.04610867, 91.6779221 ,
       91.37294597, 94.76253572, 94.57421727, 94.11176915, 93.97853495])

In [93]:
np.extract((dataset > 90) & (dataset < 95), dataset)

array([92.02628776, 92.9267508 , 92.65657752, 93.87155456, 90.93318132,
       91.37294597, 91.6604946 , 93.37126331, 94.20019732, 93.24376389,
       94.35373179, 92.5748759 , 91.37294597, 92.87730812, 93.87155456,
       92.75048583, 93.97853495, 91.32093303, 92.0108226 , 93.18884302,
       93.83969256, 94.5081787 , 94.59300658, 93.04610867, 91.6779221 ,
       91.37294597, 94.76253572, 94.57421727, 94.11176915, 93.97853495])

> **Note:**    
Conditional filtering can be done either using the brackets syntax or NumPys `extract` method

In [77]:
# indices of values that have a delta of less than 1 to 100

indices = np.where(np.abs(dataset - 100) < 1)

print(indices)

(array([ 0,  1,  3,  3,  4,  4,  6,  6,  8,  9, 10, 10, 10, 12, 13, 13, 13,
       14, 14, 15, 16, 16, 17, 17, 18, 18, 20, 21, 21, 21, 22, 23, 23],
      dtype=int64), array([0, 2, 1, 2, 2, 6, 3, 8, 5, 8, 1, 3, 5, 8, 0, 4, 7, 3, 5, 8, 1, 6,
       2, 3, 7, 8, 4, 0, 4, 5, 2, 1, 7], dtype=int64))


In [95]:
rows, cols = np.where(np.abs(dataset - 100) < 1)
[[rows[index], cols[index]] for (index, _) in np.ndenumerate(rows)]

[[0, 0],
 [1, 2],
 [3, 1],
 [3, 2],
 [4, 2],
 [4, 6],
 [6, 3],
 [6, 8],
 [8, 5],
 [9, 8],
 [10, 1],
 [10, 3],
 [10, 5],
 [12, 8],
 [13, 0],
 [13, 4],
 [13, 7],
 [14, 3],
 [14, 5],
 [15, 8],
 [16, 1],
 [16, 6],
 [17, 2],
 [17, 3],
 [18, 7],
 [18, 8],
 [20, 4],
 [21, 0],
 [21, 4],
 [21, 5],
 [22, 2],
 [23, 1],
 [23, 7]]

---

#### Sorting

They also want to experiment with some more plotting techniques so they ask you to also deliver these datasets:
- values sorted in ascending order for each row
- values sorted in ascending order for each column
- the matrix of indices indicating the position in a sorted list of each value   
```
[3, 1, 2, 5, 4]  =>  [1, 2, 0, 4, 3]
```

In [31]:
# values sorted for each row
import pandas as pd
df= pd.DataFrame(np.sort(dataset))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,96.819649,97.852307,98.567832,98.749869,98.808334,99.149315,101.347459,104.038527,107.435347
1,92.026288,92.656578,92.926751,93.871555,97.104393,97.245848,99.320669,101.231629,105.719785
2,90.933181,95.177501,95.662537,98.374814,98.800844,105.952977,106.546543,107.224824,110.188895
3,91.372946,91.660495,95.087158,100.401183,100.967814,103.404121,105.485088,106.147284,113.420905
4,93.371263,94.200197,96.100203,100.286909,100.79479,101.208625,103.573031,105.852694,108.579804
5,89.034527,93.243764,96.283275,97.2413,97.627878,98.296876,101.134424,102.803871,104.603448
6,94.353732,95.464934,98.457233,100.077215,100.724189,102.975856,106.397985,106.717516,106.832738
7,91.372946,92.574876,96.025483,97.575443,98.747675,101.347459,102.456518,102.823609,106.475518
8,92.877308,97.852307,100.854471,101.222604,101.293268,103.192583,104.405183,105.303504,106.038688
9,92.750486,93.871555,96.968512,97.653935,99.149315,101.536365,101.720746,103.291471,110.444843


> **Note:**   
By default, sorting will always be done along the last axis. In our case this is 1, leading to each row being sorted.

In [33]:
# values sorted for each column
df= pd.DataFrame(np.sort(dataset, axis=0))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,91.372946,88.802211,90.933181,93.188843,85.988396,91.660495,91.320933,92.574876,91.372946
1,92.026288,91.677922,93.243764,94.593007,89.034527,92.656578,93.046109,94.200197,91.372946
2,94.111769,92.010823,93.839693,96.746303,92.750486,95.191843,94.353732,94.762536,93.871555
3,95.65982,92.877308,94.508179,97.2413,92.926751,95.464934,96.503429,95.087158,93.978535
4,95.662537,93.871555,97.758876,97.245848,93.371263,95.623593,96.819649,95.852842,95.191843
5,96.025483,94.574217,98.457233,97.627878,93.978535,96.283275,96.892443,97.595722,96.100203
6,96.100203,95.177501,99.320669,97.653935,95.937992,96.346228,96.968512,98.00253,97.104393
7,96.768148,96.593855,99.578599,97.852307,98.29244,96.593778,97.575443,98.071227,97.2413
8,96.782662,97.104393,100.286909,99.488954,98.613252,98.659127,97.940469,98.567832,97.627878
9,97.213157,98.296876,100.401183,99.958279,98.749869,98.747675,97.997624,99.586647,97.852307


In [81]:
row_indices = np.indices(dataset.shape)[0]

print(row_indices)

[[ 0  0  0  0  0  0  0  0  0]
 [ 1  1  1  1  1  1  1  1  1]
 [ 2  2  2  2  2  2  2  2  2]
 [ 3  3  3  3  3  3  3  3  3]
 [ 4  4  4  4  4  4  4  4  4]
 [ 5  5  5  5  5  5  5  5  5]
 [ 6  6  6  6  6  6  6  6  6]
 [ 7  7  7  7  7  7  7  7  7]
 [ 8  8  8  8  8  8  8  8  8]
 [ 9  9  9  9  9  9  9  9  9]
 [10 10 10 10 10 10 10 10 10]
 [11 11 11 11 11 11 11 11 11]
 [12 12 12 12 12 12 12 12 12]
 [13 13 13 13 13 13 13 13 13]
 [14 14 14 14 14 14 14 14 14]
 [15 15 15 15 15 15 15 15 15]
 [16 16 16 16 16 16 16 16 16]
 [17 17 17 17 17 17 17 17 17]
 [18 18 18 18 18 18 18 18 18]
 [19 19 19 19 19 19 19 19 19]
 [20 20 20 20 20 20 20 20 20]
 [21 21 21 21 21 21 21 21 21]
 [22 22 22 22 22 22 22 22 22]
 [23 23 23 23 23 23 23 23 23]]


In [97]:
# indices of positions for each row

np.argsort(dataset)

array([[6, 3, 7, 4, 5, 0, 8, 1, 2],
       [0, 5, 4, 8, 1, 3, 2, 7, 6],
       [2, 1, 0, 6, 4, 5, 7, 8, 3],
       [0, 5, 7, 2, 1, 8, 4, 6, 3],
       [4, 7, 8, 2, 6, 0, 1, 3, 5],
       [4, 2, 5, 3, 8, 1, 7, 0, 6],
       [6, 5, 2, 8, 3, 1, 4, 0, 7],
       [8, 7, 0, 6, 5, 3, 4, 1, 2],
       [1, 8, 5, 6, 4, 2, 3, 0, 7],
       [4, 1, 6, 3, 8, 2, 5, 7, 0],
       [8, 7, 6, 5, 1, 3, 0, 4, 2],
       [4, 3, 0, 7, 2, 5, 6, 8, 1],
       [4, 6, 5, 0, 8, 7, 3, 2, 1],
       [1, 3, 8, 6, 2, 7, 0, 4, 5],
       [2, 5, 3, 4, 6, 0, 7, 8, 1],
       [1, 2, 3, 5, 6, 7, 8, 4, 0],
       [0, 8, 6, 1, 2, 7, 4, 5, 3],
       [5, 6, 3, 2, 0, 4, 1, 7, 8],
       [6, 5, 1, 0, 7, 8, 3, 2, 4],
       [8, 1, 0, 6, 4, 3, 2, 5, 7],
       [8, 7, 4, 2, 5, 6, 1, 0, 3],
       [7, 8, 0, 5, 4, 6, 1, 2, 3],
       [1, 5, 0, 6, 4, 2, 8, 7, 3],
       [4, 0, 3, 5, 1, 7, 2, 6, 8]], dtype=int64)

---

#### Combining

After finishing their visualization and doing ask you to deliver a way they can incrementally add the split parts of the dataset to make sure it works with every subset, too.   
They want you to send them examples for:
- adding the second half of the first column
- adding the second column
- adding the third and last separate column


In [99]:
# split up dataset from activity03
thirds = np.hsplit(dataset, (3))
halfed_first = np.vsplit(thirds[0], (2))

# this is the part we've sent the client in activity03
halfed_first[0]

array([[ 99.14931546, 104.03852715, 107.43534677],
       [ 92.02628776,  97.10439252,  99.32066924],
       [ 95.66253664,  95.17750125,  90.93318132],
       [ 91.37294597, 100.96781394, 100.40118279],
       [101.20862522, 103.5730309 , 100.28690912],
       [102.80387079,  98.29687616,  93.24376389],
       [106.71751618, 102.97585605,  98.45723272],
       [ 96.02548256, 102.82360856, 106.47551845],
       [105.30350449,  92.87730812, 103.19258339],
       [110.44484313,  93.87155456, 101.5363647 ],
       [101.3514185 , 100.37372248, 106.6471081 ],
       [ 97.21315663, 107.02874163, 102.17642112]])

In [102]:
# adding the second half of the first column to the data
first_col = np.vstack([halfed_first[0], halfed_first[1]])
first_col

#halfed_first[1][:,0]

array([[ 99.14931546, 104.03852715, 107.43534677],
       [ 92.02628776,  97.10439252,  99.32066924],
       [ 95.66253664,  95.17750125,  90.93318132],
       [ 91.37294597, 100.96781394, 100.40118279],
       [101.20862522, 103.5730309 , 100.28690912],
       [102.80387079,  98.29687616,  93.24376389],
       [106.71751618, 102.97585605,  98.45723272],
       [ 96.02548256, 102.82360856, 106.47551845],
       [105.30350449,  92.87730812, 103.19258339],
       [110.44484313,  93.87155456, 101.5363647 ],
       [101.3514185 , 100.37372248, 106.6471081 ],
       [ 97.21315663, 107.02874163, 102.17642112],
       [ 95.65982034, 107.22482426, 107.19119932],
       [100.39303522,  92.0108226 ,  97.75887636],
       [103.1521596 , 109.40523174,  93.83969256],
       [106.11454989,  88.80221141,  94.5081787 ],
       [ 96.78266211,  99.84251605, 104.03478031],
       [101.86186193, 103.61720152,  99.57859892],
       [ 97.49594839,  96.59385486, 104.63817694],
       [ 96.76814836,  91.67792

In [103]:
# adding the second column to our combined dataset
#halfed_first[1][:,1]

first_second_col = np.hstack([first_col, thirds[1]])
first_second_col

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412],
       [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752],
       [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
         98.80084371, 105.95297652],
       [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
        105.48508838,  91.6604946 ],
       [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
         93.37126331, 108.57980357],
       [102.80387079,  98.29687616,  93.24376389,  97.24130034,
         89.03452725,  96.2832753 ],
       [106.71751618, 102.97585605,  98.45723272, 100.72418901,
        106.39798503,  95.46493436],
       [ 96.02548256, 102.82360856, 106.47551845, 101.34745901,
        102.45651798,  98.74767493],
       [105.30350449,  92.87730812, 103.19258339, 104.40518318,
        101.29326772, 100.85447132],
       [110.44484313,  93.87155456, 101.5363647 ,  97.65393524,
         92.75048583, 101.7

In [104]:
# adding the third column to our combined dataset
#halfed_first[1][:,2]

full_data = np.hstack([first_second_col, thirds[2]])
full_data

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189,
        101.34745901],
       [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752, 105.7197853 , 101.23162942,
         93.87155456],
       [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
         98.80084371, 105.95297652,  98.37481387, 106.54654286,
        107.22482426],
       [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
        105.48508838,  91.6604946 , 106.1472841 ,  95.08715803,
        103.40412146],
       [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
         93.37126331, 108.57980357, 100.79478953,  94.20019732,
         96.10020311],
       [102.80387079,  98.29687616,  93.24376389,  97.24130034,
         89.03452725,  96.2832753 , 104.60344836, 101.13442416,
         97.62787811],
       [106.71751618, 102.97585605,  98.45723272, 100.72418901,
        106.39798503,  95.4649

> **Note:**    
The same results can be achieved with `np.concatenate` and `np.stack`.    
For both methods, you need to provide the axis onto which it should be appended.   
Depending on your preferences you might want to use those.

---

#### Reshaping

For their internal AI algorithms, they need the dataset in a reshaped manner that reduces the number of columns.   
They asked us to deliver the whole dataset in the following shapes:
- reshaped in a one-dimensional list with all values
- reshaped in a matrix with only 2 columns

In [82]:
# reshaping to a list of values

dataset.flatten()

array([ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
        98.74986914,  98.80833412,  96.81964892,  98.56783189,
       101.34745901,  92.02628776,  97.10439252,  99.32066924,
        97.24584816,  92.9267508 ,  92.65657752, 105.7197853 ,
       101.23162942,  93.87155456,  95.66253664,  95.17750125,
        90.93318132, 110.18889465,  98.80084371, 105.95297652,
        98.37481387, 106.54654286, 107.22482426,  91.37294597,
       100.96781394, 100.40118279, 113.42090475, 105.48508838,
        91.6604946 , 106.1472841 ,  95.08715803, 103.40412146,
       101.20862522, 103.5730309 , 100.28690912, 105.85269352,
        93.37126331, 108.57980357, 100.79478953,  94.20019732,
        96.10020311, 102.80387079,  98.29687616,  93.24376389,
        97.24130034,  89.03452725,  96.2832753 , 104.60344836,
       101.13442416,  97.62787811, 106.71751618, 102.97585605,
        98.45723272, 100.72418901, 106.39798503,  95.46493436,
        94.35373179, 106.83273763, 100.07721494,  96.02

In [105]:
np.reshape(dataset, (1, -1))

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189,
        101.34745901,  92.02628776,  97.10439252,  99.32066924,
         97.24584816,  92.9267508 ,  92.65657752, 105.7197853 ,
        101.23162942,  93.87155456,  95.66253664,  95.17750125,
         90.93318132, 110.18889465,  98.80084371, 105.95297652,
         98.37481387, 106.54654286, 107.22482426,  91.37294597,
        100.96781394, 100.40118279, 113.42090475, 105.48508838,
         91.6604946 , 106.1472841 ,  95.08715803, 103.40412146,
        101.20862522, 103.5730309 , 100.28690912, 105.85269352,
         93.37126331, 108.57980357, 100.79478953,  94.20019732,
         96.10020311, 102.80387079,  98.29687616,  93.24376389,
         97.24130034,  89.03452725,  96.2832753 , 104.60344836,
        101.13442416,  97.62787811, 106.71751618, 102.97585605,
         98.45723272, 100.72418901, 106.39798503,  95.46493436,
         94.35373179, 106.83273763, 100.

In [83]:
# reshaping to a matrix with two columns
dataset.reshape(-1,2)

array([[ 99.14931546, 104.03852715],
       [107.43534677,  97.85230675],
       [ 98.74986914,  98.80833412],
       [ 96.81964892,  98.56783189],
       [101.34745901,  92.02628776],
       [ 97.10439252,  99.32066924],
       [ 97.24584816,  92.9267508 ],
       [ 92.65657752, 105.7197853 ],
       [101.23162942,  93.87155456],
       [ 95.66253664,  95.17750125],
       [ 90.93318132, 110.18889465],
       [ 98.80084371, 105.95297652],
       [ 98.37481387, 106.54654286],
       [107.22482426,  91.37294597],
       [100.96781394, 100.40118279],
       [113.42090475, 105.48508838],
       [ 91.6604946 , 106.1472841 ],
       [ 95.08715803, 103.40412146],
       [101.20862522, 103.5730309 ],
       [100.28690912, 105.85269352],
       [ 93.37126331, 108.57980357],
       [100.79478953,  94.20019732],
       [ 96.10020311, 102.80387079],
       [ 98.29687616,  93.24376389],
       [ 97.24130034,  89.03452725],
       [ 96.2832753 , 104.60344836],
       [101.13442416,  97.62787811],
 

> **Note:**   
-1 in the dimension definition means that it figures out the other dimension on its own