## Activity 02: Indexing, Slicing, and Iterating

Our client wants to prove that our dataset is nicely distributed around the mean value of 100.   
They asked us to run some tests on several subsections of it to make sure they won't get a non-descriptive section of our data.

Look at the mean value of each subtask.

#### Loading the dataset

In [1]:
# importing the necessary dependencies
import numpy as np

In [2]:
# loading the Dataset
dataset = np.genfromtxt('./data/normal_distribution.csv', delimiter=',')

In [3]:
dataset

array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189,
        101.34745901],
       [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752, 105.7197853 , 101.23162942,
         93.87155456],
       [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
         98.80084371, 105.95297652,  98.37481387, 106.54654286,
        107.22482426],
       [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
        105.48508838,  91.6604946 , 106.1472841 ,  95.08715803,
        103.40412146],
       [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
         93.37126331, 108.57980357, 100.79478953,  94.20019732,
         96.10020311],
       [102.80387079,  98.29687616,  93.24376389,  97.24130034,
         89.03452725,  96.2832753 , 104.60344836, 101.13442416,
         97.62787811],
       [106.71751618, 102.97585605,  98.45723272, 100.72418901,
        106.39798503,  95.4649

---

#### Indexing

Since we need several rows of our dataset to complete the given task, we have to use indexing to get the right rows.   
To recap, we need: 
- the second row 
- the last row
- the first value of the first row
- the last value of the second to the last row

In [4]:
# indexing the second row of the dataset (2nd row)

second_row = dataset[1]
print(second_row)

[ 92.02628776  97.10439252  99.32066924  97.24584816  92.9267508
  92.65657752 105.7197853  101.23162942  93.87155456]


In [5]:
# indexing the last element of the dataset (last row)

last_element = dataset[-1:]
print(last_element)

[[ 94.11176915  99.62387832 104.51786419  97.62787811  93.97853495
   98.75108352 106.05042487 100.07721494 106.89005002]]


In [18]:
# indexing the first value of the second row (1st row, 1st value)

print(dataset[1][0])

92.02628776


In [7]:
# indexing the last value of the second to last row (we want to use the combined access syntax here) 

print(dataset[-2,-1])

101.2226037


---

#### Slicing

Other than the single rows and values we also need to get some subsets of the dataset.   
Here we want slices:
- a 2x2 slice starting from the second row and second element to the 4th element in the 4th row
- every other element of the 5th row
- the content of the last row in reversed order

In [8]:
# slicing an intersection of 4 elements (2x2) of the first two rows and first two columns

print(dataset[:2,:2])

[[ 99.14931546 104.03852715]
 [ 92.02628776  97.10439252]]


##### Why is it not a problem if such a small subsection has a bigger standard deviation from 100?

Several smaller values can cluster in such a small subsection leading to the value being really low.   
If we make our subsection larger, we have a higher chance of getting a more expressive view of our data.

In [22]:
# selecting every second element of the fifth row 

print(dataset[4,::2])

[101.20862522 100.28690912  93.37126331 100.79478953  96.10020311]


In [10]:
# reversing the entry order, selecting the first two rows in reversed order

print('rev:',dataset[2,::-1])

rev: [107.22482426 106.54654286  98.37481387 105.95297652  98.80084371
 110.18889465  90.93318132  95.17750125  95.66253664]


---

#### Splitting

Our client's team only wants to use a small subset of the given dataset.   
Therefore we need to first split it into 3 equal pieces and then give them the first half of the first split.   
They sent us this drawing to show us what they need:
```
1, 2, 3, 4, 5, 6          1, 2     3, 4    5, 6          1, 2  
3, 2, 1, 5, 4, 6    =>    3, 2     1, 5    4, 6    =>    3, 2    =>    1, 2
5, 3, 1, 2, 4, 3          5, 3     1, 2    4, 3                        3, 2
1, 2, 2, 4, 1, 5          1, 2     2, 4    1, 5          5, 3
                                                         1, 2
```

> **Note:**   
We are using a very small dataset here but imagine you have a huge amount of data and only want to look at a small subset of it to tweak your visualizations

In [11]:
# splitting up our dataset horizontally on indices one third and two thirds


# horizontal version1
dataset.shape
first_split = int(24 * 1/3)
second_split = int(24 * 2/3)
print(first_split, second_split)

first_split_arr = dataset[:first_split]
print('1st:',first_split_arr.shape)

second_split_arr = dataset[first_split:]
print('2nd:',second_split_arr.shape)


8 16
1st: (8, 9)
2nd: (16, 9)


In [12]:
# splitting up our dataset horizontally on indices one third and two thirds


# my approachhorizontal version1
dataset.shape
first_split = int(24 * 1/3)
second_split = int(24 * 2/3)
print(first_split, second_split)

first_split_arr = dataset[:first_split]
print('1st:',first_split_arr.shape)

second_split_arr = dataset[first_split:]
print('2nd:',second_split_arr.shape)

8 16
1st: (8, 9)
2nd: (16, 9)


In [28]:
# class approach

hsplits = np.hsplit(dataset, (3))
first_split = hsplits[0]
second_split = [hsplits[1],hsplits[2]]

print('1/3:',first_split)
print('...................')
print('2/3:',second_split)

# vsplits = np.vsplit(hsplits[0],(2))
# print(vsplits[0])

1/3: [[ 99.14931546 104.03852715 107.43534677]
 [ 92.02628776  97.10439252  99.32066924]
 [ 95.66253664  95.17750125  90.93318132]
 [ 91.37294597 100.96781394 100.40118279]
 [101.20862522 103.5730309  100.28690912]
 [102.80387079  98.29687616  93.24376389]
 [106.71751618 102.97585605  98.45723272]
 [ 96.02548256 102.82360856 106.47551845]
 [105.30350449  92.87730812 103.19258339]
 [110.44484313  93.87155456 101.5363647 ]
 [101.3514185  100.37372248 106.6471081 ]
 [ 97.21315663 107.02874163 102.17642112]
 [ 95.65982034 107.22482426 107.19119932]
 [100.39303522  92.0108226   97.75887636]
 [103.1521596  109.40523174  93.83969256]
 [106.11454989  88.80221141  94.5081787 ]
 [ 96.78266211  99.84251605 104.03478031]
 [101.86186193 103.61720152  99.57859892]
 [ 97.49594839  96.59385486 104.63817694]
 [ 96.76814836  91.6779221  101.79132774]
 [106.89005002 106.57364584 102.26648279]
 [ 99.80873105 101.63973121 106.46476468]
 [ 96.10020311  94.57421727 100.80409326]
 [ 94.11176915  99.62387832 1

In [23]:
dataset.shape

(24, 9)

In [14]:
# splitting up our dataset vertically on index 2

first_split = dataset[:,:2]
print(first_split.shape)

second_split = dataset[:,2:]
print(second_split.shape)

(24, 2)
(24, 7)


---

#### Iterating

Once you sent over the dataset they tell you that they also need a way iterate over the whole dataset element by element as if it would be a one-dimensional list.   
However, they want to also now the position in the dataset itself.

They send you this piece of code and tell you that it's not working as mentioned.   
Come up with the right solution for their needs.

In [15]:
# iterating over whole dataset (each value in each row)
curr_index = 0

for x in np.nditer(dataset):
    print(x, curr_index)
    curr_index += 1

99.14931546 0
104.03852715 1
107.43534677 2
97.85230675 3
98.74986914 4
98.80833412 5
96.81964892 6
98.56783189 7
101.34745901 8
92.02628776 9
97.10439252 10
99.32066924 11
97.24584816 12
92.9267508 13
92.65657752 14
105.7197853 15
101.23162942 16
93.87155456 17
95.66253664 18
95.17750125 19
90.93318132 20
110.18889465 21
98.80084371 22
105.95297652 23
98.37481387 24
106.54654286 25
107.22482426 26
91.37294597 27
100.96781394 28
100.40118279 29
113.42090475 30
105.48508838 31
91.6604946 32
106.1472841 33
95.08715803 34
103.40412146 35
101.20862522 36
103.5730309 37
100.28690912 38
105.85269352 39
93.37126331 40
108.57980357 41
100.79478953 42
94.20019732 43
96.10020311 44
102.80387079 45
98.29687616 46
93.24376389 47
97.24130034 48
89.03452725 49
96.2832753 50
104.60344836 51
101.13442416 52
97.62787811 53
106.71751618 54
102.97585605 55
98.45723272 56
100.72418901 57
106.39798503 58
95.46493436 59
94.35373179 60
106.83273763 61
100.07721494 62
96.02548256 63
102.82360856 64
106.475518

In [16]:
# iterating over the whole dataset with indices matching the position in the dataset

