## Assignment #1.1: Use NumPy to compute the Mean, Median, and Variance

Taken from https://www.packtpub.com/big-data-and-business-intelligence/data-visualisation-python<br /><br />
Last modified by Alexander Ogay on 21/9/2022

In this activity, you will consolidate the skills you've acquired in the last exercise and use NumPy to do some very basic mathematical calculations on our `normal_distribution` dataset.   
NumPy has a consistent API, so it should be rather easy to transfer your knowledge of the mean method to median and variance.

#### Complete the code in all sections with the <code>numpy</code> module.
####  <span style="color:red">If other modules, such as <code>statistics</code> and <code>pandas</code>, are used to complete this assignment with the exception of the last section, it will not be graded. In addition, keying the data directly into this notebook will result in a considerably lower score.</span> 

### Loading the dataset
1. Import the necessary dependencies

In [1]:
import numpy as np

2. Load the dataset in the folder.

In [2]:
dataset = np.genfromtxt('normal_distribution.csv', delimiter=',')

3. Look at the first two rows of the dataset.

In [3]:
dataset[0], dataset[1]

(array([ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
         98.74986914,  98.80833412,  96.81964892,  98.56783189]),
 array([ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
         92.9267508 ,  92.65657752, 105.7197853 , 101.23162942]))

---

## Section 2: Indexing, Slicing, and Iterating

Our client wants to prove that our dataset is nicely distributed around the mean value of 100.   
They asked us to run some tests on several subsections of it to make sure they won't get a non-descriptive section of our data.

Look at the mean value of each subtask.

#### Indexing

Since we need several rows of our dataset to complete the given task, we have to use indexing to get the right rows.   
To recap, we need: 
- the second row 
- the last row
- the first value of the first row
- the last value of the second to the last row

4. Index the second row of the dataset

In [4]:
dataset[1]

array([ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
        92.9267508 ,  92.65657752, 105.7197853 , 101.23162942])

5. Index the last element of the dataset (last row)

In [5]:
dataset[-1]

array([ 94.11176915,  99.62387832, 104.51786419,  97.62787811,
        93.97853495,  98.75108352, 106.05042487, 100.07721494])

6. Index the first value of the second row (row 1 cell 0)

In [6]:
dataset[0][0]

99.14931546

7. Index the last value of the second to last row (*not the last two rows*)

In [7]:
dataset[-2][-1]

103.83852459

---

#### Slicing

Other than the single rows and values we also need to get some subsets of the dataset.   
Here we want slices:
- a 2x2 slice starting from the second row and second element to the 4th element in the 4th row
- every other element of the 5th row
- the content of the last row in reversed order

8. Slice an intersection of 4 elements (2x2) of the first two rows and first two columns

In [8]:
dataset[:2, :2]

array([[ 99.14931546, 104.03852715],
       [ 92.02628776,  97.10439252]])

##### Why is it not a problem if such a small subsection has a bigger standard deviation from 100?

Several smaller values can cluster in such a small subsection leading to the value being really low.   
If we make our subsection larger, we have a higher chance of getting a more expressive view of our data.

9. Select every second element of the fifth row

In [9]:
dataset[4, 0:-1:2]

array([101.20862522, 100.28690912,  93.37126331, 100.79478953])

10. Reverse the entry order, selecting the first two rows in reversed order

In [10]:
dataset[-2:]

array([[ 96.10020311,  94.57421727, 100.80409326, 105.02389857,
         98.61325194,  95.62359311,  97.99762409, 103.83852459],
       [ 94.11176915,  99.62387832, 104.51786419,  97.62787811,
         93.97853495,  98.75108352, 106.05042487, 100.07721494]])

---

## Section 3: Summary Statistics

### Mean

11. Calculate the mean of the third row

In [11]:
np.mean(dataset[2])

100.20466135250001

12. Calculate the mean of the last column

In [12]:
np.mean(dataset[:, -1:])

100.4404927375

13. Calculate the mean of the intersection of the first 3 rows and first 3 columns

In [13]:
np.mean(dataset[:3, :3])

97.87197312333333

### Median

14. Calculate the median of the last row

In [14]:
np.median(dataset[-1])

99.18748092

15. Calculate the median of the last 3 columns

In [15]:
np.median(dataset[:, -3:])

99.47332349999999

16. Calculate the median of each row

In [16]:
for i in dataset:
    print(np.median(i))

98.77910163
97.17512034
98.58782879
100.684498365
101.001707375
97.76908825
101.85002253
100.04756696999999
102.242925555
99.59514997
100.49557530499999
99.886071405
99.006479935
98.672761775
102.44376222
96.61933565499999
104.0968893
100.72023042500001
98.70877396
99.75008653500001
104.89344427500001
101.006349425
98.305438015
99.18748092


17. Calculate the median of each column

In [17]:
np.median(dataset[:, :1]), np.median(dataset[:, 1:2]), np.median(dataset[:, 2:3]), np.median(dataset[:, 3:4]), np.median(dataset[:, 4:5]), np.median(dataset[:, 5:6]), np.median(dataset[:, 6:7]), np.median(dataset[:, 7:8])

(99.479023255,
 100.108119265,
 101.66384622,
 100.96596128,
 100.19629221,
 99.08416696500001,
 98.79890832000001,
 100.60581955)

---

### Variance
18. Calculate the variance of each column

In [18]:
np.var(dataset[:, :1]), np.var(dataset[:, 1:2]), np.var(dataset[:, 2:3]), np.var(dataset[:, 3:4]), np.var(dataset[:, 4:5]), np.var(dataset[:, 5:6]), np.var(dataset[:, 6:7]), np.var(dataset[:, 7:8])

(23.647574647546676,
 29.78886108974666,
 20.50542010670524,
 26.03204443386493,
 28.388531753037643,
 19.099608170015305,
 17.672911740506233,
 16.179232042544072)

19. Calculate the variance of the intersection of the last 2 rows and first 2 columns

In [20]:
np.var(dataset[-2:, :2])

4.674691991769191

The values of the variance might seem a little bit strange at first.   
You can always go back to the topic that gives you a quick statistical overview to recap what you've learned so far.   

> **Note:**   
Just remember, the variance is not the standard deviation.   

Try calculation the standard deviation with NumPy to get a more descriptive value when comparing it to our dataset

20. Calculate the standard deviation for each column

In [21]:
np.std(dataset[:, :1]), np.std(dataset[:, 1:2]), np.std(dataset[:, 2:3]), np.std(dataset[:, 3:4]), np.std(dataset[:, 4:5]), np.std(dataset[:, 5:6]), np.std(dataset[:, 6:7]), np.std(dataset[:, 7:8])

(4.8628771984851396,
 5.457917284985791,
 4.528291080165369,
 5.102160761272123,
 5.328088940045731,
 4.370309848284822,
 4.203916238521676,
 4.022341611865416)

21. Calculate the standard deviation for the dataset

In [22]:
np.std(dataset)

4.838197554269257

---

## A Bit Extra

#### Splitting

Our client's team only wants to use a small subset of the given dataset.   
Therefore we need to first split it into 3 equal pieces and then give them the first half of the first split.   
They sent us this drawing to show us what they need:
```
1, 2, 3, 4, 5, 6          1, 2     3, 4    5, 6          1, 2  
3, 2, 1, 5, 4, 6    =>    3, 2     1, 5    4, 6    =>    3, 2    =>    1, 2
5, 3, 1, 2, 4, 3          5, 3     1, 2    4, 3                        3, 2
1, 2, 2, 4, 1, 5          1, 2     2, 4    1, 5          5, 3
                                                         1, 2
```

> **Note:**   
We are using a very small dataset here but imagine you have a huge amount of data and only want to look at a small subset of it to tweak your visualizations

22. Split up our dataset horizontally on indices one third and two thirds

In [33]:
np.array_split(dataset, 3)

[array([[ 99.14931546, 104.03852715, 107.43534677,  97.85230675,
          98.74986914,  98.80833412,  96.81964892,  98.56783189],
        [ 92.02628776,  97.10439252,  99.32066924,  97.24584816,
          92.9267508 ,  92.65657752, 105.7197853 , 101.23162942],
        [ 95.66253664,  95.17750125,  90.93318132, 110.18889465,
          98.80084371, 105.95297652,  98.37481387, 106.54654286],
        [ 91.37294597, 100.96781394, 100.40118279, 113.42090475,
         105.48508838,  91.6604946 , 106.1472841 ,  95.08715803],
        [101.20862522, 103.5730309 , 100.28690912, 105.85269352,
          93.37126331, 108.57980357, 100.79478953,  94.20019732],
        [102.80387079,  98.29687616,  93.24376389,  97.24130034,
          89.03452725,  96.2832753 , 104.60344836, 101.13442416],
        [106.71751618, 102.97585605,  98.45723272, 100.72418901,
         106.39798503,  95.46493436,  94.35373179, 106.83273763],
        [ 96.02548256, 102.82360856, 106.47551845, 101.34745901,
         102.45651

23. Split up our dataset vertically on index 2

In [50]:
np.array_split(np.array_split(dataset, 3), 1, axis=2)

[array([[[ 99.14931546, 104.03852715, 107.43534677],
         [ 92.02628776,  97.10439252,  99.32066924],
         [ 95.66253664,  95.17750125,  90.93318132],
         [ 91.37294597, 100.96781394, 100.40118279],
         [101.20862522, 103.5730309 , 100.28690912],
         [102.80387079,  98.29687616,  93.24376389],
         [106.71751618, 102.97585605,  98.45723272],
         [ 96.02548256, 102.82360856, 106.47551845]],
 
        [[105.30350449,  92.87730812, 103.19258339],
         [110.44484313,  93.87155456, 101.5363647 ],
         [101.3514185 , 100.37372248, 106.6471081 ],
         [ 97.21315663, 107.02874163, 102.17642112],
         [ 95.65982034, 107.22482426, 107.19119932],
         [100.39303522,  92.0108226 ,  97.75887636],
         [103.1521596 , 109.40523174,  93.83969256],
         [106.11454989,  88.80221141,  94.5081787 ]],
 
        [[ 96.78266211,  99.84251605, 104.03478031],
         [101.86186193, 103.61720152,  99.57859892],
         [ 97.49594839,  96.59385486, 10

---

#### Iterating

Once you sent over the dataset they tell you that they also need a way iterate over the whole dataset element by element as if it would be a one-dimensional list.   
However, they want to also now the position in the dataset itself.

They send you this piece of code and tell you that it's not working as mentioned.   
Come up with the right solution for their needs.

24. Iterate over whole dataset (each value in each row)

In [57]:
for i in np.array_split(np.array_split(dataset, 3), 1, axis=2):
    for j in i:
        for k in j:
            for l in k:
                print(l)

99.14931546
104.03852715
107.43534677
92.02628776
97.10439252
99.32066924
95.66253664
95.17750125
90.93318132
91.37294597
100.96781394
100.40118279
101.20862522
103.5730309
100.28690912
102.80387079
98.29687616
93.24376389
106.71751618
102.97585605
98.45723272
96.02548256
102.82360856
106.47551845
105.30350449
92.87730812
103.19258339
110.44484313
93.87155456
101.5363647
101.3514185
100.37372248
106.6471081
97.21315663
107.02874163
102.17642112
95.65982034
107.22482426
107.19119932
100.39303522
92.0108226
97.75887636
103.1521596
109.40523174
93.83969256
106.11454989
88.80221141
94.5081787
96.78266211
99.84251605
104.03478031
101.86186193
103.61720152
99.57859892
97.49594839
96.59385486
104.63817694
96.76814836
91.6779221
101.79132774
106.89005002
106.57364584
102.26648279
99.80873105
101.63973121
106.46476468
96.10020311
94.57421727
100.80409326
94.11176915
99.62387832
104.51786419


25. Iterate over the whole dataset with indices matching the position in the dataset

In [106]:
index = np.shape(np.array_split(np.array_split(dataset, 3), 1, axis=2)[0])[0]*np.shape(np.array_split(np.array_split(dataset, 3), 1, axis=2)[0])[1]*np.shape(np.array_split(np.array_split(dataset, 3), 1, axis=2)[0])[2]
index
con = np.concatenate((np.array_split(np.array_split(dataset, 3), 1, axis=2)[0][0]))
con = np.append(con, np.concatenate((np.array_split(np.array_split(dataset, 3), 1, axis=2)[0][1])))
con = np.append(con, np.concatenate((np.array_split(np.array_split(dataset, 3), 1, axis=2)[0][2])))
for i in range(index):
    print(con[i])

99.14931546
104.03852715
107.43534677
92.02628776
97.10439252
99.32066924
95.66253664
95.17750125
90.93318132
91.37294597
100.96781394
100.40118279
101.20862522
103.5730309
100.28690912
102.80387079
98.29687616
93.24376389
106.71751618
102.97585605
98.45723272
96.02548256
102.82360856
106.47551845
105.30350449
92.87730812
103.19258339
110.44484313
93.87155456
101.5363647
101.3514185
100.37372248
106.6471081
97.21315663
107.02874163
102.17642112
95.65982034
107.22482426
107.19119932
100.39303522
92.0108226
97.75887636
103.1521596
109.40523174
93.83969256
106.11454989
88.80221141
94.5081787
96.78266211
99.84251605
104.03478031
101.86186193
103.61720152
99.57859892
97.49594839
96.59385486
104.63817694
96.76814836
91.6779221
101.79132774
106.89005002
106.57364584
102.26648279
99.80873105
101.63973121
106.46476468
96.10020311
94.57421727
100.80409326
94.11176915
99.62387832
104.51786419
