# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [2]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, (1000, 20))

# print the shape of X
print(X)
print(X.shape)


[[2966 3214 4527 ..., 3728 1078 4880]
 [4194 1322 3140 ..., 4127 2185 4399]
 [3276 3382 3741 ...,   25 3570 4896]
 ..., 
 [ 659 2093 3901 ..., 2732 1696 1927]
 [ 795  386 1141 ..., 1063 2414 3161]
 [2655 2243 1526 ..., 2098 1878 1392]]
(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [3]:
# Average of the values in each column of X
ave_cols = np.mean(X, 0)
print(ave_cols)


# Standard Deviation of the values in each column of X
std_cols = np.std(X, 0)
print(std_cols)

# Calculating std of each column without using the built-in functions
test_var_cols = (X - ave_cols)**2
print(test_var_cols.shape)
test_std_cols = np.sqrt(test_var_cols.sum(axis = 0)/np.size(X, 0))
print(test_std_cols.shape)


print(std_cols)
print(test_std_cols)


[ 2598.032  2434.8    2506.782  2508.648  2534.379  2502.676  2464.7
  2475.877  2462.315  2527.378  2503.342  2522.908  2535.202  2492.342
  2526.258  2586.509  2564.443  2516.672  2459.702  2543.8  ]
[ 1410.43493823  1404.48729008  1455.58162137  1445.65279099  1473.17773923
  1403.29053906  1433.2068469   1459.43735044  1421.88968763  1480.24260347
  1440.42310001  1464.31678661  1414.31948343  1420.93793567  1460.2216008
  1465.4320523   1453.88281741  1433.11013199  1471.94522017  1429.0728225 ]
(1000, 20)
(20,)
[ 1410.43493823  1404.48729008  1455.58162137  1445.65279099  1473.17773923
  1403.29053906  1433.2068469   1459.43735044  1421.88968763  1480.24260347
  1440.42310001  1464.31678661  1414.31948343  1420.93793567  1460.2216008
  1465.4320523   1453.88281741  1433.11013199  1471.94522017  1429.0728225 ]
[ 1410.43493823  1404.48729008  1455.58162137  1445.65279099  1473.17773923
  1403.29053906  1433.2068469   1459.43735044  1421.88968763  1480.24260347
  1440.42310001  1464

If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [4]:
# Print the shape of ave_cols
print(ave_cols.shape)

# Print the shape of std_cols
print(std_cols.shape)

(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [11]:
# Mean normalize X
X_norm = (X - ave_cols) / std_cols
print(X_norm)
print(X_norm.shape)

[[ 0.26088974  0.5547932   1.38791118 ...,  0.84524418 -0.93869118
   1.63476624]
 [ 1.13154316 -0.7923176   0.43502748 ...,  1.12365963 -0.18662515
   1.29818437]
 [ 0.48068009  0.67440981  0.84792085 ..., -1.73864656  0.7543066
   1.64596231]
 ..., 
 [-1.37477593 -0.24336283  0.95784254 ...,  0.15025223 -0.5188386
  -0.43160852]
 [-1.27835177 -1.45875297 -0.93830671 ..., -1.01434772 -0.03104871
   0.43188842]
 [ 0.04039038 -0.13656229 -0.67380763 ..., -0.29214224 -0.3951927
  -0.80597712]]
(1000, 20)


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [12]:
# Print the average of all the values of X_norm
print("Average of all values of X_Norm is: ", np.mean(X_norm))

# Print the average of the minimum value in each column of X_norm
print("\nAverage of the minimum value in each of column of X_Norm is:\n\n", np.min(X_norm, axis = 0))

# Print the average of the maximum value in each column of X_norm
print("\nAverage of the minimum value in each of column of X_Norm is: \n\n", np.max(X_norm, axis = 0))


Average of all values of X_Norm is:  -4.19220214098e-17

Average of the minimum value in each of column of X_Norm is:

 [-1.83704468 -1.73287435 -1.71256765 -1.73461291 -1.71966962 -1.77844568
 -1.71901216 -1.69097838 -1.72539053 -1.70605683 -1.73653283 -1.722925
 -1.79252427 -1.75190058 -1.71703939 -1.7575083  -1.76248248 -1.75609114
 -1.66969665 -1.77723623]

Average of the minimum value in each of column of X_Norm is: 

 [ 1.70228908  1.82571962  1.71218018  1.72126531  1.66824473  1.77819484
  1.75920175  1.72198073  1.77769416  1.66906559  1.73050404  1.69163669
  1.74133075  1.74578913  1.69203222  1.64558363  1.67245735  1.73212647
  1.72581015  1.70963996]


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [20]:
# We create a random permutation of integers 0 to 4
print(np.random.permutation(4))

[2 1 3 0]


# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [24]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[221 795 125 660 226 393  18 831 765 729 561 932 473 413 688 726 818 495
 725 627 958 617 779 528  40   0  59  14 346 500 423 131 638 946 123 680
 338 107 984 166 153 991 168 364  97 515 333 944 672 649 564 199  76 821
  56 728 462 334 629 710 784 863 523 896 456 219 650 195 707 611 349 792
 269 622 912 143 183  23 937 492 922 845 738 297 190 664 979 982 897 458
 115 363 978 435 708 196  24  19 488 776  55 111 949 552 625 573 527 141
 400 409 130 628 626 245 762 754 534 231 449  89 700 877 158 933 309 197
 960 351 833 235 544 401 116 467 211 732 670 129 229 684 493 137 160 685
 271  93 641 696 117 653 540 669 526 175 580 993  58 433  25 279 159 683
 882 579 938 862 519 760 308 714 963 213  22 427  52 555 899 794 135 667
 915 328 292 781 832 743 498 973 652 212 172 698 706 624  83  82   8 266
 136   3 890 596 908 891 246 441 412 100 919 876 799 395 479 797 414 695
 787 502 945 847 640 553 609 578 494 391 264 284 234 850 359 769 403 969
 619 620 804 282 868 202 737 167 970 113 312 201  7

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [51]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.


# Create a Training Set (60% of data)
X_train = X[row_indices[:600]]

# Create a Cross Validation Set (20% of data)
X_crossVal = X[row_indices[600:800]]

# Create a Test Set (20% of data)
X_test = X[row_indices[800:]]

print(X[row_indices[0:2]], "\n")
print(X[[221, 795]])

[[4425 1762   25 3317 4899 1070 2323 4640 2866 1896 4117 1520 2621 2541
  2186 4104 2591   79 3067 2827]
 [1363  580 2406 4628 4279 1481   21 4216  380  361  540  801 4939 1582
  2485 4906  281 1861 1356 3849]] 

[[4425 1762   25 3317 4899 1070 2323 4640 2866 1896 4117 1520 2621 2541
  2186 4104 2591   79 3067 2827]
 [1363  580 2406 4628 4279 1481   21 4216  380  361  540  801 4939 1582
  2485 4906  281 1861 1356 3849]]


If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [27]:
# Print the shape of X_train
print(X_train.shape)

# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)


# Playground

In [33]:
test = np.arange(1, 21).reshape((4,5))
print(test)

mean = np.mean(test, axis = 0)
print(mean)
print(np.size(test))

print(np.sqrt((mean - test)**2 / np.size(test)))

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]
[  8.5   9.5  10.5  11.5  12.5]
20
[[ 1.67705098  1.67705098  1.67705098  1.67705098  1.67705098]
 [ 0.55901699  0.55901699  0.55901699  0.55901699  0.55901699]
 [ 0.55901699  0.55901699  0.55901699  0.55901699  0.55901699]
 [ 1.67705098  1.67705098  1.67705098  1.67705098  1.67705098]]


In [48]:
test_matrix = np.arange(1, 21).reshape((4,5))
print(test_matrix)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]


In [50]:
add_matrix = np.arange(1,6)
print(add_matrix)

[1 2 3 4 5]


In [52]:
# Broadcasting an array by addition alongside the 2 Dimensional array
test_result = test_matrix + add_matrix
print(test_result)

[[ 2  4  6  8 10]
 [ 7  9 11 13 15]
 [12 14 16 18 20]
 [17 19 21 23 25]]


In [54]:
# Inserting an array in between the rows of another array
test_insert = np.insert(test_result, 2 , add_matrix, axis = 0)
print(test_insert)

[[ 2  4  6  8 10]
 [ 7  9 11 13 15]
 [ 1  2  3  4  5]
 [12 14 16 18 20]
 [17 19 21 23 25]]


In [79]:
# finding the maximum value of each column
test_insert_max_col = np.max(test_insert, axis =0)
print(test_insert_max_col)

[17 19 21 23 25]


In [80]:
# finding the minimum value of each column
test_insert_max_col = np.min(test_insert, axis =0)
print(test_insert_max_col)

[1 2 3 4 5]


In [61]:
# A Matrix with 3 Dimensions
test_matrx = np.array([[[1,2],[3,4],[5,6]],[[1,2],[3,4],[5,6]]])
print(test_matrx)

[[[1 2]
  [3 4]
  [5 6]]

 [[1 2]
  [3 4]
  [5 6]]]


In [62]:
# Broadcasting an array alongside the 3 dimensional matrix
test_matrx + [1,2]

array([[[2, 4],
        [4, 6],
        [6, 8]],

       [[2, 4],
        [4, 6],
        [6, 8]]])