# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [2]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001,size=(1000,20))

# print the shape of X
print(X)
print(X.shape)

[[4915 1504 1226 ..., 2448 3974 4716]
 [2481 2141 2944 ...,  779  448 1465]
 [2380 2646 3635 ..., 4414  170 4952]
 ..., 
 [2620 4439  574 ..., 2267 1682 3192]
 [4170 3252 1586 ..., 4648 2710 3398]
 [2489 2960 2583 ..., 4137 4579 4360]]
(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [5]:
# Average of the values in each column of X
ave_cols = np.mean(X, axis=0)

# Standard Deviation of the values in each column of X
std_cols = np.std(X, axis=0)
print(ave_cols, std_cols)

[ 2502.33   2492.366  2530.654  2514.398  2472.911  2518.507  2575.853
  2495.425  2443.828  2483.813  2472.913  2529.937  2596.869  2394.429
  2504.687  2463.162  2516.863  2498.975  2493.712  2474.165] [ 1473.71098561  1463.31645793  1417.70371738  1448.48140533  1489.15771733
  1425.51605952  1441.57013891  1445.8612535   1438.49391393  1438.1654446
  1423.55849315  1441.16761587  1439.7371329   1390.43443749  1408.20041082
  1468.19307441  1438.24147911  1447.27326044  1436.08467057  1431.36219797]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [6]:
# Print the shape of ave_cols
print(ave_cols.shape)

# Print the shape of std_cols
print(std_cols.shape)

(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [7]:
# Mean normalize X
X_norm = (X-ave_cols)/std_cols

If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [11]:
# Print the average of all the values of X_norm
print(np.mean(X_norm))

# Print the average of the minimum value in each column of X_norm
print(np.mean(np.min(X_norm, axis=1)))

# Print the average of the maximum value in each column of X_norm
print(np.mean(np.max(X_norm, axis=1)))

-2.13162820728e-18
-1.56984657834
1.57582744683


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [12]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([1, 3, 0, 2, 4])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [15]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)
print(row_indices.shape)

[623 754 115  17 134 348  79  37  31 718 443 322 602 403  53 885 350 280
 515 914 812  64 958 877 408 619 462 820 492 494 924 845  10 665 158   4
 226 228 823 640 670 526 575 345 156 880 934 316 457 356 706 534 154 868
 284 487 988 601 647 883 184 180 465 618 762  67 748 516 903 625  72 986
 712 260 904 249 755 702 816 306 925 663  97 181 198 668 622 414 634 689
 204 704 827  19 957 997 496 477 691 522 841 927 278   6 248 344  48   8
 331  34 206 520 337 342 858 654 855 785 295 757 639 155  36  65 556 804
 167 774 666 401 797 255 191 759 693 294  58 847 286 469 418 731 937 696
 882 859 392 976 865 452 326 621 945 233 850 400 483 570  32 728 786 690
 864 323 735 662 783 935 840 542 547 519 119 289 320 441 787 973 843 614
 521 814 455 984  40 493 900  15 312 775  94  27 135 948 829 788 523 643
 235  63 215 671 821 268 738 415 389 830 272  20 197 103 178 908 800 657
 174 202 677 231 377 959 834 846 806 980  39 692 540 798 475 624   5 114
 338 913 529 486  11 187 444 343 713 767 559 381 70

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [16]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.


# Create a Training Set
X_train = X_norm[row_indices[:600]]

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[600:800]]

# Create a Test Set
X_test = X_norm[row_indices[800:]]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [17]:
# Print the shape of X_train
print(X_train.shape)
# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
