# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [1]:
# import NumPy into Python

import numpy as np
# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001,size = (1000,20))
# print the shape of X
print("--> X =\n",X)
print()
print("--> X size = ",X.shape)

--> X =
 [[ 614  158 1958 ... 3918 1845 3335]
 [3125 3500 4437 ... 4458 2264 2139]
 [2685 4390 2462 ... 3362  827 1106]
 ...
 [ 270 2599 3352 ... 1658 1845 3297]
 [3172 3753 4784 ... 2839  266 3548]
 [4786 1961 2273 ...  530 2344 3839]]

--> X size =  (1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [2]:
# Average of the values in each column of X
ave_cols = X.mean(axis = 0)
# Standard Deviation of the values in each column of X
std_cols = X.std(axis = 0)

If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [10]:
# Print the shape of ave_cols
print("--> Average Of Columns = \n",ave_cols)
print()
print("--> Average Of Columns Size = ",ave_cols.shape)
print()
# Print the shape of std_cols
print("--> Standard Deviation Of Columns = \n",std_cols)
print()
print("--> Standard Deviation Of Columns Size = ",std_cols.shape)

--> Average Of Columns = 
 [2529.534 2502.067 2407.71  2462.735 2457.845 2513.301 2500.081 2499.189
 2526.756 2529.45  2528.428 2520.346 2521.432 2490.985 2525.904 2535.439
 2561.252 2522.209 2435.77  2511.083]

--> Average Of Columns Size =  (20,)

--> Standard Deviation Of Columns = 
 [1419.21916871 1450.04430295 1447.50003589 1432.63361428 1399.78505813
 1445.12478022 1455.0272968  1441.81360074 1475.00944691 1457.94690421
 1430.89778489 1470.4970698  1449.66524666 1482.53677957 1464.8819764
 1453.30979983 1427.80034406 1439.68382408 1441.27156397 1444.19034691]

--> Standard Deviation Of Columns Size =  (20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [11]:
# Mean normalize X
X_norm = ( X - ave_cols ) / std_cols
print("--> X noramlized = \n",X_norm)

--> X noramlized = 
 [[-1.34970979 -1.61654854 -0.31068048 ...  0.96951218 -0.40989499
   0.57050444]
 [ 0.41957297  0.68820863  1.40192743 ...  1.34459453 -0.11917948
  -0.25764125]
 [ 0.10954333  1.30198298  0.03750604 ...  0.58331627 -1.11621574
  -0.97292092]
 ...
 [-1.59209659  0.0668483   0.65235922 ... -0.60027694 -0.40989499
   0.54419212]
 [ 0.45268977  0.86268606  1.64165108 ...  0.22004206 -1.50545536
   0.71799192]
 [ 1.58993484 -0.37313825 -0.0930639  ... -1.3837823  -0.06367294
   0.9194889 ]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [5]:
# Print the average of all the values of X_norm
print("--> Average of all elements of X_normalized = ",X_norm.mean())
print()
# Print the average of the maximum value in eaprint()
# Print the average of the minimum value in each column of X_norm
print("--> Average of MINIMUM elements of X_normalized = ",X_norm.min(axis = 0).mean())
print()
print("--> Average of MAXIMUM elements of X_normalized = ",X_norm.max(axis = 0).mean())
print()

--> Average of all elements of X_normalized =  3.0908609005564356e-17

--> Average of MINIMUM elements of X_normalized =  -1.728702714344412

--> Average of MAXIMUM elements of X_normalized =  1.722622934326203



You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [6]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([0, 2, 4, 1, 3])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [7]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(1000) # x_norm.shape[0] 
print("--> Row indices = \n",row_indices)

--> Row indices = 
 [415 124 926  76  97 325   9 663 224 259 187 437 353 268 851 820 121  16
 833 834 831 370 225 945 198 261 916   1  85 176 183  92 732 202 231 730
 166 921 796 647  50 664 356  78  94 913 982  56 645 238 173 413  87  64
 512 378 962 835 382 668 573 161 675  33 594  53 264 701 958 707 934 632
 885  61 721 507 659 470 394 392 444 876 970 464 652 818 949 678 175 185
 143 672 305 355 145 112 989 528 798 282 702 380 761 696  48 447 584 727
 323 640 928 336 217 933 357 822 792 442  20 657 152 953 236 399 618 211
 465 116 735 854 178 521 114  54 248 590 318 552 170 517 125 156 631 765
 432 979 390 593 602 153 541 462 244 354  99 313 362 374 670 551 821 243
 644 439 709 866 219 518 424 887 978 364 287 103 539 420 363  46 599 973
 827 216 453 772 230 639 882 342 427 621 167 500 148 858 623  43 638 904
 286 596 484 491 381 720 544 141 361 923 252 265 266 791  27 304 917 498
 345 234 671 903 557 940 641 168 785 239 193 188 637 773 690  47 482 489
 290 457 877 412 561 991 794 39

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [8]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.


# Create a Training Set
X_train = X_norm[row_indices[:600],:]

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[600:800],:]

# Create a Test Set
X_test = X_norm[row_indices[800:1000],:]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [9]:
# Print the shape of X_train
print("--> X Train = \n",X_train)
print()
print("--> X Train Size = ",X_train.shape)
print()
# Print the shape of X_crossVal
print("--> X Validation = \n",X_crossVal)
print()
print("--> X Validation Size = ",X_crossVal.shape)
print()

# Print the shape of X_test
print("--> X Test = \n",X_test)
print() 
print("--> X Test Size = ",X_test.shape)
print()

--> X Train = 
 [[-0.05322222  0.85785862  0.83336095 ... -1.25180888 -0.92055518
   0.81562448]
 [-0.2336031   1.71024637 -0.41085318 ...  1.44947867 -0.1719107
   1.58560608]
 [ 1.49481211 -1.45655343  1.54907768 ...  0.15405535  1.1317992
   0.02556242]
 ...
 [ 0.36531778  1.07026592  1.09380999 ... -0.14531593  0.0639921
  -0.76103749]
 [-0.47458068 -1.09518516  1.7584041  ...  0.29644773 -0.42724079
   1.3723378 ]
 [ 0.07008502 -0.66692238  0.38431087 ...  0.72293026 -0.76999368
  -0.23825322]]

--> X Train Size =  (600, 20)

--> X Validation = 
 [[-1.40466958 -0.58002848 -0.3797651  ... -1.03440004 -1.24665611
   1.18538183]
 [ 0.78103934 -1.54758513  0.89346457 ... -1.2941793   1.18036742
  -0.43836535]
 [-0.16455105  0.94475251 -0.14556822 ... -1.48936104  1.65078536
   0.32607682]
 ...
 [-1.36098359 -1.11035711 -0.98356474 ... -0.14601053 -0.04216416
  -0.21332576]
 [ 1.54061194 -1.03173882  1.31902587 ...  0.38396695 -1.02671144
  -0.01598335]
 [ 0.64998136  0.88337508  1.772