# Mini-Project (Numpy)
Numpy is Numerical Python. It is an extensive math library. It improves the speed of the process. It creates multidimentional array data structures that can represent vectors and matrices. In machine learning we need to do feature scaling which is called as mean mormalizing. In this exercise, we took random values between 0 and 10,000 and we made the data between 0 and 1 or sometimes very small range around between -3 and 3 by normalizing. Therefore, the average will be near zero.
In this project, there is a good example how to choose data for machine learning algorithm. 60% sets have been chosen as training set, 20% sets have been chosen as cross validation sets and 20% sets have been chosen as test sets. It has been made sure that any of the sets values will not overlap each other.

In [31]:
# importing NumPy into Python
import numpy as np

# Creating a 1000 x 20 ndarray with random integers in the half-open interval [0, 10001).
X = np.random.randint(0,10000,(1000,20))

# printing the shape of X
print(X.shape)

(1000, 20)


Performing mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. 

In [40]:
# Average of the values in each column of X
ave_cols = np.mean(X,axis = 0)

# Standard Deviation of the values in each column of X
std_cols = np.std(X,axis = 0)

In [41]:
# Print the shape of ave_cols
print(ave_cols.shape)

# Print the shape of std_cols
print(std_cols.shape)


(20,)
(20,)


In [34]:
# Mean normalize X
X_norm = (X-ave_cols)/std_cols
print(X_norm.shape)

(1000, 20)


The average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. We can varify by calculating the average values.

In [35]:
# Print the average of all the values of X_norm
print(np.mean(X_norm))

# Print the average of the minimum value in each column of X_norm
print(np.min(X_norm,axis=0).mean())
print(np.mean(np.sort(X_norm, axis=0)[0]))

# Print the average of the maximum value in each column of X_norm
print(np.max(X_norm,axis=0).mean())
print(np.mean(np.sort(X_norm, axis=0)[-1]))

-1.9184653865522706e-17
-1.731747558009267
-1.731747558009267
1.7468173779239489
1.7468173779239489


# Data Distribution for machine learning 

We can split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

To make sure that there is no overlap, we can use random permutaion of row indices of X_norm.

In [37]:
# Creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)
print(row_indices.shape)

[238 460 750  58 724 658 434 571  47 702 986 936 360 114 136 703 768 753
 892 513  11  69 458 171 190 497 848 339 250 295 849 754 701 307 519 258
 689 706 829 891 800 440 100 962 193 725 522 412 142 468  38 831 695 799
 527 737 786 467 227 430 402 330 712  99 597 911 163 894 442  39 935 922
 191 292 746 588 132  13 550 419 472 429 788 413 568 716 232 129 930 355
 347 133 311 372 839 920 204 755 647 739 270 543 523 778 108 200 106 150
 981 408 107 457 479 542 791 381 509 234 826 877 425 508 651 758  91 869
 275 805 428 220 260 907 662 700 692 131 561 762 252 996 732 896 231   2
 906 492 334 531 358 384 552  30 621 876 253 349  55 494 437 641 789 216
 882 315 493 226 688 241 604 213 633 269 369 395 628 654 151 949 331 720
 874 686 904 661 999 217 557 318 461 120 665 296 971 953 452 377 183 322
 153   6  26 281 255 731 488 113 105 600 777 157 696 590 450 370 714 164
 594 995 858 361 465   8  66 218  12 678 155   3 374 924 276 422 289 810
 379 392 759 840  48 708 156 592 961 946  33 646 44

Now, we can distribute the Data. The Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In [42]:
train_row = int(row_indices.shape[0]*0.6)
Row_train = row_indices[0:train_row]

cross_start = train_row
cross_end = train_row + int(row_indices.shape[0]*0.2)
Row_crossVal = row_indices[cross_start:cross_end]

test_start = cross_end
test_end = cross_end + int(row_indices.shape[0]*0.2)
Row_test = row_indices[test_start:test_end]

# Creating a Training Set
X_train = X[Row_train,:]

# Creating a Cross Validation Set
X_crossVal = X[Row_crossVal,:]

# Creating a Test Set
X_test = X[Row_test,:]

In [39]:
# Checking the distribution
# Print the shape of X_train
print(X_train.shape)

# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
