### Math of Intelligence : Vectors

In this week's coding challenge we will focus on the usage of vectors while solving different machine learning problems.
Vectors are used for:
1. A way of encoding our data, whether it's images,text,signal processing data,audio etc. For each kind of data we encode it into a set of vectors using some feature space.
2. A way to describe our model, our final trained ML model is simply a set of weights and biases which are simply vectors that were learned during some optimization process.

A common term for vectors (of all dimensions) that is used in the data science community is **tensors**, you can see this when you work with major ML libraries like **tensor**flow or pytorch that they refer to each vectorized data as a tensor of some rank (rank = vector's dimension).

In this notebook I will pass through a dataset that provides several metrics of facebook posts in order to find the amount of total interactions resulting from this post (this dataset is small on purpose, the real target here is to show the high usage of vectorized data in the learning process,not the dataset).

In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
%matplotlib notebook

In [97]:
#Let's look at our dataset.
df = pd.read_csv('train.csv',sep=',')
df.head()

Unnamed: 0,x,y
0,24.0,21.549452
1,50.0,47.464463
2,15.0,17.218656
3,38.0,36.586398
4,87.0,87.288984


By looking at these features we can see that several of these features are **categorical**, because I want to simplify the problem as much as possible and focus on the regularization we will leave out all the categorical features (if we really were trying to attack this problem,we could use embedding vectors for these categorical features).

In [98]:
# #Remove all categorical features.
# df.drop(df.columns[[0,1,2,3,4,5,6,7,8,9,10,11]],axis=1,inplace=True)
# #Drop all rows that contain NaNs
# df.dropna(axis=0,how='any',inplace=True)
df.head(120)

Unnamed: 0,x,y
0,24.0,21.549452
1,50.0,47.464463
2,15.0,17.218656
3,38.0,36.586398
4,87.0,87.288984
5,36.0,32.463875
6,12.0,10.780897
7,81.0,80.763399
8,25.0,24.612151
9,5.0,6.963319


Now that we are left with the categorical features let's get straight to the code.

In [103]:
train = df.as_matrix()
train.shape

train = train[~np.isnan(train).any(axis=1)]
train.shape

(699, 2)

In [104]:
# #Dimensionality reduction.
# pca = PCA(n_components=2)
# reduced_data = pca.fit(np_data.T).components_.T
# reduced_data.shape

In [105]:
plt.figure()
plt.scatter(train[:,0],train[:,1])

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f51a6624c88>

Because this notebook is focused on the use of **regularizations** I will not explain much about the training process (there is a great explanation about SGD in the youtube series by siraj).
We will use the MSE loss function (this is a standard for simple linear regression problems):
$$ loss(y') = \frac {1}{N}\sum_{i=1} ^ {N} (y_i - y'_i)^2 $$

The equation aboves means that on a given batch of N training examples the loss will be computed by passing through each training example,take it's prediction $y'_i$ and the real label $y_i$ compute the squared error and then compute the mean of all of those errors.

In [118]:
#Save number of training examples.
N = train.shape[0]

#Save the batch size.
bs = 64

X_train = train[:,0]
Y_train = train[:,1]

def squared_error(y,label):
    return (1 / N) * ((y - labels) ** 2)

def predict(x,w,b):
    return w * x + b

def compute_grads(x,y,w,b):
    # Save the number of samples for x and y.
    num_samples = x.shape[0]
    
    #print("Num samples ",num_samples)
    #print(y)
#     print(w * x + b - y)
    #print((((w * x + b)  - y)))
    #Compute analytical gradients.
    res = w * x + b -y
    gW =   (2 / num_samples) * np.sum(x * res)
    gb =  (2 / num_samples) * np.sum(res)
    #print("gW: ",gW)
    return [gW,gb]

def grad_step(w,b,lr = 1e-2,epochs = 10):
#     for i in range(epochs):
#         grads = compute_grads()
        #Pass over each training example at a time,plain SGD.
#         for j in range(int((N / bs) + 1)):
#             curr_X = X_train[j * bs : min((j + 1) * bs,N)]
#             curr_Y = Y_train[j * bs : min((j + 1) * bs,N)]
#             #print("Current shape: ",curr_X.shape)
#             #print("Start: ",j * bs)
#             #print("End: ",min((j + 1) * bs,N))
#             grads = compute_grads(curr_X,curr_Y,w,b)
#             print("gW: ",grads[0])
#             print("gb: ",grads[1])
#             w -= lr * grads[0]
#             b -= lr * grads[1]
            
            #print("Current weights: ",w,b)
            
           # print(np.power(predict(X_train,w,b) - Y_train,2))
            #print("Mid loss: ",np.power(predict(X_train,w,b) - Y_train,2)))
#             loss = (1 / bs) * np.sum(np.power(predict(curr_X,w,b) - curr_Y,2))
#             print("Loss: ",loss)
        for j in range(epochs):
            grads = compute_grads(X_train,Y_train,w,b)
            print("Bef w: ",w)
            print("gW: ",grads[0])
            w -= grads[0] * lr
            b -= grads[1] * lr
            print("After w: ",w)
            #Compute loss
            loss = (1 / N) * np.sum(np.power(predict(X_train,w,b) - Y_train,2))
            print("Current loss: ",loss)
    
    
    return w,b

w = b = 1.0
grad_step(w,b,lr=1e-4,epochs=3)
    

IndentationError: unindent does not match any outer indentation level (<ipython-input-118-e65dd3fb119e>, line 66)