In [533]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import preprocessing
import matplotlib.pyplot as plt

## RSS

* Residual sum of squares
* the `loss function` when selecting coefficients

## Regularization

* pushes coefficients of the polynomial to 0 to make the polynomial simpler
* to prevent overfitting
* Ridge Regression, Lasso

## Ridge (L2) Regression

* `discourages` learning a more `complex` model, so as to `avoid` the risk of `overfitting`
* good for polynomials, continuous functions, and normal distributions

### λ
* shrinkage parameter
* controls the size of the coefficients
* controls amount of regularization
* λ ↓ 0, we obtain the least squares solutions
* λ ↑ ∞, we get an intercept-only model

![image](https://user-images.githubusercontent.com/7232635/44366487-d62d1e00-a49a-11e8-8502-c5ceea47d925.png)

## Stochastic Gradient Descent
* fits points to lines
* `optimization` algorithm to find a good set of model `parameters` given a training dataset
* `iterative` algorithm and evaluates and updates the `coefficients` every iteration to `minimize` the `error` of a model on its training data

### Required Parameters for SGD
* **Learning Rate** - to limit the amount each coefficient is corrected on each update
* **Epochs** - number of times to run through the training data while updating the coefficients

### Update to Coefficients

![image](https://user-images.githubusercontent.com/7232635/44360541-2307f900-a489-11e8-99a3-88e6e014df8f.png)

![image](https://user-images.githubusercontent.com/7232635/44360561-387d2300-a489-11e8-8a40-c0ab31bc77fc.png)


## Gradient Descent vs Stochastic Gradient Descent

| Gradient Descent | Stochastic Gradient Descent |
|:-----------------|:----------------------------|
| When the dataset is large, calculating the parameters is `expensive`. Consider a dataset of `1 billion sample points`. Gradient descent would go through all 1 billion sample points on each epoch to calculate the parameters. | |
| Only works for problems which have a `well defined convex optimization` problem | |
| Suffers with `saddle points` and multiple local minima | Suffers with `saddle points` |

### Well-Defined Convex Optimization Problem

![image](https://user-images.githubusercontent.com/7232635/44543527-c6514c00-a6dd-11e8-894d-5c5ec6396c4a.png)

### Non-Convex Optimization Problem

![image](https://user-images.githubusercontent.com/7232635/44543469-9bff8e80-a6dd-11e8-86e9-72f46df21ee6.png)

### Non-Strongly Convex Optimization Problem

![image](https://user-images.githubusercontent.com/7232635/44543891-d4ec3300-a6de-11e8-931e-49b39ddbf2de.png)

### Saddle Point

![image](https://user-images.githubusercontent.com/7232635/44543643-1d572100-a6de-11e8-9826-28bcca8ca6b5.png)


In [9]:
from IPython.display import Image
Image(url='https://www.jeremyjordan.me/content/images/2018/01/opt1.gif')

## Gradient Descent on Multiple Variables

Use the `transcoding_measurement.tsv` to predict `video transcoding time` for `MPEG` encoded videos. 

To follow along, read the [associated paper](https://ieeexplore.ieee.org/abstract/document/6890256/) which goes over their steps for this problem.

* [Dataset here](http://archive.ics.uci.edu/ml/datasets/Online+Video+Characteristics+and+Transcoding+Time+Dataset)

In [40]:
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt

In [41]:
input_video_df = pd.read_csv('./transcoding_measurement.tsv', sep='\t')
output_video_df = pd.read_csv('./youtube_videos.tsv', sep='\t')

## Input Video (MPEG)

In [42]:
input_video_df.head()

Unnamed: 0,id,duration,codec,width,height,bitrate,framerate,i,p,b,...,p_size,b_size,size,o_codec,o_bitrate,o_framerate,o_width,o_height,umem,utime
0,04t6-jw9czg,130.35667,mpeg4,176,144,54590,12.0,27,1537,0,...,825054,0,889537,mpeg4,56000,12.0,176,144,22508,0.612
1,04t6-jw9czg,130.35667,mpeg4,176,144,54590,12.0,27,1537,0,...,825054,0,889537,mpeg4,56000,12.0,320,240,25164,0.98
2,04t6-jw9czg,130.35667,mpeg4,176,144,54590,12.0,27,1537,0,...,825054,0,889537,mpeg4,56000,12.0,480,360,29228,1.216
3,04t6-jw9czg,130.35667,mpeg4,176,144,54590,12.0,27,1537,0,...,825054,0,889537,mpeg4,56000,12.0,640,480,34316,1.692
4,04t6-jw9czg,130.35667,mpeg4,176,144,54590,12.0,27,1537,0,...,825054,0,889537,mpeg4,56000,12.0,1280,720,58528,3.456


In [43]:
input_video_df.columns

Index(['id', 'duration', 'codec', 'width', 'height', 'bitrate', 'framerate',
       'i', 'p', 'b', 'frames', 'i_size', 'p_size', 'b_size', 'size',
       'o_codec', 'o_bitrate', 'o_framerate', 'o_width', 'o_height', 'umem',
       'utime'],
      dtype='object')

## Output Decoded Video

In [44]:
output_video_df.head()

Unnamed: 0,id,duration,bitrate,bitrate(video),height,width,frame rate,frame rate(est.),codec,category,url
0,uDNj-_5ty48,267,373,274,568,320,29.97,0.0,h264,Music,http://r2---sn-ovgq0oxu-5goe.c.youtube.com/vid...
1,uDNj-_5ty48,267,512,396,480,270,29.97,29.97,h264,Music,http://r2---sn-ovgq0oxu-5goe.c.youtube.com/vid...
2,uDNj-_5ty48,267,324,263,400,226,29.97,29.97,flv1,Music,http://r2---sn-ovgq0oxu-5goe.c.youtube.com/vid...
3,uDNj-_5ty48,267,85,55,176,144,12.0,12.0,mpeg4,Music,http://r2---sn-ovgq0oxu-5goe.c.youtube.com/vid...
4,WCgt-AactyY,31,1261,1183,640,480,24.0,0.0,h264,People & Blogs,http://r1---sn-ovgq0oxu-5goe.c.youtube.com/vid...


## Select Features for Gradient Descent on Multiple Variables

In the paper, they selected these features:

### From input video

* bitrate
* framerate
* resolution
* codec
* number of i frames
* number of p frames
* number of b frames
* size of i frames
* size of p frames
* size of b frames 

### From output video

* desired bitrate
* framerate
* resolution
* codec

### Predict

* transcoding rate - measured in frames per second (fps) is defined as the number of frames transcoded per unit time (sec). Higher transcoding rate indicates better throughput, reduced delay and total cost of the system

### Description of Features

* id = Youtube video id 
* duration = duration of video 
* bitrate bitrate(video) = video bitrate 
* height = height of video in pixles 
* width = width of video in pixles 
* frame rate = actual video frame rate 
* frame rate(est.) = estimated video frame rate 
* codec = coding standard used for the video 
* category = YouTube video category 
* url = direct link to video (has expiration date) 
* i = number of i frames in the video 
* p = number of p frames in the video 
* b = number of b frames in the video 
* frames = number of frames in video 
* i_size = total size in byte of i videos 
* p_size = total size in byte of p videos 
* b_size = total size in byte of b videos 
* size = total size of video 
* o_codec = output codec used for transcoding 
* o_bitrate = output bitrate used for transcoding 
* o_framerate = output framerate used for transcoding 
* o_width = output width in pixel used for transcoding 
* o_height = output height used in pixel for transcoding 
* umem = total codec allocated memory for transcoding 
* utime = total transcoding time for transcoding

## Extract Features

In [45]:
# let height = resolution

input_video_df_filtered = input_video_df[['bitrate', 'framerate', 'height', 'codec', 'i', 'p', 'b', 'i_size', 'p_size', 
                                          'b_size']]
output_video_df_filtered = output_video_df[['bitrate', 'frame rate', 'height', 'codec']]

In [46]:
input_video_df_filtered.head()

Unnamed: 0,bitrate,framerate,height,codec,i,p,b,i_size,p_size,b_size
0,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
1,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
2,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
3,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
4,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0


In [47]:
output_video_df_filtered.head()

Unnamed: 0,bitrate,frame rate,height,codec
0,373,29.97,568,h264
1,512,29.97,480,h264
2,324,29.97,400,flv1
3,85,12.0,176,mpeg4
4,1261,24.0,640,h264


## Give Features Friendlier Names

In [48]:
input_video_df_filtered = input_video_df_filtered.rename(index=str, columns={"bitrate": "input_bitrate", 
                                                                             "framerate": "input_framerate", 
                                                                             "height": "input_resolution", "codec": "input_codec"})
output_video_df_filtered = output_video_df_filtered.rename(index=str, columns={"bitrate": "output_bitrate", "frame rate": "output_framerate", "height": "output_resolution", "codec": "output_codec"})

In [49]:
input_video_df_filtered.head()

Unnamed: 0,input_bitrate,input_framerate,input_resolution,input_codec,i,p,b,i_size,p_size,b_size
0,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
1,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
2,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
3,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0
4,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0


In [50]:
output_video_df_filtered.head()

Unnamed: 0,output_bitrate,output_framerate,output_resolution,output_codec
0,373,29.97,568,h264
1,512,29.97,480,h264
2,324,29.97,400,flv1
3,85,12.0,176,mpeg4
4,1261,24.0,640,h264


## Create X and y DataFrames

In [54]:
X = input_video_df_filtered.join(output_video_df_filtered)
y = input_video_df[['utime']]

In [55]:
X.head()

Unnamed: 0,input_bitrate,input_framerate,input_resolution,input_codec,i,p,b,i_size,p_size,b_size,output_bitrate,output_framerate,output_resolution,output_codec
0,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0,373,29.97,568,h264
1,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0,512,29.97,480,h264
2,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0,324,29.97,400,flv1
3,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0,85,12.0,176,mpeg4
4,54590,12.0,144,mpeg4,27,1537,0,64483,825054,0,1261,24.0,640,h264


In [52]:
X = np.array([[1, 1.5, 1], [3, 4, 2.5], [5, 6, 6.5], [8, 9.5, 10]])
print(X)

NameError: name 'np' is not defined

In [53]:
X = np.array([[1, 1.5, 1], [3, 4, 2.5], [5, 6, 6.5], [8, 9.5, 10]])
y = np.array([2, 4, 16, 32])

X = preprocessing.normalize(X)

learning_rate = 0.01
num_epochs = 10

NameError: name 'np' is not defined

In this case, x has one feature and 4 training examples:

In [527]:
print(X)

[[ 0.48507125  0.72760688  0.48507125]
 [ 0.53665631  0.71554175  0.4472136 ]
 [ 0.49206783  0.5904814   0.63968818]
 [ 0.5017178   0.59578988  0.62714725]]


In [528]:
print(y)

[ 2  4 16 32]


In [529]:
def init_theta(X):
    return np.zeros(X.shape[1])

In [530]:
def hypothesis(theta, X):
    return np.dot(X, theta)

In [498]:
def calc_cost(y_hat, y, X):
    m = len(y) # training examples
    return (1.0/m) * np.dot(X.T, np.subtract(y_hat, y)) # costly - would have to calculate each cost 3 million times!

In [499]:
def update_theta(theta, learning_rate, cost):
    return theta - learning_rate * cost

### Coefficients SGD

In [500]:
# gradient
def coefficients_sgd(X, y, learning_rate, num_epochs):
    theta = init_theta(X)
    while (error < tolerance) & (epoch < max_epochs):
        y_hat = hypothesis(theta, X)
        cost = calc_cost(y_hat, y, X)
        theta = update_theta(theta, learning_rate, cost)
    return theta

In [501]:
theta = coefficients_sgd(X, y, learning_rate, num_epochs)
print(theta)

[ 0.64664752  0.78233495  0.79435881]
