# Document Analysis Assignment 2: Youtube Video Engagement Prediction

---
## Your Information
Please fill in the following information:

- **Name**:    [You Li]

- **Uni id**:  [u6430173]

---

## Overview
The engagement of a Youtube video is a number between 0 and 1, and it can be defined as the percentage of the video that its audience watches in average. 
For more information please refer to this recent paper and its presentation:
```
S. Wu, M.-A. Rizoiu, & L. Xie, "Beyond Views: Measuring and Predicting Engagement in Online Videos, " in Proc. International AAAI Conference on Web and Social Media (ICWSM ’18), Stanford, CA, USA, 2018.
```
[link to paper](https://arxiv.org/pdf/1709.02541.pdf)
[paper presentation](http://www.rizoiu.eu/documents/research/presentations/WU_ICWSM-2018_slides.pdf)

Your task in this assignment is to predict a Youtube video's engagement by using only the textual infomation related to it (i.e. its description, its author name also called *channelTitle*, and the *videoTitle*).
Predicting a continuous variable is called a **regression** task, very similar to the classification tasks presented in the lectures.
Techniques such as Linear Regression, SVM, kNN and tree-based methods have been adapted for regression and are readily implemented in packages such as [`scikit-learn`](http://scikit-learn.org/). 

To complete this assignment, you will need to follow the following steps:

1. Read and preprocess the training and testing dataset in order to extract text data.
2. Train and apply regression models and compare their performance.
3. **(optional, but recommended)** tune your models' hyper-parameters to improve their performance.
4. Generate your result file in the required format.
5. After solving the coding part, you need to answer three questions in written part.

(**hint: you can reuse code given in the tutorial**)

## Dataset

The dataset provided on Wattle contains two folders: **train** and **test**. 
Each folder contains the videos in the training set and the testing set respectively.
Each video is represented as a document and it is described as a json file, entitled `[id].json`. 
See an example below: 
```
{
   "channelTitle":"K KIRK PRODUCTIONS",
   "description":"New video from Reggie Records Starring FOE Lil Reggie & PESO.SOUNDCLOUD: FOE LIL REGGIE INSTAGRAM: true_stvr",
   "engagementScore":0.792,
   "videoTitle":"FOE Lil Reggie ft. PESO-Keep It Going (OFFICIAL VIDEO)"
}
```
In the training set, the response variable is in the field called `engagementScore` -- i.e. the engagement score of the video.
Your task is to predict the `engagementScore` for each document in the testing set. 

## Evaluation

The performance of your prediction will be evaluated automatically on Kaggle using the Root Mean Square Error (RMSE) performance measure.
RMSE measures the deviation between your predictions and the ground truth, and [it is defined as](http://statweb.stanford.edu/~susan/courses/s60/split/node60.html):

$$
    \operatorname {RMSE} ={\sqrt {\frac {\sum _{i=1}^{n}({\hat {y}}_{i}-y_{i})^{2}}{n}}}.
$$

where $n$ is the number of data points, $y_i$ is the true engagement score for video $i$ and $\hat{y}_i$ is the predicted engagement score for video $i$.

Your score will be computed using a lower bound and an upper bound, which will be shown on the Kaggle leader board. 
Achieving a RMSE score equal to the lower bound amount to a grade of zero, while achieving the upper bound amounts to the full points (here 7 points, see score distribution here below).
Consequently, your score for this competition task will be calculated based on:

$$
    \operatorname{Your\_Score} = \frac{Lower\_Bound - Your\_Performance}{Lower\_Bound - Upper\_Bound} * 7
$$

Notes about the lower bound and upper bounds predictors:

* The **lower bound** is the performance obtained by a regressor that always makes the *random* guess (i.e. one that always predicts the mean engagement score in the training set).
* The **upper bound** "in-house" regressor was built in a couple of hours by one of your fellow students, on the same dataset that you were given.
No exotic tricks were used for this regressor, and it is possible to obtain better results than this.
If you obtain a better performance than the upper bound, then you will have a grade higher than 7 points for the coding part. This can be useful to compensate for any lost points for the written questions.
Note however, that the total grade of this assignment is capped at 10 marks.

Note that here "min" and "max" refer to performance, and RMSE needs to be minimized, i.e. max performance RMSE is lower than min performance RMSE.

## For the Kaggle competition

- Join the competition [here](https://www.kaggle.com/t/f31b296a987c46dca752753eb6fde2b3)
- Before submitting the result, first go to **team** menu and change your **team name** as **your university id**.
- You need to upload the generated result file to Kaggle. The result file should be in the following format
```
id,engagementScore
abcdefgh,0.01
hijk1234,0.02
lmno5678,0.3
...
```
- Note that you are only allowed to upload **5 copies** of your results to Kaggle per day. Make every upload count, and don't waste your opportunities!
- You should use cross-validation instead of relying on the public set - this is what the daily limit is for!
- For detailed submission instructions, check the end of this notebook.

Score distribution (total 10 points):

- Kaggle competition: 7 points
- Written part Q1: 1 point
- Written part Q2: 1 point
- Written part Q3: 1 point

After completion, please rename this notebook to **`your_uid.ipynb` (e.g. `u6000001.ipynb`)** and submit this file to Wattle. Do not upload any other files to Wattle except this notebook file.

**Note:** you need to fill in the cells below with your code. Failure to provide your code nullifies your Kaggle grade (meaning you get zero for the coding part).

## 1. Loading dataset and preprocessing

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)
# 1. Read and preprocess the training and testing dataset in order to extract text data.
# 2. Train and apply regression models and compare their performance.
# 3. **(optional, but recommended)** tune your models' hyper-parameters to improve their performance.
# 4. Generate your result file in the required format.
# 5. After solving the coding part, you need to answer three questions in written part.
# import matplotlib.pyplot as plt
!ls

In [None]:
import glob
import pandas as pd
from collections import namedtuple
from io import open
# import os, io
import json
## A document class with following attributes
## id: document id
## category: category of document
## text: body of documment
# Doc = namedtuple('Doc', 'id category text')
Doc = namedtuple('Doc', 'id channelTitle description engagementScore videoTitle')

def read_doc(doc_path, encoding):
    '''
        reads a document from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - doc: instance of Doc namedtuple
    '''
    category, _id = tuple(doc_path.split('/')[-2:])
    _id = _id.split('.')[0]
    
    fp = open(doc_path, 'r', encoding = encoding)
    json_data = json.load(fp)
    fp.close()
    ct = json_data['channelTitle']
    ds = json_data['description']
    if category == "train":
        es = json_data['engagementScore']
    else: es = 0
    vt= json_data['videoTitle']
#     print(json_data)
    return Doc(id = _id, channelTitle=ct ,description=ds , engagementScore=es , videoTitle=vt )

def read_dataset(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: instances of Doc namedtuple returned as generator
    '''
    for doc_path in glob.glob(path + '/*'):
        yield read_doc(doc_path, encoding = encoding)

def read_as_df(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: dataframe equivalent of doc Namedtuple
    '''
    dataset = list(read_dataset(path, encoding))
    return pd.DataFrame(dataset, columns = Doc._fields)


In [None]:
## specify the path to the dataset two_newsgroup
path_to_train_dataset = './dataset/train'
path_to_test_dataset = './dataset/test'
## TODO stop
train_dataset = read_as_df(path_to_train_dataset)
print("Number of rows and columns of the dataset: {}".format(train_dataset.shape))
# print("The first five documents:")
# train_dataset.head()
test_dataset = read_as_df(path_to_test_dataset)
print("Number of rows and columns of the dataset: {}".format(test_dataset.shape))
# print("The first five documents:")
test_dataset.head()

In [None]:
import re
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.corpus import stopwords

## Stemmer
stemmer = SnowballStemmer("english")
stopwords_en = set(stopwords.words('english'))
stopwords_en.update(["n", "www", "http", "https", "nhttps", "com", "r", "v", "ft"])
# remove the unecessaty words
## now build a custom tokenizer based on these
__tokenization_pattern = r'''(?x)          # set flag to allow verbose regexps
        \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | (?:\\x\w\w)+        # ignore unicode
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

tokenizer = nltk.tokenize.regexp.RegexpTokenizer(__tokenization_pattern)

def preprocessor(text):
    '''
        turns text into tokens after tokenization, stemming, stop words removal
        imput:
            - text: document to process
        output: =>
            - tokens: list of tokens after tokenization, stemming, stop words removal
    '''
    stems = []
    tokens = tokenizer.tokenize(text.encode('utf-8').lower())

    for token in tokens:
        if token.isalpha() and token not in stopwords_en:
            stems.append(str(stemmer.stem(token)))
    return stems


In [None]:
train_dataset['cttk'] = train_dataset['channelTitle'].apply(preprocessor)
train_dataset['dstk'] = train_dataset['description'].apply(preprocessor)
train_dataset['vttk'] = train_dataset['videoTitle'].apply(preprocessor)
# train_dataset.head()

test_dataset['cttk'] = test_dataset['channelTitle'].apply(preprocessor)
test_dataset['dstk'] = test_dataset['description'].apply(preprocessor)
test_dataset['vttk'] = test_dataset['videoTitle'].apply(preprocessor)
# test_dataset.head()

In [None]:
from itertools import chain
import nltk
import matplotlib.pyplot as plt
train_dist = nltk.FreqDist(chain(*train_dataset['dstk']))
print(train_dist.most_common(10))

plt.figure(figsize=(15, 6))  # the size you want
train_dist.plot(50,cumulative=False)

test_dist = nltk.FreqDist(chain(*test_dataset['dstk']))
print(test_dist.most_common(10))

plt.figure(figsize=(15, 6))  # the size you want
test_dist.plot(50,cumulative=False)

In [None]:
all_train_tokens = set(chain(*train_dataset['cttk'] + train_dataset['dstk'] + train_dataset['vttk']))
print("number of tokens in the train dataset: {}".format(len(all_train_tokens)))
all_test_tokens = set(chain(*test_dataset['cttk'] + test_dataset['dstk'] + test_dataset['vttk']))
print("number of tokens in the test dataset: {}".format(len(all_test_tokens)))

train_dataset['features'] = (train_dataset['cttk'] + train_dataset['dstk'] + train_dataset['vttk']).apply(set)
test_dataset['features'] = (test_dataset['cttk'] + test_dataset['dstk'] + test_dataset['vttk']).apply(set)
# test_dataset.head()
test_dataset['tokens'] = test_dataset['cttk'] + test_dataset['dstk'] + test_dataset['vttk']
test_dataset.head()

train_dataset['tokens'] = train_dataset['cttk'] + train_dataset['dstk'] + train_dataset['vttk']
train_dataset.head()

## 2. Apply regression models and select models 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import datasets, linear_model, decomposition, svm, neighbors
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# train_vec = bow_vectorizer.fit_transform(train_dataset.tokens)
# train_ct_vec = bow_vectorizer.fit_transform(train_dataset.cttk)
# train_ds_vec = bow_vectorizer.fit_transform(train_dataset.dstk)
# train_vt_vec = bow_vectorizer.fit_transform(train_dataset.vttk)

# test_vec = bow_vectorizer.fit_transform(test_dataset.tokens)
# test_ct_vec = bow_vectorizer.fit_transform(test_dataset.cttk)
# test_ds_vec = bow_vectorizer.fit_transform(test_dataset.dstk)
# test_vt_vec = bow_vectorizer.fit_transform(test_dataset.vttk)

In [None]:
bow_vectorizer = CountVectorizer(lowercase = False, 
                                     tokenizer = lambda x: x, # because we already have tokens available
                                     stop_words = None, ## stop words removal already done from NLTK
                                     max_features = 8000, ## pick top 5K words by frequency
                                     ngram_range = (1, 3), ## we want unigrams now
                                     binary = False) ## we want as binary/boolean features

In [None]:
# lnrreg.fit(train_ct_vec, train_ds_vec, train_vt_vec, train_dataset['engagementScore'],[0.5, 0.3, 0.2])
# lnrreg.fit(train_vec, train_dataset['engagementScore'])

train_X, test_X, train_y, test_y = train_test_split(train_dataset['tokens'], train_dataset['engagementScore'], test_size=0.2)
# test_dataset['engagementScore'] = lnrreg.predict(test_vec)
# breg = linear_model.BayesianRidge()
# breg.fit(train_X, train_y)

pipeline = Pipeline([
    ('train_vec',  bow_vectorizer),
    ('tfidf',  TfidfTransformer(sublinear_tf=True)),
#     ('linear-regression',  linear_model.LinearRegression()) ])
#     ('svm', svm.SVR()) ])
#     ('lnrsvm', svm.LinearSVR()) ])
#     ('en', linear_model.ElasticNet()) ])
    ('knn', neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance'))])

pipeline.fit(train_X, train_y)
test_pred = pipeline.predict(test_X)
test_dataset['engagementScore'] = pipeline.predict(test_dataset['tokens'])

In [None]:
np.set_printoptions(suppress=True)
# test_pred = 1/(1+np.exp(-test_pred))
print(test_pred)

rmse = math.sqrt(mean_squared_error(test_y, test_pred))
print(rmse)
score = (0.28341-rmse)/(0.28341-0.22695)*7
print(score)

In [None]:
# np.set_printoptions(suppress=True)
# test_dataset['engagementScore'] = test_dataset['engagementScore'].apply(lambda x: 1/(1+math.exp(-x)))
# test_dataset.head()

## 3. Generate your result file for Kaggle

In [None]:
def write_output(output_file):
    with open(output_file, 'w') as output:
        output.write(u'id,engagementScore\n')  
#       read topic doc
        for i in range(0, len(test_dataset['id'])):
            index = test_dataset['id'][i]
            egscr = test_dataset['engagementScore'][i]
            output.write(u'{},{}\n'.format(index,egscr))
#     print("finished writing in output.csv")
           
output_file = 'output.csv'
write_output(output_file)

## 4. Written part

Answer briefly and concisely the following questions, based on your implementation from parts 1, 2 and 3.
Provide answers using bullet list with 2~3 items.
Check [this](https://sourceforge.net/p/jupiter/wiki/markdown_syntax/#md_ex_lists) if you are not familiar with markdown syntax.
Each questions is worth 1 mark (10%) of the total grade for the ML assignment.

### Question 1 (1pt)
 
List the methods that you have tested and why. 
Why did you end up not using them and what was the reason they did not provide high performances?

* Methods have tested using train dataset with partition of 0.2 (This may different from Kaggle result): 
    * Linear regression: rmse around 0.26
        * This is the basic regression model, and the performance is not as good as others since the training model may underfitting.
    * ElasticNet: rmse around 0.28
        * This linear regression model trained with L1 and L2 prior as regularizer, it also combined the regularization properties of Ridge. It would be useful if ct, des and vt were separately trained. Since the tokens were connected, hence this could not deliver the best performance.
    * SVR: rmse around 0.24
        * Ths model was chosen for further experiments. The preformance may not as well might result from overfitting.
    * KNeighborsRegressor: rmse around 0.23
        * This method was chosen for futeher experiments.
    * BayesianBridge: rmse around 0.26
        * estimates a probabilistic model of the regression problem, and the result is spherical Gaussian distributed. Hence its not suiteble for this question.

### Question 2 (1pt)

How did you select the best performing method? How did you tune model hyper-parameters?

* The best performing method was selected based on the performance on RMSE, lower square error could prove that the estimation performance is near the real engagementScore. SVM, KNN regression has lowest square error as well as the most reliable result on both Kaggle score and the training dataset, and the experences were mainly run on these two regression methods.

* Hyper-parameters were tuned in ``bow_vectorizer``. 
    * tokenizer and stop_words were done in the pre processing step, unicode and english stopwords were removed, so they were set as ``False``.
    * For ``max_features``, high value may lead to over fitting and low value may result for underfitting. Since the amount of commenly used etyma is around 300, so this could explain the performance on ``max_features = 300`` is pretty good. Based on the experence performance which were held by setting the ``ngran_range`` to (1, 5); (1, 3) and change the ``max_features`` from 1 to 100,000 using dichotomy, suitable value for ``max_feature`` could be less than 500.
    * Increasing the value of ``ngram_range`` could provide more combination of english words. Similarly, hold the `max_feature` and update `ngram_range`, the sutable value is (1, 3)

* Take the example of `SVMRegressor`, according to sklearn documentation, changing the value of `kernel`, `degree`, and `max_iter` instead of using `__init__` helped improve the performance. Such experiments also held for `KNeighborsRegressor`. Different values for TfidfTransformer were also tested.


```python 
class sklearn.svm.SVR(kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)[source]
```

### Question 3 (1pt)

The target variable was in the range [0, 1], however most regressors output values in $(-\infty, +\infty)$.
How did you solve this issue?

* While using prediction by LinearRegrassion() method, the regressor original output was ranged between $(-\infty, +\infty)$. This could be solved by using standard logistic function.
$$ z = log (\frac{p}{1-p}) $$
$$ \frac{p}{1-p} = e^z$$
$$ p = \frac{1}{1 + e^{-z}}$$
* in python, this could be calculated by following
```python
test_dataset['engagementScore'] = test_dataset['engagementScore'].apply(lambda x: 1/(1+math.exp(-x)))
```
* Overall, logistic regression maps the point x in dimensional feature space to $(0, 1)$ ranged value. Using simply $z = p$ or $z = log p$ is invalid, because $e^z \in (0, \infty)$ 

# Upload output file to Kaggle competition site

Once you generate `output.csv` file, you can upload your result on Kaggle competition site. To upload and evaluate your result

1. Go to Kaggle competition site: [Click here](https://www.kaggle.com/t/f31b296a987c46dca752753eb6fde2b3).
1. Sign up for Kaggle if you do not have an account. Go back to the [original kaggle page](https://www.kaggle.com/t/f31b296a987c46dca752753eb6fde2b3).
1. Before submitting the result, first go to `team` menu and change **your team name as your university id**.
![ChangeUID](images/changeuid.png)
1. Time to submit your own result. Click `submit predictions` in the menu, you may need to agree the competition rules before submitting your result.
1. Upload your output csv file, you can write additional description of your submission in the description box.
    Note that you are only allowed to submit **3 results per day**. Do not upload an arbitrary result and think which algorithm or parser will perform the best.
1. If your output format is correct, the system will generate your score automatically.
1. Go to `Leaderboard` menu. The leaderboard will show the current score of the other students.
![Leaderboard](images/leaderboard.png)


Note that you can check all of your submission from `my submission` menu. Please select one best performing submission before the assignment due. The selected submission will be used to measure the performance of *hidden* test case (see below for details).
![Check](images/check.png)