# Machine Learning Engineer Nanodegree
## Capstone Proposal
Henry Maguire 
1st March 2017

## Proposal
I would like to create a machine learning algorithm which is capable of generating coherent news headlines from the main body of a news article.

### Domain Background
_(approx. 1-2 paragraphs)_

In this section, provide brief details on the background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited in this section, including why that research is relevant. Additionally, a discussion of your personal motivation for investigating a particular problem in the domain is encouraged but not required.

### Problem Statement
_(approx. 1 paragraph)_

In this section, clearly describe the problem that is to be solved. The problem described should be well defined and should have at least one relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms) , measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

### Datasets and Inputs
_(approx. 2-3 paragraphs)_

In this section, the dataset(s) and/or input(s) being considered for the project should be thoroughly described, such as how they relate to the problem and why they should be used. Information such as how the dataset or input is (was) obtained, and the characteristics of the dataset or input, should be included with relevant references and citations as necessary It should be clear how the dataset(s) or input(s) will be used in the project and whether their use is appropriate given the context of the problem.

### Solution Statement
_(approx. 1 paragraph)_

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Additionally, describe the solution thoroughly such that it is clear that the solution is quantifiable (the solution can be expressed in mathematical or logical terms) , measurable (the solution can be measured by some metric and clearly observed), and replicable (the solution can be reproduced and occurs more than once).

### Benchmark Model
_(approximately 1-2 paragraphs)_

In this section, provide the details for a benchmark model or result that relates to the domain, problem statement, and intended solution. Ideally, the benchmark model or result contextualizes existing methods or known information in the domain and problem given, which could then be objectively compared to the solution. Describe how the benchmark model or result is measurable (can be measured by some metric and clearly observed) with thorough detail.

### Evaluation Metrics
_(approx. 1-2 paragraphs)_

In this section, propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

### Project Design
_(approx. 1 page)_

In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.

-----------

**Before submitting your proposal, ask yourself. . .**

- Does the proposal you have written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Solution Statement** and **Project Design**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your proposal?
- Have you properly proofread your proposal to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?

## Domain background

Deep neural networks are able to approximate arbitrary functions, which makes them useful in supervised learning applications. The backpropagation algorithm allows correlations to be learned from data by updating the weights of a neural network iteratively, based on how far from the ideal scenario an approximation/prediction was. With vanilla neural networks, a fixed number of inputs and outputs would be determined based on domain knowledge of a problem (given this person's traits, which of these 12 items are they most likely to buy?) and a set of training and validation data would need to be acquired (the scenarios where we know which traits have led to which actual purchases). Depending on the complexity of a task, many parameters would need to be learned from the training set.

Deep neural networks can also take advantage of the symmetry of a problem. In image recognition, every time we see a cat we shouldn't have to learn all of the parameters which give a 'cat' signal, but a cat could appear anywhere in our field of view, so we should be able to share the parameters across the field of view. This is the realm of convolutional neural networks.



### Domain Background
_(approx. 1-2 paragraphs)_

In this section, provide brief details on the background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited in this section, including why that research is relevant. Additionally, a discussion of your personal motivation for investigating a particular problem in the domain is encouraged but not required.
#### Word Embeddings

#### Sequence-to-sequence models
I am going to use Recurrent Neural Networks to handle the sequences of data, where long-term dependencies are to be dealt with by using LSTM units. 

Let's say we want to use a machine learning algorithm to generate news headlines, one idea could be:
- Find a whole bunch of news headlines.
- Next, train your model over the dataset so it learns correlations between words
- Then use your model to find conditional probability distributions over various word permutations.
- Starting from some word (perhaps a key, non-trivial word with high probability) randomly generate the most likely sequences of words.

However, what we want to do is train an algorithm that can generate headlines in a supervised manner: given that these X articles have these X headlines, how do you write a headline for this new article I have given it. This is a difficult problem in machine learning since it needs to learn how the sequential word data with variable-sized input and output vectors (good job for RNNs) in the main body of the text relates to the sequential headline text.  This problem is known as [sequence-to-sequence modelling](https://arxiv.org/pdf/1409.3215.pdf).

In [this research paper](https://arxiv.org/pdf/1512.01712.pdf), Konstantin Lopyrev uses LSTMs and encoder-decoder architecture to do the job. This is likely to be the best option although is highly computationally expensive. Also from a novice's point of view, the methods in the paper are hard to follow. However, in [this paper/tutorial](https://arxiv.org/pdf/1703.01619.pdf), Graham Neubig outlines several different approaches to Machine Translation and sequence-to-sequence models more generally. The technical details of appropriate methods are discusses in detail. The bleeding-edge of sequence-to-sequence modelling use a combination of [Encoder-Decoder models](https://arxiv.org/pdf/1506.06726v1.pdf) as well as [attention-based approaches](https://arxiv.org/pdf/1508.04025.pdf) approaches.

More generally, there are ways of representing the meaning of sentences (using [skip-thought vectors](https://arxiv.org/abs/1506.06726) perhaps) and  [paragraphs](https://arxiv.org/pdf/1405.4053.pdf) in moderately-sized bodies of text in some machine-readable way. 

Other methods for headline generation which do not use encoder-decoder models are:
- [This paper](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) from a team at Facebook AI which uses similar NN architecture but with a slightly more phenomenological framework.
- [This paper](http://www.umiacs.umd.edu/~dmzajic/papers/DUC2002.pdf) uses Hidden Markov Models to do the job, which would be simpler and less computationally expensive but the approach does not seem as widely applicable as neural networks.


## Datasets and Input

News outlets do not make it easy for you to access articles and headlines in a simple format so I have to come up with a novel way of getting this data. I have chosen to build a scraper to gather the data from one specific article source called "Breitbart" - a controversial, far-right propaganda website. Inspecting the Breitbart news website HTML code it is clear that I can easily build a crawler as detailed below. From what I understand from pages such as [this](https://www.bna.com/legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes), it is legal to scrape data from publicly accessible websites provided you do it at a non-disruptive rate so I will be extra careful about the amount of bandwidth I use.

### Scraping the news data
The method of the crawler will be as follows:
Starting from the [Big Government](http://www.breitbart.com/big-government/) feed on the Breitbart homepage
1. Search for all links to full length articles by going to the links anteceding `<h2 class="title"><a href="` and preceding `" title=` in the [page source](view-source:http://www.breitbart.com/big-government/) (on Chrome).

2. Retrieve the article headline using `<title>` and  `</title>` as delimiters and the unique post ID by using `postid-`.
3. Retrieve main body of article by using `</div></form></div></div><h2>` and `<h3>Read More Stories About:</h3>` as delimiters.
4. Back on the main page, cycle through all of the articles until you get to `<div class="divider"></div>` at which point search for the link to the next page of articles which follows `<div class="alignleft"><a href="` and precedes `" >older posts</a></div>`.
5. Repeat steps 1-4 until you have stored enough headlines and raw HTML of articles as required.

*Note: I'm sure there are more efficient delimiters than the ones I have chosen above but I think these will work and that the `BeautifulSoup` Python package will be able to clean the data up sufficiently.*

Within the raw HTML of the main body of the article there are many pieces of unnecessary media content, hyperlinks etc. which we will not need. I am using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a web-scraping package for Python, to clean the HTML down as well as some custom code to deal with non-ascii characters.

### Preprocessing the data
Data is stored in dictionary, with unique post-id as keys and [headline, body, link] as the values. The following preprocessing steps were 
- HTML and Unicode data cleaned up using Beautiful Soup and some custom code
- Split the text up into 'tokens' which are words (including suffixes such as "'s" and "n't") and punctuation.
- Replacing full stops, exclamation and question marks with `<EOS>` tokens.
- Removing the word "Breitbart" because it basically never performs any function
- Removing single character words and all remaining punctuation apart from:
    - the word 'I' when included at the start of a quote, because this is common in headlines
    - colons, because they give clear indication of quotes throughout the articles
- Only words a certain number of words will be stored in vocabulary. Can swap them out using word2vec lookup.

This preprocessing could make the algorithm less robust if used on datasets outside of Breitbart news but for these purposes it is fine.

*Note: If I wanted the code to be universally applicable to any news source/paragraph I would probably rewrite it to use a char-RNN instead of word-RNN, at the expense of a need for even bigger datasets and longer training times.*


### Benchmark Model
_(approximately 1-2 paragraphs)_

In this section, provide the details for a benchmark model or result that relates to the domain, problem statement, and intended solution. Ideally, the benchmark model or result contextualizes existing methods or known information in the domain and problem given, which could then be objectively compared to the solution. Describe how the benchmark model or result is measurable (can be measured by some metric and clearly observed) with thorough detail.

Two ideas for benchmarking:
- Statistical analysis of all articles perhaps with some Bayesian alteration of probability distribution given the article in question.
- Simple Word-RNN model. Generate a sentence seeded by the most non-trivial keywords found in an article.

## Evaluation Metrics
_(approx. 1-2 paragraphs)_

In this section, propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).


The aim of this project is to produce a generative model, with an output accuracy which could be difficult to quantify. 

However, in this model we essentially have a supervised learning scenario in the second part of the Neural Network model. The target and predicted output are headlines with some vectorial representation, so it may be able to quantify their closeness by looking at the relationship between the vectors. However, since the word representations are going to be learned from the data, the predicted and target variables may already be close in the shared vector space. Another drawback of this would be in not accounting for and hence missing more conceptual headlines, containing puns etc. Therefore the output of the algorithm may have been a great headline but just happened to not be based on the same play on words as the actual headline **give example**.

Another way of evaluating the model would be to set a threshold statement along the lines of "I want the algorithm to produce at least 7 feasible headlines for 10 different articles from the Breitbart news website."

## Project Design
_(approx. 1 page)_

In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.
### Initial analysis

### Encoder-Decoder Architecture

### Attention mechanisms
Assuming we only use the first paragraph of an article. It is likely the the beginning of the first paragraph holds much of the content/meaning that is captured in the headline. The encoder has to determine the meaning of an entire paragraph and store this in a fixed-length vector, while somehow weighting the early timesteps very highly. From the output of the final hidden layer, the decoder then makes estimates a headline. In theory LSTM units should make these long-term dependencies possible to capture but in reality the model will need to have many layers and a **huge** amount of training data to do so. The solution is something called [attention](https://arxiv.org/abs/1409.0473), which allows the hidden layers of the decoder to have access to the input sentence as well as the final hidden layer of the encoder. There are many different ways of modelling attention and it is a very active area of cutting-edge research in sequence-to-sequence modelling. For this reason we will 

### Building the model

### Training the model

### Testing the model

### Extensions: 
#### Vocabulary expansion
Training an RNN language model can be quite resource intensive, which means that I will have to limit myself to a relatively small vocabulary size. Should this simple model be too restricted by the small vocabulary, I can also use vocabulary expansion following the Skip-Thought vectors [paper](https://arxiv.org/pdf/1506.06726v1.pdf). This means that we can use the vector representations of words from a less intensive model with a larger vocabulary, such as Word2Vec and map them into the word vector space learned by the RNN. This means that if a word appears during training that is not in the (relatively small) RNN vocabulary, a close representation can be found from finding similar vectors from the much larger vocabulary of Word2Vec by looking at their overlap in the shared vector space. If we use a publically available dataset such as [CBOW](https://code.google.com/archive/p/word2vec/) then we can encode around 930,911 words rather than around 10000 (depending on RNN training resources), making the algorithm far more robust.

#### Generating headlines for non-Breitbart articles
One possible extension is to make it read non-Breitbart articles and see if it does either of the following:
- generates feasible Breitbart headlines for the similar subject matter/content involved in the story. So perhaps villifying/commending characters in the generated headlines which are actually referred to in the opposite terms in the non-Breitbart story.
- generates feasible generic news headlines with similar sentiment as the original non-Breitbart article but in the writing style of Breitbart headlines
