# Machine Learning Engineer Nanodegree
## Capstone Proposal
Henry Maguire 
1st March 2017

## Proposal
I would like to create a machine learning algorithm which is capable of generating coherent news headlines from the main body of a news article.

### Domain Background
_(approx. 1-2 paragraphs)_

In this section, provide brief details on the background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited in this section, including why that research is relevant. Additionally, a discussion of your personal motivation for investigating a particular problem in the domain is encouraged but not required.

### Problem Statement
_(approx. 1 paragraph)_

In this section, clearly describe the problem that is to be solved. The problem described should be well defined and should have at least one relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms) , measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

### Datasets and Inputs
_(approx. 2-3 paragraphs)_

In this section, the dataset(s) and/or input(s) being considered for the project should be thoroughly described, such as how they relate to the problem and why they should be used. Information such as how the dataset or input is (was) obtained, and the characteristics of the dataset or input, should be included with relevant references and citations as necessary It should be clear how the dataset(s) or input(s) will be used in the project and whether their use is appropriate given the context of the problem.

### Solution Statement
_(approx. 1 paragraph)_

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Additionally, describe the solution thoroughly such that it is clear that the solution is quantifiable (the solution can be expressed in mathematical or logical terms) , measurable (the solution can be measured by some metric and clearly observed), and replicable (the solution can be reproduced and occurs more than once).

### Benchmark Model
_(approximately 1-2 paragraphs)_

In this section, provide the details for a benchmark model or result that relates to the domain, problem statement, and intended solution. Ideally, the benchmark model or result contextualizes existing methods or known information in the domain and problem given, which could then be objectively compared to the solution. Describe how the benchmark model or result is measurable (can be measured by some metric and clearly observed) with thorough detail.

### Evaluation Metrics
_(approx. 1-2 paragraphs)_

In this section, propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

### Project Design
_(approx. 1 page)_

In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.

-----------

**Before submitting your proposal, ask yourself. . .**

- Does the proposal you have written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Solution Statement** and **Project Design**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your proposal?
- Have you properly proofread your proposal to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?

### Domain Background
_(approx. 1-2 paragraphs)_

In this section, provide brief details on the background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited in this section, including why that research is relevant. Additionally, a discussion of your personal motivation for investigating a particular problem in the domain is encouraged but not required.

I am going to use Recurrent Neural Networks to handle the sequences of data, where long-term dependencies are to be dealt with by using LSTM units. 

Let's say we want to use a machine learning algorithm to generate news headlines, one idea could be:
- Find a whole bunch of news headlines.
- Next, train your model over the dataset so it learns correlations between words
- Then use your model to find conditional probability distributions over various word permutations.
- Starting from some word (perhaps a key, non-trivial word with high probability) randomly generate the most likely sequences of words.

However, what we want to do is train an algorithm that can generate headlines in a supervised manner: given that these X articles have these X headlines, how do you write a headline for this new article I have given it. This is a difficult problem in machine learning, since it uses both sequential data with variable-sized input and output vectors (good job for RNNs) as well as 

https://arxiv.org/pdf/1512.01712.pdf
https://www.aclweb.org/anthology/D/D15/D15-1044.pdf
http://www.umiacs.umd.edu/~dmzajic/papers/DUC2002.pdf

## Datasets and Input

News outlets do not make it easy for you to access articles and headlines in a simple format so I have to come up with a novel way of getting this data. I have chosen to build a scraper to gather the data from one specific "news" source called Breitbart - a controversial, far-right propaganda website. Inspecting the Breitbart news website HTML code it is clear that I can easily build a crawler as detailed below. From what I understand from pages such as [this](https://www.bna.com/legal-issues-raised-by-the-use-of-web-crawling-and-scraping-tools-for-analytics-purposes), it is legal to scrape data from publicly accessible websites provided you do it at a non-disruptive rate so I will be extra careful of that.

### Scraping the news data
The method of the crawler will be as follows:
Starting from the [Big Government](http://www.breitbart.com/big-government/) feed on the Breitbart homepage
1. Search for all links to full length articles by going to the links anteceding `<h2 class="title"><a href="` and preceding `" title=` in the [page source](view-source:http://www.breitbart.com/big-government/) (on Chrome).

2. Retrieve the article headline using `<title>` and  `</title>` as delimiters and the unique post ID by using `postid-`.
3. Retrieve main body of article by using `</div></form></div></div><h2>` and `<h3>Read More Stories About:</h3>` as delimiters.
4. Back on the main page, cycle through all of the articles until you get to `<div class="divider"></div>` at which point search for the link to the next page of articles which follows `<div class="alignleft"><a href="` and precedes `" >older posts</a></div>`.
5. Repeat steps 1-4 until you have stored enough headlines and raw HTML of articles as required.

*Note: I'm sure there are more efficient delimiters than the ones I have chosen above but I think these will work and that the `BeautifulSoup` Python package will be able to clean the data up sufficiently.*

Within the raw HTML of the main body of the article there are many pieces of unnecessary media content, hyperlinks etc. which we will not need. I am using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a web-scraping package for Python, to clean the HTML down as well as some custom code to deal with non-ascii characters.