# Capstone Project
*14-12-2017*

At General Assembly my last month of the course will be spent working on my 'Capstone' project, this is a project which will bring together all that I have learned and will hopefully be a successful project and will show my abilities.

The idea that I have decided to follow is hopefully going to encompass many of my favourite data tools and principles. I wanted to have a 'Big Data' angle as well as a Natural Language Processing element. 

I settled on a project which would be in three parts.

1. A web scraper that would grab the date, location, headline and content of news articles
2. A script to run NLP analysis and cluster the articles
3. Create a graph data structure of the information and clusters
4. (Time Permitting) Use these clusters to predict future events

It is clear that the quantity of data that will be produced by this is potentially enormous. I hope to scrape most BBC articles and Reuters articles in order to generate a dataset as large as possible. Data this large will likely not be runnable on my laptop. I will therefore have to combine Spark and AWS (I will have to research this further - perhaps Hadoop will be useful?)

I have already created a basic test dataset of 10 articles which will be useful for running some proof of concept tests prior to Christmas. This will hopefully lead to validation of the principles on which I am working. The current code is laid out below with some basic comments added.

In [12]:
# Basic imports that have grown as the project went on 
# Useful for keeping track of dependcies
import requests
import bs4
import feedparser
import pandas as pd
from bs4 import BeautifulSoup
from scrapy.selector import Selector
from pprint import pprint

In [2]:
# Test URL initially - if we can scrape one webpage, we can follow the same formula repeatedly
URL = 'http://www.bbc.co.uk/news/uk-politics-42277040'

In [3]:
# Basic web scraping flow
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', class_='story-body')
test_result = results[0]

In [4]:
# Extracting the Headline
headline = test_result.find('h1', class_='story-body__h1').text

In [5]:
# Extracting the date posted
date_posted = test_result.find('div', class_= 'date date--v2').text

In [6]:
# Extracting the article text
article_text = test_result.find('div', class_= 'story-body__inner').text

In [7]:
# The BBC has tags of related articles at the bottom of the article, this will extract them
tags_division = soup.find_all('div', class_='tags-container')

In [26]:
# Create a list of all tags in the division
tags = []
for i in range(len(tags_division)):
    temp = tags_division[i].find_all('a')
    temp_1 = temp[0].text
    tags.append(temp_1)

In [27]:
# Creating a data dictionary of the extracted data
# allows the extracted data to be formed into a dataframe
data_dict = {
    'headline' : headline,
    'date' : date_posted,
    'tags' : [tags],
    'text' : article_text
}

In [28]:
# Ultimately unused list of the column headings
column_heads = ['headline', 'date', 'tags', 'text']

In [29]:
# The final test dataframe
test_df = pd.DataFrame(data=data_dict)

In [31]:
# BBC has an RSS feed which I can use to grab some articles
# This cell will simply use it for headlines initially
feed = feedparser.parse('http://feeds.bbci.co.uk/news/world/rss.xml')
headlines = []
for i in range(10):
    temp = feed.entries[i].title
    headlines.append(temp)
print headlines

[u"Walt Disney buys Murdoch's Fox for $52.4bn", u"Putin: Trump opponents harm US with 'invented' Russia scandal", u"Joe Biden comforts John McCain's daughter over cancer", u'Niki Austrian airline failure strands many passengers', u"Roy Moore says Alabama election 'tainted' by outside groups", u'EU leaders wrestle with migrant quotas at summit', u"Zimbabwe's Mnangagwa seeks end to Western sanctions", u'Somalia suicide bomber kills police at Mogadishu academy', u'India happiness minister sought as murder suspect in Madhya Pradesh', u'Bollywood star faces eviction from Paris apartment']


In [65]:
# Using the methods above we can create a simple function which 
# will allow the scraping to become easier down the line
def scrape(URL):
    r = requests.get(URL)
    soup = BeautifulSoup(r.text, 'html.parser')
    results = soup.find_all('div', class_='story-body')
    test_result = results[0]
    headline = test_result.find('h1', class_='story-body__h1').text
    date_posted = test_result.find('div', class_= 'date date--v2').text
    article_text = test_result.find('div', class_= 'story-body__inner').text
    
    tags = scrape_tags(soup)
    
    return headline, date_posted, tags, article_text

In [33]:
# Separate the slightly more complex tags scraping as this deserves its own function
# This function is called within the larger scraping function
def scrape_tags(s):
    tags_division = s.find_all('div', class_='tags-container')
    tags = []
    for i in range(len(tags_division)):
        temp = tags_division[i].find_all('a')
        temp_1 = temp[0].text
        tags.append(temp_1)
    return tags

In [17]:
# Creating the dataframe is the final step - at this point this function doesn't work
def dataframify(dictionary):
    
    column_heads = dictionary.keys()
    
    return pd.DataFrame(data=dictionary, columns=column_heads)

In [34]:
# Testing that the function works
scrape(URL)

(u"Brexit: 'Breakthrough' deal paves way for future trade talks",
 u'8 December 2017',
 [[u'Brexit', u'DUP (Democratic Unionist Party)']],
 u'\n\n\n\nMedia playback is unsupported on your device\n\n\n\n\n\n Media captionThe prime minister said the deal will allow more to be invested in "priorities at home"\nPM Theresa May has struck a last-minute deal with the EU in a bid to move Brexit talks on to the next phase.There will be no "hard border" with Ireland; and the rights of EU citizens in the UK and UK citizens in the EU will be protected.The so-called "divorce bill" will amount to between \xa335bn and \xa339bn, Downing Street sources say.The European Commission president said it was a "breakthrough" and he was confident EU leaders will approve it.  \nBrexit deal as it happened\nFull text of the EU-UK statement\nKuenssberg: Theresa May buys breathing space\nSo, did \'soft Brexit\' just win?\nBrexitcast - EU deal special\nReality Check: What next for talks?\nThey are due to meet next T

In [35]:
# Using the function requires a useful RSS feeder function to set up the RSS feed
RSS_URL = 'http://feeds.bbci.co.uk/news/world/rss.xml'

def RSS_reader(URL):
    feed = feedparser.parse(URL)
    links = []
    for i in range(len(feed.entries)):
        temp = feed.entries[i].link
        links.append(temp)
    return links

In [63]:
links = RSS_reader(RSS_URL)
new_links = [x.encode('ascii', 'ignore') for x in links]
print len(new_links)

30


In [66]:
df = pd.DataFrame(columns=column_heads)

for link in range(10):
    print link
    print new_links[link]
    temp_headline, temp_date_posted, temp_tags, temp_article_text = scrape(new_links[link])

    temp_list = [temp_headline, temp_date_posted, temp_tags, temp_article_text]

    df.loc[link] = temp_list

# df.set_value(temp_headline, temp_date_posted, temp_tags, temp_article_text)

df.head()

0
http://www.bbc.co.uk/news/business-42353545
1
http://www.bbc.co.uk/news/world-europe-42349906
2
http://www.bbc.co.uk/news/world-us-canada-42351576
3
http://www.bbc.co.uk/news/world-europe-42355776
4
http://www.bbc.co.uk/news/world-us-canada-42346121
5
http://www.bbc.co.uk/news/world-europe-42352876
6
http://www.bbc.co.uk/news/world-africa-42358058
7
http://www.bbc.co.uk/news/world-asia-42356797
8
http://www.bbc.co.uk/news/world-africa-42350072
9
http://www.bbc.co.uk/news/world-asia-india-42349711


Unnamed: 0,headline,date,tags,text
0,Walt Disney buys Murdoch's Fox for $52.4bn,14 December 2017,[Disney],\n\n\n\nImage copyright\nDisney\n\n\nImage cap...
1,Putin: Trump opponents harm US with 'invented'...,14 December 2017,"[Russia-Trump inquiry, Russia]",\n\n\n\nImage copyright\nReuters\n\n\nImage ca...
2,Joe Biden comforts John McCain's daughter over...,14 December 2017,[Cancer],\n\n\n\nImage copyright\nReuters\n\n\nImage ca...
3,Niki Austrian airline failure strands many pas...,14 December 2017,[Austria],\n\n\n\nImage copyright\nAFP\n\n\nImage captio...
4,Roy Moore says Alabama election 'tainted' by o...,14 December 2017,[Roy Moore],\n\n\n\nImage copyright\nReuters\n\n\nImage ca...
