# Analyzing Trends in News Headlines
---

## Collecting Data
---

* Where I am getting the data
* Talk about running a server on DigitalOcean
* Show part of the python `scraper.py` script
* Talk about the identifies, N, J, what they mean
* Talk about the Watson API

### Data Sources

For my sources I decided to use the homepage of [RT](https://www.rt.com/), the politics page of [The Washington Times](http://www.washingtontimes.com/news/politics/), and the politics page of [CBC News](http://www.cbc.ca/news/politics). I choose these three news networks, because they were best known to me.

I made a Python script `scraper.py` to scrape these individual webpages and collect news headlines at different times during the day. The script would scrape all three websites in the morning at around 11:00AM, then during the day at around 14:00PM. and finally in the evening at 18:00PM. 

The reason for having three different news networks and setting the script to scrape at differnt times throughout the day, is because I was trying to get as much variation as possible, to help me find more patterns from the data.

### Automating Work

To save my self the time and headache of remembering to scrape the websites at certain times, I spun a simple Ubuntu droplet on DigitalOcean. I used the [Python Schedule package](https://pypi.python.org/pypi/schedule) to automate the script to run at specific times throughout the day.

To run the script as a background process I used the following command `nohup python3 scraper.py > scraper.out 2>scraper.err &`.


### Watson API

To analyze the news headlines I used IBM's Watson [Natural Language Understanding](https://natural-language-understanding-demo.mybluemix.net/) API. The Natural Language Understanding (NLU) is a collection of differnt APIs that analyze text to help you understand its concepts, entities, keywords, sentiment, and more.

For every news headline I requested the sentiment (provides a score for text either negative, positive, or neutral), emotions (joy, anger, disgust, sadness, and fear), and entities (identifies people, companies, countries, etc.) to be returned by the NLU API.

### Scraper Script

The following is a chunk from the `scraper.py` script, the chunk is used to scraper, analyse, and then write the information (news headlines and their analysis) to a `.txt` file from the RT homepage.

```python

def RT():

    ...

    with open(filename, mode) as fp:

        # Start of Russia Today news
        fp.write('N RT\n')

        # Scrapes the website of Russia Today news network and writes news headlines to a file
        for ul_tag in soup.find_all('ul', {'class': 'main-promobox__list'}):
            for li_tag in ul_tag.find_all('li', {'class': 'main-promobox__item'}):
                for headline in li_tag.find_all('a', {'class': 'main-promobox__link'}):

                    news_headline = headline.text.lstrip().replace('\n', '')

                    fp.write('H ')
                    fp.write(news_headline)
                    fp.write('\n')
                    fp.write('J ')
                    json.dump(NLU.analyze(text=news_headline, features=[features.Sentiment(), 
                            features.Emotion(), features.Entities()]), fp)
                    fp.write('\n')

    ...
    
```

### Storing Data

The following is a chunk from one of the `.txt` files containing information (news headlines and their analysis) from all three news networks. The following shows information about the first two headlines from RT homepage.

* **N** indicates the start of a new news network.
* **H** indicates a new headline.
* **J** indicates the NLU analysis of the above headline.

```sh

N RT
H ‘It takes 2 to tango’: Germany threatens Turkey with major policy overhaul                                                                     
J {"sentiment": {"document": {"score": -0.358041, "label": "negative"}}, "entities": [{"text": "Germany", "count": 1, "disambiguation": {"subtype": ["Country"]}, "type": "Location", "relevance": 0.33}, {"text": "Turkey", "count": 1, "disambiguation": {"dbpedia_resource": "http://dbpedia.org/resource/Turkey", "subtype": ["Brand", "GovernmentalJurisdiction", "ProjectParticipant", "Country"], "name": "Turkey"}, "type": "Location", "relevance": 0.33}], "language": "en", "emotion": {"document": {"emotion": {"disgust": 0.054634, "joy": 0.018692, "fear": 0.267773, "anger": 0.132166, "sadness": 0.068436}}}}
H Trump to end lavish CIA support for ‘moderate’ anti-Assad forces in Syria – reports                                                                    
J {"sentiment": {"document": {"score": 0.0, "label": "neutral"}}, "entities": [{"text": "Syria", "count": 1, "disambiguation": {"subtype": ["Country"]}, "type": "Location", "relevance": 0.33}, {"text": "CIA", "count": 1, "type": "Organization", "relevance": 0.33}], "language": "en", "emotion": {"document": {"emotion": {"disgust": 0.393306, "joy": 0.010359, "fear": 0.123981, "anger": 0.200277, "sadness": 0.42641}}}}

```

## Cleaning Data
---

* Show the txt file, show how the data was initially being stored
* Convert JSON objects into CSV
* Remove irrelavent things from the Entities item, parse for `Trump` amd `Putin`
* Figure something out with morning, day, and evening news
    * Different starting dates

## Analyzing Data
---

* Compare Negative and Positive news
    * How many Negative and Positive news in total - Pie Chart
    * How does time of day effect the number of Negative and Positive news - Line Chart with specific time internval
    * How does big celebrations (Stampede) effected the number of Negative News, did they increase or decrease over the course of that week - Line Chart only showing CBC news
* Compare the Emoitions (Joy, Anger, Saddness, etc...)
    * See which emotion doninates - Bubble Chart with bubbles being the emojis corresponding to the feeling
    * Talk about why the dominant emotion doninates
        * Talk about the Article (link in GDrive)
* Draw a Line Chart with all the news, one chart for all networks in the morning, day, and evening
    * Trying to see patterns, if they repeat the number of negatives news every week
    * See if there are more negative news in the morning, day, or evening

## Represent Data
---

* Who is the most mentioned president
    * Trump, Putin, Trudeau - Pie Chart
* Which country is the most mentioned
    * Russia, America, Canada, etc.

In [2]:
from iplotter import GCPlotter

plotter = GCPlotter()

data = [
    ['Year', 'Sales', 'Expenses'],
    ['2004',  0,      400],
    ['2005',  1170,      460],
    ['2006',  660,       1120],
    ['2007',  1030,      540]
]

options = {
    "width": 600,
    "height": 400,
    "curveType": 'function',
}

plotter.plot(data, chart_type="LineChart",chart_package='corechart', options=options)