# Analyzing Subjectivity Sentiment in News Articles

This notebook outlines my process for getting articles from various news sources, cleaning up the text, and calculating the subjectivity of each article using TextBlob.  Some things to keep in mind:
    - When calculating the sentiments, Textblob is looking at the adjectives use in the sentence and uses the scores from WordNet to determine subjectivity.
    - The subjectivity for each article is determined by taking the mean of the sentiments for each sentence in an article.
    - The final ranking seems to be more about percentage of reporting versus analysis/editorial than a measure of actual objectivity.

In [5]:
import newspaper
import pandas as pd
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import string
from textblob import TextBlob
import json
import re
import glob
import pickle
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary

# Clean, Tokenize and Calculate Objectivity Sentiment for Text

The text that we get from scraping the articles has some unwanted characters like unicode, punctuation and linebreaks.  We still want to keep the periods so that we can tokenize our sentences properly.

In [2]:
def clean_text(text):
    removed = text.replace("\n\n", " ")
    clean = filter(lambda x: x in string.printable, removed)
    return "".join(l for l in clean if l not in string.punctuation.replace(".",""))

def tokenize_text(text):
    sents = sent_tokenize(text)
    return [sent for sent in sents if len(sent.split()) >= 5]

def objectivity_sentiment(text):
    sentiments = []
    for sent in text:
        sentiments.append(TextBlob(sent).sentiment[1])
    if len(sentiments) == 0:
        return []
    else:
        return sum(sentiments) / float(len(sentiments))

def process_text(text):
    clean = clean_text(text)
    tokens = tokenize_text(clean)
    return tokens


# Get the Articles for a News Source

Using the python library, Newspaper, we can build a list of articles and then iterate through, download the article, parse it and extract the information we want into a dictionary and then the list of dictionaries for the given news site is dumped to a JSON file.

In [None]:
def get_articles(name, url):
    paper = newspaper.build(url, memoize_articles=False)
    article_list = []
    for article in paper.articles:
        art = {}
        article.download()
        article.parse()
        text = process_text(article.text)
        if text:
            art["title"] = article.title
            art["authors"] = article.authors
            art["text"] = text
            art["sentiment"] = objectivity_sentiment(text)
            article_list.append(art)
            
    filename = name + '_articles.json'
    with open(filename, 'w') as fp:
        json.dump(article_list, fp, indent=4, sort_keys=True)

Here we are reading our list of news sources we want to scrape, the list is in a CSV format where the first field is the name of the news outlet and the second field is the url of their website.

In [None]:
news_df = pd.read_csv("news_sources.csv")
news_df.columns

In [None]:
names = news_df['outlet'].tolist()
urls = news_df['url'].tolist()
for n, u in zip(names,urls):
    get_articles(n,u)

In [18]:
files = glob.glob('temp/*.json')
print len(files)

22


# Cleaning up Articles
I decided to try and focus on longer articles as after reading several it seemed that a lot of the short articles were just summaries of news stories or promotions for movies which is not what I am interested in.

In [7]:
def filter_short(data):
    d_list = []
    for d in data:
        if len(d["text"]) > 10:
            d_list.append(d)
    return d_list

In [6]:
for fname in files:
    with open(fname) as data_file:    
        data = json.load(data_file)
        d_list = filter_short(data)
    with open(fname, 'w') as fp:
        json.dump(d_list, fp, indent=4, sort_keys=True)

# Calculating the Ranking
The final goal of this part of the project was to rank the various outlets in ascending order starting with the lowest subjectivity score.  To do this I calculated the average of all of the sentiments of the documents for each of the news outlets and then sorted them.  It is interesting to note that RT comes out on top as they are reletively infamous for being biased, however the limitation of my model doesn't account for bias.

In [13]:
def sum_sentiments(data):
    sent = 0
    for d in data:
        sent += d["sentiment"]
    return sent/len(data)

In [75]:
sent_list = []
for fname in files:
    with open(fname) as data_file:    
        data = json.load(data_file)
        sent = sum_sentiments(data)
        sent_list.append((fname, sent))
sents = sorted(sent_list, key=lambda x: x[1])
ranking = ""
for i, x in enumerate(sents):
    ranking += str(i+1)+', ' + x[0] + '\n'
print ranking

1, temp/RT_articles.json
2, temp/Reuters_articles.json
3, temp/CBSNews_articles.json
4, temp/WashingtonPost_articles.json
5, temp/ABC_articles.json
6, temp/Bloomberg_articles.json
7, temp/fox_articles.json
8, temp/NBC_articles.json
9, temp/USAToday_articles.json
10, temp/CNN_articles.json
11, temp/Economist_articles.json
12, temp/NYPost_articles.json
13, temp/BBC_articles.json
14, temp/Guardian_articles.json
15, temp/AlJazeera_articles.json
16, temp/NPR_articles.json
17, temp/NYT_articles.json
18, temp/ArsTechnica_articles.json
19, temp/Buzzfeed_articles.json
20, temp/HuffPost_articles.json
21, temp/Wired_articles.json
22, temp/Verge_articles.json



# Building a Corpus
I wanted to do some topic modeling to allow me to compare news outlets on a topic by topic basis to see which topics are potentially contraversial and if certain outlets were more interested in certain topics than others.  In order to do this I needed to build a corpus of documents from all of the outlets and then I dumped the resulting list of documents to a pickle file.

In [19]:
corpus = []
for fname in files:
    with open(fname) as data_file:    
        data = json.load(data_file)
        for d in data:
            corpus.append(' '.join(d['text']))
            
output = open('corpus.pkl', 'wb')
pickle.dump(corpus, output)
output.close()

In [20]:
pkl_file = open('corpus.pkl', 'rb')
corpus = pickle.load(pkl_file)
print len(corpus)
pkl_file.close()

5036
