## Analyzing News Articles With Python (Part 2)

*By Andre Sealy (Updated 6/21/2020)*

This is the second part of a multi-part series in analyzing news articles using Python. Part One involves using Python and the All Sides website to extract the most important articles based on the stories that are interesting to us. In this part, we will extract the most important information for analyzing news articles, namely the authors and the article body. To do this, we will use an API called **News-Please**.

### News Please

[News-Please](https://github.com/fhamborg/news-please) is an open source news crawler that extracts structured information from almost any website. You can use it to follow recursively internal hyperlinks and read RSS feeds to extract most recent and old archived articles.

#### Importing the Modules

Let's begin to load the necessary modules for the project, which includes the following:

* pandas
* csv
* pickle
* matplotlib
* difflib
* itertools
* newsplease

We have used `pandas` and `csv` before, so if you've completed the first part, then you're already familiar with how it's used. This tutorial will be utilizing some newer modules that will help us extract articles.

We will be using the `pickle`, which will allow us to take a complex object structure and transform it into a stream of bytes that can be saved onto the hard drive through the process of *serialization*. We won't get into what this process is here. If you want to learn more, [here is a good resource](https://docs.python-guide.org/scenarios/serialization/).

We will specifically be using the `difflib.SequenceMatcher` method to compare pairs of input sequences. It's basically a way of looking at matching sequences without all of the "junk" involved ([here](https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc) is a resource on how it all works, for further learning).

Finally, we will be using the `newsplease` module, which will allow us to extract information from the article links that we have collected.

In [1]:
import csv
import pandas as pd
import pickle
from itertools import accumulate, chain, repeat, tee
import difflib
import matplotlib
from newsplease import NewsPlease

Now we are going to reintroduce the `link_file.txt` that we created in Part One. We're going to use this to create a list containing strings.

In [2]:
filepath = r'Insight/news-lens'

with open(r'{}/link_file.txt'.format(filepath)) as f:
    news_links = [line.replace("\n", "") for line in f]
    
news_links[5:10]

['https://www.washingtonexaminer.com/opinion/just-imagine-if-trump-had-stopped-immigration-in-early-february',
 'https://www.washingtonpost.com/immigration/coronavirus-trump-immigration/2020/04/21/a2a465aa-837a-11ea-9728-c74380d9d410_story.html',
 'https://www.reuters.com/article/us-health-coronavirus-usa/u-s-coronavirus-response-deepens-divide-as-trump-suspends-immigration-idUSKCN2231TU',
 'https://www.foxnews.com/politics/trump-suspend-immigration-executive-order-coronavirus',
 'https://www.foxnews.com/politics/court-hands-trump-win-in-sanctuary-city-grant']

Next, we are going to compile the list of URLs into chunks. To do this, we're going to create a function that accepts two arguments: the list of URLs and the number of chunks that we want to create.

Assuming that the number of chucks is greater than zero, the function is going to measure the size of the list and divide it by the number of chunks we are trying to create.

In [3]:
def chuck(xs, n):
    assert n > 0
    L = len(xs)
    s, r = divmod(L, n)
    widths = chain(repeat(s+1, r), repeat(s, n-r))
    offsets = accumulate(chain((0,), widths))
    b, e = tee(offsets)
    next(e)
    return [xs[s] for s in map(slice, b, e)]


batch = chuck(news_links, 251)

batch[3:7]

[['https://www.washingtonpost.com/immigration/coronavirus-trump-immigration/2020/04/21/a2a465aa-837a-11ea-9728-c74380d9d410_story.html',
  'https://www.reuters.com/article/us-health-coronavirus-usa/u-s-coronavirus-response-deepens-divide-as-trump-suspends-immigration-idUSKCN2231TU'],
 ['https://www.foxnews.com/politics/trump-suspend-immigration-executive-order-coronavirus',
  'https://www.foxnews.com/politics/court-hands-trump-win-in-sanctuary-city-grant'],
 ['https://www.axios.com/court-trump-immigration-sanctuary-cities-565ee05c-46ea-4894-8e1b-52f6a0fde7c8.html',
  'https://www.thedailybeast.com/trump-administration-can-withhold-grants-from-sanctuary-cities-court-rules'],
 ['https://www.nytimes.com/2020/02/13/us/politics/border-wall-funds-pentagon.html',
  'https://www.washingtonexaminer.com/policy/defense-national-security/trumps-plan-to-strip-planes-ships-and-vehicles-from-pentagon-budget-to-fund-border-wall-draws-bipartisan-howls-of-protest-from-congress']]

The purpose of the chunks is to make it more difficult for the News Please API to run into errors. So now that we have created the chunks, we are going to use the `NewsPlease` module to extract information from the URLs.

We will get this information using the `NewsPlease.from_urls` method, which extracts all of the important information from each article. (Source, Description, Authors, etc.)

Once we extract the information, we will serialize this information using a pickle rather than storing the information in memory.

In [4]:
def article_crawler():
    # crawler
    n = 0
    for i in range(0, len(batch)):
        try:
            slice = batch[i]
            # print slice
            slice_name = str(i) + '-NewsPlease-articleCrawl.p'
            article_information = NewsPlease.from_urls(slice)
            pickle.dump(article_information, open(slice_name, 'wb'))
            n += 1
        except:
            continue
            
article_crawler()
    

Exception in thread Thread-59:
Traceback (most recent call last):
  File "C:\Users\andre\Anaconda3\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "C:\Users\andre\Anaconda3\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\andre\Anaconda3\lib\site-packages\newsplease\crawler\simple_crawler.py", line 31, in _fetch_url
    html = urllib.request.urlopen(req, data=None, timeout=timeout).read()
  File "C:\Users\andre\Anaconda3\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\andre\Anaconda3\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\andre\Anaconda3\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\andre\Anaconda3\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Users\andre\Anaconda3\lib\urllib\request.py", line

You may have noticed that the program has run into errors. This is perfectly normal, as some slugs for the URLs have changed, or some pages may have been deleted. We want an idea of how many articles our program has managed to extract. 

The following code chunk will look through each of the articles and attempt to extract relevant information. After its finished, it will calculate the success rate. A high success rate suggests that we were able to extract the majority of information.

In [5]:
# helper function
def make_unique(url_list):
   # Not order preserving    
   unique = set(url_list)
   return list(unique)

def check_data(filepath):
    scraped = []
    not_scraped = []

    for i in range(0, 251):
        try:
            file_path = filepath+"/crawl/"
            open_crawl = pickle.load(open(file_path + str(i)
                                          + "-NewsPlease-articleCrawl.p", "rb"))
            for url in open_crawl:
                text = open_crawl[str(url)].maintext
                if text == None:
                    not_scraped.append(url)
                else:
                    scraped.append(url)
        except FileNotFoundError:
            continue

    scraped = make_unique(scraped)
    return scraped, not_scraped
            
success, fail = check_data(filepath)

# analyze data collection success
def percentage(part, whole):
    percent = 100 * float(part)/float(whole)
    format = "{0:.2f}".format(percent)
    return format+'%'

print("The extraction process yielded " 
      + str(len(success)) + " articles, or "
      + percentage(len(success),len(news_links)) 
      + " of the total.")

The extraction process yielded 291 articles, or 97.32% of the total.


We were able to extract 97.32% of the total amount of articles -- more than enough for our analysis. Now we want to get additional information that the All Sides website was not able to provide directly: The Title, Authors, and Source.

The process is simple. We will loop through each pickle in the directory. The `pickle.load` method will open a News Please object, which will be used to extract the title, authors, and source.

The source provides the source domain (.com, .gov. .edu, etc.), which is not exactly what we want. We created a loop that replaces parts of the domain with blank spaces.

Once everything is done, the object will be returned in the form of a dictionary.

In [6]:
def get_data(filepath):
    news_dict = {}

    remove_list = ['www.', 'www1.' '.com', '.gov', '.org', 'beta.', ',eu',
                   '.co.uk', 'europe', 'gma', 'blogs', 'in.', 'm.',
                   'eclipse2017.', 'money', 'insider', 'news.', 'finance.'
                   'www1.']

    for i in range(0, 220):
        try:
            file_path = filepath + "/crawl/"
            open_crawl = pickle.load(open(file_path+str(i)
                                          + "-NewsPlease-articleCrawl.p", "rb"))

            for url in open_crawl:
                text = open_crawl[str(url)].maintext
                if text != None:
                    title = open_crawl[str(url)].title
                    authors = ', '.join(open_crawl[str(url)].authors)
                    source = open_crawl[str(url)].source_domain

                    for seq in remove_list:
                        if seq in source:
                            source = source.replace(seq, "")

                    date = open_crawl[str(url)].date_publish

                    news_dict[str(url)] = [source, title, authors, date, text]

        except FileNotFoundError:
            continue

    return news_dict


all_news = get_data(filepath)

So the function returns an object named `all_news`, which provides the title, authors and source domain. Now we want an idea of how many articles we've extracted from each source. The following functions will count the number of articles for each source and return the results in the form of a dictionary.

In [7]:
def get_sources(all_news):
    news_sources = {}
    for article in all_news:
        source = all_news[article][0]
        if source not in news_sources:
            news_sources[source] = 1
        else:
            news_sources[source] += 1
    return news_sources

check_sources = get_sources(all_news)

check_sources

{'buzzfeedcom': 3,
 'reuters.com': 20,
 'dailycaller.com': 4,
 'af.reuters.com': 1,
 'cnn.com': 12,
 'washingtonexaminer.com': 10,
 'washingtonpost.com': 11,
 'foxcom': 25,
 'thedailybeast.com': 2,
 'axios.com': 3,
 'nytimes.com': 13,
 'apcom': 7,
 'bbc.com': 6,
 'aljazeera.com': 2,
 'washingtontimes.com': 17,
 'npr': 5,
 'nationalreview.com': 9,
 'vox.com': 7,
 'usatoday.com': 12,
 'nbccom': 4,
 'bloomberg.com': 3,
 'wsj.com': 17,
 'jacobinmag.com': 1,
 'cbscom': 3,
 'thehill.com': 15,
 'dailymail': 3,
 'slate.com': 1,
 'thefederalist.com': 1,
 'msnbc.com': 1,
 'huffpost.com': 1,
 'breitbart.com': 3,
 'theblaze.com': 2,
 'newsmax.com': 1,
 'www1.cbn.com': 2,
 'rollcall.com': 1,
 'townhall.com': 8,
 'politico.com': 7,
 'nypost.com': 3,
 'splintercom': 1,
 'huffingtonpost.com': 8,
 'latimes.com': 3,
 'vanityfair.com': 1,
 'salon.com': 1,
 'theguardian.com': 2}

We have extracted the articles, but it's not exactly presented in a way that we would like. Unforunately, we would need to clean this up manually. (This is not necessary for those who are fine with how the data turned out)

Next, we want to assign the bias rating to all of the sources we have extracted. For that, we will import the All Sides Media Bias rating, which is simply a compliation of all ratings on their website.

In [8]:
check_sources = {'Buzzfeed': 3, 'Reuters': 20, 'daily caller': 4, 'af.reuters': 1, 'CNN (Web News)': 12, 
 'The Washington Examiner': 10, 'The Washington Post': 11, 'Fox News': 25, 'the daily beast': 2,
 'axios': 3, 'New York Times': 13, 'AP': 7, 'bbc': 6, 'Al Jazeera': 2, 'Washington Times': 17,
 'NPR News': 5, 'National Review': 9, 'Vox': 7, 'USA TODAY': 12, 'NBCNews.com': 4, 'Bloomberg': 3,
 'The Wall Street Journal': 17, 'Jacobin Magazine': 1, 'CBS News': 3, 'The Hill': 15, 'daily mail': 3,
 'slate': 1, 'the fFederalist': 1, 'MSNBC': 1, 'HuffPost': 1, 'breitbart': 3, 'the blaze': 2,
 'NewMax': 1, 'cbn': 2, 'rollcall': 1, 'Townhall': 8, 'politico': 7, 'New York Post': 3, 'splinter news': 1,
 'The Huffington Post': 8, 'Los Angeles Times': 3, 'Vanity Fair': 1, 'Salon': 1, 'The Guardian': 2}

bias_ratings = r"{}/allsides-media-bias-ratings.csv".format(filepath)

bias_dict = {}

with open(bias_ratings, mode='r') as infile:
    reader = csv.reader(infile)
    bias_dict = {rows[0]: rows[1] for rows in reader}

Now we are going to merge our dictionary of sources with the bias alignment from All Sides. The function will attempt to match the source created in our dictionary to the sources in the media bias document. If there is a match, it will assign it's proper alignment to the source. If the source isn't found, its alignment will be left unknown.

The end result will be a data frame of all the sources we've extracted, the number of articles extracted, and the alignment of each source.

In [9]:
def string_sim(a, b):
    seq = difflib.SequenceMatcher(None, a, b)
    sim = seq.ratio() * 100
    return sim


def replace_names(check_sources, bias_dict):
    real_source = {}

    for entry in check_sources:
        for source in bias_dict:

            sim = string_sim(entry, source)
            count = check_sources[entry]

            if entry not in real_source:
                real_source[entry] = [source, sim, count]

            else:
                if sim > real_source[entry][1]:
                    real_source[entry] = [source, sim, count]

    raw_data = {'News_Source': [], 'Bias': [], 'Article_Count': []}

    for key in real_source:
        new_key = real_source[key][0]
        new_count = real_source[key][2]

        bias = bias_dict[new_key]

        raw_data['News_Source'].append(new_key)
        raw_data['Article_Count'].append(new_count)
        raw_data['Bias'].append(bias)

    for bias_rating in raw_data['Bias']:
        if bias_rating == 'Mixed':
            bias_rating = bias_rating.replace('Mixed', 'Center')

    source_bias = pd.DataFrame(
        raw_data, columns=['News_Source', 'Bias', 'Article_Count'])

    return source_bias


updated_info = replace_names(check_sources, bias_dict)
updated_info.to_csv('news-corpus-info.csv', index=False)
updated_info.sort_values('Article_Count', ascending=False)[:10]

Unnamed: 0,News_Source,Bias,Article_Count
7,Fox News,Lean Right,25
1,Reuters,Center,20
21,Wall Street Journal- News,Center,17
14,Washington Times,Lean Right,17
24,The Hill,Center,15
10,New York Times,Lean Left,13
18,USA TODAY,Center,12
4,CNN (Web News),Lean Left,12
6,Washington Post,Lean Left,11
5,Washington Examiner,Right,10


As we can see, Fox News has the largest number of articles in our dataset, with 25 pieces. Reuters is right behind Fox News with 20 articles, followed by the Wall Street Journal (News Section), Washington Times, and The Hill with 20, 17, and 15 articles, respectively.

I can only speak for myself, but from what I see so far, I believe this alignment is correct for the most part. Fox News tends to lean right, while CNN and New York Times lean more left.

I think some people would disagree with the alignment of the Wall Street Journal, as the publisher is owned by News Corp, which is the parent company of Fox News. I believe most people would be inclined to agree, as far as community feedback on All Sides is concerned.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides5.PNG)

Still, whether or not a news outlet is owned or operated by a particular person has little to do with its overall objectivity and bias. All Sides has conducted an [in-depth analysis](https://www.allsides.com/news-source/wall-street-journal-media-bias) of major news publications such as The Wall Street Journal and has found that outlet is more aligned to the center than its peers. (Keep in mind, a Center alignment doesn't mean better!)

Now that we have a better idea of the types of articles we've extracted, it's time to bring everything together. We will be importing the `allsides-content.csv` file that we created during Part One.

When opening the file, we're going to include all of the contents inside the CSV in a dictionary. However, we will also be including the text of the article, as well as the name of the authors (if available).

Once the extraction process is done, we're going to transform it into a data frame.

In [10]:
def open_sesame(filepath):
    datafile = open(r'{}/allsides-content.csv'.format(filepath),
                    'r', encoding='utf-8')
    myreader = csv.reader(datafile)
    return myreader

# return dict of stored values


def fill_dict():
    read_in = {'date': [], 'main_headline': [], 'description': [], 'source': [],
               'bias': [], 'headline': [], 'link': []}

    i = 0
    for key in read_in:
        myreader2 = open_sesame(filepath)
        read_in[key] = [row[i] for row in myreader2][1:]
        i += 1

        # add info from NewsPlease crawl
    read_in['text'] = []
    read_in['authors'] = []
    for url in read_in['link']:
        try:
            read_in['text'].append(all_news[str(url)][4])
            read_in['authors'].append(all_news[str(url)][2])
        except KeyError:
            read_in['text'].append('None')
            read_in['authors'].append('None')

    return read_in


formatted = fill_dict()

df = pd.DataFrame(formatted, columns=['date', 'main_headline', 'authors', 'description',
                                      'source', 'bias', 'headline', 'link', 'text'])

df.head(5)

Unnamed: 0,date,main_headline,authors,description,source,bias,headline,link,text
0,"June 22nd, 2020",b'Trump Admin Suspends Certain Visas Through E...,"Hamed Aleaziz, Jeremy Singer-Vine, Adolfo Flor...",The Trump administration announced Monday that...,BuzzFeed News,Left,b'Trump Is Suspending Certain Visas For Foreig...,https://www.buzzfeednews.com/article/adolfoflo...,BuzzFeed News has reporters around the world b...
1,"June 22nd, 2020",b'Trump Admin Suspends Certain Visas Through E...,"Ted Hesson, Min Read",The Trump administration announced Monday that...,Reuters,Center,b'Trump to suspend entry of certain foreign wo...,https://www.reuters.com/article/us-usa-immigra...,WASHINGTON (Reuters) - U.S. President Donald T...
2,"June 22nd, 2020",b'Trump Admin Suspends Certain Visas Through E...,,The Trump administration announced Monday that...,The Daily Caller,Right,b'Trump To Suspend Visas Through End Of The Ye...,https://dailycaller.com/2020/06/22/exclusive-t...,President Donald Trump will sign an executive ...
3,"April 24th, 2020",b'Perspectives Trump s Immigration Executive O...,"Ted Hesson, Min Read",President Donald Trump's executive order suspe...,Reuters,Center,"b""Inside Trump's proposal to suspend some lega...",https://af.reuters.com/article/worldNews/idAFK...,WASHINGTON (Reuters) - President Donald Trump ...
4,"April 24th, 2020",b'Perspectives Trump s Immigration Executive O...,Opinion Rafia Zakaria,President Donald Trump's executive order suspe...,CNN - Editorial,Left,"b""Trump's moves on immigration reveal his true...",https://www.cnn.com/2020/04/22/opinions/trump-...,"Rafia Zakaria is the author of "" The Upstairs ..."


Now we have more complete information. However, we still want to clean it up a bit. We're going to start by dropping all of the rows with missing stories (or mixed stories). After that, we're going to drop the stories that are less than 250 characters and greater than 20,000 characters.

Afterward, we'll save the results to a CSV for further analysis later on.

In [11]:
df = df.drop(df[df.text == 'None'].index)
df = df.drop(df[df.text == 'Mixed'].index)
df.date = pd.to_datetime(df.date)

df['text_len'] = df.apply(lambda row: len(row.text), axis=1)

df = df.drop(df[df.text_len > 20000].index)
df = df.drop(df[df.text_len < 250].index)

# write to csv
df.to_csv('news-corpus-df.csv', index=False)
df.head()


Unnamed: 0,date,main_headline,authors,description,source,bias,headline,link,text,text_len
0,2020-06-22,b'Trump Admin Suspends Certain Visas Through E...,"Hamed Aleaziz, Jeremy Singer-Vine, Adolfo Flor...",The Trump administration announced Monday that...,BuzzFeed News,Left,b'Trump Is Suspending Certain Visas For Foreig...,https://www.buzzfeednews.com/article/adolfoflo...,BuzzFeed News has reporters around the world b...,3358
1,2020-06-22,b'Trump Admin Suspends Certain Visas Through E...,"Ted Hesson, Min Read",The Trump administration announced Monday that...,Reuters,Center,b'Trump to suspend entry of certain foreign wo...,https://www.reuters.com/article/us-usa-immigra...,WASHINGTON (Reuters) - U.S. President Donald T...,6015
2,2020-06-22,b'Trump Admin Suspends Certain Visas Through E...,,The Trump administration announced Monday that...,The Daily Caller,Right,b'Trump To Suspend Visas Through End Of The Ye...,https://dailycaller.com/2020/06/22/exclusive-t...,President Donald Trump will sign an executive ...,1954
3,2020-04-24,b'Perspectives Trump s Immigration Executive O...,"Ted Hesson, Min Read",President Donald Trump's executive order suspe...,Reuters,Center,"b""Inside Trump's proposal to suspend some lega...",https://af.reuters.com/article/worldNews/idAFK...,WASHINGTON (Reuters) - President Donald Trump ...,5310
4,2020-04-24,b'Perspectives Trump s Immigration Executive O...,Opinion Rafia Zakaria,President Donald Trump's executive order suspe...,CNN - Editorial,Left,"b""Trump's moves on immigration reveal his true...",https://www.cnn.com/2020/04/22/opinions/trump-...,"Rafia Zakaria is the author of "" The Upstairs ...",8191


In [12]:
bybias = df.groupby('bias')
bybias['text_len'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
bias,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Center,89.0,3435.662921,2214.848517,348.0,1968.0,3283.0,5025.0,11675.0
Left,82.0,5431.207317,3648.040576,301.0,2493.75,4915.0,7931.25,19492.0
Right,89.0,3831.337079,2549.135803,531.0,2153.0,3242.0,4757.0,17786.0
