## Analyzing News Articles With Python (Part 1)

*By Andre Sealy (Updated 6/21/2020)*

This is the first part of what I believe to be a multi-part series in analyzing news articles on Python.

Never before has America become more polarized than we are today. As such, the news we consume is a function of our polarization. Right-leaning people tend to get their information from Fox News, The Wall Street Journal, National Review, etc. People who lean left tend to get their information from MSNBC, The New York Times, The Huffington Post, etc. There are a handful of neutral publications (Reuters, Politico, Associated Press, etc.). Still, even their alignment falls into question, based on how they report the news.

This project will attempt to analyze how polarizing mainstream articles are for stories typical in the U.S. news cycle, as well as measuring the sentiment for each item. To help guide us for this project, we will be using the website All Sides.

### All Sides News Aggregator

All Sides is a bipartisan organization that looks at a more balanced approach to news coverage by collecting the top headlines of the day and showcasing the reporting of the news outlet on the left, right, and center. The platform also allows readers the rate the lean of the publication for further analysis.

#### Importing the Modules

Let's begin to load the necessary modules for the project, which include the following:

* **requests**
* **BeautifulSoup**
* **csv**
* **re**
* **socket**

The `requests` module allows you to send HTTP requests using Python, which will enable you to inspect the structure of many different websites. From there, you will be able to scrape the relevant information of the website with the `BeautifulSoup` module. Regular Expressions or the `re` module allows Python to look for specific text that follows a particular pattern. (doing this technique several times, you'll soon realize that certain outlets follow a pattern when publishing articles) 

The `csv` module will just allow us to create and save our information to a csv for further analysis later on. There is nothing reall important about the `socket` module; it's used just in case we run into an error during the extraction process.

In [7]:
import requests
from bs4 import BeautifulSoup
import csv
import re
import socket

The first mode of attack is figuring out how we can go about extracting these stories for our analysis. We can extract all stories from all publications, but what if we want a more targeted focus? One interesting feature of the All Sides website is that we can look for articles based on the topic.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides.PNG)

We can get articles from pretty much any topic: **Criminal Justice**, **Education**, and the **Economy**. All Sides even collects the news perspectives on the most important topic in the world right now, the Coronavirus. (Yes, our society is so divided right now, we've even managed to politize a deadly virus)

I'm going to use immigration as our topic; I feel it's pretty easy to understand where both sides (the political left and right) stand on this issue. 

The following code chunk will extract the most important news stories involving immigration within the last two years.

In [8]:
# creates a empty list to store the story pages from AllSides.com
pages = []

# We only want to extract stories about immigration
story = 'Immigration'


def get_seed(n):
    """
    n defines the number of pages back to pull
    n=1 steps back to April 2018 (as of April 2020)
    """
    for i in range(0, n+1):
        url = 'https://www.allsides.com/story/admin?tid=&field_story_topic_tid=' + \
            str(story) + '&page=' + str(i)
        pages.append(url)

get_seed(1)

pages

['https://www.allsides.com/story/admin?tid=&field_story_topic_tid=Immigration&page=0',
 'https://www.allsides.com/story/admin?tid=&field_story_topic_tid=Immigration&page=1']

The output should give you a list of page links, which provides a link of stories on the All Sides website.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides2.PNG)

We want to create functions that will allow us to parse through the HTML format of the [AllSides](https://www.allsides.com/unbiased-balanced-news) website so we can extract the URLs of these stories. After the extraction process is done, we're going to store all of these links into a list.

In [9]:
# set up BeautifulSoup to run over All Sides Media
link_harvest = []

# helper function to harvest and parse pages
def soup_basics(item):
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def harvest_links(pages):
    '''
    runs the parser over submitted pages
    identifies headline link content in the extracted page
    appends relevant links to a list
    '''
    for item in pages:
        soup = soup_basics(item)

        # Pull all headlines from the featured stories under class 'view-content'
        story_headline_list = soup.find(class_='view-content')
        # Pull headline/link text from all instances of <a> tag
        story_list_items = story_headline_list.find_all('a')

        # harvest the headline and link information
        for story_headline in story_list_items:
            #headline = story_headline.contents[0]
            #headline = headline.encode("utf8").strip()
            link = 'https://www.allsides.com'+story_headline.get('href')
            if '/story/' in link:
                link_harvest.append(link)

harvest_links(pages)

link_harvest[5:15]

['https://www.allsides.com/story/trump-restricts-travel-six-new-countries',
 'https://www.allsides.com/story/officials-uncertain-2021-goal-border-wall-completion',
 'https://www.allsides.com/story/ny-nj-enact-laws-let-undocumented-immigrants-apply-drivers-licenses-lices',
 'https://www.allsides.com/story/pentagon-watchdog-review-400m-border-contract',
 'https://www.allsides.com/story/trump-gets-jeers-after-colorado-wall-comments',
 'https://www.allsides.com/story/trump-administration-require-visa-applicants-be-able-afford-health-care',
 'https://www.allsides.com/story/honduras-will-accept-more-asylum-seekers',
 'https://www.allsides.com/story/judge-reinstates-injunction-against-asylum-ban',
 'https://www.allsides.com/story/incoming-palestinian-harvard-student-deemed-inadmissible-cbp',
 'https://www.allsides.com/story/statue-liberty-poem-revised']

The following output shows an example of 10 links to AllSide stories on immigration. Now we want to extract the URLs to the articles from the following stories. The links can be found as follows:

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides3.PNG)

We want to be able to locate the href tag that contains the link to the article. We can do this using the `BeautifulSoup` module I mentioned earlier.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides4.PNG)

If we expect the elements of the AllSides webpage, we review the HTML and CSS structure of the webpage. The link to the Reuters article is wrapped around an href container. We use the `soup.find_all` method to locate the title and the URL containers.

Once we have located the containers, we can extract the information inside; namely the Title and the URL.

In [10]:
# get all news article links
all_articles = []


def extract_articles(link_harvest):
    for link_content in link_harvest:
        soup = soup_basics(link_content)

        # locate relevant information within the extracted page
        substory_list = soup.find_all(class_='news-title')

        # loop through the different news sources within each major news story
        for i in range(0, len(substory_list)):
            substory_items = substory_list[i].find_all('a')
            for substory_headline in substory_items:
                link = substory_headline.get('href')
                all_articles.append(link)

extract_articles(link_harvest)

all_articles[5:10]

['https://www.washingtonexaminer.com/opinion/just-imagine-if-trump-had-stopped-immigration-in-early-february',
 'https://www.washingtonpost.com/immigration/coronavirus-trump-immigration/2020/04/21/a2a465aa-837a-11ea-9728-c74380d9d410_story.html',
 'https://www.reuters.com/article/us-health-coronavirus-usa/u-s-coronavirus-response-deepens-divide-as-trump-suspends-immigration-idUSKCN2231TU',
 'https://www.foxnews.com/politics/trump-suspend-immigration-executive-order-coronavirus',
 'https://www.foxnews.com/politics/court-hands-trump-win-in-sanctuary-city-grant']

Now that we've extracted all of the necessary articles, we want to be able to save these articles for research later on. The following chunk will allow us to keep all of the links into a single text file.

In [11]:
# save news article links to file
link_file = open('link_file.txt', 'w')

for item in all_articles:
    link_file.write("%s\n" % item)

Finally, we want to create a function that allows us to extract the important information for the articles we're planning on extracting. This information includes the following: **Date**, **Headline**, **Story Description**, **Source/Publication**, **Bias/Lean**, and **Link**.

We're going to be using the same technique to extract this information, as we did with the article URLs. However, the only difference is that this information is located in HTML class attributes. As long as we are aware of the exact attributes, we can extract the relevant information.

Afterward, we will review the results of the function by using `pandas` to open the CSV file.

In [12]:
import pandas as pd

socket.socket

def csv_encoder(text_string):
    coded = text_string.encode("utf-8").strip()
    return coded

# extract all content
def extract_content(link_harvest):
    '''
    for each story, pulls the shared news headline, date, and summary description
    for each news source, identifies the source bias (liberal, conservative, center) & outgoing link
    uses re and .contents to clean harvested text
    writes collected, cleaned data to csv
    '''

    # open csv file to store info
    file = open('allsides-content.csv', 'w', newline="", encoding='utf-8')
    fileWriter = csv.writer(file)
    fileWriter.writerow(['Date', 'AllMedia_Headline', 'Description',
                         'Source_Name', 'Source_Bias', 'Source_Headline', 'Source_Link'])

    try:
        for link_content in link_harvest:
            soup = soup_basics(link_content)

            # locate relevant information within the extracted page
            story_headline = soup.find(class_='taxonomy-heading')
            story_date = soup.find(property='dc:date')
            story_description = soup.find(class_='story-id-page-description')
            substory_source = soup.find_all(class_='news-source')
            substory_bias = soup.find_all(class_='global-bias')
            substory_list = soup.find_all(class_='news-title')

            # loop through the different news sources within each major news story
            n = 0
            for i in range(0, len(substory_list)):
                substory_items = substory_list[i].find_all('a')
                for substory_headline in substory_items:

                    # extracting the date
                    clean_date = story_date.contents[0]

                    # extracting the headline
                    clean_headline = re.sub(
                        '\W+', ' ', story_headline.contents[0])[1:][:-1]
                    clean_headline = csv_encoder(clean_headline)

                    # extracting the description
                    try:
                        clean_description = str(
                            story_description.contents[1].text)
                    except IndexError:
                        clean_description = 'Null'
                        
                    # clean_authors = 'None'

                    # extracting the story source
                    try:
                        clean_source = substory_source[n].contents[1]
                    except (AttributeError, IndexError):
                        clean_source = 'Unknown'

                    # extracting the lean and the bias
                    clean_bias = substory_bias[n]
                    clean_bias = re.sub(
                        '\W+', ' ', clean_bias.contents[0])[10:]
                    n = n+1

                    # extracting the headline
                    headline = substory_headline.contents[0]
                    headline = csv_encoder(headline)

                    # extracting the link
                    link = substory_headline.get('href')

                    fileWriter.writerow(
                        [clean_date, clean_headline, clean_description, clean_source, clean_bias, headline, link])

    except socket.error as err:
        print('Socket connection error... Waiting 10 seconds to retry.')
        del self.sock
        time.sleep(10)
        try_count += 1

    file.close()

# running the function
extract_content(link_harvest)

# showing the results of the csv
pd.read_csv('allsides-content.csv').head(5)


Unnamed: 0,Date,AllMedia_Headline,Description,Source_Name,Source_Bias,Source_Headline,Source_Link
0,"June 22nd, 2020",b'Trump Admin Suspends Certain Visas Through E...,The Trump administration announced Monday that...,BuzzFeed News,Left,b'Trump Is Suspending Certain Visas For Foreig...,https://www.buzzfeednews.com/article/adolfoflo...
1,"June 22nd, 2020",b'Trump Admin Suspends Certain Visas Through E...,The Trump administration announced Monday that...,Reuters,Center,b'Trump to suspend entry of certain foreign wo...,https://www.reuters.com/article/us-usa-immigra...
2,"June 22nd, 2020",b'Trump Admin Suspends Certain Visas Through E...,The Trump administration announced Monday that...,The Daily Caller,Right,b'Trump To Suspend Visas Through End Of The Ye...,https://dailycaller.com/2020/06/22/exclusive-t...
3,"April 24th, 2020",b'Perspectives Trump s Immigration Executive O...,President Donald Trump's executive order suspe...,Reuters,Center,"b""Inside Trump's proposal to suspend some lega...",https://af.reuters.com/article/worldNews/idAFK...
4,"April 24th, 2020",b'Perspectives Trump s Immigration Executive O...,President Donald Trump's executive order suspe...,CNN - Editorial,Left,"b""Trump's moves on immigration reveal his true...",https://www.cnn.com/2020/04/22/opinions/trump-...


We're showing the first five rows of the CSV we created. So far, everything looks satisfactory.

Keep in mind; this DataFrame lacks specific information for analyzing news articles. Such as the actual body or text of the article (and perhaps the authors of the article). For Part 2, we will be using the same CSV to extract the necessary information.