## Analyzing News Articles With Python (Part 1)

*By Andre Sealy (Updated 6/21/2020)*

This is the first part of what I believe to be a multi-part series in analyzing news articles on Python.

Never before has America become more polarized than we are today. As such, the news we consume is a function of our polarization. Right-leaning people tend to get their information from Fox News, The Wall Street Journal, National Review, etc. People who lean left tend to get their information from MSNBC, The New York Times, The Huffington Post, etc. There are a handful of neutral publications (Reuters, Politico, Associated Press, etc.). Still, even their alignment falls into question, based on how they report the news.

This project will attempt to analyze how polarizing mainstream articles are for stories typical in the U.S. news cycle, as well as measuring the sentiment for each item. To help guide us for this project, we will be using the website All Sides.

### All Sides News Aggregator

All Sides is a bipartisan organization that looks at a more balanced approach to news coverage by collecting the top headlines of the day and showcasing the reporting of the news outlet on the left, right, and center. The platform also allows readers the rate the lean of the publication for further analysis.

#### Importing the Modules

Let's begin to load the necessary modules for the project, which include the following:

* **requests**
* **BeautifulSoup**
* **csv**
* **re**
* **socket**

The `requests` module allows you to send HTTP requests using Python, which will enable you to inspect the structure of many different websites. From there, you will be able to scrape the relevant information of the website with the `BeautifulSoup` module. Regular Expressions or the `re` module allows Python to look for specific text that follows a particular pattern. (doing this technique several times, you'll soon realize that certain outlets follow a pattern when publishing articles) 

The `csv` module will just allow us to create and save our information to a csv for further analysis later on. There is nothing reall important about the `socket` module; it's used just in case we run into an error during the extraction process.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import re
import socket

The first mode of attack is figuring out how we can go about extracting these stories for our analysis. We can extract all stories from all publications, but what if we want a more targeted focus? One interesting feature of the All Sides website is that we can look for articles based on the topic.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides.PNG)

We can get articles from pretty much any topic: **Criminal Justice**, **Education**, and the **Economy**. All Sides even collects the news perspectives on the most important topic in the world right now, the Coronavirus. (Yes, our society is so divided right now, we've even managed to politize a deadly virus)

I'm going to use immigration as our topic; I feel it's pretty easy to understand where both sides (the political left and right) stand on this issue. 

The following code chunk will extract the most important news stories involving immigration within the last two years.

In [None]:
# creates a empty list to store the story pages from AllSides.com
pages = []

# We only want to extract stories about immigration
story = 'donald-trump'


def get_seed(n):
    """
    n defines the number of pages back to pull
    n=1 steps back to April 2018 (as of April 2020)
    """

    for i in range(0, n+1):
        url = 'https://www.allsides.com/blog/tags/' + \
            str(story) + '?page=' + str(i)
        pages.append(url)

get_seed(9)

pages

['https://www.allsides.com/blog/tags/donald-trump?page=0',
 'https://www.allsides.com/blog/tags/donald-trump?page=1',
 'https://www.allsides.com/blog/tags/donald-trump?page=2',
 'https://www.allsides.com/blog/tags/donald-trump?page=3',
 'https://www.allsides.com/blog/tags/donald-trump?page=4',
 'https://www.allsides.com/blog/tags/donald-trump?page=5',
 'https://www.allsides.com/blog/tags/donald-trump?page=6',
 'https://www.allsides.com/blog/tags/donald-trump?page=7',
 'https://www.allsides.com/blog/tags/donald-trump?page=8',
 'https://www.allsides.com/blog/tags/donald-trump?page=9']

The output should give you a list of page links, which provides a link of stories on the All Sides website.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides2.PNG)

We want to create functions that will allow us to parse through the HTML format of the [AllSides](https://www.allsides.com/unbiased-balanced-news) website so we can extract the URLs of these stories. After the extraction process is done, we're going to store all of these links into a list.

In [None]:
def soup_basic(item):
    # Send a request to the URL and get the response
    response = requests.get(item)

    # Parse the HTML content of the response using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

In [None]:
# set up BeautifulSoup to run over All Sides Media
link_harvest = []

# helper function to harvest and parse pages
def soup_basics(item):
    # Send a request to the URL and get the response
    response = requests.get(item)

    # Parse the HTML content of the response using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    return soup
def harvest_links(pages):
    '''
    runs the parser over submitted pages
    identifies headline link content in the extracted page
    appends relevant links to a list
    '''
    for item in pages:
        soup=soup_basics(item)

        # Find all the links on the page that point to blog posts
        blog_links = []
        for link in soup.find_all('a'):
            href = link.get('href')
            if href and '/blog/' in href:
                blog_links.append(href)

        # Print the list of blog links
        # print(blog_links)
        for i in blog_links:
          if('https://www.allsides.com/' in i) : link = i
          else: link= 'https://www.allsides.com/' +i 
          if not '/tags/' in link:
            link_harvest.append(link)

harvest_links(pages)

# story_headline_list
unique_link_harvest= list(set(link_harvest))
unique_link_harvest
# link_harvest

['https://www.allsides.com//../blog/rnc-2020-speakers-try-inspire-base-engage-undecided-voters',
 'https://www.allsides.com//blog/when-donald-trump-gets-back-twitter',
 'https://www.allsides.com//blog/pro-trump-rally-BLM-protest-odd-couples',
 'https://www.allsides.com//../blog/coronavirus-reaches-president-trump-and-white-house',
 'https://www.allsides.com//blog/both-sides-urge-transparency-mar-lago-raid',
 'https://www.allsides.com//blog/trump%E2%80%99s-economic-plan-other-media-contrasts-week',
 'https://www.allsides.com//../blog/trump-vs-biden-gun-rights-and-gun-control-explained-two-minutes',
 'https://www.allsides.com//../blog/trump-vs-biden-education-policy-explained-two-minutes',
 'https://www.allsides.com//../blog/house-launches-trump-impeachment-inquiry',
 'https://www.allsides.com//blog/media-bias-alert-what-left-and-right-are-omitting-about-mar-lago-fbi-raid',
 'https://www.allsides.com//../blog/fact-check-what-does-scotus-travel-ban-ruling-actually-mean',
 'https://www.all

The following output shows an example of 10 links to AllSide stories on immigration. Now we want to extract the URLs to the articles from the following stories. The links can be found as follows:

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides3.PNG)

We want to be able to locate the href tag that contains the link to the article. We can do this using the `BeautifulSoup` module I mentioned earlier.

![](https://kidquant.com/post/images/Analyzing-News/topics-allsides4.PNG)

If we expect the elements of the AllSides webpage, we review the HTML and CSS structure of the webpage. The link to the Reuters article is wrapped around an href container. We use the `soup.find_all` method to locate the title and the URL containers.

Once we have located the containers, we can extract the information inside; namely the Title and the URL.

Now that we've extracted all of the necessary articles, we want to be able to save these articles for research later on. The following chunk will allow us to keep all of the links into a single text file.

Finally, we want to create a function that allows us to extract the important information for the articles we're planning on extracting. This information includes the following: **Date**, **Headline**, **Story Description**, **Source/Publication**, **Bias/Lean**, and **Link**.

We're going to be using the same technique to extract this information, as we did with the article URLs. However, the only difference is that this information is located in HTML class attributes. As long as we are aware of the exact attributes, we can extract the relevant information.

Afterward, we will review the results of the function by using `pandas` to open the CSV file.

In [None]:
def find_bias(st):
  substory_bias=None
  lst=['Left','Right','Center','Lean Left','Lean Right']
  for i in lst:
    if(i in str(st)) : substory_bias=i
  return substory_bias

find bias ............................


In [None]:
# get all news article links
all_articles = []


def extract_articles(link):
    for link_content in link:
        soup = soup_basics(link_content)

        # locate relevant information within the extracted page
        substory_list = soup.find_all(class_='blog-content-wrapper')

        # loop through the different news sources within each major news story
        if 'Snippets ' in str(substory_list) :
          for i in range(0, len(substory_list)):
              substory_items = substory_list[i].find_all('a')
              for substory_headline in substory_items:
                  link = substory_headline.get('href')
                  if link and '/news/20' in link:
                    all_articles.append(link)

extract_articles(unique_link_harvest)
unique_all_articles= list(set(all_articles))
unique_all_articles
unique_all_articles[0:99]

['https://www.allsides.com/news/2019-12-26-0602/house-democrats-are-contemplating-adding-new-articles-impeachment',
 'https://www.allsides.com/news/2019-11-21-0734/trump-says-its-all-over-impeachment-inquiry-after-sondland-testimony',
 'https://www.allsides.com/news/2019-02-14-1633/trumps-emergency-declaration-would-trigger-drawn-out-legal-fight?utm_source=newsletter&utm_medium=Story&utm_campaign=USATODAYTrumpsEmergencyDeclarationWouldTriggerDrawnOutLegalFight&utm_source=AllSides+Mailing+List&utm_campaign=f8edc1488d-National_Emergency_Border_Wall_02142019&utm_medium=email&utm_term=0_0b086ce741-f8edc1488d-',
 'http://www.allsides.com/news/2016-10-12-0731/politics-us-climate-change?utm_source=AllSides+Mailing+List&utm_campaign=d946eaa764-Allegations_Against_Trump_1013016&utm_medium=email&utm_term=0_0b086ce741-d946eaa764-',
 'https://www.allsides.com/news/2019-09-11-1423/why-donald-trumps-plan-host-taliban-prestigious-camp-david-stirred-bipartisan',
 'https://www.allsides.com/news/2020-01

In [None]:
import pandas as pd

socket.socket

def csv_encoder(text_string):
    coded = text_string.encode("utf-8").strip()
    return coded

# extract all content
def extract_content(link_harvest):
    '''
    for each story, pulls the shared news headline, date, and summary description
    for each news source, identifies the source bias (liberal, conservative, center) & outgoing link
    uses re and .contents to clean harvested text
    writes collected, cleaned data to csv
    '''

    # open csv file to store info
    file = open('allsides-content-f.csv', 'w+', newline="", encoding='utf-8')
    fileWriter = csv.writer(file)
    fileWriter.writerow(['Date', 'Description',
                         'Source_Name', 'Source_Bias', 'Source_Headline', 'Source_Link'])

    try:
        for i in range(len(unique_all_articles)):
            soup = soup_basics(unique_all_articles[i])


            story_date = soup.find(property="og:updated_time")
            parts = str(story_date).split('"')
            if len(parts)>1 :
              clean_date = parts[1]
            else:
              clean_date = None
            clean_date


            lst_line=[]
            lines = str(soup).split("\n")
            for line in lines:
              if 'description' in line : 
                lst_line.append(line)
            try:
                    if len(lst_line[0].split('"'))>2:
                        if 'og:description' in lst_line[0].split('"') : clean_description=lst_line[0].split('"')[0]
                        else :  clean_description=lst_line[0].split('"')[1]
            except: continue

            substory_source = soup.find(property='og:url')
            parts = str(substory_source).split('"')
            if len(parts) >1 :
              clean_source = parts[1]
            else:
              clean_source = None
            clean_source

            soup = soup_basic(all_articles[i])
            lst_line=[]
            lines = str(soup).split("\n")
            for line in lines:
              if 'field-content">From The' in line : 
                lst_line.append(line)
            clean_bias = find_bias(lst_line)
            clean_bias


            substory_list = soup.find(property='og:title')
            parts = str(substory_list).split('"')
            if len(parts) >1 :
              clean_list = parts[1]
            else:
              clean_list = None
            
            
            fileWriter.writerow(
                [clean_date, clean_description,clean_list , clean_bias, clean_source, unique_all_articles[i]])

    except socket.error as err:
        print('Socket connection error... Waiting 10 seconds to retry.')
        del self.sock
        time.sleep(10)
        try_count += 1

    file.close()

# running the function
extract_content(unique_all_articles)

# showing the results of the csv
pd.read_csv('allsides-content-f.csv').head(45)


Unnamed: 0,Date,Description,Source_Name,Source_Bias,Source_Headline,Source_Link
0,2019-12-26T06:02:18-08:00,The Democrats' partisan impeachment push is ge...,"RNC 2020: Melania Trump, Mike Pompeo Buck Trad...",Center,https://www.allsides.com/news/2019-12-26-0602/...,https://www.allsides.com/news/2019-12-26-0602/...
1,2019-11-21T07:34:08-08:00,President Trump on Wednesday said that Ambassa...,Trump’s big RNC challenge: Reframing pandemic ...,Center,https://www.allsides.com/news/2019-11-21-0734/...,https://www.allsides.com/news/2019-11-21-0734/...
2,2019-02-14T16:37:59-08:00,President Donald Trump will declare a national...,OPINION: The RNC’s aspirational message couldn...,Right,https://www.allsides.com/news/2019-02-14-1633/...,https://www.allsides.com/news/2019-02-14-1633/...
3,2016-10-12T07:31:05-07:00,Sixth in a 10-part weekly series. The Politics...,"ANALYSIS: Like Joe Rogan, The RNC Will Thrive ...",Right,https://www.allsides.com/news/2016-10-12-0731/...,http://www.allsides.com/news/2016-10-12-0731/p...
4,2019-09-11T14:23:57-07:00,WASHINGTON – President Donald Trump faced blow...,ANALYSIS: The RNC portrays Trump as he wants t...,Left,https://www.allsides.com/news/2019-09-11-1423/...,https://www.allsides.com/news/2019-09-11-1423/...
5,2020-01-16T07:22:14-08:00,Senate Republicans vulnerable to a Democratic ...,OPINION: What country does Mike Pence live in?,Left,https://www.allsides.com/news/2020-01-16-0722/...,https://www.allsides.com/news/2020-01-16-0722/...
6,2019-08-21T08:12:02-07:00,New allegations of Donald Trump groping women ...,Trump Continues Virus Fight: 'Maybe I'm Immune',Right,https://www.allsides.com/news/2016-10-13-0833/...,http://www.allsides.com/news/2016-10-13-0833/t...
7,2019-03-21T13:16:59-07:00,DAYS AFTER a gunman carried out a horrifying a...,OPINION: Cal Thomas: Why Trump's COVID experie...,Right,https://www.allsides.com/news/2019-03-21-1316/...,https://www.allsides.com/news/2019-03-21-1316/...
8,2018-07-19T15:11:57-07:00,He again helped Putin cover up Russia’s attack...,President Trump has tested positive for COVID-...,Lean Left,https://www.allsides.com/news/2018-07-19-1511/...,https://www.allsides.com/news/2018-07-19-1511/...
9,2019-05-10T07:17:07-07:00,A president who talks and acts tough on trade ...,President Trump has Covid-19: How global media...,Center,https://www.allsides.com/news/2019-05-10-0717/...,https://www.allsides.com/news/2019-05-10-0717/...


In [None]:
d=pd.read_csv('allsides-content-f.csv')

In [None]:
# get all news article links

def extract_articles(link):
    for i in range(len(d['Source_Headline'])):
        soup = soup_basics(d['Source_Headline'][i])
        link=None
        # locate relevant information within the extracted page
        substory_list = soup.find_all(class_='read-more-story')

        # loop through the different news sources within each major news story
        try:

          link = re.search('(?P<url>https?://[^\s]+)', str(substory_list)).group('url')
        except: continue
        link = link[:-1]
        d['Source_Link'][i]=link
extract_articles(d.Source_Headline)


In [None]:
# get all image from article links
d['Image_link']=0

def extract_articles(link):
    for i in range(len(d['Source_Link'])):
        try:
          soup = soup_basics(d['Source_Link'][i])
          link=None 
          # locate relevant information within the extracted page
          story_image = soup.find(property="og:image")

          # loop through the different news sources within each major news story
          

          link = re.search('(?P<url>https?://[^\s]+)', str(story_image)).group('url')
        except: continue
        link = link[:-1]
        d['Image_link'][i]=link
extract_articles(d['Source_Link'])



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d['Image_link'][i]=link


We're showing the first five rows of the CSV we created. So far, everything looks satisfactory.

Keep in mind; this DataFrame lacks specific information for analyzing news articles. Such as the actual body or text of the article (and perhaps the authors of the article). For Part 2, we will be using the same CSV to extract the necessary information.

In [None]:
d

In [None]:
file_l = list(d.Source_Link)

In [None]:
with open('link_file.txt', 'w') as file:
    # iterate over the list and write each string to a new line in the file
    for string in file_l:
        file.write(string + '\n')

In [None]:
d=d.dropna()

In [None]:
d.to_csv("allsides-content-f.csv", index=False)