# Web Scraping with python requests, urllib and BeautifulSoup

Web Scraping tutorial using python's requests library and BeautifulSoup. Going to scrape global news section of Reuters news. Retrieving four things from each article: title, firstline of article, url link of the article and when the article was written

### Method 1 : Using python requests and BeautifulSoup

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'http://www.reuters.com/news/archive/worldNews?view=page'

In [3]:
result = requests.get(url)

In [4]:
result.status_code

# Checking if there is no problem with the request

200

In [5]:
result.raise_for_status()

# prints None if there is no problem with the request. Otherwise, it raises an error

In [6]:
result.status_code == requests.codes.ok

# prints True if there is no problem with the request

True

In [7]:
content = result.content

In [8]:
# decoding, removing whitespaces and tabs. Lastly, spliting by new line
# To view the structure of the html and determine which tags to extract
content.decode().strip().replace('\t','').split('\n')

['<!--[if !IE]> This has been served from cache <![endif]-->',
 '<!--[if !IE]> Request served from apache server: produs--i-ede3bd7e <![endif]-->',
 '<!--[if !IE]> Cached on Sat, 17 Jun 2017 13:37:13 GMT and will expire on Sat, 17 Jun 2017 13:42:20 GMT <![endif]-->',
 '<!--[if !IE]> token: 85f27e91-1f6a-4c48-af2c-7f02ce75a87f <![endif]-->',
 '<!--[if !IE]> Tomcat Server /produs--i-04d0ea63c73100166/ <![endif]-->',
 '',
 '<!--[if !IE]> App Server /produs--i-04d0ea63c73100166/ <![endif]-->',
 '',
 '<!doctype html><html lang="en"><head>',
 '<title>World News Headlines  | Reuters</title>',
 '    <meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch" href="//s1.reutersmedia.net"/><link rel="dns-prefetch" href="//s2.reutersmedia.net"/><link rel="dns-prefetch" href="//s3.reutersmedia.net"/><link rel="dns-prefetch" href="//s4.reutersmedia.net"/><link rel="dns-prefetch" href="//static.reuters.com

###### Extracting Titles

In [9]:
soup = BeautifulSoup(content, 'html.parser')
soup.find_all('h3', {'class':'story-title'})

[<h3 class="story-title">
 								Seven sailors missing after U.S. Navy destroyer collides with container ship off Japan</h3>,
 <h3 class="story-title">
 								Putin: more U.S. sanctions would be harmful, talk of retaliation premature</h3>,
 <h3 class="story-title">
 								Four U.S. soldiers killed in apparent insider attack in Afghanistan: official</h3>,
 <h3 class="story-title">
 								Yemen government agrees to U.N. Hodeidah plan, Houthis skeptical</h3>,
 <h3 class="story-title">
 								Pope tells Merkel to keep pressing for international cooperation</h3>,
 <h3 class="story-title">
 								Duterte resurfaces after rest, says battle with militants winding down</h3>,
 <h3 class="story-title">
 								U.N. mediator targets fresh Syria talks for July 10</h3>,
 <h3 class="story-title">
 								Iraqi forces remove Islamic State fighters from vicinity of U.S.base in Syria</h3>,
 <h3 class="story-title">
 								Egyptian court recommends death penalty for 30 over assassination 

In [10]:
titles = []
for title in soup.find_all('h3', {'class':'story-title'}):
    titles.append(title.string.strip())

In [11]:
titles

['Seven sailors missing after U.S. Navy destroyer collides with container ship off Japan',
 'Putin: more U.S. sanctions would be harmful, talk of retaliation premature',
 'Four U.S. soldiers killed in apparent insider attack in Afghanistan: official',
 'Yemen government agrees to U.N. Hodeidah plan, Houthis skeptical',
 'Pope tells Merkel to keep pressing for international cooperation',
 'Duterte resurfaces after rest, says battle with militants winding down',
 'U.N. mediator targets fresh Syria talks for July 10',
 'Iraqi forces remove Islamic State fighters from vicinity of U.S.base in Syria',
 'Egyptian court recommends death penalty for 30 over assassination of prosecutor',
 'British PM May tries to quell public anger after deadly London fire']

###### Extracting first line of article 

In [12]:
# there are some things in the p-tag that we don't need and they also cause errors when using string.strip method.
# Therefore, use try and except to avoid facing errors

first_lines = []
try:
    for first_line in soup.find_all('p'):
        first_lines.append(first_line.string.strip())
except:
    pass

In [13]:
first_lines

['Search and rescue efforts went on after dark for seven U.S. sailors missing after the U.S. Navy destroyer USS Fitzgerald collided with a Philippine-flagged container ship more than three times its size off eastern Japan early on Saturday.',
 'Russian President Vladimir Putin said new sanctions under consideration by the United States would damage relations between the two countries, but it was too early to talk about retaliation, state news agency RIA reported on Saturday.',
 'At least four American soldiers were shot and killed in an apparent insider attack by an Afghan soldier at a base in northern Afghanistan on Saturday, a military official said.',
 "Yemen's Saudi-backed government said on Saturday it agreed to a two-point plan advanced by the United Nations to ease suffering in the country's civil war, but the Iran-aligned Houthi movement remained skeptical.",
 "Pope Francis has asked Germany to keep fighting for the Paris climate change deal and to 'tear down walls' that inhibi

###### Extracting when the article was written

In [14]:
written_times = []
for written_time in soup.find_all('span', {'class':'timestamp'}):
    written_times.append(written_time.string.strip())

In [15]:
written_times

['9:27am EDT',
 '7:14am EDT',
 '9:10am EDT',
 '8:38am EDT',
 '9:25am EDT',
 '7:55am EDT',
 '8:51am EDT',
 '9:05am EDT',
 '7:20am EDT',
 '9:35am EDT']

###### Extracting url link of article

In [16]:
url_links = []
import re

for tags in soup.find_all('a'):
    if re.search('article', tags['href']):
        url_links.append(tags['href'])

In [17]:
url_links

['/article/us-usa-navy-asia-idUSKBN1972SW',
 '/article/us-usa-navy-asia-idUSKBN1972SW',
 '/article/us-russia-usa-sanctions-idUSKBN1980CF',
 '/article/us-russia-usa-sanctions-idUSKBN1980CF',
 '/article/us-afghanistan-attack-idUSKBN1980HJ',
 '/article/us-afghanistan-attack-idUSKBN1980HJ',
 '/article/us-yemen-security-hodeidah-idUSKBN1980G0',
 '/article/us-yemen-security-hodeidah-idUSKBN1980G0',
 '/article/us-pope-merkel-idUSKBN1980HW',
 '/article/us-pope-merkel-idUSKBN1980HW',
 '/article/us-philippines-militants-idUSKBN1980EE',
 '/article/us-philippines-militants-idUSKBN1980EE',
 '/article/us-mideast-crisis-syria-un-idUSKBN1980DP',
 '/article/us-mideast-crisis-syria-un-idUSKBN1980DP',
 '/article/us-mideast-crisis-iraq-syria-idUSKBN19807Y',
 '/article/us-mideast-crisis-iraq-syria-idUSKBN19807Y',
 '/article/us-egypt-court-idUSKBN198096',
 '/article/us-egypt-court-idUSKBN198096',
 '/article/us-britain-fire-idUSKBN1980CY',
 '/article/us-britain-fire-idUSKBN1980CY',
 '/article/us-usa-cuba-idU

Since the last 5 urls are not headlines but just article links on the sidebar, we want to get rid of these

In [18]:
url_links = url_links[:-5]

In [19]:
url_links

['/article/us-usa-navy-asia-idUSKBN1972SW',
 '/article/us-usa-navy-asia-idUSKBN1972SW',
 '/article/us-russia-usa-sanctions-idUSKBN1980CF',
 '/article/us-russia-usa-sanctions-idUSKBN1980CF',
 '/article/us-afghanistan-attack-idUSKBN1980HJ',
 '/article/us-afghanistan-attack-idUSKBN1980HJ',
 '/article/us-yemen-security-hodeidah-idUSKBN1980G0',
 '/article/us-yemen-security-hodeidah-idUSKBN1980G0',
 '/article/us-pope-merkel-idUSKBN1980HW',
 '/article/us-pope-merkel-idUSKBN1980HW',
 '/article/us-philippines-militants-idUSKBN1980EE',
 '/article/us-philippines-militants-idUSKBN1980EE',
 '/article/us-mideast-crisis-syria-un-idUSKBN1980DP',
 '/article/us-mideast-crisis-syria-un-idUSKBN1980DP',
 '/article/us-mideast-crisis-iraq-syria-idUSKBN19807Y',
 '/article/us-mideast-crisis-iraq-syria-idUSKBN19807Y',
 '/article/us-egypt-court-idUSKBN198096',
 '/article/us-egypt-court-idUSKBN198096',
 '/article/us-britain-fire-idUSKBN1980CY',
 '/article/us-britain-fire-idUSKBN1980CY']

Since the url links have duplicates, we want to get rid of those. We can use set() but this will cause the order of the links to be shuffled

In [20]:
final_urls = []
for url in url_links:
    if url not in final_urls:
        final_urls.append(url)

In [21]:
final_urls

['/article/us-usa-navy-asia-idUSKBN1972SW',
 '/article/us-russia-usa-sanctions-idUSKBN1980CF',
 '/article/us-afghanistan-attack-idUSKBN1980HJ',
 '/article/us-yemen-security-hodeidah-idUSKBN1980G0',
 '/article/us-pope-merkel-idUSKBN1980HW',
 '/article/us-philippines-militants-idUSKBN1980EE',
 '/article/us-mideast-crisis-syria-un-idUSKBN1980DP',
 '/article/us-mideast-crisis-iraq-syria-idUSKBN19807Y',
 '/article/us-egypt-court-idUSKBN198096',
 '/article/us-britain-fire-idUSKBN1980CY']

In [22]:
import pandas as pd

reuters_word_headlines = pd.DataFrame({'title':titles, 'headline content':first_lines, 'time':written_times,
                                     'link url':final_urls})

In [23]:
reuters_word_headlines=\
reuters_word_headlines.loc[:,('time', 'title', 'headline content', 'link url')]

In [25]:
# adding www.reuters.com in front of each url

reuters_word_headlines['link url'] =\
'www.reuters.com' + reuters_word_headlines['link url']

In [26]:
reuters_word_headlines

Unnamed: 0,time,title,headline content,link url
0,9:27am EDT,Seven sailors missing after U.S. Navy destroye...,Search and rescue efforts went on after dark f...,www.reuters.com/article/us-usa-navy-asia-idUSK...
1,7:14am EDT,"Putin: more U.S. sanctions would be harmful, t...",Russian President Vladimir Putin said new sanc...,www.reuters.com/article/us-russia-usa-sanction...
2,9:10am EDT,Four U.S. soldiers killed in apparent insider ...,At least four American soldiers were shot and ...,www.reuters.com/article/us-afghanistan-attack-...
3,8:38am EDT,"Yemen government agrees to U.N. Hodeidah plan,...",Yemen's Saudi-backed government said on Saturd...,www.reuters.com/article/us-yemen-security-hode...
4,9:25am EDT,Pope tells Merkel to keep pressing for interna...,Pope Francis has asked Germany to keep fightin...,www.reuters.com/article/us-pope-merkel-idUSKBN...
5,7:55am EDT,"Duterte resurfaces after rest, says battle wit...","President Rodrigo Duterte of the Philippines, ...",www.reuters.com/article/us-philippines-militan...
6,8:51am EDT,U.N. mediator targets fresh Syria talks for Ju...,"The United Nations special mediator for Syria,...",www.reuters.com/article/us-mideast-crisis-syri...
7,9:05am EDT,Iraqi forces remove Islamic State fighters fro...,The Iraqi army and Sunni tribal fighters have ...,www.reuters.com/article/us-mideast-crisis-iraq...
8,7:20am EDT,Egyptian court recommends death penalty for 30...,A Cairo criminal court on Saturday recommended...,www.reuters.com/article/us-egypt-court-idUSKBN...
9,9:35am EDT,British PM May tries to quell public anger aft...,British Prime Minister Theresa May's governmen...,www.reuters.com/article/us-britain-fire-idUSKB...


### Method 2 : Using python urllib and BeautifulSoup

In [27]:
from urllib import request

url = 'http://www.reuters.com/news/archive/worldNews?view=page'

fhand = request.urlopen(url)
for line in fhand:
    print(line.strip().decode())

<!--[if !IE]> This has been served from cache <![endif]-->
<!--[if !IE]> Request served from apache server: produs--i-070ae7cf4c5cf96cf <![endif]-->
<!--[if !IE]> Cached on Sat, 17 Jun 2017 13:37:20 GMT and will expire on Sat, 17 Jun 2017 13:42:20 GMT <![endif]-->
<!--[if !IE]> token: c45cf780-a819-4ecd-8663-3aef8ed92eb9 <![endif]-->
<!--[if !IE]> Tomcat Server /produs--i-04d0ea63c73100166/ <![endif]-->

<!--[if !IE]> App Server /produs--i-04d0ea63c73100166/ <![endif]-->

<!doctype html><html lang="en"><head>
<title>World News Headlines  | Reuters</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch" href="//s1.reutersmedia.net"/><link rel="dns-prefetch" href="//s2.reutersmedia.net"/><link rel="dns-prefetch" href="//s3.reutersmedia.net"/><link rel="dns-prefetch" href="//s4.reutersmedia.net"/><link rel="dns-prefetch" href="//static.reuters.com"/><link rel="dns-prefetch" href="//w

In [28]:
content2 = request.urlopen(url).read()
soup2= BeautifulSoup(content2, 'html.parser')

The rest of the part where we use beautifulsoup to extract specific data is equivalent to method 1 above