# Web scraping with python 


This is a simple demo of how to use python for web scraping and text mining. Python libraries used for web scraping includes: Beautifulsoup, xpath, scrapy. This demo used beautifulsoup for web scraping.

Disclaimer: this is a simple demo use Web of science as an example. Since WoS provide fee-based services, You may not able to reproduce the results, depending on whether you have access to their services. You may also need to modify the "http" address. You may copy or reproduced the codes, but solely on your computer and for your personal, non-commercial use. 

## A demo from Beautifulsoup  

#### Let's first get some basic ideas of beautifulsoup. This example is from the official doctument

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'http://computational-class.github.io/bigdata/data/test.html'
content = requests.get(url)
content = content.text
soup = BeautifulSoup(content, 'html.parser') 
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>

In [3]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


#### Several way to extract information from the source code

In [82]:
soup.select('body > p.title > b')[0].text

"The Dormouse's story"

In [83]:
soup.select('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [84]:
soup.find_all('a', {'class', 'sister'})[0]

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [19]:
soup.select('.sister')[0]

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [22]:
soup.select('#link1')[0]['id']

'link1'

In [23]:
for i in soup('p'):
    print(i.text)

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [24]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


#### using xpath - Get an article title from pubmed

In [56]:
from lxml import html
import requests
page = requests.get('https://www.ncbi.nlm.nih.gov/pubmed/15986988')
tree = html.fromstring(page.content)

The path you need you find AND copy: Highlight the title in chrome, right click-inspect-finding corresponding code-right click-copy-copy Xpath. then you will find the path is *[@id="maincontent"]/div/div[5]/div/h1/text()

In [57]:
title = tree.xpath('//*[@id="maincontent"]/div/div[5]/div/h1/text()')

In [66]:
print(title)
type(title)

['Famine, social disruption, and involuntary fetal loss: evidence from Chinese survey data.']


list

### Articles published in American Journal of Sociology 

#### Here is an example: scraping all  articles titles published in the AJS from 2010-2019, 1800 in total

In [2]:
from datetime import datetime
from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 3200)


The following workflow enables you to crawl all article titles published in AJS. Here I only obtained published articles for the last 10 years.

In [466]:
t1=[]
j1=[]
for j in range(1, 37):
    html0='http://apps.webofknowledge.com/summary.do?product=WOS&colName=WOS&qid=1&SID=7DvsabUuPjyfkHd9SzC&search_mode=AdvancedSearch&formValue(summary_mode)=AdvancedSearch&update_back2search_link_param=yes&page=%d'% j
    html = urlopen(html0)
    bs = BeautifulSoup(html.read(), 'html.parser')
    List = bs.findAll('div', {'class', 'search-results-content'})
    for i in List:
        j = i.find('span', {'class', 'hitHilite'}).text
        j1.append(j)
        t=i.find('a').text.strip()
        t1.append(t)

In [495]:
title=(t1[0:1800])


1800

In [578]:
## this is the source code 
test=bs.findAll('div', {'class', 'search-results-content'})
for i in test:
    z=i
print(z)    
    

<div class="search-results-content"><div>
<div>
<span class="smallV110"></span><a class="smallV110 snowplow-full-record" href="/full_record.do?product=WOS&amp;search_mode=AdvancedSearch&amp;qid=1&amp;SID=7DvsabUuPjyfkHd9SzC&amp;page=36&amp;doc=1800">
<value lang_id="">Operating Room: Relational Spaces and Microinstitutional Change in Surgery</value>
</a>
</div>
</div>
<div>
<span class="label">By: </span><a alt="Find more records by this author" href="/OutboundService.do?SID=7DvsabUuPjyfkHd9SzC&amp;mode=rrcAuthorRecordService&amp;action=go&amp;product=WOS&amp;daisIds=3166608" title="Find more records by this author">Kellogg, Katherine C.</a>
</div>
<div>
<span id="fetch_wos_subject_Span_1800" style="display: none" url="http://apps.webofknowledge.com/FetchESIField.do?product=WOS&amp;search_mode=CitedFullRecord&amp;SID=7DvsabUuPjyfkHd9SzC&amp;isickref=WOS:000274365700001&amp;doc=1800"></span><span id="show_journal_overlay_link_1800" name="show_journal_overlay_link_1800" style="display: n

In [515]:
#j3=j1+j2
## create a id variable 
id= [x for x in range (1, 1801)]

#### let's take a look of the data and save it into a csv file

In [None]:
title10_19 = {'id':id, 'title':title,'journal':j3}
df=pd.DataFrame(title10_19)

In [12]:
df.head(10) # the first 20 titles we got

Unnamed: 0,id,title,journal
0,1,Decoupling: Marital Violence and the Struggle ...,AMERICAN JOURNAL OF SOCIOLOGY
1,2,How Organizational Minorities Form and Use Soc...,AMERICAN JOURNAL OF SOCIOLOGY
2,3,More Than a Sorting Machine: Ethnic Boundary M...,AMERICAN JOURNAL OF SOCIOLOGY
3,4,How Do Criminal Courts Respond in Times of Cri...,AMERICAN JOURNAL OF SOCIOLOGY
4,5,Gender Pay Gaps in US Federal Science Agencies...,AMERICAN JOURNAL OF SOCIOLOGY
5,6,For-Profit Democracy: Why the Government Is Lo...,AMERICAN JOURNAL OF SOCIOLOGY
6,7,"War, Women, and Power: From Violence to Mobili...",AMERICAN JOURNAL OF SOCIOLOGY
7,8,The Medicalisation of Incest and Abuse: Biomed...,AMERICAN JOURNAL OF SOCIOLOGY
8,9,"When Police Use Force: Context, Methods, and O...",AMERICAN JOURNAL OF SOCIOLOGY
9,10,Pathways of Desire: The Sexual Migration of Me...,AMERICAN JOURNAL OF SOCIOLOGY


In [None]:
df.to_csv("title10_19.csv") # save data to a csv file

#### In most case, beautiful soup is good enough for samll scale scraping. 