# Retrieving Data from Wikipedia- With Python
Shivam Panchal

In [1]:
# Importing the necessary libraries
import bs4 as bs
import lxml
from urllib import request

-**Urllib/Urllib2:**
-It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc). 

-**BeautifulSoup:**
-It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. In this article, we will use latest version BeautifulSoup 4.

Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others:
**
-mechanize
-scrapemark
-scrapy
**

**Scrapping a web Page using BeautifulSoup**

Here, I am scraping data from a Wikipedia page. Our final goal is to extract list of state, union territory capitals in India. And some basic detail like establishment, former capital and others form this wikipedia page. Let’s learn with doing this project step wise step:

In [2]:
# Specify the url

wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"

In [4]:
#Query the website and return the html to the variable 'page'

import urllib
page = urllib.request.urlopen(wiki)

In [6]:
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = bs.BeautifulSoup(page,'lxml')

In [7]:
# Use function “prettify” to look at nested structure of HTML page
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of state and union territory capitals in India - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_state_and_union_territory_capitals_in_India","wgTitle":"List of state and union territory capitals in India","wgCurRevisionId":756165382,"wgRevisionId":756165382,"wgArticleId":2371868,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Featured lists","States and territories of India-related lists","Indian capital cities","Lists of cities in India","Lists of capitals of country subdivisions"],"wgBreakFrames":false,"wgP

**Above, you can see that structure of the HTML tags. This will help you to know about different available tags and how can you play with these to extract information.**

In [8]:
soup.title

<title>List of state and union territory capitals in India - Wikipedia</title>

In [9]:
soup.title.string

'List of state and union territory capitals in India - Wikipedia'

In [11]:
# Find all the links within page’s <a> tags

soup.find_all('a')

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a href="/wiki/States_and_union_territories_of_India" title="States and union territories of India">States and union<br/>
 territories of India</a>,
 <a class="image" href="/wiki/File:Flag_of_India.svg"><img alt="Flag of India.svg" data-file-height="900" data-file-width="1350" height="47" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg

**Above, it is showing all links including titles, links and other information.  Now to show only links, we need to iterate over each a tag and then return the link using attribute “href” with get.**

In [12]:
all_links = soup.find_all('a')
for link in all_links:
    print(link.get("href"))

None
/wiki/Wikipedia:Featured_lists
#mw-head
#p-search
/wiki/States_and_union_territories_of_India
/wiki/File:Flag_of_India.svg
/wiki/List_of_states_and_territories_of_India_by_area
/wiki/List_of_states_and_union_territories_of_India_by_population
/wiki/ISO_3166-2:IN
/wiki/List_of_Indian_states_by_Child_Nutrition
/wiki/Indian_states_and_territories_ranking_by_crime_rate
/wiki/Indian_states_ranked_by_economic_freedom
/wiki/Indian_states_ranking_by_households_having_electricity
/wiki/Indian_states_ranking_by_fertility_rate
/wiki/Forest_cover_by_state_in_India
/wiki/List_of_Indian_states_and_union_territories_by_GDP
/wiki/List_of_Indian_states_by_GDP_per_capita
/wiki/List_of_Indian_states_and_territories_by_highest_point
/wiki/Indian_states_ranked_by_HIV_awareness
/wiki/List_of_Indian_states_and_territories_by_Human_Development_Index
/wiki/Indian_states_ranking_by_families_owning_house
/wiki/Indian_states_ranking_by_household_size
/wiki/Indian_states_and_territories_ranked_by_incidents_of

**Find the right table: As we are seeking a table to extract information about state capitals, we should identify the right table first. Let’s write the command to extract information within all table tags.**

In [13]:
all_tables = soup.find_all('table')

In [14]:
print(all_tables)

[<table class="vertical-navbox nowraplinks" style="float:right;clear:right;width:22.0em;margin:0 0 1.0em 1.0em;background:#f9f9f9;border:1px solid #aaa;padding:0.2em;border-spacing:0.4em 0;text-align:center;line-height:1.4em;font-size:88%">
<tr>
<th style="padding:0.2em 0.4em 0.2em;font-size:145%;line-height:1.2em"><a href="/wiki/States_and_union_territories_of_India" title="States and union territories of India">States and union<br/>
territories of India</a><br/>
ordered by</th>
</tr>
<tr>
<td style="padding:0.2em 0 0.4em">
<div class="center">
<div class="floatnone"><a class="image" href="/wiki/File:Flag_of_India.svg"><img alt="Flag of India.svg" data-file-height="900" data-file-width="1350" height="47" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/70px-Flag_of_India.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/105px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/140px-Flag_of_I

In [16]:
# Extract the information to a dataframe
right_table = soup.find_all('table', {"class" : 'wikitable sortable plainrowheaders'})
right_table

[<table class="wikitable sortable plainrowheaders">
 <tr>
 <th scope="col">No.</th>
 <th scope="col">State or<br/>
 union territory</th>
 <th scope="col">Administrative capitals</th>
 <th scope="col">Legislative capitals</th>
 <th scope="col">Judiciary capitals</th>
 <th scope="col">Year capital was established</th>
 <th scope="col">The Former capital</th>
 </tr>
 <tr>
 <td>1</td>
 <th scope="row"><a href="/wiki/Andaman_and_Nicobar_Islands" title="Andaman and Nicobar Islands">Andaman and Nicobar Islands</a> <img alt="union territory" data-file-height="14" data-file-width="9" height="14" src="//upload.wikimedia.org/wikipedia/commons/3/37/Dagger-14-plain.png" width="9"/></th>
 <td><b><a href="/wiki/Port_Blair" title="Port Blair">Port Blair</a></b></td>
 <td>Port Blair</td>
 <td><a href="/wiki/Kolkata" title="Kolkata">Kolkata</a></td>
 <td>1955</td>
 <td>Calcutta (1945–1956)</td>
 </tr>
 <tr>
 <td>2</td>
 <th scope="row"><a href="/wiki/Andhra_Pradesh" title="Andhra Pradesh">Andhra Pradesh

In [19]:
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table

<table class="wikitable sortable plainrowheaders">
<tr>
<th scope="col">No.</th>
<th scope="col">State or<br/>
union territory</th>
<th scope="col">Administrative capitals</th>
<th scope="col">Legislative capitals</th>
<th scope="col">Judiciary capitals</th>
<th scope="col">Year capital was established</th>
<th scope="col">The Former capital</th>
</tr>
<tr>
<td>1</td>
<th scope="row"><a href="/wiki/Andaman_and_Nicobar_Islands" title="Andaman and Nicobar Islands">Andaman and Nicobar Islands</a> <img alt="union territory" data-file-height="14" data-file-width="9" height="14" src="//upload.wikimedia.org/wikipedia/commons/3/37/Dagger-14-plain.png" width="9"/></th>
<td><b><a href="/wiki/Port_Blair" title="Port Blair">Port Blair</a></b></td>
<td>Port Blair</td>
<td><a href="/wiki/Kolkata" title="Kolkata">Kolkata</a></td>
<td>1955</td>
<td>Calcutta (1945–1956)</td>
</tr>
<tr>
<td>2</td>
<th scope="row"><a href="/wiki/Andhra_Pradesh" title="Andhra Pradesh">Andhra Pradesh</a></th>
<td><b><a hre

**Above, you can notice that second element of <tr> is within tag <th> not <td> so we need to take care for this. Now to access value of each element, we will use “find(text=True)” option with each element.  Let’s look at the code:**

In [20]:
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states=row.findAll('th') #To store second column data
    if len(cells)==6: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))
        G.append(cells[5].find(text=True))

In [21]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df

Unnamed: 0,Number,State/UT,Admin_Capital,Legislative_Capital,Judiciary_Capital,Year_Capital,Former_Capital
0,1,Andaman and Nicobar Islands,Port Blair,Port Blair,Kolkata,1955,Calcutta (1945–1956)
1,2,Andhra Pradesh,Hyderabad,Hyderabad,Hyderabad,1959,Kurnool
2,3,Arunachal Pradesh,Itanagar,Itanagar,Guwahati,1986,
3,4,Assam,Dispur,Guwahati,Guwahati,1975,Shillong
4,5,Bihar,Patna,Patna,Patna,1912,
5,6,Chandigarh,Chandigarh,—,Chandigarh,1966,—
6,7,Chhattisgarh,Naya Raipur,Raipur,Bilaspur,2000,—
7,8,Dadra and Nagar Haveli,Silvassa,—,Mumbai,1945,Mumbai (1954–1961)
8,9,Daman and Diu,Daman,—,Delhi,1987,Ahmedabad
9,10,National Capital Territory of Delhi,New Delhi,New Delhi,New Delhi,1931,—
