## Scrapping : [List of states and union territories of India by area](https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_area)


- In this Notebook, we scrape a "List of states and union territories of India by area" which is help us to analyze the states of India based on area.


- The list of states and union territories of the Republic of India by area is ordered from largest to smallest according to the census of 2011. India consists of 28 states and 8 union territories, including the National Capital Territory of Delhi.


- In August 2019, the Indian Parliament passed a resolution to divide the state of Jammu and Kashmir into two Union Territories, Jammu & Kashmir (J&K) and Ladakh, which came into action on 31 October 2019.


- Source Wikipedia

### Import required libraries for web scraping

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import requests

### requesting url

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_area'

In [3]:
source = urllib.request.urlopen(url)

### creating object of BeautifulSoup

In [4]:
soup = BeautifulSoup(source,'lxml')

In [5]:
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of states and union territories of India by area - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X-BwSQpAMMQAASowJf8AAAEQ","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_states_and_union_territories_of_India_by_area","wgTitle":"List of states and union territories of India by area","wgCurRevisionId":997834183,"wgRevisionId":997834183,"wgArticleId":1884577,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is differ

In [7]:
main_table=soup('table', class_="wikitable sortable")[0]
main_table

<table class="wikitable sortable">
<tbody><tr>
<th style="background:#9cf;">Rank
</th>
<th style="background:#9cf;">State (S) / Union territory (UT)
</th>
<th class="data-sort-type:number" style="background:#9cf;">Area (km<sup>2</sup>)
</th>
<th style="background:#9cf;">Region
</th>
<th style="background:#9cf;">National Share (%)
</th>
<th style="background:#9cf;">Political entities of comparable size (land mass)
</th>
<th style="background:#9cf;">Ref
</th></tr>
<tr>
<td>1 (S1)
</td>
<td><a href="/wiki/Rajasthan" title="Rajasthan">Rajasthan</a>
</td>
<td>342,239
</td>
<td><a href="/wiki/North_India" title="North India">Northern</a>
</td>
<td>10.42
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.pn

### Scraping head

In [175]:
head = []
for i in range(len(main_table.find_all('th'))):
    head.append(main_table.find_all('th')[i].get_text())

In [176]:
head

['Rank\n',
 'State (S) / Union territory (UT)\n',
 'Area (km2)\n',
 'Region\n',
 'National Share (%)\n',
 'Country of comparable size (land mass)\n',
 'Ref\n']

In [177]:
df_head=pd.DataFrame(columns=head)

### Removing ' \n ' in head

In [178]:
head = df_head.columns.str.replace('\n','')

In [179]:
head

Index(['Rank', 'State (S) / Union territory (UT)', 'Area (km2)', 'Region',
       'National Share (%)', 'Country of comparable size (land mass)', 'Ref'],
      dtype='object')

### Scraping tabular data

In [181]:
main_table.find_all('td')

[<td>1 (S1)
 </td>,
 <td><a href="/wiki/Rajasthan" title="Rajasthan">Rajasthan</a>
 </td>,
 <td>342,239
 </td>,
 <td><a href="/wiki/North_India" title="North India">Northern</a>
 </td>,
 <td>10.42
 </td>,
 <td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" width="23"/> </span><a href="/wiki/Germany" title="Germany">Germany</a>
 </td>,
 <td>
 </td>,
 <td>2 (S2)
 </td>,
 <td><a href="/wiki/Madhya_Pradesh" title="Madhya Pradesh">Madhya Pradesh</a>
 </td>,
 <td>308,252
 </td>,
 <td><a href="/wiki/Central_India" title="Central India">Central</a>
 </td>,
 <td>9.37
 </td>,
 <td><span class="flagicon"><img

In [182]:
data=[]
for d in range(len(main_table.find_all('td'))):
    data.append(main_table.find_all('td')[d].get_text())

In [183]:
n=len(main_table.find_all('th'))
data = [data[i:i+n] for i in range(0,len(data),n)]

In [184]:
data

[['1 (S1)\n',
  'Rajasthan\n',
  '342,239\n',
  'Northern\n',
  '10.42\n',
  '\xa0Germany\n',
  '\n'],
 ['2 (S2)\n',
  'Madhya Pradesh\n',
  '308,252\n',
  'Central\n',
  '9.37\n',
  '\xa0Oman\n',
  '[note 1]\n'],
 ['3 (S3)\n',
  'Maharashtra\n',
  '307,713\n',
  'Western\n',
  '9.36\n',
  '\xa0Italy\n',
  '\n'],
 ['4 (S4)\n',
  'Uttar Pradesh\n',
  '240,928\n',
  'Northern\n',
  '7.33\n',
  '\xa0United Kingdom\n',
  '\n'],
 ['5 (S5)\n',
  'Gujarat\n',
  '196,024\n',
  'Western\n',
  '5.96\n',
  '\xa0Senegal\n',
  '\n'],
 ['6 (S6)\n',
  'Karnataka\n',
  '191,791\n',
  'Southern\n',
  '5.83\n',
  '\xa0Syria\n',
  '\n'],
 ['8 (S7)\n',
  'Andhra Pradesh\n',
  '160,205\n',
  'Southern\n',
  '4.87\n',
  '\xa0Tunisia\n',
  '[4][note 2]\n'],
 ['9 (S8)\n',
  'Odisha\n',
  '155,707\n',
  'Eastern\n',
  '4.73\n',
  '\xa0Bangladesh\n',
  '\n'],
 ['10 (S9)\n',
  'Chhattisgarh\n',
  '135,191\n',
  'Central\n',
  '4.11\n',
  '\xa0Greece\n',
  '[note 3]\n'],
 ['11 (S10)\n',
  'Tamil Nadu\n',
  '130,0

### Creating DataFrame using head and data

In [185]:
dataset= pd.DataFrame(df,columns=head)

In [186]:
dataset

Unnamed: 0,Rank,State (S) / Union territory (UT),Area (km2),Region,National Share (%),Country of comparable size (land mass),Ref
0,1 (S1)\n,Rajasthan\n,"342,239\n",Northern\n,10.42\n,Germany\n,\n
1,2 (S2)\n,Madhya Pradesh\n,"308,252\n",Central\n,9.37\n,Oman\n,[note 1]\n
2,3 (S3)\n,Maharashtra\n,"307,713\n",Western\n,9.36\n,Italy\n,\n
3,4 (S4)\n,Uttar Pradesh\n,"240,928\n",Northern\n,7.33\n,United Kingdom\n,\n
4,5 (S5)\n,Gujarat\n,"196,024\n",Western\n,5.96\n,Senegal\n,\n
5,6 (S6)\n,Karnataka\n,"191,791\n",Southern\n,5.83\n,Syria\n,\n
6,8 (S7)\n,Andhra Pradesh\n,"160,205\n",Southern\n,4.87\n,Tunisia\n,[4][note 2]\n
7,9 (S8)\n,Odisha\n,"155,707\n",Eastern\n,4.73\n,Bangladesh\n,\n
8,10 (S9)\n,Chhattisgarh\n,"135,191\n",Central\n,4.11\n,Greece\n,[note 3]\n
9,11 (S10)\n,Tamil Nadu\n,"130,058\n",Southern\n,3.95\n,Greece\n,\n


### Removing '\n' in Data

In [187]:
for col in dataset.columns:
    dataset[col]=dataset[col].str.replace('\n','')

In [190]:
dataset.drop('Ref',axis=1,inplace=True)

In [191]:
dataset

Unnamed: 0,Rank,State (S) / Union territory (UT),Area (km2),Region,National Share (%),Country of comparable size (land mass)
0,1 (S1),Rajasthan,342239,Northern,10.42,Germany
1,2 (S2),Madhya Pradesh,308252,Central,9.37,Oman
2,3 (S3),Maharashtra,307713,Western,9.36,Italy
3,4 (S4),Uttar Pradesh,240928,Northern,7.33,United Kingdom
4,5 (S5),Gujarat,196024,Western,5.96,Senegal
5,6 (S6),Karnataka,191791,Southern,5.83,Syria
6,8 (S7),Andhra Pradesh,160205,Southern,4.87,Tunisia
7,9 (S8),Odisha,155707,Eastern,4.73,Bangladesh
8,10 (S9),Chhattisgarh,135191,Central,4.11,Greece
9,11 (S10),Tamil Nadu,130058,Southern,3.95,Greece


## Now, Our dataset is created 

### Saving a dataser to CSV format 

In [196]:
dataset.to_csv("List of states and union territories of India by area.csv",index=False)

In [197]:
pd.read_csv('List of states and union territories of India by area.csv')

Unnamed: 0,Rank,State (S) / Union territory (UT),Area (km2),Region,National Share (%),Country of comparable size (land mass)
0,1 (S1),Rajasthan,342239,Northern,10.42,Germany
1,2 (S2),Madhya Pradesh,308252,Central,9.37,Oman
2,3 (S3),Maharashtra,307713,Western,9.36,Italy
3,4 (S4),Uttar Pradesh,240928,Northern,7.33,United Kingdom
4,5 (S5),Gujarat,196024,Western,5.96,Senegal
5,6 (S6),Karnataka,191791,Southern,5.83,Syria
6,8 (S7),Andhra Pradesh,160205,Southern,4.87,Tunisia
7,9 (S8),Odisha,155707,Eastern,4.73,Bangladesh
8,10 (S9),Chhattisgarh,135191,Central,4.11,Greece
9,11 (S10),Tamil Nadu,130058,Southern,3.95,Greece


### I hope this dataset is useful for you please upvote 

# Thank You !