# SCRAPE THE DATA FROM UNIVERSITY OF KERALA

In [27]:
import bs4 as bs
import urllib.request


In [28]:
source=urllib.request.urlopen("https://en.wikipedia.org/wiki/University_of_Kerala").read()

In [29]:
#Then, we create the "soup." This is a beautiful soup object:

In [30]:
soup=bs.BeautifulSoup(source,'lxml')

In [31]:
#If you do print(soup) and print(source), it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:

In [32]:
#title of the page

print(soup.title)

<title>University of Kerala - Wikipedia</title>


In [33]:
#get attributes

print(soup.title.name)


title


In [34]:
#get values

print(soup.title.string)

University of Kerala - Wikipedia


In [35]:
#beginning navigation

print(soup.title.parent.name)

head


In [36]:
#getting specific values

print(soup.p)

<p class="mw-empty-elt">
</p>


In [37]:
#Finding paragraph tags <p> is a fairly common task. In the case above, we're just finding the first one. What if we wanted to find them all?

In [38]:
print(soup.find_all('p'))

[<p class="mw-empty-elt">
</p>, <p><b>University of Kerala</b>, formerly the <b>University of Travancore</b>, is a <a href="/wiki/State_university_(India)" title="State university (India)">state-run</a> <a href="/wiki/Public_university" title="Public university">public university</a> located in <a href="/wiki/Thiruvananthapuram" title="Thiruvananthapuram">Thiruvananthapuram</a>, the state capital of <a href="/wiki/Kerala" title="Kerala">Kerala</a>, India. It was established in 1937 by a promulgation of the <a class="mw-redirect" href="/wiki/Maharajah_of_Travancore" title="Maharajah of Travancore">Maharajah of Travancore</a>, <a href="/wiki/Chithira_Thirunal_Balarama_Varma" title="Chithira Thirunal Balarama Varma">Chithira Thirunal Balarama Varma</a> who was also the first Chancellor of the university. <a class="mw-redirect" href="/wiki/C._P._Ramaswamy_Iyer" title="C. P. Ramaswamy Iyer">C. P. Ramaswamy Iyer</a>, the then Diwan (Prime Minister) of the State was the first Vice-Chancellor.

In [39]:
#We can also iterate through them:

In [40]:
for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))





None
University of Kerala, formerly the University of Travancore, is a state-run public university located in Thiruvananthapuram, the state capital of Kerala, India. It was established in 1937 by a promulgation of the Maharajah of Travancore, Chithira Thirunal Balarama Varma who was also the first Chancellor of the university. C. P. Ramaswamy Iyer, the then Diwan (Prime Minister) of the State was the first Vice-Chancellor. It was the first university in Kerala, and among the first in the country.

None
The university has over 150 affiliated colleges and has sixteen faculties and 43 Departments of teaching and research. The Governor of Kerala serves as the Chancellor of university.

None
It was established in 1937 by a promulgation of the Maharajah of Travancore, Chithira Thirunal Balarama Varma who was also the first Chancellor of the university. C. P. Ramaswamy Iyer, the then Diwan (Prime Minister) of the State was the first Vice-Chancellor. It was the first university in Kerala, 

In [41]:
#The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string on, we will get None returned.

#Another common task is to grab links. For example:

In [42]:
# Print all the href links present in the university webpage.

In [43]:
for url in soup.find_all('a'):
    print(url.get('href'))

None
#mw-head
#searchInput
/wiki/File:Question_book-new.svg
/wiki/Wikipedia:Verifiability
https://en.wikipedia.org/w/index.php?title=University_of_Kerala&action=edit
/wiki/Help:Referencing_for_beginners
//www.google.com/search?as_eq=wikipedia&q=%22University+of+Kerala%22
//www.google.com/search?tbm=nws&q=%22University+of+Kerala%22+-wikipedia&tbs=ar:1
//www.google.com/search?&q=%22University+of+Kerala%22&tbs=bkt:s&tbm=bks
//www.google.com/search?tbs=bks:1&q=%22University+of+Kerala%22+-wikipedia
//scholar.google.com/scholar?q=%22University+of+Kerala%22
https://www.jstor.org/action/doBasicSearch?Query=%22University+of+Kerala%22&acc=on&wc=on
/wiki/Help:Maintenance_template_removal
/wiki/File:Kerala_University.jpg
/wiki/Sanskrit_language
/wiki/State_university_(India)
/wiki/Chithira_Thirunal_Balarama_Varma
/wiki/University_Grants_Commission_(India)
/wiki/National_Assessment_and_Accreditation_Council
/wiki/Association_of_Indian_Universities
/wiki/Association_of_Commonwealth_Universities
/wik

In [44]:
#In this case, if we just grabbed the .text from the tag, you'd get the anchor text, but we actually want the link itself. That's why we're using .get('href') to get the true URL.

#Finally, you may just want to grab text. You can use .get_text() on a Beautiful Soup object, including the full soup:

In [45]:
print(soup.get_text())




University of Kerala - Wikipedia








































University of Kerala

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
University in India


This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "University of Kerala" – news · newspapers · books · scholar · JSTOR (September 2015) (Learn how and when to remove this template message)
University of KeralaUniversity of Kerala HeadquartersFormer nameUniversity of TravancoreMottoकर्मणि व्यज्यते प्रज्ञाKarmaṇi Vyajyate Prajñā  (Sanskrit)Motto in EnglishWisdom manifests itself in actionTypeState UniversityEstablished1937; 85 years ago (1937)FounderChithira Thirunal Balarama VarmaAffiliationUGC, NAAC, AIU, ACUChancellorGovernor Of KeralaVice-ChancellorDr V. P. Mahadevan Pillai[1]LocationThiruvananthapuram, Kerala, India8°30′12″N 76°56′50″E﻿ / ﻿8.50333°N

In [46]:
#Next, we can grab the links from just the nav bar:

In [47]:
nav = soup.nav

In [48]:
for url in nav.find_all('a'):
    print(url.get('href'))

/wiki/Special:MyTalk
/wiki/Special:MyContributions
/w/index.php?title=Special:CreateAccount&returnto=University+of+Kerala
/w/index.php?title=Special:UserLogin&returnto=University+of+Kerala


In [49]:
#In this case, we're grabbing the first nav tags that we can find (the navigation bar). You could also go for soup.body to get the body section, then grab the .text from there:

In [50]:
body = soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)



University of Kerala, formerly the University of Travancore, is a state-run public university located in Thiruvananthapuram, the state capital of Kerala, India. It was established in 1937 by a promulgation of the Maharajah of Travancore, Chithira Thirunal Balarama Varma who was also the first Chancellor of the university. C. P. Ramaswamy Iyer, the then Diwan (Prime Minister) of the State was the first Vice-Chancellor. It was the first university in Kerala, and among the first in the country.

The university has over 150 affiliated colleges and has sixteen faculties and 43 Departments of teaching and research. The Governor of Kerala serves as the Chancellor of university.

It was established in 1937 by a promulgation of the Maharajah of Travancore, Chithira Thirunal Balarama Varma who was also the first Chancellor of the university. C. P. Ramaswamy Iyer, the then Diwan (Prime Minister) of the State was the first Vice-Chancellor. It was the first university in Kerala, and among the fir