# BeautifulSoup 1: scraping my page

BeautifulSoup is a powerful Python library used for pulling data out of HTML documents. In this notebook we will use the **requests** library to get the HTML document from my website and then use **BeautifulSoup** to get data from that document. Unlike **regular expressions**, which deal all HTML document as a string/text, **BeautifulSoup** distinguishes between simple/plain text and HTML tags/attributes which is very helpful for scraping.

If you do not have BeautifulSoup installed, then open a completely new command prompt (black window) and type the following command:

```
pip install beautifulsoup
```

Okay, let's start from importing abovementioned libraries and selecting the url to scrape.

In [1]:
import requests
# import everything from BeautifulSoup
from BeautifulSoup import *

In [2]:
url = "https://hrantdavtyan.github.io/"

Once we have the libraries imported and the url selected, we should use the **get()** function from the **requests** library to get the website content as a response and then, convert it to text.

In [4]:
response = requests.get(url)
my_page = response.text
print(response)
type(my_page)

<Response [200]>


unicode

In order to be able to initiate several function available from BeautifulSoup library, we need to pass **my_page** as an argument to **BeautifulSoup()** function. The content will still remain the same, yet the object type will change which will let us to use some nice methods.

In [5]:
soup = BeautifulSoup(my_page)

In [6]:
type(soup)

BeautifulSoup.BeautifulSoup

In [7]:
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Hrant Davtyan</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
<link rel="stylesheet" type="text/css" href="css/reset.css" />
<link rel="stylesheet" type="text/css" href="css/style.css" />
<link rel="stylesheet" type="text/css" href="css/fancybox.css" />
<link rel="stylesheet" type="text/css" href="http://fonts.googleapis.com/css?family=Open+Sans:400,600,300,800,700,400italic|PT+Serif:400,400italic" />
<script type="text/javascript" src="js/jquery.min.js"></script>
<script type="text/javascript" src="js/jquery.easytabs.min.js"></script>
<script type="text/javascript" src="js/respond.min.js"></script>
<script type="text/javascript" src="js/j

Fine. Let's now try to find all the a tags from my page.

In [8]:
a_tags = soup.findAll('a')

In [9]:
type(a_tags)

list

In [10]:
len(a_tags)

29

As you can see above, we received a list as an output with 29 elements. The 29 elements are the 29 a tags from my website. We can print the outcome to see them.

In [11]:
print(a_tags)

[<a href="#" class="social-text">SOCIAL PROFILES</a>, <a href="https://www.facebook.com/HrantDavtyan" class="social-facebook"></a>, <a href="https://www.linkedin.com/in/hrantdavtyan" class="social-in"></a>, <a href="#profile" class="tab-profile">Profile</a>, <a href="#resume" class="tab-resume">Resume</a>, <a href="#portfolio" class="tab-portfolio">Portfolio</a>, <a href="#contact" class="tab-contact">Contact</a>, <a href="http://aua.am/">American University of Armenia (AUA)</a>, <a href="http://isifa.am/en/ifa-dfi/">French-German University</a>, <a href="https://www.cerge-ei.cz/">Cerge-EI</a>, <a href="http://www.fao.org/armenia/en/">United Nations Food and Agriculture Organization (UN FAO)</a>, <a href="http://www.metric.am/">METRIC</a>, <a href="http://ahpc.am/?lang=en">Armenian Harvest Promotion Center</a>, <a href="http://www.ucl.ac.uk/">University College London (UCL)</a>, <a href="http://iset.tsu.ge/">International School of Economics (ISET)</a>, <a href="" class="current" data-

If you were interested in finding only the very first a tag, then the **find()** function could be useful instead of **findAll()**. This function already strings the a tag and its content as a string, rather than a list.

In [12]:
a_tag = soup.find('a')
type(a_tag)

BeautifulSoup.Tag

In [13]:
print(a_tag)

<a href="#" class="social-text">SOCIAL PROFILES</a>


As you can see above, although this is just s string, its type is a **BeautifulSoup.Tag** which will helps us to use some other methods on it. For example, we can get the link inside the a tag (**href**) by using a **get()** function. As the links are always inside a **href** attribute, we will try to get the value of **href** as follows:

In [14]:
print(a_tag.get('href'))

#


If we want to get links from all a_tags (the latter was a list), then we should iterate over the list and get the **href** value from each element of the list as follows:

In [15]:
for i in a_tags:
    print(i.get("href"))

#
https://www.facebook.com/HrantDavtyan
https://www.linkedin.com/in/hrantdavtyan
#profile
#resume
#portfolio
#contact
http://aua.am/
http://isifa.am/en/ifa-dfi/
https://www.cerge-ei.cz/
http://www.fao.org/armenia/en/
http://www.metric.am/
http://ahpc.am/?lang=en
http://www.ucl.ac.uk/
http://iset.tsu.ge/






https://github.com/HrantDavtyan/HrantDavtyan.github.io\teaching\jdocs\Business Analytics\html\index.html
http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png
http://www.armstat.am/en/?nid=661
http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png
http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png
http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png
http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png
http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png


Similarly, one can get all the p_tags from my page by just searching for All p-s as follows:

In [16]:
p_tags = soup.findAll('p')
print(p_tags)

[<p>I am a Data Enthusiast, teaching Business Analytics and providing consultancy on Statistics, Economics and IT. Feel free to take a look around my webpage.</p>, <p align="justify">I achieved an average feedback score of 4.7 (out of 5) teaching Business Analytics,
									Business Mathematics and Macroeconomics courses to overall 111 students.</p>, <p align="justify">Students with no prior knowledge of programming successfully learned Python, R, Stata and IBM SPSS
									and applied their skills by participating to a Kaggle competition as a final course project</p>, <p align="justify">I was awarded a one year fellowship to teach modern economics at local universities.
									During the fellowship tenure I taught around 230 students at 3 different universities.</p>, <p align="justify">Developed a software (using R, HTML, CSS) for Post Disaster Needs Assessment.
									Consulted contracted companies regarding Ministry staff appraisal and training needs assessment.
									

If you are interested only in paragraphs (text without tags) then you should again (as above in case of a_tags) iterate over the list and for each element of the list, get the text/string out of it as follows:

In [17]:
for i in p_tags:
    print(i.text)

I am a Data Enthusiast, teaching Business Analytics and providing consultancy on Statistics, Economics and IT. Feel free to take a look around my webpage.
I achieved an average feedback score of 4.7 (out of 5) teaching Business Analytics,
									Business Mathematics and Macroeconomics courses to overall 111 students.
Students with no prior knowledge of programming successfully learned Python, R, Stata and IBM SPSS
									and applied their skills by participating to a Kaggle competition as a final course project
I was awarded a one year fellowship to teach modern economics at local universities.
									During the fellowship tenure I taught around 230 students at 3 different universities.
Developed a software (using R, HTML, CSS) for Post Disaster Needs Assessment.
									Consulted contracted companies regarding Ministry staff appraisal and training needs assessment.
									Consulted National Statistical Service regarding Census (and post census) activities
Led several maj