In [1]:
from urllib.request import urlopen


In [2]:
from bs4 import BeautifulSoup

In [3]:
html=urlopen('https://www.w3schools.com/css/default.asp')
bs= BeautifulSoup(html.read(),'html.parser')
#calling html.read() in order to get the HTML content of the page.

In [4]:
print(bs.h1)

<h1>CSS <span class="color_h1">Tutorial</span></h1>


In [5]:
bs= BeautifulSoup(html,'html.parser')

In [6]:
"""
This HTML content is then transformed into a BeautifulSoup object, 
with the following structure:

- html → <html><head>...</head><body>...</body></html>

- head → <head><title>A Useful Page<title></head>

- title →  <title>A Useful Page</title>

- body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
  
- h1 → <h1>An Interesting Title</h1>

- div → <div>Lorem Ipsum dolor...</div>

"""


'\nThis HTML content is then transformed into a BeautifulSoup object, \nwith the following structure:\n\n- html → <html><head>...</head><body>...</body></html>\n\n- head → <head><title>A Useful Page<title></head>\n\n- title →  <title>A Useful Page</title>\n\n- body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>\n  \n- h1 → <h1>An Interesting Title</h1>\n\n- div → <div>Lorem Ipsum dolor...</div>\n\n'

'http://www.pythonscraping.com/pages/page1.html'

## An Interesting Title
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

# Parser
html.parser is a parser that is included with Python 3 and requires no extra installations in order to use.

lxml has some advantages over html.parser in that it is generally better at parsing “messy” or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also somewhat faster than html.parser.

In [7]:
from urllib.request import urlopen
from urllib.error import URLError
from urllib.error import HTTPError
from bs4 import BeautifulSoup

In [8]:
try:
    html= urlopen('http://www.pythonscraping.com/pages/page1.html')
    
except HTTPError as e:
    print("ERROR!!!")
    
except URLError as e:
    print("The server could not be found!!")
    
except OSError as e:
    print('ERROR')
    
else:
    bs=BeautifulSoup(html,'html.parser')
    print(bs.h1)
    
    

<h1>An Interesting Title</h1>


attempting to access a tag on a None object itself will
result in an AttributeError being thrown.

AttributeError: 'NoneType' object has no attribute 'someTag'

In [9]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup 


In [10]:
def getContent(URL):
    try: 
        html=urlopen(URL)
    except OSError as e:
        return None
    try:
        bs=BeautifulSoup(html,'html.parser')
        content=bs.body.h1
    except AttributeError as e:
        return None
    return content

content=getContent("https://www.w3schools.com/css/default.asp")
if content==None :
    print("Found nothing")
else:
    print(content)
    

<h1>CSS <span class="color_h1">Tutorial</span></h1>


## Scraping using class &ID attributes of tags
### Example
find_all function : find_all function to extract a
Python list of proper nouns found by selecting only the text within the given tag with particular attribute.

In [11]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [12]:
html = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs=BeautifulSoup(html,'html.parser')

In [13]:
namelist=bs.find_all('span',{'class':'green'})
#selecting only the text within <span class="green"></span> tags
for name in namelist:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


### get_text() function: 
.get_text() strips all tags from the document you are working
with and returns a Unicode string containing the text only. For
example, if you are working with a large block of text that contains
many hyperlinks, paragraphs, and other tags, all those will be stripped
away, and you’ll be left with a tagless block of text.

# Find & Find_all

- find_all(tag, attributes, recursive, text, limit, keywords)

- find(tag, attributes, recursive, text, keywords)

Both functions are used for same purpose , only difference is the argument 
'limit'. These are the arguments ,maximum time we use only tag and attribute. below discussed all agguments with example

In [14]:
#1.  tag, or list of tags
bs.find_all(['h1','h2','h3'])
#returns a list of all the header tags in a document

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [15]:
#2. The attributes argument takes a Python dictionary of attributes and matches tags
#that contain any one of those attributes. 
namelist=bs.find_all('span',{'class':{'green','red'}})
for name in namelist:
    print(name.get_text())
#For example, the following function would return both the green and red span
#tags in the HTML document

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
Heavens! what a virulent attack!
the prince
Anna Pavlovna
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
Can one be well while suffering morally? Can one be calm in times
like these if one

In [16]:
#3 The recursive argument is a boolean. How deeply into the document do you 
#want to go? If recursive is set to True, the find_all function looks into 
#children, and children’s children, for tags that match your parameters. If 
#it is False, it will look only at the top-level tags in your document.By
#default, find_all works recursively (recursive is set to True)
namelist=bs.find_all('span',{'class':'red'},recursive=False)
for name in namelist:
    print(name.get_text())

In [17]:
#4. The text argument is unusual in that it matches based on the text content of the tags,
#rather than properties of the tags themselves. For instance, if you want to find the
#number of times “the prince” is surrounded by tags on the example page

namelist=bs.find_all(text="the prince")
print(len(namelist))

7


In [18]:
#5. limit, in case of find_all. The limit argument is used only in the
#find_all method; find is equivalent to the same find_all call, with a 
#limit of 1.

#is used only in retrieving the first x items from the page

namelist=bs.find_all('span',{'class':'green'},limit=5)
print(namelist)
for name in namelist:
    print(name.get_text())#prints with \n thats why its looking as 7 names

[<span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>]
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


In [33]:
#6. The keyword argument allows you to select tags that contain a particular 
#attribute or set of attributes.

namelist=bs.find_all(class_='red')
for name in namelist:
    print(name.get_text())
    
#This returns the first tag with the word “text” in the class_ attribute 
#and “title” in the id attribute. Note that, by convention, each value for 
#an id should be used only once on the page.

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
Heavens! what a virulent attack!
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?
You are
staying the whole evening, I hope?
And the fete at the English ambassador's? Today is Wednesday.