In [1]:
# Install the library

!pip install requests --upgrade --quiet

In [2]:
# Import the library

import requests

We can download a web page using the `requests.get` function.


In [4]:
topic_url = 'https://github.com/topics/machine-learning'

In [5]:
response = requests.get(topic_url)

In [6]:
type(response)

requests.models.Response

`requests.get` returns a response object with the page contents and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status. 

 If the request was successful, `response.status_code` is set to a value between 200 and 299. 

In [7]:
response.status_code

200

The contents of the web page can be accessed using the `.text` property of the `response`. 

In [10]:
page_contents = response.text

In [11]:
len(page_contents)

605364

The page contains over 60,000 characters! Let's view the first 1000 characters of the web page.

In [41]:
page_contents[:50]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <'

What you see above is the *source code* of the web page. It written in a language called [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML). It defines the content and structure of the web page. 

Let's save the contents to a file with the `.html` extension.

In [24]:
with open('C:\JIgsaw FSDS\Python\machine-learning-topics.html', 'w', encoding='utf-8') as file:
    file.write(page_contents)

In [25]:
topics_da_url = 'https://github.com/topics/data-analysis'

In [26]:
da_response = requests.get(topics_da_url)

In [28]:
type(da_response)

requests.models.Response

In [29]:
da_response.status_code

200

In [30]:
page_contents_da = da_response.text

In [40]:
page_contents_da[:50]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <'

In [32]:
with open('C:\JIgsaw FSDS\Python\data-analysis.html', 'w', encoding='utf-8') as file:
    file.write(page_contents_da)

## Extracting information from HTML using Beautiful Soup

To extract information from the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Let's install the library and import the `BeautifulSoup` class from the `bs4` module.

In [34]:
!pip install beautifulsoup4 --upgrade --quiet

In [37]:
# Import the library
from bs4 import BeautifulSoup

In [39]:
?BeautifulSoup

Next, let's read the contents of the file `machine-learning-topics.html` and create a `BeautifulSoup` object to parse the content.

In [45]:
with open('C:\JIgsaw FSDS\Python\machine-learning-topics.html', 'r', encoding='utf-8') as f:
    html_source = f.read()

In [46]:
html_source[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-uGiH6wbEDXS0vWuvN3hZbENUuT1jRMWy2XVfJIgd3mEESUBtD/hnFdIiujVyRcPJ5dofwZ6e196xmCczSkgz9g==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-b86887eb06c40d74b4bd6baf3778596c.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-gEUpuli94xYShC0AAbAVQoQqxAoVyNDUWuD3x6Hsvwm8f1L7gbiu4bEM1HDLEkRz4ofHAvdAdmeqaUtzBCy6xg==" rel="stylesheet" href="https://github.githubassets.com/assets/site-804529ba58bde31612842d0001b01542.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-8rXKu7ZOFdS3H7Rk0wJ38WQFoEp6

In [47]:
doc = BeautifulSoup(html_source)

In [49]:
type(doc)

bs4.BeautifulSoup

In [50]:
doc.title

<title>machine-learning · GitHub Topics · GitHub</title>

In [52]:
doc.name

'[document]'

In [53]:
title_tag=doc.title

In [54]:
title_tag

<title>machine-learning · GitHub Topics · GitHub</title>

In [56]:
title_tag.text

'machine-learning · GitHub Topics · GitHub'

In [57]:
title_tag.name

'title'

In [69]:
title_tag_html = doc.html
title_tag_img = doc.img
title_tag_div = doc.div
title_tag_span = doc.span
title_tag_p=doc.p

In [72]:
print(title_tag_html.name)
print(title_tag_img.name)
print(title_tag_div.name)
print(title_tag_span.name)
print(title_tag_p.name)

html
img
div
span
p


In [73]:
title_tag_img

<img alt="" class="mr-2 header-search-key-slash" src="https://github.githubassets.com/images/search-key-slash.svg"/>

In [74]:
title_tag_span

<span class="progress-pjax-loader width-full js-pjax-loader-bar Progress position-fixed">
<span class="Progress-item progress-pjax-loader-bar" style="background-color: #79b8ff;width: 0%;"></span>
</span>

In [75]:
title_tag_p

<p>Machine learning is the practice of teaching a computer to learn. The concept uses pattern recognition, as well as other forms of predictive algorithms, to make judgments on incoming data. This field is closely related to artificial intelligence and computational statistics.</p>

In [96]:
first_link =doc.a

In [97]:
first_link

<a class="px-2 py-4 color-bg-info-inverse color-text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

In [78]:
first_link.text

'Skip to content'

In [86]:
first_link = doc.p

In [87]:
first_link.text

'Machine learning is the practice of teaching a computer to learn. The concept uses pattern recognition, as well as other forms of predictive algorithms, to make judgments on incoming data. This field is closely related to artificial intelligence and computational statistics.'

### Finding all tags of the same type

To find all the occurrences of a tag, use the `find_all` method.

> **QUESTION**: Find all the link tags on the page. How many links does the page contain?

In [88]:
all_link_tags = doc.find_all('a')

In [90]:
len(all_link_tags)

599

In [92]:
all_link_tags[1:2]

[<a aria-label="Homepage" class="mr-4" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github color-text-white" height="32" version="1.1" viewbox="0 0 16 16" width="32"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8z" fill-rule="evenodd"></path></svg>
 </a>]

In [163]:
all_img_tags = doc.find_all('img')

In [94]:
len(all_img_tags)

18

In [95]:
all_img_tags[0:4]

[<img alt="" class="mr-2 header-search-key-slash" src="https://github.githubassets.com/images/search-key-slash.svg"/>,
 <img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>,
 <img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>,
 <img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>]

### Accessing attributes

The attributes of a tag can be accessed using the indexing notation, e.g., `first_link['href']`

In [99]:
first_link['href']

'#start-of-content'

In [100]:
first_link['class']

['px-2',
 'py-4',
 'color-bg-info-inverse',
 'color-text-white',
 'show-on-focus',
 'js-skip-to-content']

Note that the `class` attribute is automatically split into a list of classes (this isn't done for any other attribute). This is because it's common practice to check for a specific class within a tag.

You can use the `.attrs` property to view all the attributes as a dictionary.

In [101]:
first_link.attrs

{'href': '#start-of-content',
 'class': ['px-2',
  'py-4',
  'color-bg-info-inverse',
  'color-text-white',
  'show-on-focus',
  'js-skip-to-content']}

In [103]:
all_img_tags[4]

<img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>

In [104]:
all_img_tags[4].attrs

{'class': ['avatar',
  'mr-2',
  'flex-shrink-0',
  'js-jump-to-suggestion-avatar',
  'd-none'],
 'alt': '',
 'aria-label': 'Team',
 'src': '',
 'width': '28',
 'height': '28'}

In [105]:
all_img_tags[4]['src']

''

In [106]:
all_img_tags[4]['alt']

''

### Searching by Attribute Value

> **QUESTION**: Find the `img` tag(s) on the page with the `alt` attribute set to `tsbertalan`.

We can provide a dictionary of attributes as the second argument to `find_all`

In [108]:
doc.find_all('img',{'alt':'tsbertalan'})

[<img alt="tsbertalan" class="avatar avatar-user avatar-small" height="32" src="https://avatars.githubusercontent.com/u/306137?v=4" width="32"/>]

If we're just interested in the first element, we can use the `find` method. Keep in mind that `find` returns `None` if no matching tag is found.

In [109]:
doc.find('img',{'alt':'tsbertalan'})

<img alt="tsbertalan" class="avatar avatar-user avatar-small" height="32" src="https://avatars.githubusercontent.com/u/306137?v=4" width="32"/>

> **EXERCISE**: Find the `src` attribute of the first `img` tag with the `alt` attribute set to `julia`. Visit the link and check what the image represents.

In [117]:
doc.find('img',{'alt':'julia'})['src']

'https://repository-images.githubusercontent.com/1644196/ddfc1e00-6638-11e9-9b80-0fe7b9aedd72'

### Searching by Class

The `class` attribute is one of the most frequently used attributes on HTML tags (used for layout and styling). We can search for tags containing a class using the `class_` argument in `find_all` (note that `class` is a reserved keyword in Python, hence the underscore in the argument name).

> **QUESTION**: Find all the tags containing the class `HeaderMenu-link`. 

In [124]:
matching_tags=doc.find_all(class_='HeaderMenu-link')

We can also for a specific type of tag e.g. `<a>` matching the given class.

In [126]:
header_link_tags=doc.find_all('a', class_='HeaderMenu-link')

In [None]:
header_link_tags

In [157]:
doc.find('img', alt="simeonschaub")

<img alt="simeonschaub" class="avatar avatar-user avatar-small" height="32" src="https://avatars.githubusercontent.com/u/5220528?v=4" width="32"/>

## Parsing Information from Tags
Once we have a list of tags matching some criteria, it's easy to extract information and convert it to a more convenient format.

QUESTION: Find the link text and URL of all the links withing the page header on https://github.com/topics/machine-learning .

We'll create a list of dictionaries containing the required information. We'll add the base URL https://github.com as a prefix because the href attribute only contains the relative path e.g. /explore.

In [165]:
all_img_tags[0]

<img alt="" class="mr-2 header-search-key-slash" src="https://github.githubassets.com/images/search-key-slash.svg"/>

In [168]:
all_img_tags[0]['class']

['mr-2', 'header-search-key-slash']

In [169]:
all_img_tags[0]['src']

'https://github.githubassets.com/images/search-key-slash.svg'

In [177]:
header_link_tags[0]['href']

'/team'

In [179]:
header_links=[]
base_url = 'https://github.com'

for tag in header_link_tags:
    header_links.append({'title':tag.text.strip(),'url':base_url + tag['href']})

In [None]:
header_link_tags

In [182]:
header_links

[{'title': 'Team', 'url': 'https://github.com/team'},
 {'title': 'Enterprise', 'url': 'https://github.com/enterprise'},
 {'title': 'Marketplace', 'url': 'https://github.com/marketplace'},
 {'title': 'Sign in',
  'url': 'https://github.com/login?return_to=%2Ftopics%2Fmachine-learning'},
 {'title': 'Sign up',
  'url': 'https://github.com/join?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftopics%2Fmachine-learning&source=header'},
 {'title': 'Sign up',
  'url': 'https://github.com/join_next?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftopics%2Fmachine-learning&source=header'}]

In [187]:
all_img_tags[0].attrs

{'src': 'https://github.githubassets.com/images/search-key-slash.svg',
 'alt': '',
 'class': ['mr-2', 'header-search-key-slash']}

In [205]:
#list of all the images matching the class avatar-use
images_all=[]
img_list=doc.find_all('img', class_='avatar-user')
#base_url = 'https://github.com'

for tag in img_list:
    images_all.append({'username':tag['alt'] ,'url': tag['src']})

In [206]:
images_all

[{'username': 'tsbertalan',
  'url': 'https://avatars.githubusercontent.com/u/306137?v=4'},
 {'username': 'gabrieldemarmiesse',
  'url': 'https://avatars.githubusercontent.com/u/12891691?v=4'},
 {'username': 'akamaus',
  'url': 'https://avatars.githubusercontent.com/u/58955?v=4'},
 {'username': 'sh-biswas',
  'url': 'https://avatars.githubusercontent.com/u/51776663?v=4'},
 {'username': 'simeonschaub',
  'url': 'https://avatars.githubusercontent.com/u/5220528?v=4'},
 {'username': 'trivialfis',
  'url': 'https://avatars.githubusercontent.com/u/16746409?v=4'}]

### Elements inside a tag

> **QUESTION**: Find the `li` tags that are direct children of `ul` tag with the class `top-list` in the sample HTML document below.


In [208]:
sample_html = """
<html>
    <body>
        <ul class="top-list">
            <li>Item 1</li>
            <li>Item 2</li>
            <li>
                <ul>
                    <li>Item 3.1</li>
                    <li>Item 3.2</li>
                    <li>Item 3.3</li>
                </ul> 
            </li>
        </ul>
    </body>
</html>"""

In [209]:
sample_doc=BeautifulSoup(sample_html)

In [211]:
list_tag=sample_doc.find('ul',class_='top-list')

We can use the `find_all` method on the tag, and set `recursive=False` to find just the direct children.

In [213]:
list_tag

<ul class="top-list">
<li>Item 1</li>
<li>Item 2</li>
<li>
<ul>
<li>Item 3.1</li>
<li>Item 3.2</li>
<li>Item 3.3</li>
</ul>
</li>
</ul>

In [227]:
list_item_tags = list_tag.find_all('li',recursive=False)

In [228]:
list_item_tags

[<li>Item 1</li>,
 <li>Item 2</li>,
 <li>
 <ul>
 <li>Item 3.1</li>
 <li>Item 3.2</li>
 <li>Item 3.3</li>
 </ul>
 </li>]

Without recursive=False, the inner list items are also included in the result.

In [231]:
list_item_tags = list_tag.find_all('li')

In [232]:
list_item_tags

[<li>Item 1</li>,
 <li>Item 2</li>,
 <li>
 <ul>
 <li>Item 3.1</li>
 <li>Item 3.2</li>
 <li>Item 3.3</li>
 </ul>
 </li>,
 <li>Item 3.1</li>,
 <li>Item 3.2</li>,
 <li>Item 3.3</li>]

Keep in mind that you don't need to remember all (or any) of the methods or properties offered by Beautiful Soup documents and tags. You should be able to figure out what you need to do, when you need to do it. Here's how:

* Look up the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Google what you're trying to do: https://www.google.co.in/search?q=beautiful+soup+get+href
* Ask a question on StackOverflow: https://stackoverflow.com/questions/tagged/beautifulsoup

