# Introduction to Web Scraping and REST APIs 

This tutorial is a part of the [Zero to Data Analyst Bootcamp by Jovian](https://www.jovian.ai/data-analyst-bootcamp)

![](https://i.imgur.com/6zM7JBq.png)


Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing [HTML documents](https://developer.mozilla.org/en-US/docs/Web/HTML), some platforms offer REST APIs to retrieve information in a machine-readable format like JSON. In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.


The following topics are covered in this tutorial:

* Downloading web pages using the requests library
* Inspecting the HTML source code of a web page
* Parsing parts of a website using Beautiful Soup
* Writing parsed information into CSV files
* Using a REST API to retrieve data as JSON
* Combining data from multiple sources
* Using links on a page to crawl a website


### How to Run the Code

The best way to learn the material is to execute the code and experiment with it yourself. This tutorial is an executable [Jupyter notebook](https://jupyter.org). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.





## Problem 

Over the course of this tutorial, we'll solve the following problem to learn the tools and techniques used for web scraping:


> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. The top repositories for the topic `machine-learning` can be found on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL. 


 <a href="https://github.com/topics/machine-learning"><img src="https://i.imgur.com/5V1HGLs.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;"></a>
 
 
How would you go about solving this problem in Python? Explore the web page and take a couple of minutes to come up with an approach before proceeding further. How many lines of code do you think the solution will require?

## Downloading a web page using `requests`

When you access a URL like https://github.com/topics/machine-learning using a web browser, it downloads the contents of the web page the URL points to and displays them on the screen. Before we can extract information from a web page using Python, we need to download the page.

We'll use a library called [`requests`](https://docs.python-requests.org/en/master/) to download web pages from the internet. Let's begin by installing and importing the library.

In [1]:
# Install the library
!pip install requests --upgrade --quiet

In [2]:
# Import the library
import requests

We get download a web page using the `requests.get` function.

In [3]:
topic_url = 'https://github.com/topics/machine-learning'

In [4]:
response = requests.get(topic_url)

In [5]:
type(response)

requests.models.Response

`requests.get` returns a response object, which contains the contents of the web page and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status. 

 If the request was successful, `response.status_code` is set to a value between 200 and 299. 

In [6]:
response.status_code

200

The contents of the web page can be accessed using the `.text` property of the `response`. 

In [7]:
page_contents = response.text

Let's view the first 1000 characters of the web page.

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-uGiH6wbEDXS0vWuvN3hZbENUuT1jRMWy2XVfJIgd3mEESUBtD/hnFdIiujVyRcPJ5dofwZ6e196xmCczSkgz9g==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-b86887eb06c40d74b4bd6baf3778596c.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-gEUpuli94xYShC0AAbAVQoQqxAoVyNDUWuD3x6Hsvwm8f1L7gbiu4bEM1HDLEkRz4ofHAvdAdmeqaUtzBCy6xg==" rel="stylesheet" href="https://github.githubassets.com/assets/site-804529ba58bde31612842d0001b01542.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-8rXKu7ZOFdS3H7Rk0wJ38WQFoEp6

What you're seeing above is the *source code* of the web page, written in a language called [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML). It defines the content and structure of the web page. 

Let's save the contents to a file with the `.html` extension.

In [9]:
with open('machine-learning-topics.html', 'w') as file:
    file.write(page_contents)

You can now view the file using the "File > Open" menu option within Jupyter and clicking on *machine-learning.html* in the list of files displayed. Here's what you'll see when you open the file:

<img src="https://i.imgur.com/8gEbT1P.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

While this looks similar to the original web page, note that it's simply a copy. You will notice that none of the links or buttons actually work. To view or edit the source code of the file, click "File > Open" within Jupyter, then select the file *machine-learning.html* from the list and click the "Edit" button.

<img src="https://i.imgur.com/JG7Q8CK.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

As you might expect, the source code looks something like this:

<img src="https://i.imgur.com/6ynXNdz.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

Try scrolling through the source code. Can you make sense of it? Can you see how the information on the page is organized within the file? We'll learn more about it in the next section.

> **EXERCISE**: Download the web page for a different topic e.g. https://github.com/topics/data-analysis using `requests` and save it to a file e.g. `data-analysis.html`. View the page and compare it with the previously downloaded page? How are the two different? Can you spot the differences in the source code?

Let's save our work using `jovian` before continuing.

In [10]:
!pip install jovian --upgrade --quiet

In [11]:
import jovian

In [12]:
jovian.commit(project='python-web-scraping-and-rest-api')

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

## Inspecting the HTML source code of a web page

![](https://i.imgur.com/mvBpQIP.png)

As mentioned earlier, web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://jovian.ai" target="_blank">Go to Jovian</a>`. An HTML tag has three parts:

1. **Name**: E.g. `html`, `head`, `body`, `div` etc. The name indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: E.g. `href`, `target`, `class`, `id` etc. These are used set properties for a tag and are used by the browser to customize how a tag is displayed and what happens when a user interacts with it.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments e.g. `<div>Some content</div>`.


### Inside an HTML Document

Here's a simple HTML document that uses many commonly used tags:

```html
<html>
  <head>
    <title>All About Python</title>
  </head>
  <body>
    <div style="width: 640px; margin: 40px auto">
      <h1 style="text-align:center;">Python - A Programming Language</h1>
      <img src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png" alt="python-logo" style="width:240px;margin:0 auto;display:block;">
      <div>
        <h2>About Python</h2>
        <p>
          Python is an <span style="font-style: italic">interpreted, high-level and general-purpose</span> programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. Visit the <a href="https://docs.python.org/3/">official documentation</a> to learn more.
        </p>
      </div>
      <div>
        <h2>Some Python Libraries</h2>
        <ul id="libraries">
          <li>Numpy</li>
          <li>Pandas</li>
          <li>PyTorch</li>
          <li>Scikit Learn</li>
        </ul>
      </div>
      <div>
        <h2>Recent Python Versions</h2>
        <table id="versions-table">
          <tr>
            <th class="bordered-table">Version</th>
            <th class="bordered-table">Released on</th>
          </tr>
          <tr>
            <td class="bordered-table">Python 3.8</td>
            <td class="bordered-table">October 2019</td>
          </tr>
          <tr>
            <td class="bordered-table">Python 3.7</td>
            <td class="bordered-table">June 2018</td>
          </tr>
        </table>
          <style>
              .bordered-table { 
                  border: 1px solid black; padding: 8px;
              }
          </style>
      </div>
    </div>
  </body>
</html>

```

> **EXERCISE**: Copy the above HTML code and paste into a new file called `webpage.html`. To create a new file, first select "File > Open" from the menu bar, then select "New > Text" file. View the saved file. Can you see how the different tags are displayed in different ways by the browser?


<img src="https://i.imgur.com/lcSHz5V.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

> **EXERCISE**: Make some changes to the code inside `webpage.html`. Save the file and view it again. Do you see your changes reflected? Play with the structure of the file. Try to break things and fix them!

### Common Tags and Attributes

Following are some of the most commonly used HTML tags:

* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tags supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)




> **EXERCISE**: Complete this tutorial on HTML: https://www.htmldog.com/guides/html/ . Once done, describe what the above tags and attributes are used for. Try creating a new HTML page using the tags you find most interesting.



### Inspecting HTML in the Browser

You can view the source code of any webpage right within your browser by right clicking anywhere on a page and selecting the "Inspect" option. It opens the "Developer Tools" pane where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

Here's what it looks like on the Chrome browser:


<img src="https://i.imgur.com/jCA1T6Z.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">


> **EXERCISE**: Explore the source code of the web page https://github.com/topics/machine-learning . Try to find the portions in the source code corresponding to the username, repostiory name and number of stars for each repository listed on the page.

Let's save our work before continuing.

In [13]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

## Extracting information from HTML using Beautiful Soup

To extract information the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Let's install the library and import the `BeautifulSoup` class from `bs4` module.

In [14]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

In [15]:
# Import the library
from bs4 import BeautifulSoup

In [16]:
?BeautifulSoup

Next, let's read the contents of the file `machine-learning.html` and create a `BeautifulSoup` object to parse the content.

In [17]:
with open('machine-learning.html', 'r') as f:
    html_source = f.read()

In [18]:
html_source[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-uGiH6wbEDXS0vWuvN3hZbENUuT1jRMWy2XVfJIgd3mEESUBtD/hnFdIiujVyRcPJ5dofwZ6e196xmCczSkgz9g==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-b86887eb06c40d74b4bd6baf3778596c.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-gEUpuli94xYShC0AAbAVQoQqxAoVyNDUWuD3x6Hsvwm8f1L7gbiu4bEM1HDLEkRz4ofHAvdAdmeqaUtzBCy6xg==" rel="stylesheet" href="https://github.githubassets.com/assets/site-804529ba58bde31612842d0001b01542.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-8rXKu7ZOFdS3H7Rk0wJ38WQFoEp6

In [19]:
doc = BeautifulSoup(html_source)

The `doc` object contains several useful properties and methods to extract information from the HTML document. Let's look at a few examples below.

**NOTE**: You don't need to remember all (or any) of the properties/methods, you can look up [the documentation of BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [just search online](https://www.google.co.in/search?q=beautifulsoup+how+to+get+href+of+link) to find exactly what you need.

### Accessing a tag

> **QUESTION**: Find the title of the page represented by `doc`.

The title of the page is contained within the `<title>` tag. We can access the title tag using `doc.title`.

In [20]:
title_tag = doc.title

In [21]:
title_tag

<title>machine-learning · GitHub Topics · GitHub</title>

In [22]:
type(title_tag)

bs4.element.Tag

We can access a tag's name using the `.name` property.

In [23]:
title_tag.name

'title'

The text within a tag can be accessed using `.text`.

In [24]:
title_tag.text

'machine-learning · GitHub Topics · GitHub'

> **EXERCISE**: Explore the `html`, `body` and `head` tags of `doc`. Do you see what you expect to see?

If a tag occurs more than once in a document e.g. `<a>` (which represents links), then `doc.a` finds the first `<a>` tag.

In [25]:
first_link = doc.a

In [26]:
first_link

<a class="px-2 py-4 color-bg-info-inverse color-text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

In [27]:
first_link.text

'Skip to content'

> **EXERCISE**: Find the first occurrence of each of these tags in `doc`: `div`, `img`, `span`, `p` etc.

### Finding all tags of the same type

To find all the occurrence of a tag, use the `find_all` method.

> **QUESTION**: Find all the link tags on the page. How many links does the page contain?

In [28]:
all_link_tags = doc.find_all('a')

In [29]:
len(all_link_tags)

599

In [30]:
all_link_tags[:3]

[<a class="px-2 py-4 color-bg-info-inverse color-text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>,
 <a aria-label="Homepage" class="mr-4" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github color-text-white" height="32" version="1.1" viewbox="0 0 16 16" width="32"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8z" fill-ru

> **EXERCISE**: Get a list of all the `img` tags on the page. How many images does the page contain?

### Accessing attributes

The attributes of a tag can be accessed using the indexing notation e.g. `first_link['href']`

In [31]:
first_link

<a class="px-2 py-4 color-bg-info-inverse color-text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

In [32]:
first_link['href']

'#start-of-content'

In [33]:
first_link['class']

['px-2',
 'py-4',
 'color-bg-info-inverse',
 'color-text-white',
 'show-on-focus',
 'js-skip-to-content']

Note that the `class` attribute is automatically split into a list of classes (this isn't done for any other attribute). This is because it's common practice to check for a specific class within a tag.

You can use the `.attrs` property to view all the attributes as a dictionary.

In [34]:
first_link.attrs

{'href': '#start-of-content',
 'class': ['px-2',
  'py-4',
  'color-bg-info-inverse',
  'color-text-white',
  'show-on-focus',
  'js-skip-to-content']}

> **EXERCISE**: Find the 5th image tag on the page (counting from 0). Which attributes does the tag contain? Find the values of the `src` and `alt` attributes of the tag.

### Searching by Attribute Value

> **QUESTION**: Find the `img` tag(s) on the page with the `alt` attribute set to `tsbertalan`.

We can provide a dictionary of attribute as the second argument to `find_all`

In [35]:
doc.find_all('img', { 'alt': 'tsbertalan'})

[<img alt="tsbertalan" class="avatar avatar-user avatar-small" height="32" src="https://avatars.githubusercontent.com/u/306137?v=4" width="32"/>]

If we're just interested in the first element, we can use the `find` method. Keep in mind that `find` returns `None` if no matching tag was found.

In [36]:
doc.find('img', { 'alt': 'tsbertalan'})

<img alt="tsbertalan" class="avatar avatar-user avatar-small" height="32" src="https://avatars.githubusercontent.com/u/306137?v=4" width="32"/>

> **EXERCISE**: Find the `src` attribute of the first `img` tag with the `alt` attribute set to `julia`. Visit the link and check what the image represents.

### Searching by Class

The `class` attribute is one of the most frequently used attributes on HTML tags (because it is used for layout and styling). We can search for tags containing a class using the `class_` argument in `find_all` (note that `class` is a reserved keyword in Python, hence the underscore in the argument name).

> **QUESTION**: Find all the tags containing the class `HeaderMenu-link`. 

In [37]:
matching_tags = doc.find_all(class_='HeaderMenu-link')

In [38]:
matching_tags

[<summary class="HeaderMenu-summary HeaderMenu-link px-0 py-3 border-0 no-wrap d-block d-lg-inline-block">
                     Why GitHub?
                     <svg class="icon-chevon-down-mktg position-absolute position-lg-relative" fill="none" viewbox="0 0 14 8" x="0px" xml:space="preserve" y="0px">
 <path d="M1,1l6.2,6L13,1"></path>
 </svg>
 </summary>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-ga-click="(Logged out) Header, go to Team" href="/team">Team</a>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-ga-click="(Logged out) Header, go to Enterprise" href="/enterprise">Enterprise</a>,
 <summary class="HeaderMenu-summary HeaderMenu-link px-0 py-3 border-0 no-wrap d-block d-lg-inline-block">
                     Explore
                     <svg class="icon-chevon-down-mktg position-absolute position-lg-relative" fill="none" viewbox="0 0 14 8" x="0px" xml:space="preserve" y="0px">
 <path d="M1,1l6.2,6L13,1"></path>
 

We can also for a specific type of tag e.g. `<a>` matching the given class.

In [39]:
header_link_tags = doc.find_all('a', class_='HeaderMenu-link')

In [40]:
header_link_tags

[<a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-ga-click="(Logged out) Header, go to Team" href="/team">Team</a>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-ga-click="(Logged out) Header, go to Enterprise" href="/enterprise">Enterprise</a>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-ga-click="(Logged out) Header, go to Marketplace" href="/marketplace">Marketplace</a>,
 <a class="HeaderMenu-link flex-shrink-0 no-underline mr-3" data-ga-click="(Logged out) Header, clicked Sign in, text:sign-in" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"site header menu","repository_id":null,"auth_type":"SIGN_UP","originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="77fa0805c2ee5e083e5dbe5571a2e37d8661eca3a115806e97d54914b8332ea9" href="/login?return_to=%2Ftopics%2Fmachine-learning">
           Sign in
         </a>,


### Parsing Information from Tags

Once we have a list of tags matching a given criteria, it's easy to extract information and convert it to a more convenient format.

> **QUESTION**: Find the link text and URL of all the links with the header of page contained in `doc`.

We'll create a list of dictionaries containing the required information.

In [41]:
header_links = []
base_url = 'https://github.com'

for tag in header_link_tags:
    header_links.append({ 'title': tag.text.strip(), 'url': base_url + tag['href']})
    
header_links

[{'title': 'Team', 'url': 'https://github.com/team'},
 {'title': 'Enterprise', 'url': 'https://github.com/enterprise'},
 {'title': 'Marketplace', 'url': 'https://github.com/marketplace'},
 {'title': 'Sign in',
  'url': 'https://github.com/login?return_to=%2Ftopics%2Fmachine-learning'},
 {'title': 'Sign up',
  'url': 'https://github.com/join?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftopics%2Fmachine-learning&source=header'},
 {'title': 'Sign up',
  'url': 'https://github.com/join_next?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftopics%2Fmachine-learning&source=header'}]

Thus, we have successfully extracted some links form the page. This is exactly what scraping is: downloading a webpage, parsing the HTML and extracting some useful information.

> **EXERCISE**: Find the list of all the images matching the class `avatar-user`. Each element of the list should be a dictionary containing two keys, `"username"` and `"url"`. You can obtain the username using the `alt` attribute of a tag and the URL using the `src` attribute of a tag.

### Elements inside a tag

> **QUESTION**: Find the tags contained within the `ul` tag in the sample HTML document below.


In [42]:
sample_html = """
<html>
    <body>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
    </body>
</html>"""

In [43]:
sample_doc = BeautifulSoup(sample_html)

In [44]:
list_tag = sample_doc.find('ul')

We can use the `find_all` method on the tag, and set `recursive=False` to find just the direct children.

In [45]:
list_item_tags = list_tag.find_all('li', recursive=False)

In [46]:
list_item_tags

[<li>Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

Keep in mind that you don't need to remember all (or any) of the methods or properties offered by Beautiful Soup documents and tags. You look up the documentation or simply Google what you're trying to do or ask a question on StackOverflow.

You should be able to figure out what you need to do, when you need to do it.

Let's save our work before continuing.

In [47]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

### Top Repositories for a Topic

Let's return to our original problem statement of finding the top repositories for a given topic. Before we parse a page and find the top repositories let's define a helper function to get the contents of the web page for any topic.

> **QUESTION**: Define a function `get_topic_page` which downloads the GitHub web page for a given topic and returns a beautiful soup document representing the page.

In [48]:
def get_topic_page(topic):
    # Construct the URL
    topic_repos_url = 'https://github.com/topics/' + topic
    
    # Get the HTML page content using requests
    response = requests.get(topic_repos_url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + topic_repos_url)
    
    # Construct a beautiful soup document
    doc = BeautifulSoup(response.text)
    
    return doc

In [49]:
doc = get_topic_page('machine-learning')

In [50]:
doc.title.text

'machine-learning · GitHub Topics · GitHub'

Getting the topic page for another topic is now as simple as invoking the function with a different argument.

In [51]:
doc2 = get_topic_page('data-analysis')

In [52]:
doc2.title.text

'data-analysis · GitHub Topics · GitHub'

> **QUESTION**: Come with a strategy to find the repository name, owner's username, no. of stars and repository link for the repositories listed on a topic page.

<img src="https://i.imgur.com/szL76cU.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

Upon inspecting the box containing the information for a repository, you will find that each repository is represented by an `article` tag with class attribute `border rounded color-shadow-small color-bg-secondary my-4`.

Let's find all the `article` tags matching this class.


In [53]:
article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-secondary my-4')

In [54]:
len(article_tags)

30

There are 30 repositories listed on the page, and our query resulted in 30 article tags. Looks like we've found the enclosing tag for each repository. We can verify this by looking inside one of the tags.

In [55]:
article_tag = article_tags[4]

In [56]:
# Uncomment to view
# article_tag

We need to extract the following information from each tag:

1. Repository name
2. Owner's username
3. Number of stars
4. Repository link

If you look at the source of an article tag, you will notice that the repository name, owner's username and the repository link are all part of an `h1` tag.

Let's retrieve the first `h1` inside an article.

In [57]:
h1_tag = article_tag.find('h1')
h1_tag

<h1 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":10386605,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="1bae99446056a085802d3401cc23faad491e91eb64e36a9fbba9ac02ba2bb434" href="/aymericdamien">
            aymericdamien
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":45986162,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="fef3f

The `h1` has `a` tags inside it, one containing the owner's username and the second containing the repository title. The `href` of the second tag also contains the relative path of the repository. Let's extract this information from the `a` tags

In [58]:
a_tags = h1_tag.find_all('a', recursive=False)

In [59]:
username = a_tags[0].text
username

'\n            aymericdamien\n'

Looks like the username contains some leading and trailing whitespace. We can get rid of it using `strip`.

In [60]:
username = a_tags[0].text.strip()
username

'aymericdamien'

We can get the repository name and repository path in the same fashion.

In [61]:
repo_name = a_tags[1].text.strip()
repo_name

'TensorFlow-Examples'

In [62]:
repo_path = a_tags[1]['href'].strip()
repo_path

'/aymericdamien/TensorFlow-Examples'

To get the full URL to the repository, we can append the base URL `https://github.com` at the beginning of the path.

In [63]:
base_url = 'https://github.com'
repo_url = base_url + repo_path 
repo_url

'https://github.com/aymericdamien/TensorFlow-Examples'


Next, to get the number of starts, we notice that it is contained within an `a` tag which has the count `social-count float-none`.


In [64]:
a_star_tag = article_tags[4].find('a', class_='social-count float-none')

In [65]:
a_star_tag

<a class="social-count float-none" data-ga-click="Explore, go to repository stargazers, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"STARGAZERS","click_visual_representation":"STARGAZERS_NUMBER","actor_id":null,"record_id":45986162,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="a46079c0b03997f1f8a4f276ba3457dbf1123db76023ef8d1a4d701ec64b17da" href="/aymericdamien/TensorFlow-Examples/stargazers">
          40.3k
</a>

Let's extract the star count from the `a` tag.

In [66]:
a_star_tag.text.strip()

'40.3k'

The `k` at the end indicates `1000`. Let's write a helper function which can convert strings like `40.3k` into the number `40,300`.

In [67]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    else:
        return int(stars_str)

In [68]:
parse_star_count('40.3k')

40300

In [69]:
parse_star_count('991')

991

We can now determine the star count as a number.

In [70]:
star_count = parse_star_count(a_star_tag.text.strip())

In [71]:
star_count

40300

Perfect, we've extracted all the information we were interested in.

In [72]:
print('Repository name:', repo_name)
print("Owner's username:", username)
print('Stars:', star_count)
print('Repository URL:', repo_url)

Repository name: TensorFlow-Examples
Owner's username: aymericdamien
Stars: 40300
Repository URL: https://github.com/aymericdamien/TensorFlow-Examples


Let's extract the logic for parsing the required information from an article tag into a function.

> **QUESTION**: Write a function `parse_repostory` which returns a dictionary containing the repository name, owner's username, number of stars and repository URL by parsing a given `article` tag representing a repository.

In [73]:
def parse_repository(article_tag):
    # <a> tags containing username, repository name and URL
    a_tags = article_tag.h1.find_all('a')
    # Owner's username
    username = a_tags[0].text.strip()
    # Repository name
    repo_name = a_tags[1].text.strip()
    # Repository URL
    repo_url = base_url + a_tags[1]['href'].strip()
    # Star count
    stars_tag = article_tag.find('a', class_='social-count float-none')
    star_count = parse_star_count(stars_tag.text.strip())
    # Return a dictionary
    return {
        'repository_name': repo_name,
        'owner_username': username,        
        'stars': star_count,
        'repository_url': repo_url
    }

We can now use the function to parse any `article` tag.

In [74]:
parse_repository(article_tags[0])

{'repository_name': 'tensorflow',
 'owner_username': 'tensorflow',
 'stars': 155000,
 'repository_url': 'https://github.com/tensorflow/tensorflow'}

In [75]:
parse_repository(article_tags[10])

{'repository_name': 'caffe',
 'owner_username': 'BVLC',
 'stars': 31500,
 'repository_url': 'https://github.com/BVLC/caffe'}

We can even use a list comprehension to parse all the `article` tags in one go.

In [76]:
top_repositories = [parse_repository(tag) for tag in article_tags]

In [77]:
len(top_repositories)

30

In [78]:
top_repositories[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 155000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 51000,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 47300,
  'repository_url': 'https://github.com/pytorch/pytorch'},
 {'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 45100,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'TensorFlow-Examples',
  'owner_username': 'aymericdamien',
  'stars': 40300,
  'repository_url': 'https://github.com/aymericdamien/TensorFlow-Examples'}]



> **QUESTION**: Write a function that takes a `BeautifulSoup` object representing a topic page and returns a list of dictionaries containing information about the top repositories for the topic.


In [79]:
def get_top_repositories(doc):
    article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-secondary my-4')
    topic_repos = [parse_repository(tag) for tag in article_tags]
    return topic_repos

We can now use the functions we've defined to get the top repositories for any topic.

In [80]:
topic_page_ml = get_topic_page('machine-learning')
top_repos_ml = get_top_repositories(topic_page_ml)
top_repos_ml[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 155000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 51000,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 47300,
  'repository_url': 'https://github.com/pytorch/pytorch'},
 {'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 45100,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'TensorFlow-Examples',
  'owner_username': 'aymericdamien',
  'stars': 40300,
  'repository_url': 'https://github.com/aymericdamien/TensorFlow-Examples'}]

Here are the top repositories for the keyword `data-analysis`.

In [81]:
topic_page_da = get_topic_page('data-analysis')
top_repos_da = get_top_repositories(topic_page_da)
top_repos_da[:5]

[{'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 45100,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'superset',
  'owner_username': 'apache',
  'stars': 36300,
  'repository_url': 'https://github.com/apache/superset'},
 {'repository_name': 'pandas',
  'owner_username': 'pandas-dev',
  'stars': 29200,
  'repository_url': 'https://github.com/pandas-dev/pandas'},
 {'repository_name': 'metabase',
  'owner_username': 'metabase',
  'stars': 24400,
  'repository_url': 'https://github.com/metabase/metabase'},
 {'repository_name': 'streamlit',
  'owner_username': 'streamlit',
  'stars': 14000,
  'repository_url': 'https://github.com/streamlit/streamlit'}]

Here are the top repositories for the keyword `python`

In [82]:
get_top_repositories(get_topic_page('python'))[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 155000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'system-design-primer',
  'owner_username': 'donnemartin',
  'stars': 125000,
  'repository_url': 'https://github.com/donnemartin/system-design-primer'},
 {'repository_name': 'CS-Notes',
  'owner_username': 'CyC2018',
  'stars': 125000,
  'repository_url': 'https://github.com/CyC2018/CS-Notes'},
 {'repository_name': 'Python',
  'owner_username': 'TheAlgorithms',
  'stars': 102000,
  'repository_url': 'https://github.com/TheAlgorithms/Python'},
 {'repository_name': 'awesome-python',
  'owner_username': 'vinta',
  'stars': 95400,
  'repository_url': 'https://github.com/vinta/awesome-python'}]

Do you see the power of defining functions and using libraries? With just one line of code, we can scrape GitHub and find the top repositories for any topic.

Let's save our work before continuing.

In [83]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai/[0m
[jovian] Uploading notebook..[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

## Writing information to CSV files

Let's create a helper function which takes a list of dictionaries and writes them to a CSV file.

The input to our function will be a list of dictionary of the form:

```
[
  {'key1': 'abc', 'key2': 'def', 'key3': 'ghi'},
  {'key1': 'jkl', 'key2': 'mno', 'key3': 'pqr'},
  {'key1': 'stu', 'key2': 'vwx', 'key3': 'yza'}
  ...
]
```

The function will create a file with a given name containing the following data:

```
key1,key2,key3
abc,def,ghi
jkl,mno,pqr
stu,vwx,yza

```

In [84]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

Let's write the data stored in `top_repos_ml` into a CSV file.

In [85]:
len(top_repos_ml)

30

In [86]:
top_repos_ml[:3]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 155000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 51000,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 47300,
  'repository_url': 'https://github.com/pytorch/pytorch'}]

In [87]:
write_csv(top_repositories, 'machine-learning.csv')

We can now read the file and inspect its contents. The contents of the file can also be inspected using the "File > Open" menu option within Jupyter.

In [88]:
with open('machine-learning.csv', 'r') as f:
    print(f.read())

repository_name,owner_username,stars,repository_url
tensorflow,tensorflow,155000,https://github.com/tensorflow/tensorflow
keras,keras-team,51000,https://github.com/keras-team/keras
pytorch,pytorch,47300,https://github.com/pytorch/pytorch
scikit-learn,scikit-learn,45100,https://github.com/scikit-learn/scikit-learn
TensorFlow-Examples,aymericdamien,40300,https://github.com/aymericdamien/TensorFlow-Examples
tesseract,tesseract-ocr,39400,https://github.com/tesseract-ocr/tesseract
face_recognition,ageitgey,39200,https://github.com/ageitgey/face_recognition
faceswap,deepfakes,34800,https://github.com/deepfakes/faceswap
julia,JuliaLang,33100,https://github.com/JuliaLang/julia
100-Days-Of-ML-Code,Avik-Jain,31800,https://github.com/Avik-Jain/100-Days-Of-ML-Code
caffe,BVLC,31500,https://github.com/BVLC/caffe
awesome-scalability,binhnguyennus,29700,https://github.com/binhnguyennus/awesome-scalability
madewithml,GokuMohandas,25600,https://github.com/GokuMohandas/madewithml
machine-learning-for-sof

Perfect! We've created a CSV containing the information about the top GitHub repositories for the topic `machine-learning`. We can now put together everything we've done so far to solve the original problem.

> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. The top repositories for the topic `machine-learning` can be found on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL. 



In [97]:
import requests
from bs4 import BeautifulSoup
base_url = 'https://gitub.com'

def scrape_topic_repositories(topic, path=None):
    if path is None:
        path = topic + '.csv'
    topic_page_doc = get_topic_page(topic)
    topic_repositories = get_top_repositories(topic_page_doc)
    write_csv(topic_repositories, path)
    print('Top repositories for topic "{}" written to file "{}"'.format(topic, path))
    return path

def get_top_repositories(doc):
    article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-secondary my-4')
    topic_repos = [parse_repository(tag) for tag in article_tags]
    return topic_repos

def get_topic_page(topic):
    topic_repos_url = 'https://github.com/topics/' + topic
    response = requests.get(topic_repos_url)
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + topic_repos_url)
    return BeautifulSoup(response.text)    

def parse_repository(article_tag):
    a_tags = article_tag.h1.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href'].strip()
    stars_tag = article_tag.find('a', class_='social-count float-none')
    star_count = parse_star_count(stars_tag.text.strip())
    return {'repository_name': repo_name, 'owner_username': username, 'stars': star_count, 'repository_url': repo_url}

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    return int(float(stars_str[:-1]) * 1000) if stars_str[-1] == 'k' else int(stars_str)

def write_csv(items, path):
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

The entire code of this problem is only about 50 lines long. Isn't that neat?

In [98]:
scrape_topic_repositories('machine-learning')

Top repositories for topic "machine-learning" written to file "machine-learning.csv"


'machine-learning.csv'

Now that we have a CSV file, we can use the `pandas` library to view its contents.

In [99]:
import pandas as pd

In [100]:
pd.read_csv('machine-learning.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,155000,https://gitub.com/tensorflow/tensorflow
1,keras,keras-team,51000,https://gitub.com/keras-team/keras
2,pytorch,pytorch,47300,https://gitub.com/pytorch/pytorch
3,scikit-learn,scikit-learn,45100,https://gitub.com/scikit-learn/scikit-learn
4,TensorFlow-Examples,aymericdamien,40300,https://gitub.com/aymericdamien/TensorFlow-Exa...
5,tesseract,tesseract-ocr,39400,https://gitub.com/tesseract-ocr/tesseract
6,face_recognition,ageitgey,39200,https://gitub.com/ageitgey/face_recognition
7,faceswap,deepfakes,34800,https://gitub.com/deepfakes/faceswap
8,julia,JuliaLang,33100,https://gitub.com/JuliaLang/julia
9,100-Days-Of-ML-Code,Avik-Jain,31800,https://gitub.com/Avik-Jain/100-Days-Of-ML-Code


In [101]:
scrape_topic_repositories('data-analysis')

Top repositories for topic "data-analysis" written to file "data-analysis.csv"


'data-analysis.csv'

In [102]:
pd.read_csv('data-analysis.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,scikit-learn,scikit-learn,45100,https://gitub.com/scikit-learn/scikit-learn
1,superset,apache,36300,https://gitub.com/apache/superset
2,pandas,pandas-dev,29200,https://gitub.com/pandas-dev/pandas
3,metabase,metabase,24400,https://gitub.com/metabase/metabase
4,streamlit,streamlit,14000,https://gitub.com/streamlit/streamlit
5,goaccess,allinurl,13000,https://gitub.com/allinurl/goaccess
6,CyberChef,gchq,11600,https://gitub.com/gchq/CyberChef
7,AI-Expert-Roadmap,AMAI-GmbH,9700,https://gitub.com/AMAI-GmbH/AI-Expert-Roadmap
8,OpenRefine,OpenRefine,8000,https://gitub.com/OpenRefine/OpenRefine
9,mlcourse.ai,Yorko,7500,https://gitub.com/Yorko/mlcourse.ai


In [103]:
scrape_topic_repositories('python')

Top repositories for topic "python" written to file "python.csv"


'python.csv'

In [104]:
pd.read_csv('python.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,155000,https://gitub.com/tensorflow/tensorflow
1,system-design-primer,donnemartin,125000,https://gitub.com/donnemartin/system-design-pr...
2,CS-Notes,CyC2018,125000,https://gitub.com/CyC2018/CS-Notes
3,Python,TheAlgorithms,102000,https://gitub.com/TheAlgorithms/Python
4,awesome-python,vinta,95400,https://gitub.com/vinta/awesome-python
5,free-programming-books-zh_CN,justjavac,78200,https://gitub.com/justjavac/free-programming-b...
6,thefuck,nvbn,59600,https://gitub.com/nvbn/thefuck
7,django,django,56500,https://gitub.com/django/django
8,flask,pallets,54400,https://gitub.com/pallets/flask
9,keras,keras-team,51000,https://gitub.com/keras-team/keras


Of course, we can go even further and write a function that scrapes

> **EXERCISE**: Write a function `scrape_topics` which takes a list of topics and creates CSV files containing top repositories for all the topics. Test it out using the empty cells below.

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m


## Using a REST API to retrieve data as JSON

Not all URLs point to . In fact, many websites offer a REST API to access data in the JSON format.

* Github
* Twitter
* Facebook
* LinkedIn
* Slack

In case you're wondering what the acronyms mean, here are the expansions (you needn't remember or care about any of them):
- REST - "Represetational State Transfer"
- API - "Application Programming Interface"
- JSON - "JavaScript Object Notation"
- URL - "Universal Resource Locator"




> **QUESTION**: For the above problem, augment the repository information with some more details: description, watcher count, fork count, open issues count, created at time and updated at time.

## Crawling Websites by Parsing Links on a Page

> **EXERCISE**: Get the top 100 repositories for the all the featured topics on GitHub. Second page URL: https://github.com/topics/machine-learning?page=2 and topic pages are https://github.com/topics/?page=8. Create a different CSV file for each topic.

## Summary and Further Reading


Some things to keep in mind with web scraping:

* Most websites disallow web scraping for commercial purposes
* Use web scraping only for learning and research purpose
* Review the terms and conditions of a website before scraping
* Prefer removing personally identifiable information before publishing a dataset online
* Use official REST APIs wherever available, with proper keys
* Scraping data that you see after logging in is harder to do (it requires special cookies and headers)
* Websites may change their structure, in which case your code may no longer work

Some


Here are some more examples of scraping:

* https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
* https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5
* https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961
* https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
* https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
* https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852

In [None]:
jovian.commit()