# The Requests Library

Now that we know how to use BeautifulSoup to get data from HTML files, let's see how we can scrape data from a real website. Unfortunately, Beautifulsoup can't access websites directly. Therefore, in order to access websites, we will use Python's `requests` library. The `requests` library allows us to send web requests and get a website's HTML data. Once the `requests` library gets us the HTML data, we can use Beautifulsoup, just as we did before, to extract the data we want. So let's see an example.

In the code below we will use the `requests` library and BeautifulSoup to get `Summary of main facilities operated by Tesla` data from a `html table` the following Wikipedia [webpage](https://en.wikipedia.org/wiki/Tesla,_Inc.). This table corresponds to Tesla's production and sales figures since Q1 2013. We will start by importing the `requests` library by using:

```python
import requests
```

We will then use the `requests.get(website)` function to get the source code from our `wikipage`. The `requests.get()` function returns a `Response` object that we will save in the variable `r`. We can get the HTML data we need from this object by using the `.text` method, as shown below. Finally, we'll convert and display the extracted html table into Pandas dataframe.

In [1]:
from bs4 import BeautifulSoup
import requests

import pandas as pd
import numpy as np

In [4]:
# Create a Response object
r = requests.get('https://en.wikipedia.org/wiki/Tesla,_Inc.', verify=False)

# Get HTML data
html_data = r.text



The `.text` method returns a string, therefore, `html_data` is a string containing the HTML data from our website. Notice, that since `html_data` is a string it can be passed to the BeautifulSoup constructor, and we will do this next, but for now, let's print the `html_data` string to see what it looks like:

In [8]:
# Print the HTML data
print(html_data)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Tesla, Inc. - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )enwikimwclientprefs=(

As we can see, `html_data` indeed contains the HTML data of our website. Notice, that since we are dealing with a real website this time, the HTML file is very long. 

Now that we have the HTML data from our website, we are ready to use BeautifulSoup just as we did before. The only difference is that this time, instead of passing an open filehandle to the BeautifulSoup constructor, we will pass the `html_data` string. So let's pass `html_data` to the BeautifulSoup constructor to get a BeautifulSoup object:

In [9]:
# Create a BeautifulSoup Object
page_content = BeautifulSoup(html_data, 'html.parser')

# Print the BeautifulSoup Object
print(page_content.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Tesla, Inc. - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )en

Now that we have a BeautifulSoup object, we can get our sales and production data. To do this, we need to know which tags contain our table data. In order to figure this out, we need to inspect our webpage using our web browser. For this example we will use the Chrome web browser but all web browsers have the same kind of functionality. We begin by going to our wikipage: https://en.wikipedia.org/wiki/Tesla,_Inc.

Next, we hover our mouse over around the target html table. As an example, we will hover over the `table header` row. Next, we right-click on the header title for `Quarter` and a dropdown menu will appear. From this menu we will choose **Inspect**. Once we click on **Inspect** we will see the HTML source code of our webpage appear on the right, as shown in the figure below:

<br>
<figure>
  <img src = "./wikitable.png" width = "100%" style = "border: thin silver solid; padding: 10px">
</figure> 
<br>

If we inspect the html source code on the right panel, we will see that the table header and rows are all contained within the same `<table>` tag. Therefore, we can use the above information to extract the column title from table header and detail data in the rows. In the code below we use BeautifulSoup's `find()` method to find the target `<table>` tag that has `class="wikitable"`.

A complete `html table` can be structured in the following way:
```
<table>
  <thead>
    <tr>
      <th>Month</th>
      <th>Sales</th>
    </tr>
  </thead>
  
  <tbody>
    <tr>
      <td>January</td>
      <td>$100</td>
    </tr>
    <tr>
      <td>February</td>
      <td>$80</td>
    </tr>
  </tbody>
  
  <tfoot>
    <tr>
      <td>Sum</td>
      <td>$180</td>
    </tr>
  </tfoot>
</table>
```

But notice that the html table in our wikipage does not have `<thead>` and `<tfoot>` tags. Let's create a BeautifulSoup's object and name it as `wikitable`.

In [10]:
wikitable = page_content.find('table', {'class': 'wikitable'})
wikitable

<table class="wikitable sortable">
<caption>Primary facilities operated by Tesla
</caption>
<tbody><tr>
<th>Opened
</th>
<th>Name
</th>
<th>City
</th>
<th>Country
</th>
<th>Employees
</th>
<th>Products
</th>
<th><abbr title="References">Ref.</abbr>
</th></tr>
<tr>
<td>2010
</td>
<td><a href="/wiki/Tesla_Fremont_Factory" title="Tesla Fremont Factory">Tesla Fremont Factory</a>
</td>
<td><a href="/wiki/Fremont,_California" title="Fremont, California">Fremont, California</a>
</td>
<td>United States
</td>
<td>10,000
</td>
<td><a href="/wiki/Tesla_Model_S" title="Tesla Model S">Model S</a>, <a href="/wiki/Tesla_Model_X" title="Tesla Model X">Model X</a>, <a href="/wiki/Tesla_Model_3" title="Tesla Model 3">Model 3</a>, <a href="/wiki/Tesla_Model_Y" title="Tesla Model Y">Model Y</a>
</td>
<td><sup class="reference" id="cite_ref-Future_35-1"><a href="#cite_note-Future-35">[34]</a></sup><sup class="reference" id="cite_ref-TC_staff_2020_267-0"><a href="#cite_note-TC_staff_2020-267">[266]</a></sup

Now that we have extracted the `<table>` tag that have `class="wikitable"` as an object, we can use BeautifulSoup's built-in functions to extract the human-readable text inside this table. Let's grab the information inside `<tbody>` tag.

In [11]:
wikitable.tbody

<tbody><tr>
<th>Opened
</th>
<th>Name
</th>
<th>City
</th>
<th>Country
</th>
<th>Employees
</th>
<th>Products
</th>
<th><abbr title="References">Ref.</abbr>
</th></tr>
<tr>
<td>2010
</td>
<td><a href="/wiki/Tesla_Fremont_Factory" title="Tesla Fremont Factory">Tesla Fremont Factory</a>
</td>
<td><a href="/wiki/Fremont,_California" title="Fremont, California">Fremont, California</a>
</td>
<td>United States
</td>
<td>10,000
</td>
<td><a href="/wiki/Tesla_Model_S" title="Tesla Model S">Model S</a>, <a href="/wiki/Tesla_Model_X" title="Tesla Model X">Model X</a>, <a href="/wiki/Tesla_Model_3" title="Tesla Model 3">Model 3</a>, <a href="/wiki/Tesla_Model_Y" title="Tesla Model Y">Model Y</a>
</td>
<td><sup class="reference" id="cite_ref-Future_35-1"><a href="#cite_note-Future-35">[34]</a></sup><sup class="reference" id="cite_ref-TC_staff_2020_267-0"><a href="#cite_note-TC_staff_2020-267">[266]</a></sup><sup class="reference" id="cite_ref-268"><a href="#cite_note-268">[267]</a></sup>
</td></tr

In the earlier cell, we use `find()` function to get the html table. Now, we are going to use `findAll()` function to grab all the defined tags inside the BeautifulSoup object. In this case, we want to extract every row in the `<tbody>` tag and we can do so by using `findAll()` function.

In [12]:
wikitable.tbody.findAll('tr')

[<tr>
 <th>Opened
 </th>
 <th>Name
 </th>
 <th>City
 </th>
 <th>Country
 </th>
 <th>Employees
 </th>
 <th>Products
 </th>
 <th><abbr title="References">Ref.</abbr>
 </th></tr>, <tr>
 <td>2010
 </td>
 <td><a href="/wiki/Tesla_Fremont_Factory" title="Tesla Fremont Factory">Tesla Fremont Factory</a>
 </td>
 <td><a href="/wiki/Fremont,_California" title="Fremont, California">Fremont, California</a>
 </td>
 <td>United States
 </td>
 <td>10,000
 </td>
 <td><a href="/wiki/Tesla_Model_S" title="Tesla Model S">Model S</a>, <a href="/wiki/Tesla_Model_X" title="Tesla Model X">Model X</a>, <a href="/wiki/Tesla_Model_3" title="Tesla Model 3">Model 3</a>, <a href="/wiki/Tesla_Model_Y" title="Tesla Model Y">Model Y</a>
 </td>
 <td><sup class="reference" id="cite_ref-Future_35-1"><a href="#cite_note-Future-35">[34]</a></sup><sup class="reference" id="cite_ref-TC_staff_2020_267-0"><a href="#cite_note-TC_staff_2020-267">[266]</a></sup><sup class="reference" id="cite_ref-268"><a href="#cite_note-268">[26

As we can see from the html source code above, the first row consists of the column name and followed by the actual data in the subsequent rows. Let's grab the `column names` from the first row and all the `<th>` tags inside the row.

Notice the row index of `[0]` after `findAll('tr')` to get the first row. Then, we can chain the second `findAll()` function to get all the `<th>` tags inside this row.

In [13]:
wikicolumns = wikitable.tbody.findAll('tr')[0].findAll('th')
wikicolumns

[<th>Opened
 </th>, <th>Name
 </th>, <th>City
 </th>, <th>Country
 </th>, <th>Employees
 </th>, <th>Products
 </th>, <th><abbr title="References">Ref.</abbr>
 </th>]

Let's store the column names in a Python object, called `df_columns`, so we can use it to build Pandas dataframe.

In [14]:
df_columns = []

for column in wikicolumns:
    # remove <br/> inside <th> text, such as `<th>Total<br/>production</th>`
    text = column.get_text(strip=True, separator=" ")
    # append the text into df_columns
    df_columns.append(text)
print(np.array(df_columns))

['Opened' 'Name' 'City' 'Country' 'Employees' 'Products' 'Ref.']


Now that we hava stored the column names, now we want to iterate the remaining rows, consisting the real data in this table. We can use Python `for loop` function to iterate from the second row onward. To do so, we need to set the starting index as follows: `wikitable.tbody.findAll('tr')[1:]`. Let's store our dataset in Python object, called `data`.

In [23]:
df_data = []

for row in wikitable.tbody.findAll('tr')[1:]:
    row_data = []
    for td in row.findAll('td'):
        text = td.get_text(strip=True, separator=" ")
        row_data.append(text)
    df_data.append(np.array(row_data))

# print the first 10 data rows
print(df_data[:10])

[array(['Egypt', 'amazon .eg', 'September 2021', 'Arabic, English',
       'Formerly known as Souq.com Egypt'], 
      dtype='<U32'), array(['Brazil', 'amazon.com .br', 'December 2012', 'Portuguese', ''], 
      dtype='<U14'), array(['Canada', 'amazon .ca', 'June 2002', 'English, French', ''], 
      dtype='<U15'), array(['Mexico', 'amazon.com .mx', 'August 2013', 'Spanish', ''], 
      dtype='<U14'), array(['United States', 'amazon .com', 'July 1995',
       'English, Spanish, Arabic, German, Hebrew, Korean, Portuguese, Chinese (Simplified), Chinese (Traditional)',
       'International customers without a localized Amazon website may purchase eBooks from the Kindle Store on Amazon US. [52]'], 
      dtype='<U119'), array(['China', 'amazon .cn', 'September 2004', 'Chinese (Simplified)',
       'Formerly known as Joyo.com CHN'], 
      dtype='<U30'), array(['India', 'amazon .in', 'June 2013',
       'English, Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Marathi',
       ''], 
   

Great! We have grabbed the column names and data. But we want to present the data in human-readable structure. We can use [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) library. Pandas DataFrame is similar to Excel spreadsheet and Google Sheet. This library provides a convenient data structure to manipulate and present data with Python.

Let's create a Panda DataFrame object, called `dataframe`, so we can present our data in a spreadsheet structure.

In [19]:
dataframe = pd.DataFrame(data=df_data, columns=df_columns)
dataframe

Unnamed: 0,Opened,Name,City,Country,Employees,Products,Ref.
0,2010,Tesla Fremont Factory,"Fremont, California",United States,10000,"Model S , Model X , Model 3 , Model Y",[34] [266] [267]
1,2016,Gigafactory Nevada,"Storey County, Nevada",United States,7000,"Batteries, Powerwall , Powerpack , Megapack , ...",[268] [269] [270]
2,2017,Gigafactory New York,"Buffalo, New York",United States,1500,"Solar Roof , Superchargers",[271] [272]
3,2019,Gigafactory Shanghai,Shanghai,China,15000,"Model 3, Model Y",[273] [274]
4,2022,Gigafactory Berlin-Brandenburg,Grünheide,Germany,10000,"Model Y (planned: batteries, Model 3)",[275] [276] [277]
5,2022,Gigafactory Texas,"Austin, Texas",United States,12000,"Model Y, batteries (planned: Cybertruck , Mode...",[278] [279] [280]
6,2022,Megafactory,"Lathrop, California",United States,1000,Megapack,[281] [282]


## Great job! We have extracted Tesla's production and sales data from an `html table` in a Wikipage and converted the data into Python and Pandas DataFrame. It's now your time to practice with `requests` and `BeautifulSoup` libraries.

# TODO: Get Amazon financial data from Wikipage

URL links:
- Wikipage: https://en.wikipedia.org/wiki/Amazon_(company) <br/>
- Financial data: https://en.wikipedia.org/wiki/Amazon_(company)#Finances

Tasks: <br/>
Start by importing the `BeautifulSoup` and `requests` libraries. Then use the `requests.get()` function with the appropriate `params` to get our website's HTML data. Then create a BeautifulSoup Object named `page_content` using our website's HTML data and the `html.parser` parser. Then use the `find()` method to find the `<table>` tag. Then, get the table column names. Finally, create a loop that prints all the countries and population from `<tbody>` tag.

In [52]:
# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

# Create a Response object
r = requests.get('https://en.wikipedia.org/wiki/Amazon_(company)')

# Get HTML data
html_data = r.text

# Create a BeautifulSoup Object
page_content = BeautifulSoup(html_data, 'html.parser')

# Find financial table
# since there are two wikitable on the website, we use find_all() method which rerurns a list
# and take the second table
# instead of find() method just return the first element
wikitables = page_content.find_all('table', class_='wikitable')
wikitable = wikitables[1]
#print(wikitable)

# Find all column titles
# find all tr tags, get the first tr tags, and then get all the th tags

th_list = wikitable.find_all('tr')[0].find_all('th')
#print(th_list)

# loop over th_list and strip all 'th' tag and get the column name of the dataframe
wikicolumns = []
for tag in th_list:
    text = tag.get_text(strip = True, separator=" ")
    wikicolumns.append(text)
#print(wikicolumns)

#wikicolumns = [i.get_text(strip=True, separator=" ") for i in wikitable.find_all('tr')[0].find_all('td')]
#wikicolumns
# Loop through column titles and store into Python array
#df_columns = np.array(wikicolumns)
#print(df_columns)

# Loop through the data rows and store into Python array
# loop from the first tr tags
# for each tr tag find all td tags
# loop over and strip all td tags
df_data = []
for tr in wikitable.find_all('tr')[1:]:
    td_list=[]
    for tag in tr.find_all('td'):
        td_text = tag.get_text(strip = True, separator=" ")
        td_list.append(td_text)
    #th_array = np.array(td_list)
    df_data.append(td_list)
#print(df_data)
        


# Print financial data in DataFrame format and set `Year` as index
dataframe = pd.DataFrame(df_data, columns = wikicolumns)
dataframe.set_index('Year', inplace = True)
dataframe

Unnamed: 0_level_0,Revenue [154] in mil. US$,Net income in mil. US$,Total Assets in mil. US$,Employees
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1995 [155],0.5,−0.3,1.1,
1996 [155],16.0,−6,8.0,
1997 [155],148.0,−28,149.0,614.0
1998 [156],610.0,−124,648.0,2100.0
1999 [156],1639.0,−720,2466.0,7600.0
2000 [156],2761.0,"−1,411",2135.0,9000.0
2001 [156],3122.0,−567,1638.0,7800.0
2002 [156],3932.0,−149,1990.0,7500.0
2003 [157],5263.0,35,2162.0,7800.0
2004 [157],6921.0,588,3248.0,9000.0


# XML

Throughout these lessons we have used HTML files and BeautifulSoup's `html_parser` to show you how to scrape data. We should note that the exact same techniques can be applied to XML files. The only difference is that you will have to use an XML parser in the BeautifulSoup constructor. For example, in order to parse a document as XML, you can use `lxml`’s XML parser by passing in `xml` as the second argument to the BeautifulSoup constructor:

```python
page_content = BeautifulSoup(xml_file, 'xml')
```

The above statement will parse the given `xml_file` as XML using the `xml` parser.

# Final Remarks

So now you should know how to scrape data from websites using the `requests` and `BeautifulSoup` libraries. We should note, that you should be careful when scrapping websites not to overwhelm a website's server. This can happen if you write computer programs that send out a lot of requests very quickly. Doing this, will overwhelm the server and probably cause it to get stuck. This is obviously very bad, so avoid making tons of web requests in a short amount of time. In fact, some servers monitor if you are making too many requests and block you, if you are doing so. So keep this in mind when you are writing computer programs.

# Solution

[Solution notebook](requests_library_solution.ipynb)