#### You will work with two python modules three python modules that are helpful for scraping data from the web, parsing it to the table(s) you want, and loading the table(s) to a dataframe. Our goal is to create a dataframe of [soups](https://en.wikipedia.org/wiki/List_of_soups)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

- It's a good idea to view the page source first to see how it's structured
- The devtools may also be helpful
    - in Chrome right-click and choose `inspect` or just use `F12` to bring up the devtools

### Make a request using the `requests` [library](https://requests.readthedocs.io/en/master/user/quickstart/)
- `request.get()` uses http GET to get a webpage
- `request.post()` uses http POST when the webpage is submitting a form
- checking the [`status_code`](https://www.restapitutorial.com/httpstatuscodes.html) on the result let's you know your request was successful


In [13]:
website_url = 'https://en.wikipedia.org/wiki/List_of_soups'
response = requests.get(website_url)

response.status_code

200

### Next look at the content ot the result 
- it is a `Response` datatype
- but it looks like an html document

In [25]:
print(type(response))
response.content

<class 'requests.models.Response'>


b'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of soups - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XsGS3ApAAEEAAEvarJ0AAADM","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_soups","wgTitle":"List of soups","wgCurRevisionId":952868488,"wgRevisionId":952868488,"wgArticleId":308412,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Dynamic lists","Articles containing Hungarian-language text","Wikipedia articles needing clarification from July 2015","Articles containing

### The [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package is handy for extracting data from html docs (and xml docs)

In [26]:
soup = BeautifulSoup(response.content, 'lxml')
print(soup.title)

<title>List of soups - Wikipedia</title>


### You can get the table that contains the data from the page using beautiful soup

In [15]:
tables = soup.find_all('table', attrs = {'class': 'wikitable sortable'})
tables

[<table class="wikitable sortable">
 <tbody><tr>
 <th>Name
 </th>
 <th>Image
 </th>
 <th>Origin
 </th>
 <th>Type
 </th>
 <th>Distinctive ingredients and description
 </th></tr>
 <tr>
 <td><a href="/wiki/Aguadito" title="Aguadito">Aguadito</a>
 </td>
 <td>
 </td>
 <td><a href="/wiki/Peru" title="Peru">Peru</a>
 </td>
 <td>Chunky
 </td>
 <td>Peruvian green soup usually made with <a class="mw-redirect" href="/wiki/Cilantro" title="Cilantro">cilantro</a>, <a href="/wiki/Carrot" title="Carrot">carrot</a>, <a href="/wiki/Pea" title="Pea">peas</a>, <a href="/wiki/Potato" title="Potato">potatoes</a> and can have chicken, hen, mussels or fish. It also contains <a class="mw-redirect" href="/wiki/Aj%C3%AD_amarillo" title="Ají amarillo">ají amarillo</a> (yellow chili pepper) and various other vegetables and spices. The green color is due to cilantro. It is known for having a potential for easing or alleviating symptoms associated with the hangover.<sup class="reference" id="cite_ref-Barrell_2017_1

### It is a good idea to check to see how many tables you scraped
- then use `pd.read_html()` get a list dataframes extracted from the soup tables
- you'll need to convert the tables (still a response object) to a string before pandas can read it
- to load the table you want to a dataframe, grab it from the list of dataframes

In [20]:
len(tables)

In [23]:
result_list = pd.read_html(str(tables[0])) # a list of dataframes
len(result_list)

1

In [24]:
world_soups = result_list[0]  # get the first df from the list
world_soups.head()

Unnamed: 0,Name,Image,Origin,Type,Distinctive ingredients and description
0,Aguadito,,Peru,Chunky,Peruvian green soup usually made with cilantro...
1,Ajiaco,,Colombia,Chunky,"In the Colombian capital of Bogotá, ajiaco is ..."
2,Acquacotta,,Italy (Kalona),Chunky,"Originally a peasant food,[4] historically, it..."
3,Analı kızlı soup,,Turkey,Chunky,Bulgur meatballs and chickpeas in gravy with y...
4,Ash-e doogh,,Iran,Yogurt soup,Consists of yogurt and leafy vegetables. Serve...
