# Python Notes: Webscraping  
<hr>

Webscraping is the automatic extraction of information from a website  
> This can be done in Python using the modules requests and beautiful soup

**from bs4 import BeautifulSoup**  
imports Beautiful Soup  
  
**_myHTML_ = _HTMLDoc_**  
**_mySoup_ = BeautifulSoup(_myHTML_, "html.parser")**  
stores the html doc as a nested data structure  
> represents html as nested treelike objects  
>> methods parse the html  

**print(_mySoup_.prettify())**  
prints out the object how the html doc would have looked like  
  
**_myTitle_ = _mySoup_.title**  
returns the content surrounded with the title tags, including the tags itself  
> returns the first instance of that tag if multiple instances of the tag is present  
  
**_myH3_ = _mySoup_.h3**  
**_myH3sB_ = _myH3_._b_**  
returns what is between the <\b> tags in the returned <\h3> tags; navigates to child tag    
> **_backtoH3_ = _myH3sB_.parent**  
> the reverse can be done and travel up the parent tag  
> **_mySib_ = _myH3_.next_sibling**  
> returns the sibling next to the current tag  
  
**_myAtt_ = _myH3_.attrs**  
returns the attributes of the tag in a dictionary form; the key as the parameter and the values as values  
> **_myH3_["attr1"]**  
> attributes can be accessed similar to a dictionary without needing to use the method  
>> **_myH3_.get("attr1")**  
>> similarly, this code can be used  
  
**_myString_ = _myH3_.string**  
returns the contents in between the tags  
> the contents are classed as a NavigableString class
>> NavigableStrings are similar to Python strings except that it supports Beautiful Soup features

<hr>  
Below is the sample html:  

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [2]:
from bs4 import BeautifulSoup

html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"
soup = BeautifulSoup(html, "html.parser")

tag_title = soup.title
print("This is the first title tag:", tag_title)

tag_h3 = soup.h3
print("This is the first h3 tag:", tag_h3)

tag_h3_b = tag_h3.b
print("This is the bold tag in the h3 tag originally printed:", tag_h3_b)

tag_h3_b_parent = tag_h3_b.parent
print("This is the parent of the h3 b tag, which is just h3:", tag_h3_b_parent)

tag_h3_att = tag_h3_b.attrs
print("This is the attributes in the h3 tag:", tag_h3_att)

tag_h3_content = tag_h3.string
print("This is the content between the tags:", tag_h3_content)

print("This is the value of the id attribute in the bold tag:", tag_h3_b["id"])
print("This is the value of the id attribute in the bold tag:", tag_h3_b.get("id"))


This is the first title tag: <title>Page Title</title>
This is the first h3 tag: <h3><b id="boldest">Lebron James</b></h3>
This is the bold tag in the h3 tag originally printed: <b id="boldest">Lebron James</b>
This is the parent of the h3 b tag, which is just h3: <h3><b id="boldest">Lebron James</b></h3>
This is the attributes in the h3 tag: {'id': 'boldest'}
This is the content between the tags: Lebron James
This is the value of the id attribute in the bold tag: boldest
This is the value of the id attribute in the bold tag: boldest


<hr>

**_myHTML_ = "_HTMLDoc_"**  
**_mySoup_ = _mySoup_.find_all(name = "_tag_")**  
returns a list of all tags with the given filter, each instance being one element in the list  
> Method Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
> A list can be used as a name parameter to return all tags in the list  
> attrs can be used to filter the search, either finding those which have a value (_attrName_ = True) or those with a specific value (_attrName_ = _value_)   
>> when finding for a value for the "class" attribute, an underscore after it needs to be added since it is a Python Keyword  
>>> _Example_: _mySoup_.find_all(class_ = "_value_")    
  
> string can be used to filter for the content (string = "_string_")  

**_mySoup_.find()**  
returns the first element in the document  
> functions similarly to **find_all()**  

The following html provided in Coursera will be used

    <table>
      <tr>
        <td id='flight' >Flight No</td>
        <td>Launch site</td> 
        <td>Payload mass</td>
      </tr>
      <tr> 
        <td>1</td>
        <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
        <td>300 kg</td>
      </tr>
      <tr>
        <td>2</td>
        <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
        <td>94 kg</td>
      </tr>
      <tr>
        <td>3</td>
        <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
        <td>80 kg</td>
      </tr>
    </table>


<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

In [3]:
# Coursera Provided Code

table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"
table_bs = BeautifulSoup(table, "html.parser")

table_rows = table_bs.find_all('tr')
print("All instances of tr: ", table_rows)

list_input = table_bs .find_all(name=["tr", "td"])
print("\n \n All instances of tr and td", list_input)

href_list = table_bs.find_all(href=True)
print("\n \n All instances where there is an attribute href", href_list)

wiki_link_list = table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
print("\n \n All instances where there is an attribute href", wiki_link_list)

florida_list = table_bs.find_all(string="Florida")
print("\n \n All instances of 'florida': ", florida_list)

All instances of tr:  [<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

 
 All instances of tr and td [<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">T

<hr>

### Scraping the Web

**_page_ = requests.get("_URL_").text**  
**_soup_ = BeautifulSoup(_page_, "html.parser")**  
saves the html of a website and then converts it into a soup object  
  
**for i in _soup_.find_all(_item of interest type_):**  
---- **print(_filtered for item of interest_)**  
loop to return items of interest from a certain tag group  
  
**targetList = _soup_.find_all(_item of interest type_)**  
**for x, i in enumerate():**  
---- **if ("keywords of string" in str(targetList)):**  
-------- **desiredIndex = x**  
**print(targetList[desiredIndex])**  
prints out the desired target object by using a loop to check each element in the list of objects for those matching the keyword strings  
  
**pd.read_html("_url_", match = "_string criteria_", flavor = 'bs4') [0]**  
!!! functions similar to manually finding a table and running a loop to assign each <td> to the appropriate cell?  
> need confirmation  


In [28]:
# Coursera Provided Code

# prints out all the hyperlink in the ibm website
from bs4 import BeautifulSoup
import requests


url = "http://www.ibm.com"
data  = requests.get(url).text 
soup = BeautifulSoup(data,"html.parser")
for link in soup.find_all("a",href=True):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))


https://www.ibm.com/sports/fantasy/?lnk=ushpv18l1
#ibm-hp--tech-section
https://www.ibm.com/consulting/?lnk=ushpv18intro2
https://www.ibm.com/products/turbonomic/sustainability?lnk=ushpv18f1
https://www.ibm.com/products/process-mining?lnk=ushpv18f2
https://skillsbuild.org/adult-learners/explore-learning/cybersecurity-analyst?lnk=ushpv18f3
https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/quantum-decade?lnk=ushpv18f4
https://research.ibm.com/blog/ibm-arc-africa-climate?lnk=ushpv18r1
https://research.ibm.com/blog/ibm-arc-africa-climate?lnk=ushpv18r1
https://research.ibm.com/blog/ibm-pytorch-cloud-ai-ethernet?lnk=ushpv18r2
https://research.ibm.com/blog/2023-quantum-internships?lnk=ushpv18r3
https://research.ibm.com/blog/endpoint-security-reaqta-sysflow?lnk=ushpv18r4
https://www.ibm.com/case-studies/search?lnk=ushpv18mAll
https://www.ibm.com/case-studies/new-jersey-department-of-community-affairs/?lnk=ushpv18m1
https://www.ibm.com/case-studies/bluecross-blueshiel

In [48]:
# find a keyword in a table and return the html for that table

url = "https://en.wikipedia.org/wiki/World_population"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
table = soup.find_all('table')

for x, i in enumerate(table):
    if ("10 most densely populated countries" in str(i)):
        table_index = x
print(table[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_107-0">
   <a href="#cite_note-:10-107">
    [102]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload

In [53]:
import pandas as pd

population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])



for i in table[table_index].tbody.find_all("tr"):   #iterating through a list with all instances of <tr> in the <tbody> of the table found from the previous cell
    col = i.find_all("td")   # assigning the object with the list of all <td> for the current iteration of <tr> to var "col"
    if (col != []):   # if the table row has contents, assign the following values to the df created
        rank = col[0].text
        country = col[1].text.strip()
        population = col[2].text.strip() #text.strip() removes the trailing and leading whitespace
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)
  population_data = population_data.append({"Rank":rank, "Country":country, "Popul

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[103],5223000,6025,867
3,4,Lebanon,5296814,10400,509
4,5,Taiwan,23580712,35980,655
5,6,South Korea,51844834,99720,520
6,7,Rwanda,13173730,26338,500
7,8,Israel,8914885,21937,406
8,9,Haiti,11334637,27750,408
9,10,Netherlands,17400824,41543,419
