# Review of Web Scraping

# Objectives


After completing this lab you will be able to:

* Download a webpage using requests module
* Scrape all links from a web page
* Scrape all image urls from a web page
* Scrape data from html tables


## Scrape www.ibm.com


In [23]:
#import required libraries
import requests                  #helps download a web page
from bs4 import BeautifulSoup    #helps in web scrapping

Download the content of the webpage

In [19]:
url = "http://www.ibm.com"

In [20]:
#get the content of the webpage in text format and store in a varibale called data
data = requests.get(url).text

In [22]:
!pip install bs4


Collecting bs4
  Obtaining dependency information for bs4 from https://files.pythonhosted.org/packages/51/bb/bf7aab772a159614954d84aa832c129624ba6c32faa559dfb200a534e50b/bs4-0.0.2-py2.py3-none-any.whl.metadata
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [25]:
#create a soup object using the class beautifulsoup
soup = BeautifulSoup(data,'html.parser')

In [26]:
#scrape all the links
for link in soup.find_all('a'): # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

https://www.ibm.com/de-de/cloud?lnk=intro


In [27]:
# scrape all images
for link in soup.find_all('img'): ## in html image is represented by the tag <img>
    print(link.get('src'))

/content/dam/adobe-cms/default-images/home-consultants.component.crop-16by9-xl.ts=1695221931390.jpg/content/adobe-cms/de/de/homepage/_jcr_content/root/table_of_contents/simple_image


## Scrape data from html tables


The below url contains a html table with data about colors and color codes.

In [28]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.


In [30]:
# get the contenrts of the webpage in ext format
data = requests.get(url).text

In [31]:
data

'<html>\n   <body>\n      <h1>Partital List  of HTML5 Supported Colors</h1>\n<table border ="1" class="main-table">\n   <tr>\n      <td>Number </td>\n      <td>Color</td>\n      <td>Color Name</td>\n      <td>Hex Code<br>#RRGGBB</td>\n      <td>Decimal Code<br>(R,G,B)</td>\n   </tr>\n   <tr>\n      <td>1</td>\n      <td style="background:lightsalmon;">&nbsp;</td>\n      <td>lightsalmon</td>\n      <td>#FFA07A</td>\n      <td>rgb(255,160,122)</td>\n   </tr>\n   <tr>\n      <td>2</td>\n      <td style="background:salmon;">&nbsp;</td>\n      <td>salmon</td>\n      <td>#FA8072</td>\n      <td>rgb(250,128,114)</td>\n   </tr>\n   <tr>\n      <td>3</td>\n      <td style="background:darksalmon;">&nbsp;</td>\n      <td>darksalmon</td>\n      <td>#E9967A</td>\n      <td>rgb(233,150,122)</td>\n   </tr>\n   <tr>\n      <td>4</td>\n      <td style="background:lightcoral;">&nbsp;</td>\n      <td>lightcoral</td>\n      <td>#F08080</td>\n      <td>rgb(240,128,128)</td>\n   </tr>\n   <tr>\n      <td>5<

In [32]:
soup = BeautifulSoup(data, 'html.parser')

In [33]:
# find a table in the web page
table = soup.find('table') ## in html table is represented by the tag <table>

In [36]:
#get all the rows from the table
for row in table.find_all('tr'): #in html table row is represented by the tag <tr>
    #get all the columns in each row
    cols = row.find_all('td') #in html table column is represented by the tag <td>
    color_name = cols[2].getText() #store the value in column 3 as color_name
    color_code = cols[3].getText() #store the value in column 4 as color_code
    print(f"{color_name}---->{color_code}")

Color Name---->Hex Code#RRGGBB
lightsalmon---->#FFA07A
salmon---->#FA8072
darksalmon---->#E9967A
lightcoral---->#F08080
coral---->#FF7F50
tomato---->#FF6347
orangered---->#FF4500
gold---->#FFD700
orange---->#FFA500
darkorange---->#FF8C00
lightyellow---->#FFFFE0
lemonchiffon---->#FFFACD
papayawhip---->#FFEFD5
moccasin---->#FFE4B5
peachpuff---->#FFDAB9
palegoldenrod---->#EEE8AA
khaki---->#F0E68C
darkkhaki---->#BDB76B
yellow---->#FFFF00
lawngreen---->#7CFC00
chartreuse---->#7FFF00
limegreen---->#32CD32
lime---->#00FF00
forestgreen---->#228B22
green---->#008000
powderblue---->#B0E0E6
lightblue---->#ADD8E6
lightskyblue---->#87CEFA
skyblue---->#87CEEB
deepskyblue---->#00BFFF
lightsteelblue---->#B0C4DE
dodgerblue---->#1E90FF


# Collecting Data Using Web Scraping

## Objectives


* Extract information from a given web site 
* Write the scraped data into a csv file.


## Extract information from the given web site
You will extract the data from the below web site: <br> 


In [42]:
# data
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape

In [41]:
import requests
from bs4 import BeautifulSoup

In [43]:
#download the webpage at the url
data = requests.get(url).text

In [44]:
#create a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [46]:
#scrape the Language name and annual average Salary
table = soup.find('table')
for row in table.find_all('tr'):
    cols = row.find_all('td')
    language_name = cols[1].getText()
    annual_average_salary = cols[3].getText()
    print(f"The Programming Language {language_name} commands an \
    annual salary of:{annual_average_salary}")

The Programming Language Language commands an     annual salary of:Average Annual Salary
The Programming Language Python commands an     annual salary of:$114,383
The Programming Language Java commands an     annual salary of:$101,013
The Programming Language R commands an     annual salary of:$92,037
The Programming Language Javascript commands an     annual salary of:$110,981
The Programming Language Swift commands an     annual salary of:$130,801
The Programming Language C++ commands an     annual salary of:$113,865
The Programming Language C# commands an     annual salary of:$88,726
The Programming Language PHP commands an     annual salary of:$84,727
The Programming Language SQL commands an     annual salary of:$84,793
The Programming Language Go commands an     annual salary of:$94,082


In [53]:
#without the header
skip_first_row = True

for row in table.find_all('tr'):
    if skip_first_row == True:
        skip_first_row = False
        continue #skip the first row
        
    cols = row.find_all('td')
    language_name = cols[1].getText()
    annual_average_salary = cols[3].getText()
    print(f"The Programming Language {language_name} commands an annual salary of:{annual_average_salary}")

The Programming Language Python commands an annual salary of:$114,383
The Programming Language Java commands an annual salary of:$101,013
The Programming Language R commands an annual salary of:$92,037
The Programming Language Javascript commands an annual salary of:$110,981
The Programming Language Swift commands an annual salary of:$130,801
The Programming Language C++ commands an annual salary of:$113,865
The Programming Language C# commands an annual salary of:$88,726
The Programming Language PHP commands an annual salary of:$84,727
The Programming Language SQL commands an annual salary of:$84,793
The Programming Language Go commands an annual salary of:$94,082


## save the scrapped data into a file named popular-languages.csv

Using Pandas

In [56]:
import pandas as pd
content = []
for row in table.find_all('tr'):
    cols = row.find_all('td')
    language_name = cols[1].getText()
    annual_average_salary = cols[3].getText()
    content.append([language_name, annual_average_salary])

df = pd.DataFrame(content) #if there are no columns, pass columns=['Language Name', 'Annual Average Salary'] as a parameter
df.to_csv('popular-languages.csv', index = False)

print("Data has been saved to 'popular-languages.csv'.")

Data has been saved to 'popular-languages.csv'.


ALternatively

In [59]:
import csv

with open('popular-language1.csv', 'w', newline='') as csvfile:
    #create a csv write object
    csv_writer = csv.writer(csvfile)
    
    #write the header row
    csv_writer.writerow(['Language Name', 'Annual Average Salary'])
    
    #Find the table in the HTML
    table = soup.find('table')
    
    for row in table.find_all('tr'):
        cols = row.find_all('td')
        language_name = cols[1].getText()
        annual_average_salary = cols[3].getText()
    
        csv_writer.writerow([language_name, annual_average_salary])
        
print("Data has been saved to 'popular-languages1.csv'.")

Data has been saved to 'popular-languages1.csv'.
