<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDA0321ENSkillsNetwork928-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Hands-on Lab : Web Scraping**


Estimated time needed: **30 to 45** minutes


## Objectives


In this lab you will perform the following:


* Extract information from a given web site 
* Write the scraped data into a csv file.


## Extract information from the given web site
You will extract the data from the below web site: <br> 


In [1]:
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.


Import the required libraries


In [2]:
# Your code here
import requests
from bs4 import BeautifulSoup


Download the webpage at the url


In [3]:
#your code goes here

print(f"Attempting to download webpage from: {url}")

try:
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)

    # Get the content of the webpage
    html_content = response.text

    print(f"Successfully downloaded webpage (Status Code: {response.status_code}).")
    print("\n--- First 500 characters of HTML content ---")
    print(html_content[:500])
    print("-------------------------------------------\n")

    # You can now use BeautifulSoup to parse this html_content
    # soup = BeautifulSoup(html_content, 'html.parser')
    # print("Webpage content parsed successfully with BeautifulSoup.")

except requests.exceptions.HTTPError as errh:
    print(f"HTTP Error occurred: {errh}")
except requests.exceptions.ConnectionError as errc:
    print(f"Error Connecting: {errc}")
except requests.exceptions.Timeout as errt:
    print(f"Timeout Error: {errt}")
except requests.exceptions.RequestException as err:
    print(f"An unexpected error occurred: {err}")



Attempting to download webpage from: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html
Successfully downloaded webpage (Status Code: 200).

--- First 500 characters of HTML content ---
<!doctype html>
<html lang="en">
<head>
<title>
Salary survey results of programming languages
</title>
<style>
table, th, td {
  border: 1px solid black;
}
</style>
</head>

<body>
<hr />
<h2>Popular Programming Languages</h2>
<hr />
<p>Finding out which is the best language is a tough task. A programming language is created to solve a specific problem. A language which is good for task A may not be able to properly handle task B. Comparing programming language is never easy. What we can do, ho
-------------------------------------------



Create a soup object


In [5]:
#your code goes here
soup = BeautifulSoup(html_content, 'html.parser')
    

Scrape the `Language name` and `annual average salary`.


In [131]:
# Find the table containing the programming language salaries
    # We look for a table that has the expected headers "Language" and "Average Salary (USD/year)".
salary_table = None
tables = soup.find_all('table')
for table in tables:
        # Check if the table contains the expected headers
    headers = [th.get_text(strip=True) for th in table.find_all('th')]
    if "Language" in headers and "Average Salary (USD/year)" in headers:
        salary_table = table
        break

if salary_table:
    print("\n--- Found Salary Table ---")
    programming_language_salaries = []

        # Find all table rows (tr) in the table body (tbody)
        # Some tables might not have a tbody, so we also check direct tr children
    rows = salary_table.find('tbody').find_all('tr') if salary_table.find('tbody') else salary_table.find_all('tr')

        # Skip the header row if it's included in `find_all('tr')` and no tbody is present
    if not salary_table.find('tbody') and len(rows) > 0:
        rows = rows[1:] # Skip header row

    for row in rows:
        cols = row.find_all(['td', 'th']) # Get all data cells (td) or header cells (th)

            # Ensure there are enough columns for language and salary
        if len(cols) >= 3: # Assuming Language is col 2 (index 1) and Salary is col 3 (index 2)
            language_name = cols[1].get_text(strip=True) # Language is usually the second column (index 1)
            average_salary_text = cols[2].get_text(strip=True) # Average Salary is usually the third column (index 2)

                # Clean the salary text: remove '$', ',', and any non-numeric characters except '.'
            cleaned_salary = re.sub(r'[$,]', '', average_salary_text)
                # Extract only the numeric part, handling potential ranges or extra text
            salary_match = re.search(r'(\d+(\.\d+)?(?:k)?)\b', cleaned_salary, re.IGNORECASE)

            if salary_match:
                 salary_value_str = salary_match.group(1)
            if 'k' in salary_value_str.lower():
                        # Convert '167k' to 167000
                    salary_value = float(salary_value_str.lower().replace('k', '')) * 1000
            else:
                    salary_value = float(salary_value_str)

            programming_language_salaries.append({
                    "language": language_name,
                    "average_annual_salary": int(salary_value) # Store as integer
                    })
        else:
            programming_language_salaries.append({
                "language": language_name,
                "average_annual_salary": "N/A (Could not extract salary)"
                    })

        # Print the extracted data
    if programming_language_salaries:
           print("Scraped Programming Language Salaries:")
    for item in programming_language_salaries:
            print(f"  Language: {item['language']}, Average Annual Salary: {item['average_annual_salary']}")
    else:
        print("No programming language salary data found after parsing.")


Save the scrapped data into a file named *popular-languages.csv*


In [140]:
# your code goes here
import csv # Import the csv module for writing to CSV
csv_file_name = "popular-languages.csv"
try:
        with open(csv_file_name, mode='w', newline='', encoding='utf-8') as file:
            fieldnames = ["language", "average_annual_salary"]
            writer = csv.DictWriter(file, fieldnames=fieldnames)
        programming_language_salaries = []
        writer.writeheader() # Write the header row
        writer.writerows(programming_language_salaries) # Write all data rows
        print(f"\nScraped data successfully saved to '{csv_file_name}'")
except IOError as e:
        print(f"Error saving data to CSV file: {e}")

ValueError: I/O operation on closed file.

## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


## Change Log


|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |
|---|---|---|---|
| 2020-10-17  | 0.1  | Ramesh Sannareddy  |  Created initial version of the lab |


 Copyright &copy; 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDA0321ENSkillsNetwork928-2022-01-01).
