<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="350" alt="Skills Network Logo">
    </a>
</p>


# **Hands-on Lab : Web Scraping**


Estimated time needed: **30 to 45** minutes


## Objectives


In this lab you will perform the following:


-   Extract information from a given web site 
-   Write the scraped data into a csv file.


## Extract information from the given web site

You will extract the data from the below web site: <br> 


In [7]:
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.


Import the required libraries


In [19]:
# Your code here
import requests
!pip install requests beautifulsoup4 pandas
from bs4 import BeautifulSoup




Download the webpage at the url


In [13]:
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    webpage_content = response.text
    print("Webpage downloaded successfully!")
else:
    print("Failed to retrieve the webpage.")

Webpage downloaded successfully!


Create a soup object


In [20]:
#your code goes here
soup = BeautifulSoup(webpage_content, 'html.parser')
    
print("BeautifulSoup object created successfully!")

BeautifulSoup object created successfully!


Scrape the `Language name`, `Created By` `annual average salary`and `Learning Difficulty`.


In [22]:
#your code goes here
from bs4 import BeautifulSoup
import requests

# URL to the webpage
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

# Send a request to the webpage and get the HTML content
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    webpage_content = response.text

    # Create a BeautifulSoup object
    soup = BeautifulSoup(webpage_content, 'html.parser')

    # Find the table containing the data
    table = soup.find('table')

    # Check if the table was found
    if table:
        # Initialize lists to store the scraped data
        languages = []
        created_by = []
        avg_salary = []
        learning_difficulty = []

        # Loop through each row in the table (skipping the header)
        for row in table.find_all('tr')[1:]:
            columns = row.find_all('td')
            
            if len(columns) > 0:  # Ensure the row has data
                languages.append(columns[1].text.strip())  # Language name
                created_by.append(columns[2].text.strip())  # Created By
                avg_salary.append(columns[3].text.strip())  # Average Annual Salary
                learning_difficulty.append(columns[4].text.strip())  # Learning Difficulty

        # Print out the first few results to confirm
        for i in range(5):  # Print first 5 rows for preview
            print(f"Language: {languages[i]}, Created By: {created_by[i]}, Avg Salary: {avg_salary[i]}, Difficulty: {learning_difficulty[i]}")
    else:
        print("No table found on the webpage.")
else:
    print("Failed to retrieve the webpage.")


Language: Python, Created By: Guido van Rossum, Avg Salary: $114,383, Difficulty: Easy
Language: Java, Created By: James Gosling, Avg Salary: $101,013, Difficulty: Easy
Language: R, Created By: Robert Gentleman, Ross Ihaka, Avg Salary: $92,037, Difficulty: Hard
Language: Javascript, Created By: Netscape, Avg Salary: $110,981, Difficulty: Easy
Language: Swift, Created By: Apple, Avg Salary: $130,801, Difficulty: Easy


Create a _dataframe_ for scrapped data


In [23]:
#your code goes here
import pandas as pd

# After scraping the data (as shown in the previous code)

# Initialize the data dictionary for the DataFrame
data = {
    'Language': languages,
    'Created By': created_by,
    'Average Annual Salary': avg_salary,
    'Learning Difficulty': learning_difficulty
}

# Create a pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame to confirm
print(df.head())

# You can also save it to a CSV file if needed
df.to_csv('popular-languages.csv', index=False)
print("Data saved to 'popular-languages.csv'.")


     Language                    Created By Average Annual Salary  \
0      Python              Guido van Rossum              $114,383   
1        Java                 James Gosling              $101,013   
2           R  Robert Gentleman, Ross Ihaka               $92,037   
3  Javascript                      Netscape              $110,981   
4       Swift                         Apple              $130,801   

  Learning Difficulty  
0                Easy  
1                Easy  
2                Hard  
3                Easy  
4                Easy  
Data saved to 'popular-languages.csv'.


Save the scrapped data into a file named _popular-languages.csv_


In [24]:
# your code goes here
# Assuming you already have the DataFrame (as shown in the previous steps)

# Save the DataFrame to a CSV file
df.to_csv('popular-languages.csv', index=False)

print("Data saved to 'popular-languages.csv'.")


Data saved to 'popular-languages.csv'.


## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


 Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).


<!--## Change Log


<!--| Date (YYYY-MM-DD) | Version | Changed By        | Change Description                 |
| ----------------- | ------- | ----------------- | ---------------------------------- |
| 2020-10-17        | 0.1     | Ramesh Sannareddy | Created initial version of the lab |--!>
