Web Scraping:

use this website : Github/topics
Write a Python script using the requests library to fetch the HTML content of the chosen website.
Print the status code of the response to ensure the request was successful using .status_code, it should be 200.
Print the first 100 characters of the HTML content to verify the response.
Save the HTML content to a file named webpage.html. Ensure you handle the text encoding correctly.
Use BeautifulSoup to parse the saved HTML content.
Identify two distinct pieces of information on the webpage to extract (e.g., titles of the topics and their descriptions).
Write code to extract these pieces of information. Ensure you identify the correct HTML tags and classes used for these elements on the webpage.
Print the length and content of each extracted list to verify the extraction process.
Create a Python dictionary to structure the extracted data, with keys representing the type of information (e.g., ‘title’ and ‘description’).
Convert this dictionary into a pandas DataFrame.
Print the DataFrame to confirm its structure and contents.

In [5]:
# Step 1: Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 2: Fetch the webpage using requests
url = 'https://github.com/topics'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

# Step 3: Print the status code to verify the request was successful
print("Status Code:", response.status_code)  # Should be 200 if successful

# Step 4: Save the HTML content to a file named 'webpage.html'
with open('webpage.html', 'w', encoding='utf-8') as file:
    file.write(response.text)

# Step 5: Parse the saved HTML content using BeautifulSoup
# Re-open the saved HTML file to ensure the parsing is done from the saved file
with open('webpage.html', 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, 'html.parser')

# Step 6: Extract the topic titles from <p> tags with class 'f3 lh-condensed mb-0 mt-1 Link--primary'
titles = [title.text.strip() for title in soup.find_all('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary')]

# Extract descriptions of topics from <p> tags with class 'color-fg-muted'
descriptions = [desc.text.strip() for desc in soup.find_all('p', class_='color-fg-muted')]

# Step 7: Print the length and content of each extracted list
print(f"Number of Titles: {len(titles)}")
print(f"Number of Descriptions: {len(descriptions)}")

# Check the first few items of both lists
print("Sample Titles:", titles[:5])
print("Sample Descriptions:", descriptions[:5])

# Step 8: Create a Python dictionary to store the extracted data
min_length = min(len(titles), len(descriptions))  # Ensure both lists are the same length

data = {
    'Title': titles[:min_length],
    'Description': descriptions[:min_length]
}

# Step 9: Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(data)

# Step 10: Print the DataFrame to confirm its structure and contents
print(df)


Status Code: 200
Number of Titles: 30
Number of Descriptions: 35
Sample Titles: ['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']
Sample Descriptions: ['To see all available qualifiers, see our documentation.', 'Browse popular topics on GitHub.', 'A database is a structured set of data held in a computer, usually a server.', 'Maven is a build automation tool used primarily for Java projects.', 'JSON (JavaScript Object Notation) allows for easy interchange of data, often between a program and a database.']
                     Title                                        Description
0                       3D  To see all available qualifiers, see our docum...
1                     Ajax                   Browse popular topics on GitHub.
2                Algorithm  A database is a structured set of data held in...
3                      Amp  Maven is a build automation tool used primaril...
4                  Android  JSON (JavaScript Object Notation) allows for e...
5                  Angula