- use this website : Github/topics
- Write a Python script using the requests library to fetch the HTML content of the chosen website.
- Print the status code of the response to ensure the request was successful using .status_code, it should be 200.
- Print the first 100 characters of the HTML content to verify the response.
- Save the HTML content to a file named webpage.html. Ensure you handle the text encoding correctly.
- Use BeautifulSoup to parse the saved HTML content.
- Identify two distinct pieces of information on the webpage to extract (e.g., titles of the topics and their descriptions).
- Write code to extract these pieces of information. Ensure you identify the correct HTML tags and classes used for these elements on the webpage.
- Print the length and content of each extracted list to verify the extraction process.
- Create a Python dictionary to structure the extracted data, with keys representing the type of information (e.g., ‘title’ and ‘description’).
- Convert this dictionary into a pandas DataFrame.
- Print the DataFrame to confirm its structure and contents.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
# Fetch the HTML content
url = "https://github.com/topics"
response = requests.get(url)

In [4]:
# Print Status Code
print("StatusCode:", response.status_code)

StatusCode: 200


In [5]:
# Print the first 100 characters of the HTML
print("\nFirst 100 characters of HTML:\n", response.text[:100])


First 100 characters of HTML:
 

<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-t


In [6]:
# Save the HTML content to a file
with open("webpage.html", "w", encoding="utf-8") as file:
    file.write(response.text)

In [7]:
# Parse the saved HTML with BeautifulSoup
with open("webpage.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")

In [8]:
# Identify and extract topics (titles and descriptions)
titles = [tag.text.strip() for tag in soup.find_all("p", class_="f3 lh-condensed mb-0 mt-1 Link--primary")]
descriptions = [tag.text.strip() for tag in soup.find_all("p", class_="f5 color-fg-muted mb-0 mt-1")]

In [12]:
# Print lengths and contents
print("\nNumber of titles found:", len(titles))
print("Titles:", titles)

print("\nNumber of descriptions found:", len(descriptions))
print("Descriptions:", descriptions)


Number of titles found: 16
Titles: ['Awesome Lists', 'Chrome', 'Code quality', 'Compiler', 'CSS', 'Database', 'Front end', 'JavaScript', 'Node.js', 'npm', 'Project management', 'Python', 'React', 'React Native', 'Scala', 'TypeScript']

Number of descriptions found: 16
Descriptions: ['An awesome list is a list of awesome things curated by the community.', 'Chrome is a web browser from the tech company Google.', 'Automate your code review with style, quality, security, and test‑coverage checks when you need them.', 'Compilers are software that translate higher-level programming languages to lower-level languages (e.g. machine code).', 'Cascading Style Sheets (CSS) is a language used most often to style and improve upon the appearance of views.', 'A database is a structured set of data held in a computer, usually a server.', 'Front end is the programming and layout that people see and interact with.', 'JavaScript (JS) is a lightweight interpreted programming language with first-class fun

In [14]:
# Create Dictionary
data = {
    "Title": titles,
    "Description": descriptions
}

In [16]:
# Convert to Dataframe
df = pd.DataFrame(data)

In [17]:
# Print dataFrame
print("\nExtracted DataFrame:\n")
print(df)


Extracted DataFrame:

                 Title                                        Description
0        Awesome Lists  An awesome list is a list of awesome things cu...
1               Chrome  Chrome is a web browser from the tech company ...
2         Code quality  Automate your code review with style, quality,...
3             Compiler  Compilers are software that translate higher-l...
4                  CSS  Cascading Style Sheets (CSS) is a language use...
5             Database  A database is a structured set of data held in...
6            Front end  Front end is the programming and layout that p...
7           JavaScript  JavaScript (JS) is a lightweight interpreted p...
8              Node.js  Node.js is a tool for executing JavaScript in ...
9                  npm  npm is a package manager for JavaScript includ...
10  Project management  Project management is about building scope and...
11              Python  Python is a dynamically typed programming lang...
12             