# Daily Challenge: End-To-End Web Scraping In Python

## Web Scraping:

- use this website : https://github.com/topics
- Write a Python script using the requests library to fetch the HTML content of the chosen website.
- Print the status code of the response to ensure the request was successful using .status_code, it should be 200.
- Print the first 100 characters of the HTML content to verify the response.
- Save the HTML content to a file named webpage.html. Ensure you handle the text encoding correctly.
- Use BeautifulSoup to parse the saved HTML content.
- Identify two distinct pieces of information on the webpage to extract (e.g., titles of the topics and their descriptions).
- Write code to extract these pieces of information. Ensure you identify the correct HTML tags and classes used for these elements on the webpage.
- Print the length and content of each extracted list to verify the extraction process.
- Create a Python dictionary to structure the extracted data, with keys representing the type of information (e.g., ‘title’ and ‘description’).
- Convert this dictionary into a pandas DataFrame.
- Print the DataFrame to confirm its structure and contents.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [3]:
url = 'https://github.com/topics'

try:
    # Send a GET request to fetch the HTML content
    response = requests.get(url)
   
    print(f"Status Code: {response.status_code}")

    # Check if the request was successful (status code 200)
    if response.status_code == 200:        
        print("First 100 characters of HTML content:")
        print(response.text[:100])

        # Save the HTML content to a file named 'webpage.html'
        with open('webpage.html', 'w', encoding='utf-8') as file:
            file.write(response.text)
        print("HTML content saved to 'webpage.html'")

        # Use BeautifulSoup to parse the saved HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract information from the webpage
        selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
        topic_title_tags = soup.find_all('p',{'class': selection_class})
        titles = [title.text.strip() for title in topic_title_tags]
        desc = "f5 color-fg-muted mb-0 mt-1"
        descriptions = soup.find_all('p',{'class':desc})
        description = [desc.text.strip() for desc in descriptions]
        
        print("\nTitles:")
        print("Length:", len(titles))
        print(titles)

        print("\nDescriptions:")
        print("Length:", len(description))
        print(description)
       
        data = {'Title': titles, 'Description': description}  
        df = pd.DataFrame(data)
       
        print("\nDataFrame:")
        display(df)

    else:
        print(f"Failed to retrieve HTML content. Status code: {response.status_code}")

except Exception as e:
    print(f"An error occurred: {e}")

Status Code: 200
First 100 characters of HTML content:


<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-t
HTML content saved to 'webpage.html'

Titles:
Length: 30
['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']

Descriptions:
Length: 30
['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open

Unnamed: 0,Title,Description
0,3D,3D refers to the use of three-dimensional grap...
1,Ajax,Ajax is a technique for creating interactive w...
2,Algorithm,Algorithms are self-contained sequences that c...
3,Amp,Amp is a non-blocking concurrency library for ...
4,Android,Android is an operating system built by Google...
5,Angular,Angular is an open source web application plat...
6,Ansible,Ansible is a simple and powerful automation en...
7,API,An API (Application Programming Interface) is ...
8,Arduino,Arduino is an open source platform for buildin...
9,ASP.NET,ASP.NET is a web framework for building modern...
