# Lab: Web Scraping and Data Extraction with Python

You are tasked with building a web scraper to extract structured data from the Wikipedia page for **"Samsung."** (`https://en.wikipedia.org/wiki/Samsung`). Follow the steps below to complete the task.


### 1. Import Relevant Libraries
Import all the necessary libraries for web scraping and data manipulation:

- `requests` for making HTTP requests.
- `BeautifulSoup` from `bs4` for parsing HTML.
- `pandas` for tabular data manipulation.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


### 2. Perform HTTP Request
- Send an HTTP GET request to the URL: `https://en.wikipedia.org/wiki/Samsung`.
- Save the response object for further processing.

In [None]:
url = "https://en.wikipedia.org/wiki/Samsung"
response = requests.get(url)


### 3. Check the Request Status
- Ensure the HTTP request is successful by checking the status code of the response.
- Print a message:
  - **Success:** If the status code is `200`.
  - **Error:** If any other status code is returned.

In [None]:
if response.status_code == 200:
    print("Success")
else:
    print("Error:", response.status_code)


Request successful!


### 4. Build the Extraction Model
- Parse the HTML content using `BeautifulSoup`.
- Use the `"html.parser"` as the parser.
- Save the parsed object for further extraction tasks.

In [None]:
soup = BeautifulSoup(response.text, "html.parser")


### 5. Extract Headings
- Use `BeautifulSoup` to extract all headings (`<h1>`, `<h2>`, `<h3>`).
- Save the extracted text into a structured format, such as a Python dictionary or a list.

In [None]:
headings = {
    "h1": [h.text.strip() for h in soup.find_all("h1")],
    "h2": [h.text.strip() for h in soup.find_all("h2")],
    "h3": [h.text.strip() for h in soup.find_all("h3")]
}

headings


### 6. Extract All Paragraphs
- Extract all the text content within `<p>` tags.

In [None]:
paragraphs = [p.text.strip() for p in soup.find_all("p")]

paragraphs


### 7. Extract All Links
- Extract all hyperlinks (links within `<a>` tags).
- Collect:
  - The **link text**.
  - The **URL** (from the `href` attribute).

In [None]:
links = []

for a in soup.find_all("a", href=True):
    links.append({
        "text": a.text.strip(),
        "url": a["href"]
    })

links


### 8. Extract Table
- Locate the first table on the page (typically the infobox or summary table in Wikipedia articles).
- Extract the table structure and its data.


In [None]:
infobox = soup.find("table", class_="infobox")

table_data = []

if infobox:
    rows = infobox.find_all("tr")
    for row in rows:
        header = row.find("th")
        value = row.find("td")
        if header and value:
            table_data.append([header.text.strip(), value.text.strip()])
else:
    print("Infobox table not found")


### 9. Convert Table into a DataFrame
- Use `pandas` to convert the table into a DataFrame.
- Ensure the table headers and rows are correctly assigned.

In [None]:
df = pd.DataFrame(table_data, columns=["Attribute", "Value"])

df


### 10. Export the Table
- Export the DataFrame as a summary table into a excel file named `samsung_summary_table.xlsx`.
- Save the file in the working directory.

In [None]:
df.to_excel("samsung_summary_table.xlsx", index=False)


#### Conclusion

This lab covered web scraping techniques using Python libraries such as requests, BeautifulSoup, and pandas. The extracted data was analyzed, stored in structured formats, and saved for further use.

#### Thank You!

Thank you for participating in this lab session! Keep exploring different web scraping techniques and ethical considerations while extracting data from websites.