# Lab: Web Scraping and Data Extraction with Python

You are tasked with building a web scraper to extract structured data from the Wikipedia page for **"Samsung."** (`https://en.wikipedia.org/wiki/Samsung`). Follow the steps below to complete the task.


### 1. Import Relevant Libraries
Import all the necessary libraries for web scraping and data manipulation:

- `requests` for making HTTP requests.
- `BeautifulSoup` from `bs4` for parsing HTML.
- `pandas` for tabular data manipulation.

In [3]:
!pip install requests
!pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import pandas as pd



### 2. Perform HTTP Request
- Send an HTTP GET request to the URL: `https://en.wikipedia.org/wiki/Samsung`.
- Save the response object for further processing.

In [5]:
url="https://en.wikipedia.org/wiki/Samsung"
response=requests.get(url)
html_content=response.text

### 3. Check the Request Status
- Ensure the HTTP request is successful by checking the status code of the response.
- Print a message:
  - **Success:** If the status code is `200`.
  - **Error:** If any other status code is returned.

In [9]:
if response.status_code==200:
    print("Request successful")

Request successful


### 4. Build the Extraction Model
- Parse the HTML content using `BeautifulSoup`.
- Use the `"html.parser"` as the parser.
- Save the parsed object for further extraction tasks.

In [10]:
soup=BeautifulSoup(html_content,'html.parser')

### 5. Extract Headings
- Use `BeautifulSoup` to extract all headings (`<h1>`, `<h2>`, `<h3>`).
- Save the extracted text into a structured format, such as a Python dictionary or a list.

In [12]:
headings=[heading.text.strip() for heading in soup.find_all(['h1','h2','h3'])]
print(headings)

['Contents', 'Samsung', 'Etymology', 'History', '1938–1970', '1970–1990', '1990–2000', '2000–present', 'Influence in South Korea', 'Operations', 'Leadership', 'Affiliates', 'Divested', 'Defunct', 'Joint ventures', 'Partially owned companies', 'Acquisitions and attempted acquisitions', 'Major clients', 'Shell plc', 'United Arab Emirates government', 'Ontario government', 'Corporate image', 'Audio logo', 'Font', 'Sponsorships', 'In Vietnam', 'Controversies', 'Labor abuses', 'Union-busting activity', '2007 slush fund scandal', "Lee Kun-hee's prostitution scandal", '2017 bribery scandal', 'Supporting far-right groups', 'Price fixing', 'Misleading claims', 'References', 'External links']


### 6. Extract All Paragraphs
- Extract all the text content within `<p>` tags.

In [17]:
paragraphs=[p.text.strip() for p in soup.find_all(['p'])]
print(paragraphs)

['', "Samsung Group[1] (Korean: 삼성; [samsʌŋ]; stylised as SΛMSUNG) is a South Korean multinational manufacturing conglomerate headquartered in the Samsung Town office complex in Seoul. The group consists of numerous affiliated businesses,[2] most of which operate under the Samsung brand, and is the largest chaebol (business conglomerate) in South Korea. As of 2024,[update] Samsung has the world's fifth-highest brand value.[3]", 'Founded in 1938 by Lee Byung-chul as a trading company, Samsung diversified into various sectors, including food processing, textiles, insurance, securities, and retail, over the next three decades. In the late 1960s, Samsung entered the electronics industry, followed by the construction and shipbuilding sectors in the mid-1970s—areas that would fuel its future growth. After Lee died in 1987, Samsung was divided into five business groups: Samsung Group, Shinsegae Group, CJ Group, Hansol Group, and JoongAng Group.', "Key affiliates of Samsung include Samsung Ele

### 7. Extract All Links
- Extract all hyperlinks (links within `<a>` tags).
- Collect:
  - The **link text**.
  - The **URL** (from the `href` attribute).

In [18]:
links=[a['href'] for a in soup.find_all('a',href=True)]
print(links)

['#bodyContent', '/wiki/Main_Page', '/wiki/Wikipedia:Contents', '/wiki/Portal:Current_events', '/wiki/Special:Random', '/wiki/Wikipedia:About', '//en.wikipedia.org/wiki/Wikipedia:Contact_us', '/wiki/Help:Contents', '/wiki/Help:Introduction', '/wiki/Wikipedia:Community_portal', '/wiki/Special:RecentChanges', '/wiki/Wikipedia:File_upload_wizard', '/wiki/Special:SpecialPages', '/wiki/Main_Page', '/wiki/Special:Search', 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en', '/w/index.php?title=Special:CreateAccount&returnto=Samsung', '/w/index.php?title=Special:UserLogin&returnto=Samsung', 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en', '/w/index.php?title=Special:CreateAccount&returnto=Samsung', '/w/index.php?title=Special:UserLogin&returnto=Samsung', '/wiki/Help:Introduction', '/wiki/Special:MyContributions', '/wiki/Special:MyTalk', '#', '#Etymology', '#History', '#1938–

### 8. Extract Table
- Locate the first table on the page (typically the infobox or summary table in Wikipedia articles).
- Extract the table structure and its data.


In [20]:
infobox=soup.find('table',class_='infobox')
if infobox:
    rows=infobox.find_all('tr')
    data={}
    for row in rows:
        header=row.find('th')
        value=row.find('td')
        if header and value:
            data[header.text.strip()]=value.text.strip()
            print(data)
        else:
                print("Nothinf in infobox")

Nothinf in infobox
Nothinf in infobox
{'Native name': '삼성그룹'}
{'Native name': '삼성그룹', 'Company type': 'Private'}
{'Native name': '삼성그룹', 'Company type': 'Private', 'Industry': 'Conglomerate'}
{'Native name': '삼성그룹', 'Company type': 'Private', 'Industry': 'Conglomerate', 'Founded': '1\xa0March 1938 (87 years ago)\xa0(1938-03-01) in Taikyu, Korea, Empire of Japan'}
{'Native name': '삼성그룹', 'Company type': 'Private', 'Industry': 'Conglomerate', 'Founded': '1\xa0March 1938 (87 years ago)\xa0(1938-03-01) in Taikyu, Korea, Empire of Japan', 'Founder': 'Lee Byung-chul'}
{'Native name': '삼성그룹', 'Company type': 'Private', 'Industry': 'Conglomerate', 'Founded': '1\xa0March 1938 (87 years ago)\xa0(1938-03-01) in Taikyu, Korea, Empire of Japan', 'Founder': 'Lee Byung-chul', 'Headquarters': 'Samsung Town, Seoul, South\xa0Korea'}
{'Native name': '삼성그룹', 'Company type': 'Private', 'Industry': 'Conglomerate', 'Founded': '1\xa0March 1938 (87 years ago)\xa0(1938-03-01) in Taikyu, Korea, Empire of Japan',

### 9. Convert Table into a DataFrame
- Use `pandas` to convert the table into a DataFrame.
- Ensure the table headers and rows are correctly assigned.

In [22]:
table=pd.DataFrame(list(data.items()),columns=['key','value'])
table

Unnamed: 0,key,value
0,Native name,삼성그룹
1,Company type,Private
2,Industry,Conglomerate
3,Founded,1 March 1938 (87 years ago) (1938-03-01) in Ta...
4,Founder,Lee Byung-chul
5,Headquarters,"Samsung Town, Seoul, South Korea"
6,Area served,Worldwide
7,Key people,Lee Jae-yong (chairman)
8,Subsidiaries,Cheil WorldwideSamsung Asset ManagementSamsung...
9,Hangul,삼성


### 10. Export the Table
- Export the DataFrame as a summary table into a excel file named `samsung_summary_table.xlsx`.
- Save the file in the working directory.

In [23]:
table.to_excel("samsung_summary_table.xlsx")

#### Conclusion

This lab covered web scraping techniques using Python libraries such as requests, BeautifulSoup, and pandas. The extracted data was analyzed, stored in structured formats, and saved for further use.

#### Thank You!

Thank you for participating in this lab session! Keep exploring different web scraping techniques and ethical considerations while extracting data from websites.