<a href="https://colab.research.google.com/github/Putusutha/Liputan6_News_Scraper/blob/main/Liputan6_News_Scraper_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

üìå Cell 1: Fetching the Website Page

üìú Explanation:
In this first step, we use the requests library to fetch the main page of the Liputan6 website.
After that, we print the HTTP status code to verify that the page was successfully retrieved.

In [1]:
import requests

# Website URL to scrape
url = "https://www.liputan6.com/"

# Fetch the page using requests
laman = requests.get(url)

# Print the HTTP status code (200 means success)
print(f'Status GET laman: {laman.status_code}')

Status GET laman: 200


üìå Cell 2: Understanding print(laman.content)

üìú Explanation:
The function print(laman.content) prints the raw HTML content of the webpage retrieved using the requests library.

laman is the response object from requests.get(url).

.content returns the binary content of the webpage (HTML, CSS, JavaScript, etc.).

This helps us inspect the page source before extracting specific elements.

In [2]:
print(laman.content)



üìå Cell 3: Parsing HTML with BeautifulSoup

üìú Explanation:
After fetching the web page, we need to parse its content so we can extract relevant data more easily.
We use BeautifulSoup to read the retrieved HTML and format it for better readability.

In [3]:
from bs4 import BeautifulSoup

# Parse the HTML using BeautifulSoup
soup = BeautifulSoup(laman.content, 'html.parser')

# Print the formatted HTML
print(soup.prettify())

<!DOCTYPE html>
<html itemscope="itemscope" itemtype="https://schema.org/WebPage" lang="id">
 <head>
  <title>
   Berita Terkini, Kabar Terbaru Hari Ini Indonesia dan Dunia - Liputan6.com
  </title>
  <meta charset="utf-8"/>
  <meta content="width=1024" name="viewport"/>
  <meta content="Home" name="adx:sections"/>
  <meta content="Berita Terpopuler, Foto dan Video - Liputan6.com" name="title"/>
  <meta content="kabar berita terpopuler, foto, dan video hari ini terbaru Indonesia dan dunia seputar politik, peristiwa, regional, bisnis, bola, tekno dan gosip artis" name="description"/>
  <meta content="berita hari ini, berita harian, berita terkini, berita terbaru, berita indonesia, berita terpopuler, berita, berita foto, berita dunia, berita video" name="keywords"/>
  <meta content="index,follow" name="googlebot-news"/>
  <meta content="index,follow" name="googlebot"/>
  <meta content="index, follow" name="robots"/>
  <meta content="max-image-preview:large" name="robots"/>
  <meta conten

üìå Cell 3: Extracting News Elements

üìú Explanation:
In this step, we search for news elements on the webpage.
The Liputan6 website uses the tag with a specific class to display news headlines.
We will extract all elements with the class 'article-snippet--numbered__title'.

In [4]:
# Extract all news elements with the <h4> tag and specific class
kontainer_berita = soup.find_all('h4', class_='article-snippet--numbered__title')

# Print the list of extracted elements
print(kontainer_berita)

[<h4 class="article-snippet--numbered__title"><a class="article-snippet__link" data-template-var="title" href="https://www.liputan6.com/news/read/5983461/buntut-wali-kota-depok-izinkan-asn-mudik-pakai-mobil-dinas-gubernur-jabar-dedi-mulyadi-bakal-panggil-kepala-daerah" title="Buntut Wali Kota Depok Izinkan ASN Mudik Pakai Mobil Dinas, Gubernur Jabar Dedi Mulyadi Bakal Panggil Kepala Daerah">Buntut Wali Kota Depok Izinkan ASN Mudik Pakai Mobil Dinas, Gubernur Jabar Dedi Mulyadi Bakal Panggil Kepala Daerah</a></h4>, <h4 class="article-snippet--numbered__title"><a class="article-snippet__link" data-template-var="title" href="https://www.liputan6.com/bisnis/read/5983667/warga-jakarta-selatan-dan-depok-keluhkan-mati-listrik-imbas-trafo-pln-meledak-di-gandul-cinere" title="Warga Jakarta Selatan dan Depok Keluhkan Mati Listrik Imbas Trafo PLN Meledak di Gandul Cinere?">Warga Jakarta Selatan dan Depok Keluhkan Mati Listrik Imbas Trafo PLN Meledak di Gandul Cinere?</a></h4>, <h4 class="article-

üìå Cell 5: Extracting News Titles and Links into a Table

üìú Explanation:
Now, we extract the news titles and their links into a structured format:

Extract news titles from <a> tags.

Extract news links from the href attribute.

Display the extracted data in tabular format using the tabulate library.

In [5]:
from bs4 import BeautifulSoup
from tabulate import tabulate

# Convert the list of extracted news elements into an HTML format
html_content = '''
{}
'''.format(kontainer_berita)

# Parse the extracted HTML again
soup = BeautifulSoup(html_content, 'html.parser')

# Lists to store news titles and links
judul_list = []
link_list = []

# Find all <a> elements with the class 'article-snippet__link'
for link in soup.find_all('a', class_='article-snippet__link'):
    judul = link.get_text(strip=True)  # Extract text from the <a> tag
    judul_list.append(judul)

    href = link['href']  # Extract the href attribute (news link)
    link_list.append(href)

# Combine titles and links into a structured 2D list
data = [[judul, link] for judul, link in zip(judul_list, link_list)]

# Display the data in a tabular format
print(tabulate(data, headers=["Title", "Link"], tablefmt="grid"))

+---------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Title                                                                                                               | Link                                                                                                                                                           |
| Buntut Wali Kota Depok Izinkan ASN Mudik Pakai Mobil Dinas, Gubernur Jabar Dedi Mulyadi Bakal Panggil Kepala Daerah | https://www.liputan6.com/news/read/5983461/buntut-wali-kota-depok-izinkan-asn-mudik-pakai-mobil-dinas-gubernur-jabar-dedi-mulyadi-bakal-panggil-kepala-daerah  |
+---------------------------------------------------------------------------------------------------------------------+--------------------------------------