We can scrape the any html website and their details with the help of the **BeautifulSoup** library of Python.

Read the documentation here: https://beautiful-soup-4.readthedocs.io/en/latest/

Below is the list of modules required to scrape.

* *requests:* Requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs or Web Scrapping, requests is must to be learned for proceeding further with these technologies. When one makes a request to a URI, it returns a response.

* *html5lib:* A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

* *bs4:* BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster.

##STEP 1. Import the library

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

##STEP 2. Access the HTML content from the webpage by assigning the URL and creating a soup object.

In [4]:
# Downloading Kompas data
headers = {'Accept-Language': 'en-US,en;q=0.8'}
url = 'http://www.kompas.com'
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

##STEP 3. Extract the title

* select the tag \<```h3```\> with the class ```article__title```. Please notice the text scraped is still in html format.

In [6]:
newstitle = soup.select('h3.article__title')
print(newstitle[0])
print(" ")
print("how many titles are available?",len(soup.select('h3.article__title')))


<h3 class="article__title article__title--medium">
<a class="article__link" href="https://www.kgnow.com/watch/1701127/sindikat-perampok-bobol-toko-kopi-di-pulogadung-jaktim-korban-rugi-rp-31-juta?source=KOMPASCOM&amp;position=wp_terkini__player_1" target="_blank">Sindikat Perampok Bobol Toko Kopi di Pulogadung Jaktim, Korban Rugi Rp 31 Juta</a>
</h3>
 
how many titles are available? 3


* To get the link, notice that the text is inside the ```<a>``` tag within the ```<h3>``` tag. We can separate it with a space.

In [7]:
text = soup.select('h3 a')
print(text)



[<a class="article__link" href="https://www.kgnow.com/watch/1701127/sindikat-perampok-bobol-toko-kopi-di-pulogadung-jaktim-korban-rugi-rp-31-juta?source=KOMPASCOM&amp;position=wp_terkini__player_1" target="_blank">Sindikat Perampok Bobol Toko Kopi di Pulogadung Jaktim, Korban Rugi Rp 31 Juta</a>, <a class="article__link" href="https://www.kgnow.com/watch/1701015/kemenag-merasa-layanan-haji-2024-lebih-baik-dari-2023?source=KOMPASCOM&amp;position=wp_terkini__player_2">Kemenag Merasa Layanan Haji 2024 Lebih Baik dari 2023</a>, <a class="article__link" href="https://www.kgnow.com/watch/1701000/kronologi-residivis-rudapaksa-dan-kasus-nia-penjual-gorengan-di-padang-pariaman?source=KOMPASCOM&amp;position=wp_terkini__player_2">Kronologi Residivis Rudapaksa dan Kasus Nia Penjual Gorengan di Padang Pariaman</a>]




*   Get the attribute inside the tag.



In [None]:
links=[]
for a in soup.select('h3 a'):
  links.append(a.attrs.get('href'))

print(links[0])

https://nasional.kompas.com/read/2023/09/18/17150041/lemhannas-prediksi-pemilu-2024-di-indonesia-akan-dijadikan-eksperimen


##STEP 4. Looping the process and storing the data.
You may also apply some regex here.

In [8]:
article_title=[]
links=[]

for t in soup.select('h3.article__title a'):
  article_title.append(t.get_text())
  links.append(t.attrs.get('href'))

print(article_title)
print(links)




['Sindikat Perampok Bobol Toko Kopi di Pulogadung Jaktim, Korban Rugi Rp 31 Juta', 'Kemenag Merasa Layanan Haji 2024 Lebih Baik dari 2023', 'Kronologi Residivis Rudapaksa dan Kasus Nia Penjual Gorengan di Padang Pariaman']
['https://www.kgnow.com/watch/1701127/sindikat-perampok-bobol-toko-kopi-di-pulogadung-jaktim-korban-rugi-rp-31-juta?source=KOMPASCOM&position=wp_terkini__player_1', 'https://www.kgnow.com/watch/1701015/kemenag-merasa-layanan-haji-2024-lebih-baik-dari-2023?source=KOMPASCOM&position=wp_terkini__player_2', 'https://www.kgnow.com/watch/1701000/kronologi-residivis-rudapaksa-dan-kasus-nia-penjual-gorengan-di-padang-pariaman?source=KOMPASCOM&position=wp_terkini__player_2']


##STEP 5. Save as DataFrame and store it as CSV for further analysis.
You can also store it as SQL if you prefer.

In [10]:
df = pd.DataFrame(
    {'article_title': article_title,
     'link': links}
    )

print (df.head())

df.to_csv('kompasarticle.csv', index=False)

                                       article_title  \
0  Sindikat Perampok Bobol Toko Kopi di Pulogadun...   
1  Kemenag Merasa Layanan Haji 2024 Lebih Baik da...   
2  Kronologi Residivis Rudapaksa dan Kasus Nia Pe...   

                                                link  
0  https://www.kgnow.com/watch/1701127/sindikat-p...  
1  https://www.kgnow.com/watch/1701015/kemenag-me...  
2  https://www.kgnow.com/watch/1701000/kronologi-...  


In [None]:
!pip install bs4
!pip install --upgrade beautifulsoup4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=81da2695061b89bdd70cd4b898c54bf649a7fb6ad742f3d4eebce79e00b1b1cb
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.11.2
    Uninstalling beautifulsoup4-4.11.2:
      Successfully uninstalled beautifulsoup4-4.11.2
Successf