# Web Scraping Tirto.id
## Eman Sukmana
Dalam notebook kali ini saya akan membahas bagaimana cara mengorek-orek data yang ada pada website Tirto.id. Perlu diketahui cara ini bisa gagal karena struktur website Tirto.id bisa berubah sewaktu-waktu, tergantung empunya.

Langkah web scrape ini semata-mata dipublikasikan untuk edukasi. Materi saya dapatkan dari pelatihan Big Data - Digital Talent Scholarship 2018.

Mari kita mulai saja orek-oreknya!

## Langkah 1: Import Pustaka yang Dibutuhkan

In [1]:
# Start importing library to scrape
import requests # for requesting the website
from bs4 import BeautifulSoup # for cleaning the website text (html), so it's more readable for further analysis
import json # for reading & writing a json format variable
import pandas as pd # for managing data frames
import time # for timer

## Langkah 2: Buat Fungsi untuk Scrape Data
Website Tirto.id, memiliki indeks berita di https://tirto.id/indeks/ . Tutorial ini akan mengakses data tersebut. Walau demikian, berdasarkan percobaan terbaru, indeks berita ini diakses dengan metode lazy loading. Langkah scraping akan sedikit berbeda dengan website yang tanpa lazy loading.

Adapun data indeks berita Tirto.id bisa ditemukan di element "script". Sehingga kita bisa ambil data dari situ.

In [2]:
# Function to scrape Tirto.id Articles
def crawl_tirto_article(i, dataArticle):
    print("Start Scraping Tirto.id Page " + str(i))
    
    # Get page content
    page = requests.get("https://tirto.id/indeks/" + str(i))
    htmlPage = page.content    
    soup = BeautifulSoup(htmlPage, "lxml")
    
    # One unique thing is, Tirto.id use lazy loading on its index
    # We found out that these data is found within "script"
    links = soup.find_all("script")
    scriptContent = links[4].text.replace('window.__NUXT__=',"")[:-1]
    
    # The formatted data inside has a similar style with JSON format. Let's convert it into JSON
    article = json.loads(scriptContent)
    
    # Extract data and save it into list
    listArticle = article["data"][0]["listarticle"]
    dataArticle = dataArticle + listArticle
    
    # Sleep to prevent your scraper from being too spammy
    time.sleep(5)
    
    return dataArticle
    
# Function to save crawled data into CSV files
def save_tirto_articles(dataArticle):
    dataFrame = pd.DataFrame(dataArticle)
    dataFrame.to_csv("Data Tirto.ID.csv", sep=";")
    print("Done saving!")
    return dataFrame

## Langkah 3: Mulai Jalankan Web Scrape

In [3]:
# Let's start scraping
firstPage = int(input("Insert first page you want to scrape (minimum is 1): "))
lastPage = int(input("Insert last page you want to scrape (minimum is 1): "))

if firstPage < 1:
    print("Please input valid page!")
    
elif lastPage < 1:
    print("Please input valid page!")

else:
    # Initialize list first to store scrape result
    data = []
    for i in range(firstPage, lastPage+1):
        data = crawl_tirto_article(i, data)

    # Finally, let's save it
    dataFr = save_tirto_articles(data)

Insert first page you want to scrape (minimum is 1): 1
Insert last page you want to scrape (minimum is 1): 5
Start Scraping Tirto.id Page 1
Start Scraping Tirto.id Page 2
Start Scraping Tirto.id Page 3
Start Scraping Tirto.id Page 4
Start Scraping Tirto.id Page 5
Done saving!


## Bonus Langkah 4: Tampilkan 5 Data Pertama dari Hasil Scrape

In [4]:
# Display it for having fun
dataFr.head()

Unnamed: 0,articleUrl,articleUrlNew,date_news,flag_tvr,foto,id_topic_pialadunia,image,image_infografik,judul,label_kanal,label_navbar,match_id,player_id,ringkasan,team_id,video
0,/cerita-peserta-tes-cpns-di-yogya-yang-gagal-u...,cerita-peserta-tes-cpns-di-yogya-yang-gagal-uj...,2018-10-27 06:30:00,0,,,[{'url': '2018/10/26/tes-cpns-yogyakarta-tirto...,False,Cerita Peserta Tes CPNS di Yogya yang Gagal Uj...,Current Issue,Sosial Budaya,,,Tes SKD CPNS di Yogyakarta batal digelar Jumat...,,
1,/bnpb-jumlah-korban-bencana-di-palu-dan-dongga...,bnpb-jumlah-korban-bencana-di-palu-dan-donggal...,2018-10-26 19:42:13,0,,,[{'url': '2018/10/11/memetakan-struktur-tanah-...,False,BNPB: Jumlah Korban Bencana di Palu dan Dongga...,Hard News,Sosial Budaya,,,Korban jiwa terbanyak dalam bencana tersebut t...,,
2,/kewenangan-anies-dalam-menentukan-wagub-dki-t...,kewenangan-anies-dalam-menentukan-wagub-dki-te...,2018-10-26 19:38:00,0,,,[{'url': '2017/11/28/Balai-Kota-Jakarta-1--tir...,False,Kewenangan Anies dalam Menentukan Wagub DKI Te...,Hard News,Politik,,,Penentuan sosok calon wakil gubernur harus men...,,
3,/sempat-kisruh-massa-aksi-bela-tauhid-dan-pbnu...,sempat-kisruh-massa-aksi-bela-tauhid-dan-pbnu-...,2018-10-26 19:24:06,0,,,[{'url': '2018/08/09/kantor-pbnu_ratio-16x9.jp...,False,"Sempat Kisruh, Massa Aksi Bela Tauhid dan PBNU...",Hard News,Sosial Budaya,,,Massa aksi bela tauhid yang berjumlah ratusan ...,,
4,/ketua-dpp-demokrat-sanksi-bawaslu-terlalu-rin...,ketua-dpp-demokrat-sanksi-bawaslu-terlalu-ring...,2018-10-26 19:00:11,0,,,[{'url': '2018/10/26/sidang-bawaslu-dki-tirto....,False,Ketua DPP Demokrat: Sanksi Bawaslu Terlalu Rin...,Hard News,Politik,,,Ketua DPP Demokrat minta agar Bawaslu&amp;nbsp...,,


## Pengembangan Ke Depannya
1. Modifikasi fungsi penyimpanan agar bisa update file CSV jika sudah dilakukan scraping selanjutnya
2. Buat fungsi unduh gambar