<a href="https://colab.research.google.com/github/BetaUliansyah/Sandbox/blob/main/Sinta_Hunter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sinta Hunter

Author: Beta Uiansyah, <beta.uliansyah@pknstan.ac.id>

Halaman notebook ini berisi skrip-skrip python yang saya gunakan untuk mencoba mendapatkan data seluruh jurnal terakreditasi Sinta yang tersedia di alamat sinta.ristekbrin.go.id dengan teknik web scrapping.

Skrip Sinta Hunter adalah bagian pertama dan utama dari [sinta-tools](https://github.com/BetaUliansyah/sinta-tools). Sinta Tools direncakan terdiri dari:
- Sinta Hunter
- CFP Hunter (mungkin berupa dashboard menggunakan [datastudio.google.com](https://datastudio.google.com))
- Reviewers Hunter

Skrip Sinta Hunter memiliki 4 varian:
1. [output ke layar](#scrollTo=hPs3_V7rNsWN&line=3&uniqifier=1)
2. [output berupa return value dengan format json](#scrollTo=BV2p2nNBfkER) (belum berjalan dengan baik)
3. [output berupa file dengan format csv](#scrollTo=dJy2ei3N2n7A) (versi terbaik saat ini)
4. [output berupa file dengan format csv tersimpan di Google Drive](#scrollTo=YaazQkWbIXnv) sehingga bisa dijalankan di colab

Data scrapping berformat csv dapat dilihat di [https://github.com/BetaUliansyah/sinta-tools/blob/master/sinta-data-2020-03-17.csv](https://github.com/BetaUliansyah/sinta-tools/blob/master/sinta-data-2020-03-17.csv)

# Sinta Hunter Beta Release


In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import json
import re

def sinta_readpages(startpage, lastpage):
    # looping halaman Sinta dari startpage sampai lastpage
    return_value = {}
    currentpage = startpage
    journal_no = (startpage-1) * 9 + 1
    i = 0
    while currentpage < int(lastpage) + 1:
        # development purpose, uncomment two lines following to limit number of pages to be scrapped
        #if page == 620:
        #    break
        
        s = requests.Session()
        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(currentpage) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("dl", {"class":"uk-description-list-line"})
        for journal in journals_in_this_page:
            # mencari nama jurnal
            journal_link = journal.find('a')

            #mencari akreditasi
            journal_info = journal.find('span', attrs={'class' :'index-val-small', 'style': re.compile('color')}) # cari span dengan color apapun
            #print(str(journal_no) + ". " + journal_link.text)
            #print("   Akreditasi: " + journal_info.text)

            # mencari affiliation
            #journal_affiliation = journal.find('dd')
            #if journal_affiliation is None:
            #    journal_affiliation = ''
            #print(journal_affiliation.text)

            # mencari area tema jurnal
            #journal_topic = journal.find('a', attrs={'class': 'area-item-small'})
            #if journal_topic is None:
            #    str_journal_topic = ''
            #else:
            #    str_journal_topic = journal_topic.text
            
            # mencari URL jurnal
            r = s.get('http://sinta.ristekbrin.go.id'+journal_link['href'])
            if r.status_code==200:
                jsoup = BeautifulSoup(r.text, 'html.parser')
                # cari <a href="http://journal.unhas.ac.id/index.php/fs/index" ><i class="uk-icon-globe uk-text-primary"></i> Website</a> | 
                
                # mencari URL
                journal_urls = jsoup.find_all('a')
                for journal_url in journal_urls:
                    if journal_url.find(text=re.compile("Website")):
                        journal_url = journal_url['href']
                        break
                #print("   URL: " + journal_url)
            return_value[i] = {
                'journal_no': journal_no,
                'journal_name': journal_link.text, 
                'journal_accreditation': journal_info.text, 
                #'journal_affiliation': journal_affiliation, 
                #'journal_topic': str_journal_topic, 
                'journal_url': journal_url
                }
            journal_no = journal_no +1
            i += 1
            
        currentpage = currentpage + 1
    return json.dumps(return_value)


def sinta_lastpage():
    """
    Find last page of Sinta list
    """
    s = requests.Session()
    r = s.get('http://sinta.ristekbrin.go.id/journals')
    bsoup = BeautifulSoup(r.text, 'html.parser')
    lastpage = 1

    if r.status_code==200:
        # mencari halaman terakhir
        for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
            for litag in ultag.find_all('li'):
                for links in litag.find_all('a', href=True):
                    # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                    # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                    query = urlparse(links['href']).query
                    lastpage = re.compile('page=(\d+)').findall(query)
        lastpage = int(lastpage[0])
    return lastpage

def main(startpage=1, lastpage=sinta_lastpage()):
    """
    Run sinta-hunter with 2 arguments: start and last page
    """
    return_value = sinta_readpages(startpage, lastpage)
    return return_value
  
main(startpage=3, lastpage=4)
# main(lastpage=4)
# main(startpage=3, lastpage=4)


# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
# https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag


'{"0": {"journal_no": 19, "journal_name": "Tropical Animal Science Journal", "journal_accreditation": "S1", "journal_url": "http://journal.ipb.ac.id/index.php/mediapeternakan"}, "1": {"journal_no": 20, "journal_name": "International Journal of Renewable Energy Development : JRED", "journal_accreditation": "S1", "journal_url": "http://ejournal.undip.ac.id/index.php/ijred/index"}, "2": {"journal_no": 21, "journal_name": "Indonesian Journal of Electrical Engineering and Computer Science", "journal_accreditation": "S1", "journal_url": "http://www.iaesjournal.com/online/index.php/IJEECS"}, "3": {"journal_no": 22, "journal_name": "Indonesian Journal of Electrical Engineering and Informatics", "journal_accreditation": "S1", "journal_url": "http://section.iaesonline.com/index.php/IJEEI/index"}, "4": {"journal_no": 23, "journal_name": "TELKOMNIKA (Telecommunication Computing Electronics and Control)", "journal_accreditation": "S1", "journal_url": "http://journal.uad.ac.id/index.php/telkomnika"}

# Sinta Hunter CSV

In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import re
import csv
from datetime import date

def sinta_lastpage():
    """
    Find last page of Sinta list
    """
    s = requests.Session()
    r = s.get('https://sinta.ristekbrin.go.id/journals')
    bsoup = BeautifulSoup(r.text, 'html.parser')
    lastpage = 1

    if r.status_code==200:
        # mencari halaman terakhir
        for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
            for litag in ultag.find_all('li'):
                for links in litag.find_all('a', href=True):
                    # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                    # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                    query = urlparse(links['href']).query
                    lastpage = re.compile('page=(\d+)').findall(query)
        lastpage = int(lastpage[0])
    return lastpage

def sinta_readpages(startpage=1, lastpage=sinta_lastpage()):
    # prepare file name with date and write header
    filename = "sinta-data-" + str(date.today()) + ".csv"
    header_row = ['journal_no', 'journal_name', 'journal_accreditation', 'journal_url']
    with open(filename, mode='w') as sintacsv_file:
        sintacsv_writer = csv.writer(sintacsv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        sintacsv_writer.writerow(header_row)
    print(header_row)

    # looping halaman Sinta dari startpage sampai lastpage
    return_value = []
    currentpage = startpage
    journal_no = (startpage-1) * 9 + 1
    i = 0
    while currentpage < int(lastpage) + 1:
        # development purpose, uncomment two lines following to limit number of pages to be scrapped
        #if page == 620:
        #    break

        # session
        s = requests.Session()
        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(currentpage) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("dl", {"class":"uk-description-list-line"})
        for journal in journals_in_this_page:
            # mencari nama jurnal
            journal_link = journal.find('a')

            #mencari akreditasi
            journal_info = journal.find('span', attrs={'class' :'index-val-small', 'style': re.compile('color')}) # cari span dengan color apapun
            
            # mencari URL jurnal
            r = s.get('http://sinta.ristekbrin.go.id'+journal_link['href'])
            if r.status_code==200:
                jsoup = BeautifulSoup(r.text, 'html.parser')
                # cari <a href="http://journal.unhas.ac.id/index.php/fs/index" ><i class="uk-icon-globe uk-text-primary"></i> Website</a> | 
                
                # mencari URL
                journal_urls = jsoup.find_all('a')
                for journal_url in journal_urls:
                    if journal_url.find(text=re.compile("Website")):
                        journal_url = journal_url['href']
                        break
                #print("   URL: " + journal_url)
                
            # prepare result_data
            result_data = [journal_no, journal_link.text, journal_info.text, journal_url]
            with open(filename, mode='a+') as sintacsv_file:
                sintacsv_writer = csv.writer(sintacsv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
                sintacsv_writer.writerow(result_data)
            return_value.append(result_data)
            print(result_data)
            journal_no = journal_no +1
            i += 1
            
        currentpage = currentpage + 1
    return return_value

if __name__ == "__main__":
    sinta_readpages()
    # sinta_readpages(startpage=3, lastpage=4)
    # sinta_readpages(7, 8)




# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
# https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag


['journal_no', 'journal_name', 'journal_accreditation', 'journal_url']
[55, 'Electronic Journal of Graph Theory and Applications', 'S1', 'http://www.ejgta.org/index.php/ejgta']
[56, 'Kukila', 'S1', 'http://kukila.org/index.php/KKL']
[57, 'International Journal of Technology', 'S1', 'http://www.ijtech.eng.ui.ac.id/old/index.php/journal']
[58, 'Critical Care and Shock', 'S1', 'http://criticalcareshock.org/']
[59, 'Forum Penelitian Agro Ekonomi', 'S2', 'http://ejurnal.litbang.pertanian.go.id/index.php/fae']
[60, 'Jurnal Aplikasi Manajemen', 'S2', 'http://jurnaljam.ub.ac.id/index.php/jam/index']
[61, 'Jurnal Penelitian Pascapanen Pertanian', 'S2', 'http://ejurnal.litbang.pertanian.go.id/index.php/jpasca']
[62, 'Jurnal Agronomi Indonesia (Indonesian Journal of Agronomy)', 'S2', 'http://journal.ipb.ac.id/index.php/jurnalagronomi']
[63, 'Berita Biologi', 'S2', 'http://e-journal.biologi.lipi.go.id/index.php/berita_biologi']
[64, 'Jurnal Ekologi Kesehatan', 'S2', 'http://ejournal.litbang.depkes

# Sinta Hunter json


In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import json
import re

def sinta_readpages(startpage, lastpage):
    # looping halaman Sinta dari startpage sampai lastpage
    return_value = []
    return_value.append(['journal_no', 'journal_name', 'journal_accreditation', 'journal_url'])
    currentpage = startpage
    journal_no = (startpage-1) * 9 + 1
    while currentpage < int(lastpage) + 1:
        # development purpose, uncomment two lines following to limit number of pages to be scrapped
        #if page == 620:
        #    break
        
        s = requests.Session()
        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(currentpage) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("dl", {"class":"uk-description-list-line"})
        for journal in journals_in_this_page:
            # mencari nama jurnal
            journal_link = journal.find('a')

            #mencari akreditasi
            journal_info = journal.find('span', attrs={'class' :'index-val-small', 'style': re.compile('color')}) # cari span dengan color apapun
            #print(str(journal_no) + ". " + journal_link.text)
            #print("   Akreditasi: " + journal_info.text)

            # mencari affiliation
            #journal_affiliation = journal.find('dd')
            #if journal_affiliation is None:
            #    journal_affiliation = ''
            #print(journal_affiliation.text)

            # mencari area tema jurnal
            #journal_topic = journal.find('a', attrs={'class': 'area-item-small'})
            #if journal_topic is None:
            #    str_journal_topic = ''
            #else:
            #    str_journal_topic = journal_topic.text
            
            # mencari URL jurnal
            r = s.get('http://sinta.ristekbrin.go.id'+journal_link['href'])
            if r.status_code==200:
                jsoup = BeautifulSoup(r.text, 'html.parser')
                # cari <a href="http://journal.unhas.ac.id/index.php/fs/index" ><i class="uk-icon-globe uk-text-primary"></i> Website</a> | 
                
                # mencari URL
                journal_urls = jsoup.find_all('a')
                for journal_url in journal_urls:
                    if journal_url.find(text=re.compile("Website")):
                        journal_url = journal_url['href']
                        break
                #print("   URL: " + journal_url)
            return_value.append([journal_no, 
                journal_link.text, 
                journal_info.text, 
                journal_url
            ])
            journal_no = journal_no +1
            
        currentpage = currentpage + 1
    return json.dumps(return_value, indent=None)


def sinta_lastpage():
    """
    Find last page of Sinta list
    """
    s = requests.Session()
    r = s.get('http://sinta.ristekbrin.go.id/journals')
    bsoup = BeautifulSoup(r.text, 'html.parser')
    lastpage = 1

    if r.status_code==200:
        # mencari halaman terakhir
        for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
            for litag in ultag.find_all('li'):
                for links in litag.find_all('a', href=True):
                    # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                    # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                    query = urlparse(links['href']).query
                    lastpage = re.compile('page=(\d+)').findall(query)
        lastpage = int(lastpage[0])
    return lastpage

def main(startpage=1, lastpage=sinta_lastpage()):
    """
    Run sinta-hunter with 2 arguments: start and last page
    """
    return_value = sinta_readpages(startpage, lastpage)
    return return_value
  
main(startpage=3, lastpage=4)
# Example arguments
# main(lastpage=4)
# main(startpage=3, lastpage=4)


# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
# https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag


'[["journal_no", "journal_name", "journal_accreditation", "journal_url"], [19, "Tropical Animal Science Journal", "S1", "http://journal.ipb.ac.id/index.php/mediapeternakan"], [20, "International Journal of Renewable Energy Development : JRED", "S1", "http://ejournal.undip.ac.id/index.php/ijred/index"], [21, "Indonesian Journal of Electrical Engineering and Computer Science", "S1", "http://www.iaesjournal.com/online/index.php/IJEECS"], [22, "Indonesian Journal of Electrical Engineering and Informatics", "S1", "http://section.iaesonline.com/index.php/IJEEI/index"], [23, "TELKOMNIKA (Telecommunication Computing Electronics and Control)", "S1", "http://journal.uad.ac.id/index.php/telkomnika"], [24, "IJOG : Indonesian Journal on Geoscience", "S1", "https://ijog.geologi.esdm.go.id/index.php/IJOG"], [25, "QIJIS (Qudus International Journal Of Islamic Studies)", "S1", "http://journal.stainkudus.ac.id/index.php/QIJIS"], [26, "Atom Indonesia", "S1", "http://aij.batan.go.id"], [27, "AGRIVITA, Jou

# Sinta Hunter versi 1.1 (text to stdout)

In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import json
import re


s = requests.Session()
r = s.get('http://sinta.ristekbrin.go.id/journals')
bsoup = BeautifulSoup(r.text, 'html.parser')
lastpage = 1

if r.status_code==200:
    # mencari halaman terakhir
    for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
        for litag in ultag.find_all('li'):
            for links in litag.find_all('a', href=True):
                # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                query = urlparse(links['href']).query
                lastpage = re.compile('page=(\d+)').findall(query)
    lastpage = lastpage[0]
    print("Bismillah, lets start srcapping " + lastpage + " pages of Sinta journals")

    # looping halaman Sinta sampai lastpage
    page = 1
    journal_no = 1
    while page < int(lastpage) + 1:
        # development purpose, uncomment two lines following to limit number of pages to be scrapped
        #if page == 2:
        #    break
        
        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(page) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("dl", {"class":"uk-description-list-line"})
        for journal in journals_in_this_page:
            journal_link = journal.find('a')
            journal_info = journal.find('span', attrs={'class' :'index-val-small', 'style': re.compile('color')}) # cari span dengan color apapun
            print(str(journal_no) + ". " + journal_link.text)
            print("   Akreditasi: " + journal_info.text)
            
            r = s.get('http://sinta.ristekbrin.go.id'+journal_link['href'])
            if r.status_code==200:
                bsoup = BeautifulSoup(r.text, 'html.parser')
                # cari <a href="http://journal.unhas.ac.id/index.php/fs/index" ><i class="uk-icon-globe uk-text-primary"></i> Website</a> | 
                journal_urls = bsoup.find_all('a')
                for journal_url in journal_urls:
                    if journal_url.find(text=re.compile("Website")):
                        journal_url = journal_url['href']
                        break
                print("   URL: " + journal_url)
            journal_no = journal_no +1
            
        page = page + 1



# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
# https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
   Akreditasi: S2
   URL: http://journal.walisongo.ac.id/index.php/ahkam/index
617. Jurnal Pendidikan Islam
   Akreditasi: S2
   URL: http://ejournal.uin-suka.ac.id/tarbiyah/JPI
618. Etnosia : Jurnal Etnografi Indonesia
   Akreditasi: S2
   URL: http://journal.unhas.ac.id/index.php/etnosia
619. Jurnal Sumberdaya Lahan
   Akreditasi: S2
   URL: http://ejurnal.litbang.pertanian.go.id/index.php/jsl
620. Jurnal Media Hukum
   Akreditasi: S2
   URL: http://journal.umy.ac.id/index.php/jmh
621. Jurnal Adabiyah
   Akreditasi: S2
   URL: http://journal.uin-alauddin.ac.id/index.php/adabiyah
622. Majalah Ilmiah Pengkajian Industri
   Akreditasi: S2
   URL: http://ejurnal.bppt.go.id/index.php/MIPI/index
623. Kontekstualita : Jurnal Penelitian Sosial Keagamaan
   Akreditasi: S2
   URL: http://e-journal.lp2m.uinjambi.ac.id/ojp/index.php/Kontekstualita/index
624. Majalah Kedokteran Gigi Indonesia
   Akreditasi: S2
   URL: https://jurnal

ConnectionError: ignored

## Lanjutan dari disconnect 

In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import json
import re


s = requests.Session()
r = s.get('http://sinta.ristekbrin.go.id/journals')
bsoup = BeautifulSoup(r.text, 'html.parser')
lastpage = 1

if r.status_code==200:
    # mencari halaman terakhir
    for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
        for litag in ultag.find_all('li'):
            for links in litag.find_all('a', href=True):
                # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                query = urlparse(links['href']).query
                lastpage = re.compile('page=(\d+)').findall(query)
    lastpage = lastpage[0]
    print("Bismillah, lets start scrapping " + lastpage + " pages of Sinta journals")

    # looping halaman Sinta sampai lastpage
    page = 1
    journal_no = (page-1) * 9 + 1
    while page < int(lastpage) + 1:
        # development purpose, uncomment two lines following to limit number of pages to be scrapped
        #if page == 620:
        #    break
        
        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(page) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("dl", {"class":"uk-description-list-line"})
        for journal in journals_in_this_page:
            journal_link = journal.find('a')
            journal_info = journal.find('span', attrs={'class' :'index-val-small', 'style': re.compile('color')}) # cari span dengan color apapun
            print(str(journal_no) + ". " + journal_link.text)
            print("   Akreditasi: " + journal_info.text)
            
            r = s.get('http://sinta.ristekbrin.go.id'+journal_link['href'])
            if r.status_code==200:
                bsoup = BeautifulSoup(r.text, 'html.parser')
                # cari <a href="http://journal.unhas.ac.id/index.php/fs/index" ><i class="uk-icon-globe uk-text-primary"></i> Website</a> | 
                journal_urls = bsoup.find_all('a')
                for journal_url in journal_urls:
                    if journal_url.find(text=re.compile("Website")):
                        journal_url = journal_url['href']
                        break
                print("   URL: " + journal_url)
            journal_no = journal_no +1
            
        page = page + 1



# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
# https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag


Bismillah, lets start scrapping 461 pages of Sinta journals
91. Jurnal Teknologi Lingkungan
   Akreditasi: S2
   URL: http://ejurnal.bppt.go.id/index.php/JTL/issue/archive
92. Majalah Geografi Indonesia
   Akreditasi: S2
   URL: https://jurnal.ugm.ac.id/mgi
93. Al-Iqtishad : Jurnal Ilmu Ekonomi Syariah (Journal of Islamic Economics)
   Akreditasi: S2
   URL: http://journal.uinjkt.ac.id/index.php/iqtishad
94. Buletin Psikologi
   Akreditasi: S2
   URL: https://jurnal.ugm.ac.id/buletinpsikologi
95. Jurnal Ilmiah Perikanan dan Kelautan
   Akreditasi: S2
   URL: https://e-journal.unair.ac.id/JIPK
96. Jurnal ASET (Akuntansi Riset)
   Akreditasi: S2
   URL: http://ejournal.upi.edu/index.php/aset
97. Jurnal Legislasi Indonesia
   Akreditasi: S2
   URL: http://e-jurnal.peraturan.go.id/index.php/jli/index
98. Jurnal Pendidikan dan Kebudayaan
   Akreditasi: S2
   URL: http://jurnaldikbud.kemdikbud.go.id/index.php/jpnk
99. SCIENTIFIC DENTAL JOURNAL
   Akreditasi: S2


KeyboardInterrupt: ignored

# Sinta Hunter v1

Skrip pertama yang saya tulis, sekedar utk memastikan bahwa teknik scrapping bisa digunakan


In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import json
import re


s = requests.Session()
r = s.get('http://sinta.ristekbrin.go.id/journals')
bsoup = BeautifulSoup(r.text, 'html.parser')
lastpage = 1

if r.status_code==200:
    # mencari halaman terakhir
    for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
        for litag in ultag.find_all('li'):
            for links in litag.find_all('a', href=True):
                # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                query = urlparse(links['href']).query
                lastpage = re.compile('page=(\d+)').findall(query)
    lastpage = lastpage[0]
    print("Bismillah, lets start sracpping " + lastpage + " pages of Sinta journals")

    # looping halaman Sinta sampai lastpage
    page = 1
    journal_no = 1
    while page < int(lastpage) + 1:
        # development purpose
        if page == 5:
            break

        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(page) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("a", {"class":"text-blue"})
        for journal in journals_in_this_page:
            print(str(journal_no)+". "+journal.text) # print journal name
            print(journal['href']) 
            #r = s.get('http://sinta.ristekbrin.go.id'+journal['href'])
            #jurnal_soup = BeautifulSoup(r.text, 'html.parser')
            #journal_url = journal_soup.find('a', {"class": ""})
            #print(r.text)
            journal_no = journal_no +1
            
        page = page + 1



# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class


Bismillah, lets start sracpping 461 pages of Sinta journals
1. Jurnal Cakrawala Pendidikan
/journals/detail/?id=703
2. Operations and Supply Chain Management: An International Journal
/journals/detail/?id=3203
3. IJAL (Indonesian Journal of Applied Linguistics)
/journals/detail/?id=1099
4. Journal on Mathematics Education
/journals/detail/?id=2113
5. Buletin Ekonomi Moneter dan Perbankan
/journals/detail/?id=1103
6. Jurnal Pendidikan IPA Indonesia (Indonesian Journal of Science Education)
/journals/detail/?id=671
7. International Journal on Advanced Science, Engineering and Information Technology (IJASEIT)
/journals/detail/?id=696
8. Forest and Society
/journals/detail/?id=1266
9. International Journal of Electrical and Computer Engineering
/journals/detail/?id=693
10. Bulletin of Electrical Engineering and Informatics
/journals/detail/?id=681
11. Journal of the Indonesian Tropical Animal Agriculture
/journals/detail/?id=678
12. Journal of Engineering and Technological Sciences
/journa

# Versi save to Colab

[Back to Home](#scrollTo=YaazQkWbIXnv)


In [None]:
#@Sinta Hunter by beta.uliansyah@pknstan.ac.id
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 
import re
import csv
from datetime import date
import time
from google.colab import drive
drive.mount('/content/drive')

def sinta_lastpage():
    """
    Find last page of Sinta list
    """
    s = requests.Session()
    r = s.get('http://sinta.ristekbrin.go.id/journals')
    bsoup = BeautifulSoup(r.text, 'html.parser')
    lastpage = 1

    if r.status_code==200:
        # mencari halaman terakhir
        for ultag in bsoup.find_all('ul', {'class': 'top-paging'}):
            for litag in ultag.find_all('li'):
                for links in litag.find_all('a', href=True):
                    # parse_qs is not working well in Python 3 at the time I wrote this code. so i use regex
                    # .query or [4], Read: https://docs.python.org/3/library/urllib.parse.html
                    query = urlparse(links['href']).query
                    lastpage = re.compile('page=(\d+)').findall(query)
        lastpage = int(lastpage[0])
    return lastpage

def sinta_readpages(startpage=1, lastpage=sinta_lastpage()):
    # prepare file name with date and write header
    filename = "sinta-data-" + str(date.today()) + ".csv"
    header_row = ['journal_no', 'journal_id', 'journal_name', 'journal_accreditation', 'journal_url']
    with open("/content/drive/My Drive/sinta-tools/"+filename, mode='w') as sintacsv_file:
        sintacsv_writer = csv.writer(sintacsv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        sintacsv_writer.writerow(header_row)
    print(header_row)

    # looping halaman Sinta dari startpage sampai lastpage
    return_value = []
    currentpage = startpage
    journal_no = (startpage-1) * 9 + 1
    i = 0
    while currentpage < int(lastpage) + 1:
        # development purpose, uncomment two lines following to limit number of pages to be scrapped
        #if page == 620:
        #    break

        # session
        s = requests.Session()
        r = s.get('http://sinta.ristekbrin.go.id/journals?page='+ str(currentpage) +'&sort=impact')
        bsoup = BeautifulSoup(r.text, 'html.parser')
        journals_in_this_page = bsoup.find_all("dl", {"class":"uk-description-list-line"})
        for journal in journals_in_this_page:
            # mencari nama jurnal
            journal_link = journal.find('a')

            #mencari akreditasi
            journal_info = journal.find('span', attrs={'class' :'index-val-small', 'style': re.compile('color')}) # cari span dengan color apapun
            
            #mencari jurnal ID SINTA: strip all text before = /journals/detail/?id=681
            journal_id = journal_link['href']
            #re.sub(r"^.*(=)", "", journal_id)
            
            # mencari URL jurnal
            # jika error connection, tunggu bbrp detik (random delay)
            try:
                r = s.get('http://sinta.ristekbrin.go.id'+journal_link['href'])
            except requests.exceptions.ConnectionError as e:
                pass
            except Exception as e:
                logger.error(e)
                randomtime = random.randint(1,5)
                logger.warn('ERROR - Retrying again website %s, retrying in %d secs' % (url, randomtime))
                time.sleep(randomtime)
                continue
                    
            if r.status_code==200:
                jsoup = BeautifulSoup(r.text, 'html.parser')
                # cari <a href="http://journal.unhas.ac.id/index.php/fs/index" ><i class="uk-icon-globe uk-text-primary"></i> Website</a> | 
                # mencari URL
                journal_urls = jsoup.find_all('a')
                for journal_url in journal_urls:
                    if journal_url.find(text=re.compile("Website")):
                        journal_url = journal_url['href']
                        break
                #print("   URL: " + journal_url)
            # prepare result_data
            result_data = [journal_no, journal_id, journal_link.text, journal_info.text, journal_url]
            with open("drive/My Drive/sinta-tools/"+filename, mode='a+') as sintacsv_file:
                sintacsv_writer = csv.writer(sintacsv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                sintacsv_writer.writerow(result_data)
            return_value.append(result_data)
            print(result_data)
            journal_no = journal_no +1
            i += 1
            
        currentpage = currentpage + 1
    return return_value

if __name__ == "__main__":
    start_time = time.time()
    sinta_readpages()
    duration = time.time() - start_time
    print(f"Executed in {duration} seconds")
    # sinta_readpages(startpage=3, lastpage=4)
    # sinta_readpages(7, 8)




# Credits:
# https://stackoverflow.com/questions/11716380/beautifulsoup-extract-text-from-anchor-tag
# https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href
# https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
# https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
# https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[150, '/journals/detail/?id=3050', 'Jurnal Tumbuhan Obat Indonesia', 'S2', 'http://www.b2p2toot.litbang.kemkes.go.id']
[151, '/journals/detail/?id=941', 'Jurnal Riset Pendidikan Matematika', 'S2', 'http://journal.uny.ac.id/index.php/jrpm']
[152, '/journals/detail/?id=638', 'Jurnal Obsesi: Jurnal Pendidikan Anak Usia Dini', 'S2', 'http://journal.universitaspahlawan.ac.id/index.php/obsesi']
[153, '/journals/detail/?id=2783', 'Health Science Journal of Indonesia', 'S2', 'http://ejournal.litbang.depkes.go.id/index.php/HSJI']
[154, '/journals/detail/?id=955', 'Islamica: Jurnal Studi Keislaman', 'S2', 'http://islamica.uinsby.ac.id/index.php/islamica/index']
[155, '/journals/detail/?id=3522', 'JIPF (Jurnal Ilmu Pendidikan Fisika)', 'S2', 'http://journal.stkipsingkawang.ac.id/index.php/JIPF']
[156, '/journals/detail/?id=746', 'Jurnal Penelitian dan Evaluasi Pendidikan', 'S2', 'http://journal.uny.ac.id/index.php/jpep/index']
[157,

tes 
tes
ojo nganti mati
ya ya
12.38
12.44
12.53: 786 bro
1.05 1454
1.20 2215
2.23 seelsai



