# Web Scrapping

## Beautiful Soup

<p>Hal pertama yang dilakukan adalah dengan mengumpulkan tautan dari pengguna yang ada di situs web anilist.</p>

<p>Pertama kita akan masuk terlebih dahulu ke situs web anilist.</p>

<img src="asset\Anilist_HomePage.png" width=50%></img>

<p>Kemudian kita akan mencari format tautan untuk menyimpan halaman dari masing-masing pengguna. Setelah dilakukan pencarian didapatkan format untuk menyimpan halaman masing-masing pengguna yaitu <b>"https://anilist.co/user/"+userId</b></p>

<p>Kemudian kita akan melihat susunan halaman pada setiap pengguna untuk mengetahui data apa saja yang bisa kita peroleh</p>

<img src="asset\Anilist_UserPage.png" width=50%></img>

<p>Selanjutnya kita akan ke bagian 'Anime List' untuk mengetahui apa saja Anime yang ditonton oleh pengguna tersebut.</p>

<img src="asset\Anilist_User_AnimeList.png" width=50%></img>

<p>Karena format tautan untuk masing-masing pengguna pada Anime List adalah <b>"https://anilist.co/user/"+userId+'/animelist'</b>, maka kita akan menyimpan semua tautan dari pengguna dengan format tersebut, Berikut merupakan kodenya</p>

In [1]:
## Menyimpan tautan pengguna

# Membuat variabel kosong untuk menyimpan tautan seluruh pengguna.
linkAllUser = []

# Masukkan berapa pengguna yang ingin dilakukan scrapping.
userCount = 10

# Melakukan perulangan
for i in range(1, userCount+1):
    linkAllUser.append("https://anilist.co/user/" + str(i) + "/animelist")

linkAllUser

['https://anilist.co/user/1/animelist',
 'https://anilist.co/user/2/animelist',
 'https://anilist.co/user/3/animelist',
 'https://anilist.co/user/4/animelist',
 'https://anilist.co/user/5/animelist',
 'https://anilist.co/user/6/animelist',
 'https://anilist.co/user/7/animelist',
 'https://anilist.co/user/8/animelist',
 'https://anilist.co/user/9/animelist',
 'https://anilist.co/user/10/animelist']

<p>Selanjutnya akan dilakukan Scrapping, digunakan tautan dari pengguna pertama saja untuk mempermudah, kemudian pada akhir akan dilakukan perulangan agar bisa digunakan untuk semua pengguna.</p>

<p>Pertama kita siapkan module yang dibutuhkan terlebih dahulu, pertama terdapat request untuk mendapatkan script HTML dari sebuah website, kemudian ada bs4 untuk melakukan proses scrapping, terakhir ada pandas untuk menyimpan data ke sebuah dataframe</p>

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

<p>Selanjutnya kita akan menggunakan module request untuk mendapatkan script HTML nya</p>

In [3]:
userPage = requests.get(linkAllUser[0])

<p>Melakukan parsing HTML dengan menggunakan bs4</p>

In [4]:
userSoup = BeautifulSoup(userPage.content, "html.parser")

<p>Kemudian kita akan melakukan inspect element dari sebuah web dan mengekstrak isinya</p>

<p>Kita akan melakukan inspect pada judul Anime yang ditonton</p>

<img src="asset\Anilist_TitleScrap.png" width=50%></img>

<p>Seluruh judul Anime disimpan pada element dengan tag div, yang didalamnya terdapat attribut class dengan value "title".</p>

<p>Namun setelah dicek terdapat element dengan tag div, yang didalamnya terdapat attribut class dengan value "title" yang memuat bukan judul Anime, maka akan diambil element parentnya yaitu element dengan tag div, yang didalamnya terdapat attribut class dengan value "entry-card row"</p>

In [5]:
titleAnime = userSoup.find_all("div", attrs = {"class": "entry-card row"})

if titleAnime == []:
    titleAnime = userSoup.find_all("div", attrs = {"class": "entry row"})

In [6]:
titleAnime

[<div class="entry-card row"><div class="cover"><div class="edit"><svg aria-hidden="true" class="svg-inline--fa fa-ellipsis-h fa-w-16 fa-lg" data-icon="ellipsis-h" data-prefix="fas" focusable="false" role="img" viewbox="0 0 512 512" xmlns="http://www.w3.org/2000/svg"><path d="M328 256c0 39.8-32.2 72-72 72s-72-32.2-72-72 32.2-72 72-72 72 32.2 72 72zm104-72c-39.8 0-72 32.2-72 72s32.2 72 72 72 72-32.2 72-72-32.2-72-72-72zm-352 0c-39.8 0-72 32.2-72 72s32.2 72 72 72 72-32.2 72-72-32.2-72-72-72z" fill="currentColor"></path></svg></div> <div class="image" style="background-image:url(https://s4.anilist.co/file/anilistcdn/media/anime/cover/medium/bx2251-Wa30L0Abk50O.jpg);"></div></div> <div class="title"><a href="/anime/2251/Baccano/">
 Baccano!
 </a></div> <div class="score" label="Score" score="3"><svg aria-hidden="true" class="svg-inline--fa fa-smile fa-w-16 fa-lg" data-icon="smile" data-prefix="far" focusable="false" role="img" viewbox="0 0 496 512" xmlns="http://www.w3.org/2000/svg"><path 

<p>Selanjutnya kita perlu mengambil isi content dari element tersebut dan menyimpannya ke dalam sebuah variabel kosong yang berisi semua judul Anime</p>

In [7]:
mergeTitleAnime = []

for title in titleAnime:
    mergeTitleAnime.append(title.find("div", attrs = {"class" : "title"}).find("a").text.strip())

In [8]:
mergeTitleAnime

['Baccano!',
 'Bakemonogatari',
 'BANANA FISH',
 'Boku no Hero Academia 3',
 'Bokutachi wa Benkyou ga Dekinai!',
 'Danshi Koukousei no Nichijou',
 'Dungeon ni Deai wo Motomeru no wa Machigatteiru Darou ka',
 'Durarara!!',
 'Fate/Zero',
 'Fate/Zero 2nd Season',
 'Full Metal Panic!',
 'Gakuen Mokushiroku: HIGHSCHOOL OF THE DEAD',
 'Gekkan Shoujo Nozaki-kun',
 'Hagane no Renkinjutsushi: FULLMETAL ALCHEMIST',
 'Hanasaku Iroha',
 'Hibike! Euphonium',
 'Initial D',
 'Kaguya-sama wa Kokurasetai: Tensaitachi no Renai Zunousen',
 'Kakegurui',
 'Kakegurui ××',
 'Katanagatari',
 'Kill la Kill',
 'Kobayashi-san Chi no Maidragon',
 'Kono Subarashii Sekai ni Shukufuku wo!',
 'Kono Subarashii Sekai ni Shukufuku wo! 2',
 'Kyousougiga (TV)',
 'Little Witch Academia (TV)',
 'Mahou Shoujo Madoka☆Magica',
 'Monogatari Series: Second Season',
 'Nekomonogatari (Kuro)',
 'Byousoku 5 Centimeter',
 'Cencoroll',
 'Choujikuu Yousai Macross: Ai Oboete Imasu ka',
 'Death Billiards',
 'Evangelion Shin Movie: Ha',
 

Berikut merupakan script untuk seluruh pengguna

In [9]:
mergeTitle = []

for item in linkAllUser:
    userPage = requests.get(item)
    userSoup = BeautifulSoup(userPage.content, "html.parser")
    titleAnime = userSoup.find_all("div", attrs = {"class": "entry-card row"})
    if titleAnime == []:
        titleAnime = userSoup.find_all("div", attrs = {"class": "entry row"})
    mergeTitleAnime = []
    for title in titleAnime:
        mergeTitleAnime.append(title.find("div", attrs = {"class" : "title"}).find("a").text.strip())
    mergeTitle.append(mergeTitleAnime)

In [11]:
pd.DataFrame({"Anime" : mergeTitle})

Unnamed: 0,Anime
0,"[Baccano!, Bakemonogatari, BANANA FISH, Boku n..."
1,"[[Oshi no Ko], Boku no Hero Academia 4, Chains..."
2,[]
3,[]
4,"[Goblin Slayer, Seishun Buta Yarou wa Bunny Gi..."
5,[]
6,"[Ansatsu Kyoushitsu, Arslan Senki (TV), Fate/s..."
7,"[Blood Lad, Choujigen Game Neptune THE ANIMATI..."
8,"[Aikatsu!, Ashita no Nadja, Cardcaptor Sakura,..."
9,"[Azumanga Daiou THE ANIMATION, Appleseed XIII,..."


## API

In [14]:
'''
curl "https://inside.fifa.com/api/ranking-overview?locale=en&dateId=id14289" ^
  -H "accept: application/json, text/plain, */*" ^
  -H "accept-language: id-ID,id;q=0.9,en-US;q=0.8,en;q=0.7" ^
  -H ^"cookie: OptanonAlertBoxClosed=2024-03-21T21:09:15.378Z; eupubconsent-v2=CP7ztygP7ztygAcABBENAsEsAP_gAEPgAChQg1NX_H__bW9r8Xr3aft0eY1P99j77sQxBhfJE-4FzLvW_JwXx2ExNA36tqIKmRIEu3bBIQFlHJDUTVigaogVryDMakWcgTNKJ6BkiFMRM2dYCF5vmwtj-QKY5vp9d3dx2D-t_dv83dzyz4VHn3e5_2e0eJCdA58tDfv9bROb-9IPd_58v4v0_F_rk2_eT1l_tevp7B8uft87_XU-9_fff79KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEQagCzDQqIA-yJCQi0DCKBACoKwgIoEAAAAJA0QEAJgwKdgYBLrCRACAFAAMEAIAAUZAAgAAEgAQiACQAoEAAEAgUAAIAAAgEADAwABgAtBAIAAQHQIUwIIFAsAEjMiIUwIQoEggJbKBBIAgQVwhCLPAggERMFAAACQAVgACAsFgMSSAlYkECXEG0AABAAgEEIFQik6MAQwJmy1U4om0ZWkBaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAACAA.f_wACHwAAAAA; __e_inc=1; OptanonConsent=isGpcEnabled=0&datestamp=Mon+Mar+25+2024+23^%^3A52^%^3A57+GMT^%^2B0700+(Western+Indonesia+Time)&version=202311.1.0&isIABGlobal=false&consentId=f5161ab5-198a-44e1-8de7-9e7dcc34f643&interactionCount=2&landingPath=NotLandingPage&groups=2^%^3A1^%^2C3^%^3A1^%^2C1^%^3A1^%^2C4^%^3A1^%^2CV2STACK42^%^3A1&hosts=H68^%^3A1^%^2CH39^%^3A1^%^2CH3^%^3A1^%^2CH98^%^3A1^%^2CH113^%^3A1^%^2CH96^%^3A1^%^2CH99^%^3A1^%^2CH1^%^3A1^%^2CH51^%^3A1^%^2CH36^%^3A1^%^2CH81^%^3A1^%^2CH94^%^3A1^%^2CH84^%^3A1^%^2CH87^%^3A1^%^2CH88^%^3A1^%^2CH70^%^3A1^%^2CH37^%^3A1^%^2CH89^%^3A1^%^2CH90^%^3A1^%^2CH48^%^3A1^%^2CH91^%^3A1^%^2CH71^%^3A1^%^2CH49^%^3A1^%^2CH69^%^3A1^%^2CH52^%^3A1^%^2CH43^%^3A1^%^2CH127^%^3A1^%^2CH5^%^3A1^%^2CH9^%^3A1&genVendors=&geolocation=ID^%^3BYO&AwaitingReconsent=false&browserGpcFlag=0^" ^
  -H "referer: https://inside.fifa.com/fifa-world-ranking/men" ^
  -H ^"sec-ch-ua: ^\^"Google Chrome^\^";v=^\^"123^\^", ^\^"Not:A-Brand^\^";v=^\^"8^\^", ^\^"Chromium^\^";v=^\^"123^\^"^" ^
  -H "sec-ch-ua-mobile: ?1" ^
  -H ^"sec-ch-ua-platform: ^\^"Android^\^"^" ^
  -H "sec-fetch-dest: empty" ^
  -H "sec-fetch-mode: cors" ^
  -H "sec-fetch-site: same-origin" ^
  -H "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Mobile Safari/537.36"
'''

  '''


'\ncurl "https://inside.fifa.com/api/ranking-overview?locale=en&dateId=id14289" ^\n  -H "accept: application/json, text/plain, */*" ^\n  -H "accept-language: id-ID,id;q=0.9,en-US;q=0.8,en;q=0.7" ^\n  -H ^"cookie: OptanonAlertBoxClosed=2024-03-21T21:09:15.378Z; eupubconsent-v2=CP7ztygP7ztygAcABBENAsEsAP_gAEPgAChQg1NX_H__bW9r8Xr3aft0eY1P99j77sQxBhfJE-4FzLvW_JwXx2ExNA36tqIKmRIEu3bBIQFlHJDUTVigaogVryDMakWcgTNKJ6BkiFMRM2dYCF5vmwtj-QKY5vp9d3dx2D-t_dv83dzyz4VHn3e5_2e0eJCdA58tDfv9bROb-9IPd_58v4v0_F_rk2_eT1l_tevp7B8uft87_XU-9_fff79KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

In [15]:
import requests
import json
import pandas as pd

In [16]:
url = "https://inside.fifa.com/api/ranking-overview?locale=en&dateId=id14289"

headers = {
    "accept": "application/json, text/plain, */*",
    "accept-language": "id-ID,id;q=0.9,en-US;q=0.8,en;q=0.7",
    "cookie": "OptanonAlertBoxClosed=2024-03-21T21:09:15.378Z; eupubconsent-v2=CP7ztygP7ztygAcABBENAsEsAP_gAEPgAChQg1NX_H__bW9r8Xr3aft0eY1P99j77sQxBhfJE-4FzLvW_JwXx2ExNA36tqIKmRIEu3bBIQFlHJDUTVigaogVryDMakWcgTNKJ6BkiFMRM2dYCF5vmwtj-QKY5vp9d3dx2D-t_dv83dzyz4VHn3e5_2e0eJCdA58tDfv9bROb-9IPd_58v4v0_F_rk2_eT1l_tevp7B8uft87_XU-9_fff79KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEQagCzDQqIA-yJCQi0DCKBACoKwgIoEAAAAJA0QEAJgwKdgYBLrCRACAFAAMEAIAAUZAAgAAEgAQiACQAoEAAEAgUAAIAAAgEADAwABgAtBAIAAQHQIUwIIFAsAEjMiIUwIQoEggJbKBBIAgQVwhCLPAggERMFAAACQAVgACAsFgMSSAlYkECXEG0AABAAgEEIFQik6MAQwJmy1U4om0ZWkBaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAACAA.f_wACHwAAAAA; OptanonConsent=isGpcEnabled=0&datestamp=Mon+Mar+25+2024+21^%^3A52^%^3A22+GMT^%^2B0700+(Western+Indonesia+Time)&version=202311.1.0&isIABGlobal=false&consentId=f5161ab5-198a-44e1-8de7-9e7dcc34f643&interactionCount=2&landingPath=NotLandingPage&groups=2^%^3A1^%^2C3^%^3A1^%^2C1^%^3A1^%^2C4^%^3A1^%^2CV2STACK42^%^3A1&hosts=H68^%^3A1^%^2CH39^%^3A1^%^2CH3^%^3A1^%^2CH98^%^3A1^%^2CH113^%^3A1^%^2CH96^%^3A1^%^2CH99^%^3A1^%^2CH1^%^3A1^%^2CH51^%^3A1^%^2CH36^%^3A1^%^2CH81^%^3A1^%^2CH94^%^3A1^%^2CH84^%^3A1^%^2CH87^%^3A1^%^2CH88^%^3A1^%^2CH70^%^3A1^%^2CH37^%^3A1^%^2CH89^%^3A1^%^2CH90^%^3A1^%^2CH48^%^3A1^%^2CH91^%^3A1^%^2CH71^%^3A1^%^2CH49^%^3A1^%^2CH69^%^3A1^%^2CH52^%^3A1^%^2CH43^%^3A1^%^2CH127^%^3A1^%^2CH5^%^3A1^%^2CH9^%^3A1&genVendors=&geolocation=ID^%^3BYO&AwaitingReconsent=false&browserGpcFlag=0",
    "referer": "https://inside.fifa.com/fifa-world-ranking/men",
    "sec-ch-ua": '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
    "sec-ch-ua-mobile": "?1",
    "sec-ch-ua-platform": '"Android"',
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Mobile Safari/537.36"
}

In [19]:
response = requests.get(url = url, headers = headers).text

In [21]:
dicti = json.loads(response)

In [32]:
rank = []
name = []
totalPoints = []

for item in dicti["rankings"]:
    rank.append(item["rankingItem"]["rank"])
    name.append(item["rankingItem"]["name"])
    totalPoints.append(item["rankingItem"]["totalPoints"])

In [33]:
pd.DataFrame({"rank" : rank, "name": name, "totalPoints": totalPoints})

Unnamed: 0,rank,name,totalPoints
0,1.0,Argentina,1855.20
1,2.0,France,1845.44
2,3.0,England,1800.05
3,4.0,Belgium,1798.46
4,5.0,Brazil,1784.09
...,...,...,...
206,207.0,British Virgin Islands,807.57
207,208.0,US Virgin Islands,796.78
208,209.0,Anguilla,785.69
209,210.0,San Marino,741.61


## Selenium