In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

# Web Scraper

In [2]:
from bs4 import BeautifulSoup
import requests

Require input is the URL.

In [3]:
url = 'https://www.britannica.com/topic/list-of-state-capitals-in-the-United-States-2119210'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
print(soup)

<!DOCTYPE html>

<html class="topic-desktop ui-unknown0 ui-unknown" lang="en">
<head prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb#">
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://cdn.britannica.com/mendel-resources/3-124" rel="dns-prefetch"/>
<link href="https://cdn.britannica.com/mendel-resources/3-124" rel="preconnect"/>
<link as="script" href="https://www.googletagservices.com/tag/js/gpt.js" rel="preload">
<link href="/favicon.png" rel="icon">
<meta content="This is a list of the cities that are state capitals in the United States, ordered alphabetically by state. The list also provides the most recent U.S. census population for each city as well as an estimated population. (This list does not include the capital of the United States, Washington, D.C.)" name="description"/>
<meta content="list of state capitals in the United States, enc

Find all tables in the page. In this website, though, there's only one so it's simple.

In [4]:
soup.find_all('table') 

[<table> <thead> <tr> <th>state</th> <th>capital</th> <th>population of capital: census</th> <th>population of capital: estimated</th> </tr> </thead> <tbody> <tr> <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Alabama-state">Alabama</a></td> <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Montgomery-Alabama">Montgomery</a></td> <td>(2020) 200,603</td> <td>(2021 est.) 198,665</td> </tr> <tr> <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Alaska">Alaska</a></td> <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Juneau">Juneau</a></td> <td>(2020) 32,255</td> <td>(2021 est.) 31,973</td> </tr> <tr> <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Arizona-state">Arizona</a></td> <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Phoenix

Let's first get just the title of the table.

*Note: `<th>` tag defines a header cell in an HTML table*

In [5]:
titles = soup.find_all('th')
titles

[<th>state</th>,
 <th>capital</th>,
 <th>population of capital: census</th>,
 <th>population of capital: estimated</th>]

Since we do not need the tags, let's clean up the data.

In [6]:
titles_list = [title.text for title in titles]
titles_list

['state',
 'capital',
 'population of capital: census',
 'population of capital: estimated']

If the output still contains newline and other symbols that are not needed, you can further clean the data using, for example, ```.strip()```

Next, create a dataframe

In [7]:
import pandas as pd

df = pd.DataFrame(columns = titles_list)
df

Unnamed: 0,state,capital,population of capital: census,population of capital: estimated


Let's scrape the remaining data and fill this table!

In [8]:
rows = soup.find_all('tr')
len(rows)

51

The data of our interest are within the scope of **td** tags.

*Note: `<td>` tag defines a standard data cell in an HTML table.*

In [9]:
# -- Long version --
# row_data = []
# for row in rows:
#   row_data.append(row.find_all('td'))

# -- Short version --
row_data = [row.find_all('td') for row in rows]
row_data

[[],
 [<td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Alabama-state">Alabama</a></td>,
  <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Montgomery-Alabama">Montgomery</a></td>,
  <td>(2020) 200,603</td>,
  <td>(2021 est.) 198,665</td>],
 [<td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Alaska">Alaska</a></td>,
  <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Juneau">Juneau</a></td>,
  <td>(2020) 32,255</td>,
  <td>(2021 est.) 31,973</td>],
 [<td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Arizona-state">Arizona</a></td>,
  <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Phoenix-Arizona">Phoenix</a></td>,
  <td>(2020) 1,608,139</td>,
  <td>(2021 est.) 1,624,569</td>],
 [<td><a class="md-crosslink" data-show-preview="true" hr

The first row collected has no value, thus an empty list.

In [10]:
row_data[0]

[]

Turns out the first row is actually here

In [11]:
row_data[1]

[<td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Alabama-state">Alabama</a></td>,
 <td><a class="md-crosslink" data-show-preview="true" href="https://www.britannica.com/place/Montgomery-Alabama">Montgomery</a></td>,
 <td>(2020) 200,603</td>,
 <td>(2021 est.) 198,665</td>]

However, we only want the text portion.

In [12]:
print(row_data[1][0].text)
print(row_data[1][1].text)
print(row_data[1][2].text)
print(row_data[1][3].text)

Alabama
Montgomery
(2020) 200,603
(2021 est.) 198,665


Therefore, more cleaning is necessary.

Add the remaining rows to the dataframe.

But does this code work?

In [13]:
row_data.remove(row_data[0]) # delete the first row that is empty.

In [14]:
for each_row_data in row_data:
  state=[]
  for each_row_data_elem in each_row_data:
    state.append(each_row_data_elem.text)

  length = len(df)
  df.loc[length] = state


In [15]:
df

Unnamed: 0,state,capital,population of capital: census,population of capital: estimated
0,Alabama,Montgomery,"(2020) 200,603","(2021 est.) 198,665"
1,Alaska,Juneau,"(2020) 32,255","(2021 est.) 31,973"
2,Arizona,Phoenix,"(2020) 1,608,139","(2021 est.) 1,624,569"
3,Arkansas,Little Rock,"(2020) 202,591","(2021 est.) 201,998"
4,California,Sacramento,"(2020) 524,943","(2021 est.) 525,041"
5,Colorado,Denver,"(2020) 715,522","(2021 est.) 711,463"
6,Connecticut,Hartford,"(2020) 121,054","(2021 est.) 120,576"
7,Delaware,Dover,"(2020) 39,403","(2021 est.) 38,992"
8,Florida,Tallahassee,"(2020) 196,068","(2021 est.) 197,102"
9,Georgia,Atlanta,"(2020) 498,715","(2021 est.) 496,461"


TODO:

- Scrape other table from wikipedia
- Generate a new table/tables using dataframe
- Feel free to use other html tags
- Clean & preprocess

---

Other websites (for instance)
- https://www.timesjobs.com/
- https://www.tripadvisor.com/

---

## Ragnarok Online City Web Scraping

In [16]:
from bs4 import BeautifulSoup
import requests

url = "https://ratemyserver.net/index.php?page=areainfo&area=5999"

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
print(soup)


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html><head>
<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-25VQYCGE3Q"></script>
<script>
	  window.dataLayer = window.dataLayer || [];
	  function gtag(){dataLayer.push(arguments);}
	  gtag('js', new Date());

	  gtag('config', 'G-25VQYCGE3Q');
	</script>
<title>All Towns - Ragnarok Map Database</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Detailed Information on All Towns  Ragnarok Online, include map images, monsters on each map, monster amount on each map, links to each monster's details." name="description"/>
<meta content=" All Towns, where is All Towns, monster at All Towns, monster spawn in All Towns, mmorpg, ragnarok, Ragnarok Online, ro, RO Field, RO Dungeon, ro world map, ro dungeon map, new world" name="keywords"/>
<meta content="width=device-width" name="viewport"/>
<meta content="RateMyServer.Net 2005-2024" na

In [17]:
soup.find_all('table') 

[<table border="1" cellpadding="3" class="content_box_m" width="520px">
 <tr class="filled_header_db">
 <td align="left">
 <img src="images/circle.gif"/><b>Map: alberta</b>
 </td>
 <td align="right"><b>Area: Alberta</b></td>
 </tr>
 <tr>
 <td align="left" valign="top"><img src="https://file5s.ratemyserver.net/maps/alberta.gif"/><br/><img src="images/bu2.gif"/> <a class="tips_mm" href="index.php?page=npc_shop_warp&amp;map=alberta&amp;re_mob=0" target="_blank">Detailed View of This Map</a> <br/> </td>
 <td align="left" valign="top" width="400"><span style="color:orange"><b>Click on a monster below to view its detail: </b></span><br/><br/><div class="area_mob_li"><img src="images/bu2.gif"/> <a class="nbu_m" href="index.php?page=mob_db&amp;mob_id=1261" onclick="return popMob(1261,1,1)" onmouseout="hideddrivetip_image()" onmouseover="ddrivetip_image('&lt;img src =\'https://file5s.ratemyserver.net/mobs/1261.gif\'&gt;')">Wild Rose <b>(</b>1 / 120~180 min<b>)</b></a></div> </td>
 </tr>
 </tabl

In [18]:
titles = soup.find_all('th')
titles

[]

In [19]:
rows = soup.find_all('tr')
len(rows)

# -- Long version --
# row_data = []
# for row in rows:
#   row_data.append(row.find_all('td'))

# -- Short version --
row_data = [row.find_all('td') for row in rows]
row_data

[[<td align="left">
  <img src="images/circle.gif"/><b>Map: alberta</b>
  </td>,
  <td align="right"><b>Area: Alberta</b></td>],
 [<td align="left" valign="top"><img src="https://file5s.ratemyserver.net/maps/alberta.gif"/><br/><img src="images/bu2.gif"/> <a class="tips_mm" href="index.php?page=npc_shop_warp&amp;map=alberta&amp;re_mob=0" target="_blank">Detailed View of This Map</a> <br/> </td>,
  <td align="left" valign="top" width="400"><span style="color:orange"><b>Click on a monster below to view its detail: </b></span><br/><br/><div class="area_mob_li"><img src="images/bu2.gif"/> <a class="nbu_m" href="index.php?page=mob_db&amp;mob_id=1261" onclick="return popMob(1261,1,1)" onmouseout="hideddrivetip_image()" onmouseover="ddrivetip_image('&lt;img src =\'https://file5s.ratemyserver.net/mobs/1261.gif\'&gt;')">Wild Rose <b>(</b>1 / 120~180 min<b>)</b></a></div> </td>],
 [<td align="left">
  <img src="images/circle.gif"/><b>Map: aldebaran</b>
  </td>,
  <td align="right"><b>Area: Al De 

In [21]:
import pandas as pd

titles_list = ['City', 'Description', 'Image']

df = pd.DataFrame(columns = titles_list)
df

Unnamed: 0,City,Description,Image


In [22]:
row_data[0]

[<td align="left">
 <img src="images/circle.gif"/><b>Map: alberta</b>
 </td>,
 <td align="right"><b>Area: Alberta</b></td>]

In [23]:
row_data[1]

[<td align="left" valign="top"><img src="https://file5s.ratemyserver.net/maps/alberta.gif"/><br/><img src="images/bu2.gif"/> <a class="tips_mm" href="index.php?page=npc_shop_warp&amp;map=alberta&amp;re_mob=0" target="_blank">Detailed View of This Map</a> <br/> </td>,
 <td align="left" valign="top" width="400"><span style="color:orange"><b>Click on a monster below to view its detail: </b></span><br/><br/><div class="area_mob_li"><img src="images/bu2.gif"/> <a class="nbu_m" href="index.php?page=mob_db&amp;mob_id=1261" onclick="return popMob(1261,1,1)" onmouseout="hideddrivetip_image()" onmouseover="ddrivetip_image('&lt;img src =\'https://file5s.ratemyserver.net/mobs/1261.gif\'&gt;')">Wild Rose <b>(</b>1 / 120~180 min<b>)</b></a></div> </td>]

In [24]:
row_data[2]

[<td align="left">
 <img src="images/circle.gif"/><b>Map: aldebaran</b>
 </td>,
 <td align="right"><b>Area: Al De Baran</b></td>]

จากข้อมูล row_data เราจะสังเกตว่า ถ้าหาก Index เป็นเลขคู่ จะเป็นชื่อเมืองของแต่ละแมพ \
ในขณะที่ถ้าหาก Index เป็นเลขคี่ จะเป็นรูปภาพของเมืองนั้น \
เราจึงต้องเขียนเพื่อที่จะเก็บข้อมูลโดยการมัดรวม 2 Index เป็นคู่ๆนั่นเอง

In [25]:
len(row_data)

54

In [27]:
for i in range(0,len(row_data), 2):
    city = row_data[i][0].text
    description = row_data[i][1].text
    image = row_data[i+1][0].find('img')['src']
    length = len(df)
    df.loc[length] = [city, description, image]

In [28]:
df

Unnamed: 0,City,Description,Image
0,\nMap: alberta\n,Area: Alberta,https://file5s.ratemyserver.net/maps/alberta.gif
1,\nMap: aldebaran\n,Area: Al De Baran,https://file5s.ratemyserver.net/maps/aldebaran...
2,\nMap: amatsu\n,"Area: Amatsu, the Land of Destiny",https://file5s.ratemyserver.net/maps/amatsu.gif
3,\nMap: ayothaya\n,Area: Ayothaya,https://file5s.ratemyserver.net/maps/ayothaya.gif
4,\nMap: brasilis\n,Area: Brasilis,https://file5s.ratemyserver.net/maps/brasilis.gif
5,\nMap: comodo\n,"Area: Beach town, Comodo",https://file5s.ratemyserver.net/maps/comodo.gif
6,\nMap: dicastes01\n,"Area: El Dicastes - El Dicastes, the Sapha Cap...",https://file5s.ratemyserver.net/maps/dicastes0...
7,\nMap: einbroch\n,"Area: Einbroch, the city of steel",https://file5s.ratemyserver.net/maps/einbroch.gif
8,\nMap: geffen\n,Area: Geffen,https://file5s.ratemyserver.net/maps/geffen.gif
9,\nMap: gonryun\n,"Area: Gonryun, the Hermit Land",https://file5s.ratemyserver.net/maps/gonryun.gif


ทำการ Clean data ของ text ของ City ด้วยการลบ \n ออก

In [29]:
df['City'] = df['City'].str.replace('\n', '')

In [30]:
df

Unnamed: 0,City,Description,Image
0,Map: alberta,Area: Alberta,https://file5s.ratemyserver.net/maps/alberta.gif
1,Map: aldebaran,Area: Al De Baran,https://file5s.ratemyserver.net/maps/aldebaran...
2,Map: amatsu,"Area: Amatsu, the Land of Destiny",https://file5s.ratemyserver.net/maps/amatsu.gif
3,Map: ayothaya,Area: Ayothaya,https://file5s.ratemyserver.net/maps/ayothaya.gif
4,Map: brasilis,Area: Brasilis,https://file5s.ratemyserver.net/maps/brasilis.gif
5,Map: comodo,"Area: Beach town, Comodo",https://file5s.ratemyserver.net/maps/comodo.gif
6,Map: dicastes01,"Area: El Dicastes - El Dicastes, the Sapha Cap...",https://file5s.ratemyserver.net/maps/dicastes0...
7,Map: einbroch,"Area: Einbroch, the city of steel",https://file5s.ratemyserver.net/maps/einbroch.gif
8,Map: geffen,Area: Geffen,https://file5s.ratemyserver.net/maps/geffen.gif
9,Map: gonryun,"Area: Gonryun, the Hermit Land",https://file5s.ratemyserver.net/maps/gonryun.gif


---

ศวิษฐ์ โกสียอัมพร 65070507238