# Scraping Kamondb for Kamon crest images + descriptions

KamonDB (https://kamondb.com/) is a Japanese website (and custom Kamon designer) which hosts a lot of regularly sized Kamon images. These images can be found on their website, sorted into 7 groups:
- Plant pattern  (the biggest group) 
- Animal crest
- Vessel crest
- Architectural pattern
- Geometric pattern
- Character pattern

As far I as I can tell, there's no consistent order to the naming of each category and subcategory - these might need to be gathered by hand. However, each page that contains crests and their descriptions is consitently structured:
 - `<article>` tag (or even more specific, `<table>` tag)contains all elements with crests and descriptions
    - all crests are stored in `<td>` tags
    - each crest is an `<img>` and a string with `<font>` tag


### Try first: Scraping a page

In [1]:
#Import requests and beautifulsoup4
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

In [2]:
#First page url:
url = "https://kamondb.com/object/274/"
raw_html = requests.get(url).content
soup_doc = BeautifulSoup(raw_html, 'html.parser')
print(soup_doc.prettify())
#this will scrape untranslated page


<!DOCTYPE html>
<html lang="ja">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0, viewport-fit=cover" name="viewport">
   <title>
    団扇｜うちわ  |  日本の家紋
   </title>
   <!-- OGP -->
   <meta content="article" property="og:type"/>
   <meta content="団扇は、紙を貼った一般の団扇、天狗が持つとされる羽団扇、戦場の采配に用いる軍配団扇の３つに大別されます。形状の面白さに加え、その実用性や天狗伝説、采配の道具として武家にも縁が深いところから、様々なデザインバリエーションがあります。源平盛衰記" property="og:description"/>
   <meta content="団扇｜うちわ" property="og:title"/>
   <meta content="https://kamondb.com/object/274/" property="og:url"/>
   <meta content="https://kamondb.com/wp/wp-content/uploads/2019/10/1-97.gif" property="og:image"/>
   <meta content="日本の家紋" property="og:site_name"/>
   <meta content="ja_JP" property="og:locale"/>
   <meta content="2019-10-03T15:15:18+09:00" property="article:published_time">
    <meta content="2019-11-09T15:52:36+09:00" property="article:modified_time">
     <meta content="器物紋" prop

In [3]:
## get to the table level
table = soup_doc.table
fans = table.find_all('td')
#each td has the link and the text
fans[0].img['src']  #link
fans[0].get_text()  #text

'\n房付き団扇'

In [6]:
#save links and text to list of objects
fan_object = []
for item in fans:
    fan_object.append({
        'image_link': item.img['src'],
        'description': item.get_text().replace('\n', '').strip()
    })

In [7]:
fan_object

[{'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/1-97.gif',
  'description': '房付き団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/2-92.gif',
  'description': '丸に一つ団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/3-85.gif',
  'description': '三つ団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/4-79.gif',
  'description': '羽団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/16-39.gif',
  'description': '変わり羽団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/鷹の羽団扇-1.jpg',
  'description': '鷹の羽団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/5-72.gif',
  'description': '唐団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/10-56.gif',
  'description': '中陰唐団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/10/6-68.gif',
  'description': '軍配唐団扇'},
 {'image_link': 'https://kamondb.com/wp/wp-content/up

## Next Steps:
Super super easy to get a list of objects containing item link and descrption from each page. Now, I need to get a list of all pages on KamonDB I want to scrape. This means collecting a list of links. This list can be scraped from each category's directory page - each page contains a `<div id="list">` further containing `<a>` elments with hrefs.

In [30]:
url = "https://kamondb.com/category/object/"
raw_html = requests.get(url).content
object_page = BeautifulSoup(raw_html, 'html.parser')
print(object_page.prettify())

<!DOCTYPE html>
<html lang="ja">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0, viewport-fit=cover" name="viewport">
   <title>
    器物紋  |  日本の家紋
   </title>
   <!-- OGP -->
   <meta content="website" property="og:type"/>
   <meta content="「器物紋」の記事一覧です。" property="og:description"/>
   <meta content="器物紋" property="og:title"/>
   <meta content="https://kamondb.com/category/object/" property="og:url"/>
   <meta content="http://kamondb.com/wp/wp-content/themes/cocoon-master/screenshot.jpg" property="og:image"/>
   <meta content="日本の家紋" property="og:site_name"/>
   <meta content="ja_JP" property="og:locale"/>
   <meta content="2019-09-20T06:19:15+09:00" property="article:published_time">
    <meta content="2019-11-09T15:51:20+09:00" property="article:modified_time">
     <meta content="器物紋" property="article:section"/>
     <meta content="い" property="article:tag"/>
     <!-- /OGP -->
     <

In [31]:
#find all links in main section
list_a = object_page.find('div', id='list').find_all('a')
list_a



[<a class="entry-card-wrap a-wrap border-element cf" href="https://kamondb.com/object/37/" title="錨｜いかり">
 <article class="post-37 entry-card e-card cf post type-post status-publish format-standard has-post-thumbnail hentry category-object-post tag-40-post" id="post-37">
 <figure class="entry-card-thumb card-thumb e-card-thumb">
 <img alt="" src="https://kamondb.com/wp/wp-content/uploads/2019/09/1-8.gif"> <span class="cat-label cat-label-34">器物紋</span> </img></figure><!-- /.entry-card-thumb -->
 <div class="entry-card-content card-content e-card-content">
 <h2 class="entry-card-title card-title e-card-title" itemprop="headline">錨｜いかり</h2>
 <div class="entry-card-snippet card-snippet e-card-snippet">
         浅い海で船体をつなぎ止めるための重しが錨の役目です。元々は石や木を使っており「碇」の字を当てていたのですが、時代が下がって鉄製となりました。猫の爪のように引っ掻く形状ですので「錨」という字が生まれたそうです。面白いですね。船をつなぎ止める威力、形の力強さから紋章になったのでしょう。珍しい紋ですが、明治以降近代になってから、形が制定されたものもあり、けっこう形のバリエーションは豊富です。      </div>
 <div class="entry-card-meta card-meta e-card-meta">
 <div class="entry-car

In [32]:
#get list of links
object_page_links_p1 = []
for link in list_a:
    object_page_links_p1.append(link['href'])
object_page_links_p1

['https://kamondb.com/object/37/',
 'https://kamondb.com/object/278/',
 'https://kamondb.com/object/276/',
 'https://kamondb.com/object/274/',
 'https://kamondb.com/object/272/',
 'https://kamondb.com/object/270/',
 'https://kamondb.com/object/268/',
 'https://kamondb.com/object/266/',
 'https://kamondb.com/object/264/',
 'https://kamondb.com/object/262/',
 'https://kamondb.com/object/260/',
 'https://kamondb.com/object/258/',
 'https://kamondb.com/object/8231/',
 'https://kamondb.com/object/8218/',
 'https://kamondb.com/object/256/',
 'https://kamondb.com/object/254/',
 'https://kamondb.com/object/252/',
 'https://kamondb.com/object/250/',
 'https://kamondb.com/object/248/',
 'https://kamondb.com/object/246/',
 'https://kamondb.com/object/244/',
 'https://kamondb.com/object/242/',
 'https://kamondb.com/object/240/',
 'https://kamondb.com/object/238/',
 'https://kamondb.com/object/236/',
 'https://kamondb.com/object/234/',
 'https://kamondb.com/object/8241/',
 'https://kamondb.com/obje

In [33]:
#this is only the first page of object links, getting the second now
url = "https://kamondb.com/category/object/page/2/"
raw_html = requests.get(url).content
object_page_2 = BeautifulSoup(raw_html, 'html.parser')
list_a = object_page_2.find('div', id='list').find_all('a')
object_page_links_p2 = []
for link in list_a:
    object_page_links_p2.append(link['href'])
object_page_links_p2


['https://kamondb.com/object/207/',
 'https://kamondb.com/object/205/',
 'https://kamondb.com/object/203/',
 'https://kamondb.com/object/201/',
 'https://kamondb.com/object/198/',
 'https://kamondb.com/object/195/',
 'https://kamondb.com/object/193/',
 'https://kamondb.com/object/191/',
 'https://kamondb.com/object/187/',
 'https://kamondb.com/object/8303/',
 'https://kamondb.com/object/185/',
 'https://kamondb.com/object/183/',
 'https://kamondb.com/object/181/',
 'https://kamondb.com/object/178/',
 'https://kamondb.com/object/176/',
 'https://kamondb.com/object/174/',
 'https://kamondb.com/object/172/',
 'https://kamondb.com/object/170/',
 'https://kamondb.com/object/167/',
 'https://kamondb.com/object/164/',
 'https://kamondb.com/object/162/',
 'https://kamondb.com/object/160/',
 'https://kamondb.com/object/158/',
 'https://kamondb.com/object/155/',
 'https://kamondb.com/object/153/',
 'https://kamondb.com/object/151/',
 'https://kamondb.com/object/149/',
 'https://kamondb.com/objec

In [34]:
#now add the two lists together
all_object_links = object_page_links_p1 + object_page_links_p2
all_object_links

['https://kamondb.com/object/37/',
 'https://kamondb.com/object/278/',
 'https://kamondb.com/object/276/',
 'https://kamondb.com/object/274/',
 'https://kamondb.com/object/272/',
 'https://kamondb.com/object/270/',
 'https://kamondb.com/object/268/',
 'https://kamondb.com/object/266/',
 'https://kamondb.com/object/264/',
 'https://kamondb.com/object/262/',
 'https://kamondb.com/object/260/',
 'https://kamondb.com/object/258/',
 'https://kamondb.com/object/8231/',
 'https://kamondb.com/object/8218/',
 'https://kamondb.com/object/256/',
 'https://kamondb.com/object/254/',
 'https://kamondb.com/object/252/',
 'https://kamondb.com/object/250/',
 'https://kamondb.com/object/248/',
 'https://kamondb.com/object/246/',
 'https://kamondb.com/object/244/',
 'https://kamondb.com/object/242/',
 'https://kamondb.com/object/240/',
 'https://kamondb.com/object/238/',
 'https://kamondb.com/object/236/',
 'https://kamondb.com/object/234/',
 'https://kamondb.com/object/8241/',
 'https://kamondb.com/obje

In [None]:
#now, scrape this list of links. Add a delay so the server doesn't get overwhelmed.
object_kamons_all = []
failed_links = []

for link in all_object_links:
    time.sleep(1)  #delay for 1 seconds between requests
    try:
        raw_html = requests.get(link).content
        soup_doc = BeautifulSoup(raw_html, 'html.parser')
        #process the soup_doc
        table_elements = soup_doc.table.find_all('td')

        object = {}
        for item in table_elements:
            try:
                object = {
                    'image_link': item.img['src'],
                    'description': item.get_text()
                }
                object_kamons_all.append(object)
            except:
                continue
        print(f"Successfully retrieved {link}")
    except:
        #if this does not work, print link
        print(f"Could not retrieve {link}")
        #add to failed links
        failed_links.append(link)

    

Successfully retrieved https://kamondb.com/object/37/
Successfully retrieved https://kamondb.com/object/278/
Successfully retrieved https://kamondb.com/object/276/
Successfully retrieved https://kamondb.com/object/274/
Successfully retrieved https://kamondb.com/object/272/
Successfully retrieved https://kamondb.com/object/270/
Successfully retrieved https://kamondb.com/object/268/
Successfully retrieved https://kamondb.com/object/266/
Successfully retrieved https://kamondb.com/object/264/
Successfully retrieved https://kamondb.com/object/262/
Successfully retrieved https://kamondb.com/object/260/
Successfully retrieved https://kamondb.com/object/258/


In [None]:
object_kamons_all

[{'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/1-8.gif',
  'description': '\n錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/2-8.gif',
  'description': '\n海軍錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/3-4.gif',
  'description': '\n汽船錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/4-4.gif',
  'description': '\n錨片喰'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/5-4.gif',
  'description': '\n丸に汽船錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/6-3.gif',
  'description': '\n細輪に錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/7-3.gif',
  'description': '\n四つ錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/8-3.gif',
  'description': '\n錨桐'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/9-4.gif',
  'description': '\n綱付き錨'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019

In [None]:
#save output to a csv
df = pd.DataFrame(object_kamons_all)
df.to_csv('Data/kamon_objects.csv', index=False)

Troubleshooting some failed pages below. Have fixed above.

In [None]:
#some links failed
failed_links

[]

In [None]:
#is this an issue with the page structure?
url = failed_links[0]
raw_html = requests.get(url).content
soup_doc = BeautifulSoup(raw_html, 'html.parser')




In [None]:
#process the soup_doc
table_elements = soup_doc.table.find_all('td')
failed_elemets = []
counter = 0
for item in table_elements:

    object = {
        'image_link': item.img['src'],
        'description': item.get_text()
    }
    counter += 1
    print(counter)
    print(object['description'])
    failed_elemets.append(object)
#fails because there are empty td elements in some pages

1

丸に三本骨扇
2

丸に五本骨扇
3

丸に七本骨扇
4

五本骨扇
5

陰五本骨扇
6

七本骨扇
7

房扇
8

檜扇
9

陰檜扇
10

丸に日の丸扇
11

丸に房扇
12

丸に二階扇
13

丸に尻合わせ二つ扇
14

三つ扇
15

三つ反り扇
16

並び扇
17

丸に並び扇
18

丸に三つ扇
19

丸に七本骨三つ扇
20

九本骨三つ雁木扇
21

五つ雁木扇車
22

佐竹扇
23

扇に八の字
24

重ね扇
25

九つ矢扇
26

横重ね扇
27

三つ盛り扇
28

石持ち地抜き扇
29

三つ日の丸扇
30

日の丸三つ反り扇
31

三つ日の丸雁木扇
32

渡辺扇
33

六つ扇
34

違い扇
35

五本重ね扇
36

扇井桁
37

三本重ね扇
38

三本組み扇
39

八本扇車
40

丸に違い扇
41

丸に三本重ね扇
42

丸に三本組み扇
43

中輪に三本扇
44

糸輪に尻合わせ三本扇
45

中輪に七本扇の骨
46

中輪に橘違い扇
47

丸に檜扇
48

総陰丸に扇
49

五本扇の骨
50

骨扇
51

四本扇菱
52

反り扇
53

雁木反り扇
54

五つ矢扇
55

七本骨雁木扇
56

折り目雁木扇
57

扇菱
58

扇に桜
59

扇に地抜き丸に釘抜き
60

扇に地抜き釘抜き
61

日の丸雁木扇
62

入れ違い扇
63

扇片喰
64

反り亀甲に扇
65

半開き違い扇
66

浮線扇
67

扇揚羽蝶
68

変わり扇蝶
69

変わり浮線扇
70

扇胡蝶
71

扇蝶花形
72

中開き三本扇
73

五つ追い重ね末広扇
74

三つ追い雁木扇
75

五つ捻じ扇
76

高崎扇
77

重ね合わせ三つ雁木扇
78

浅野扇
79

下がり藤に五本骨扇
80

扇輪
81

二つ雁木扇の丸


TypeError: 'NoneType' object is not subscriptable