# Scraping Kamondb for Kamon crest images + descriptions

KamonDB (https://kamondb.com/) is a Japanese website (and custom Kamon designer) which hosts a lot of regularly sized Kamon images. These images can be found on their website, sorted into 7 groups:
- Plant pattern  (the biggest group) 
- Animal crest
- Vessel crest
- Architectural pattern
- Geometric pattern
- Character pattern

As far I as I can tell, there's no consistent order to the naming of each category and subcategory - these might need to be gathered by hand. However, each page that contains crests and their descriptions is consitently structured:
 - `<article>` tag (or even more specific, `<table>` tag)contains all elements with crests and descriptions
    - all crests are stored in `<td>` tags
    - each crest is an `<img>` and a string with `<font>` tag

Scrape each subpage link from the category directories. Most just have 1 page directories, some have 2. 


### Scraping all pages

In [29]:
#Import requests and beautifulsoup4
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import re

In [None]:
#master list of links to scrape
category_links = [
    "https://kamondb.com/category/plant/", #plants 1 
    "https://kamondb.com/category/plant/page/2/", #plants 2
    "https://kamondb.com/category/animal/", #animals 
    "https://kamondb.com/category/object/", #objects 1
    "https://kamondb.com/category/object/page/2/", #objects 2
    "https://kamondb.com/category/architecture/", #architecture
    "https://kamondb.com/category/nature/", #nature
    "https://kamondb.com/category/geometry/", #geometry
    "https://kamondb.com/category/letter/", #characters
]

In [35]:


#keeping track of all subcategory hrefs and titles
all_subcategory = []


for url in category_links:

    #get raw html and parse
    raw_html = requests.get(url).content
    soup = BeautifulSoup(raw_html, 'html.parser')
    time.sleep(1)  # polite delay between requests

    #get all a tags in div with id=list
    item_list = soup.find('div', id='list').find_all('a')

    #add tags to a master list of hrefs
    for a in item_list:
        subcat_object = {
            'href': a['href'],
            "title": a["title"],
            "category" : re.search(r"/category/([^/]+)(?:/|$)", url).group(1)  # extract category from URL
        }
        all_subcategory.append(subcat_object)





In [36]:
all_subcategory

[{'href': 'https://kamondb.com/plant/12/',
  'title': '葵 | あおい',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/23/',
  'title': '麻｜あさ',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/25/',
  'title': '朝顔｜あさがお',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/434/',
  'title': '葦｜あし',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/27/',
  'title': '菖蒲｜あやめ',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/432/',
  'title': '虎杖｜いたどり',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/430/',
  'title': '銀杏｜いちょう',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/428/',
  'title': '稲｜いね',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/426/',
  'title': '梅｜うめ',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/424/',
  'title': '瓜｜うり',
  'category': 'plant'},
 {'href': 'https://kamondb.com/plant/422/',
  'title': '沢瀉｜おもだか',
  'category': 'plant'},
 {'href': 'https://kamondb.com

In [40]:
#for each subcategory link, scrape data and add to a master list
# master list of all items contains:
    # link to image
    # image title
    # subcategory title
    # category title
all_kamon = []

for item in all_subcategory:
    url = item['href']
    subcategory_title = item['title']
    category_title = item['category']

    time.sleep(1)  # polite delay between requests
    try: 
        #get raw html and parse
        raw_html = requests.get(url).content
        soup = BeautifulSoup(raw_html, 'html.parser')

        #find the table in each page
        table_elements = soup.table.find_all('td')

        object = {}
        for kamon in table_elements:
            try:
                kamon_object = {
                    'image_link': kamon.img['src'],
                    'description': kamon.get_text().replace('\n', '').strip(),
                    'subcategory_title': subcategory_title,
                    'category_title': category_title
                }
                all_kamon.append(kamon_object)
                print(f"Added kamon: {kamon_object['description']}")
            except:
                print("Kamon Failed to process an item in the table.")
        print(f"Successfully retrieved {url}")
    except:
        print(f"Failed to process {url}")



Added kamon: 徳川葵
Added kamon: 剣三つ葵
Added kamon: 立ち葵
Added kamon: 丸に立ち葵
Added kamon: 水に立ち葵
Added kamon: 剣に二つ葵
Added kamon: 二葉葵
Added kamon: 丸に左離れ立ち葵
Added kamon: 尻合わせ三つ葵
Added kamon: 左離れ立ち葵
Added kamon: 蔓三つ葵
Added kamon: 三つ割り葵
Added kamon: 二つ蔓葵の丸
Added kamon: 三つ蔓葵の丸
Added kamon: 変わり蔓三つ葵
Added kamon: 総陰丸に三つ葵
Added kamon: 中陰丸に三つ葵
Added kamon: 花付き三つ割り葵
Added kamon: 蔓葵片喰
Added kamon: 陰尻合わせ三つ葵
Added kamon: 花付き割り葵
Added kamon: 中陰尻合わせ三つ葵
Added kamon: 入れ違い割り葵
Successfully retrieved https://kamondb.com/plant/12/
Added kamon: 麻の葉
Added kamon: 丸に麻の葉
Added kamon: 陰麻の葉
Added kamon: 三つ割り麻の葉
Added kamon: 外三つ割り麻の葉
Added kamon: 麻の葉車
Added kamon: 真麻崩し
Added kamon: 陰陽麻の葉
Added kamon: 丸に麻の葉桐
Added kamon: 向こう真麻の葉
Added kamon: 細麻の葉
Added kamon: 丸に真麻の葉
Added kamon: 麻の花
Added kamon: 三つ麻の葉
Added kamon: 麻の葉桔梗
Added kamon: 糸輪に豆麻の葉
Added kamon: 雪輪に麻の葉
Successfully retrieved https://kamondb.com/plant/23/
Added kamon: 中輪に一つ朝顔
Added kamon: 五つ朝顔
Added kamon: 細輪に六つ朝顔
Added kamon: 朝顔枝丸
Successfully retrieved https://kam

In [38]:
all_kamon

[{'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/9-1.gif',
  'description': '\n徳川葵',
  'subcategory_title': '葵 | あおい',
  'category_title': 'plant'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/2-1.gif',
  'description': '\n剣三つ葵',
  'subcategory_title': '葵 | あおい',
  'category_title': 'plant'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/3.gif',
  'description': '\n立ち葵',
  'subcategory_title': '葵 | あおい',
  'category_title': 'plant'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/4.gif',
  'description': '\n丸に立ち葵',
  'subcategory_title': '葵 | あおい',
  'category_title': 'plant'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/5.gif',
  'description': '\n水に立ち葵',
  'subcategory_title': '葵 | あおい',
  'category_title': 'plant'},
 {'image_link': 'https://kamondb.com/wp/wp-content/uploads/2019/09/6.gif',
  'description': '\n剣に二つ葵',
  'subcategory_title': '葵 | あおい',
  'category_title': 'plant'},

In [41]:
#save kamon data to csv
df = pd.DataFrame(all_kamon)
df.to_csv('data/kamon_data.csv', index=False)

## Next Steps:


- Google translate descriptions and subcategory title
- download and parse images?
    - make dimensions equal
- Machine Learning
    - Kamon descriptions? text ML? do after class on 11/17
    - Images by radius from center - visual patterns, image clustering?
    - 