# Scraping Kamondb for Kamon crest images + descriptions

KamonDB (https://kamondb.com/) is a Japanese website (and custom Kamon designer) which hosts a lot of regularly sized Kamon images. These images can be found on their website, sorted into 7 groups:
- Plant pattern  (the biggest group) 
- Animal crest
- Vessel crest
- Architectural pattern
- Geometric pattern
- Character pattern

As far I as I can tell, there's no consistent order to the naming of each category and subcategory - these might need to be gathered by hand. However, each page that contains crests and their descriptions is consitently structured:
 - `<article>` tag (or even more specific, `<table>` tag)contains all elements with crests and descriptions
    - all crests are stored in `<td>` tags
    - each crest is an `<img>` and a string with `<font>` tag

Scrape each subpage link from the category directories. Most just have 1 page directories, some have 2. 


### Scraping all pages

In [1]:
#Import requests and beautifulsoup4
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import re

In [None]:
#master list of links to scrape
category_links = [
    "https://kamondb.com/category/plant/", #plants 1 
    "https://kamondb.com/category/plant/page/2/", #plants 2
    "https://kamondb.com/category/animal/", #animals 
    "https://kamondb.com/category/object/", #objects 1
    "https://kamondb.com/category/object/page/2/", #objects 2
    "https://kamondb.com/category/architecture/", #architecture
    "https://kamondb.com/category/nature/", #nature
    "https://kamondb.com/category/geometry/", #geometry
    "https://kamondb.com/category/letter/", #characters
]

In [None]:


#keeping track of all subcategory hrefs and titles
all_subcategory = []


for url in category_links:

    #get raw html and parse
    raw_html = requests.get(url).content
    soup = BeautifulSoup(raw_html, 'html.parser')
    time.sleep(1)  # polite delay between requests

    #get all a tags in div with id=list
    item_list = soup.find('div', id='list').find_all('a')

    #add tags to a master list of hrefs
    for a in item_list:
        subcat_object = {
            'href': a['href'],
            "title": a["title"],
            "category" : re.search(r"/category/([^/]+)(?:/|$)", url).group(1)  # extract category from URL
        }
        all_subcategory.append(subcat_object)





In [None]:
all_subcategory

In [None]:
#for each subcategory link, scrape data and add to a master list
# master list of all items contains:
    # link to image
    # image title
    # subcategory title
    # category title
all_kamon = []

for item in all_subcategory:
    url = item['href']
    subcategory_title = item['title']
    category_title = item['category']

    time.sleep(1)  # polite delay between requests
    try: 
        #get raw html and parse
        raw_html = requests.get(url).content
        soup = BeautifulSoup(raw_html, 'html.parser')

        #find the table in each page
        table_elements = soup.table.find_all('td')

        object = {}
        for kamon in table_elements:
            try:
                kamon_object = {
                    'image_link': kamon.img['src'],
                    'description': kamon.get_text().replace('\n', '').strip(),
                    'subcategory_title': subcategory_title,
                    'category_title': category_title
                }
                all_kamon.append(kamon_object)
                print(f"Added kamon: {kamon_object['description']}")
            except:
                print("Kamon Failed to process an item in the table.")
        print(f"Successfully retrieved {url}")
    except:
        print(f"Failed to process {url}")



In [None]:
all_kamon

In [None]:
#save kamon data to csv
df = pd.DataFrame(all_kamon)
df.to_csv('data/kamon_data.csv', index=False)

## Next Steps:


- Google translate descriptions and subcategory title
- download and parse images?
    - make dimensions equal
- Machine Learning
    - Kamon descriptions? text ML? do after class on 11/17
    - Images by radius from center - visual patterns, image clustering?
    - 

In [3]:
#download images to local folder

#download images from link in kamon_engdesc_only.csv
    #naming scheme:
    #use category_title_(first word in subcategory_title)_(index).jpg
kamon_df = pd.read_csv('data/kamon_engdesc_only.csv')
kamon_df.head()

#save images to images_kamon


Unnamed: 0,image_link,description,subcategory_title,category_title,English_Description,English_Subcategory_Title,English_Subcategory
0,https://kamondb.com/wp/wp-content/uploads/2019...,徳川葵,葵 | あおい,plant,Aoi Tokugawa,Aoi | Aoi,Aoi | Aoi
1,https://kamondb.com/wp/wp-content/uploads/2019...,剣三つ葵,葵 | あおい,plant,sword three Aoi,Aoi | Aoi,Aoi | Aoi
2,https://kamondb.com/wp/wp-content/uploads/2019...,立ち葵,葵 | あおい,plant,standing hollyhock,Aoi | Aoi,Aoi | Aoi
3,https://kamondb.com/wp/wp-content/uploads/2019...,丸に立ち葵,葵 | あおい,plant,Aoi standing in a circle,Aoi | Aoi,Aoi | Aoi
4,https://kamondb.com/wp/wp-content/uploads/2019...,水に立ち葵,葵 | あおい,plant,Aoi standing in the water,Aoi | Aoi,Aoi | Aoi


In [4]:
#drop English_Subcatergory_Title column
kamon_df = kamon_df.drop(columns=['English_Subcategory_Title'])

In [5]:
#change column names to lowercase
kamon_df.columns = [col.lower() for col in kamon_df.columns]
kamon_df.head()

Unnamed: 0,image_link,description,subcategory_title,category_title,english_description,english_subcategory
0,https://kamondb.com/wp/wp-content/uploads/2019...,徳川葵,葵 | あおい,plant,Aoi Tokugawa,Aoi | Aoi
1,https://kamondb.com/wp/wp-content/uploads/2019...,剣三つ葵,葵 | あおい,plant,sword three Aoi,Aoi | Aoi
2,https://kamondb.com/wp/wp-content/uploads/2019...,立ち葵,葵 | あおい,plant,standing hollyhock,Aoi | Aoi
3,https://kamondb.com/wp/wp-content/uploads/2019...,丸に立ち葵,葵 | あおい,plant,Aoi standing in a circle,Aoi | Aoi
4,https://kamondb.com/wp/wp-content/uploads/2019...,水に立ち葵,葵 | あおい,plant,Aoi standing in the water,Aoi | Aoi


There are a lot of random characters I don't want in the filenames. Removing those here. Annoyingly, both the normal | and the japanese | are present, so I have to make sure to remove both of them. Also, slashes /, commas, and quotation marks are removed.

In [8]:
#for english_subcategory, | replaced with underscores
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].str.replace(r'\s*\|\s*', '_', regex=True)
#also, replace the japanese full-width vertical bar (U+FF5C) with underscores
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].str.replace(r'\s*｜\s*', '_', regex=True)

#turn all / or spaces into underscores
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].str.replace(r'[\s/]+', '_', regex=True)
#remove all commas from english subcategory
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].str.replace(',', '', regex=False)
#remove all quotation marks from english subcategory
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].str.replace('"', '', regex=False)

#make everything lowercase
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].str.lower()
#if there are any cases where the category repeats itself, remove the repetition
kamon_df['english_subcategory'] = kamon_df['english_subcategory'].apply(
    lambda x: re.sub(r'\b(\w+)(_\1)+\b', r'\1', x)
)

kamon_df.head()


Unnamed: 0,image_link,description,subcategory_title,category_title,english_description,english_subcategory
0,https://kamondb.com/wp/wp-content/uploads/2019...,徳川葵,葵 | あおい,plant,Aoi Tokugawa,aoi
1,https://kamondb.com/wp/wp-content/uploads/2019...,剣三つ葵,葵 | あおい,plant,sword three Aoi,aoi
2,https://kamondb.com/wp/wp-content/uploads/2019...,立ち葵,葵 | あおい,plant,standing hollyhock,aoi
3,https://kamondb.com/wp/wp-content/uploads/2019...,丸に立ち葵,葵 | あおい,plant,Aoi standing in a circle,aoi
4,https://kamondb.com/wp/wp-content/uploads/2019...,水に立ち葵,葵 | あおい,plant,Aoi standing in the water,aoi


In [None]:
kamon_df.iloc[1908]
kamon_df.iloc[3568]
kamon_df.iloc[3565]

image_link             https://kamondb.com/wp/wp-content/uploads/2019...
description                                                        一つ瓢の丸
subcategory_title                                     瓢・瓢箪｜ひょう・ひょうたん・ひさご
category_title                                                     plant
english_description                                     One gourd circle
english_subcategory                         gourd_gourd_hyo_gourd_hisago
Name: 1908, dtype: object

In [12]:
#test to make sure splitting is happening correctly
kamon_df.iloc[372]["english_subcategory"]
kamon_df.iloc[3568]["english_subcategory"]

'arrow_ya'

In [13]:
#all
kamon_df['image_filename'] = ''

# Group by subcategory and process each group separately
for subcategory, group in kamon_df.groupby('english_subcategory'):
    for idx, row in group.iterrows():
        image_url = row['image_link']
        category_title = row['category_title']
        subcategory_first_word = row['english_subcategory'].split()[0]
        image_name = f"{category_title}_{subcategory_first_word}_{idx % len(group)}.jpg"  # Reset index within group
        
        try:
            image_data = requests.get(image_url).content
            with open(f'Data/images_kamon/{image_name}', 'wb') as handler:
                handler.write(image_data)
            print(f"Downloaded {image_name}")
            kamon_df.at[idx, 'image_filename'] = image_name
        except:
            print(f"Failed to download image from {image_url}")

Downloaded object_ace_masakari_4.jpg
Downloaded object_ace_masakari_5.jpg
Downloaded object_ace_masakari_6.jpg
Downloaded object_ace_masakari_7.jpg
Downloaded object_ace_masakari_8.jpg
Downloaded object_ace_masakari_0.jpg
Downloaded object_ace_masakari_1.jpg
Downloaded object_ace_masakari_2.jpg
Downloaded object_ace_masakari_3.jpg
Downloaded object_anchor_7.jpg
Downloaded object_anchor_8.jpg
Downloaded object_anchor_9.jpg
Downloaded object_anchor_10.jpg
Downloaded object_anchor_0.jpg
Downloaded object_anchor_1.jpg
Downloaded object_anchor_2.jpg
Downloaded object_anchor_3.jpg
Downloaded object_anchor_4.jpg
Downloaded object_anchor_5.jpg
Downloaded object_anchor_6.jpg
Downloaded animal_antlers_deer_antlers_tsuno_shika_kazuno_18.jpg
Downloaded animal_antlers_deer_antlers_tsuno_shika_kazuno_0.jpg
Downloaded animal_antlers_deer_antlers_tsuno_shika_kazuno_1.jpg
Downloaded animal_antlers_deer_antlers_tsuno_shika_kazuno_2.jpg
Downloaded animal_antlers_deer_antlers_tsuno_shika_kazuno_3.jpg
Down

In [None]:
#check which images failed to download - see if image_filename is empty
failed_downloads = kamon_df[kamon_df['image_filename'] == '']
failed_downloads
#empty, so all worked

Unnamed: 0,image_link,description,subcategory_title,category_title,english_description,english_subcategory,image_filename


In [None]:
#check filenames in images_kamon folder against kamon_df image_filename column
import os
image_files = set(os.listdir('Data/images_kamon'))
kamon_column = set(kamon_df['image_filename'])
#check where these two sets differ
missing_in_folder = kamon_column - image_files
missing_in_folder
missing_in_dataframe = image_files - kamon_column
missing_in_dataframe
#all ok?


{'.DS_Store'}

I'm not fully convinved, but this should be all the images.
5359 total images, 5359 total rows.

In [17]:
#export dataframe to new csv
kamon_df.to_csv('data/kamon.csv', index=False)

In [2]:
#import the new kamon.csv and display head
kamon_df = pd.read_csv('data/kamon.csv')
kamon_df.head()

Unnamed: 0,image_link,description,subcategory_title,category_title,english_description,english_subcategory,image_filename
0,https://kamondb.com/wp/wp-content/uploads/2019...,徳川葵,葵 | あおい,plant,Aoi Tokugawa,aoi,plant_aoi_0.jpg
1,https://kamondb.com/wp/wp-content/uploads/2019...,剣三つ葵,葵 | あおい,plant,sword three Aoi,aoi,plant_aoi_1.jpg
2,https://kamondb.com/wp/wp-content/uploads/2019...,立ち葵,葵 | あおい,plant,standing hollyhock,aoi,plant_aoi_2.jpg
3,https://kamondb.com/wp/wp-content/uploads/2019...,丸に立ち葵,葵 | あおい,plant,Aoi standing in a circle,aoi,plant_aoi_3.jpg
4,https://kamondb.com/wp/wp-content/uploads/2019...,水に立ち葵,葵 | あおい,plant,Aoi standing in the water,aoi,plant_aoi_4.jpg


In [7]:
#get random sample of 10% of image_filenames
sample_df = kamon_df.sample(n=540, random_state=42)
sample_df["image_filename"] 

2833              object_fan_45.jpg
3967    nature_wave_wami_nami_7.jpg
401     plant_sawata_omodaka_73.jpg
2638      animal_crane_tsuru_38.jpg
2005             plant_peony_10.jpg
                   ...             
3125         object_sword_ken_5.jpg
1580               plant_ivy_92.jpg
1666         plant_pear_none_10.jpg
1730          plant_hanakaku_28.jpg
3407            object_ladder_3.jpg
Name: image_filename, Length: 540, dtype: object

In [8]:
#copy random sample to random_subset folder
import os
for filename in sample_df["image_filename"]:
    src_path = os.path.join('Data/images_kamon', filename)
    dest_path = os.path.join('Data/random_subset', filename)
    try:
        with open(src_path, 'rb') as src_file:
            with open(dest_path, 'wb') as dest_file:
                dest_file.write(src_file.read())
        print(f"Copied {filename} to random_subset.")
    except FileNotFoundError:
        print(f"File {filename} not found in images_kamon.")

Copied object_fan_45.jpg to random_subset.
Copied nature_wave_wami_nami_7.jpg to random_subset.
Copied plant_sawata_omodaka_73.jpg to random_subset.
Copied animal_crane_tsuru_38.jpg to random_subset.
Copied plant_peony_10.jpg to random_subset.
Copied plant_hemp_asa_16.jpg to random_subset.
Copied plant_melon_uri_94.jpg to random_subset.
Copied geometry_meyui_61.jpg to random_subset.
Copied plant_chouji_32.jpg to random_subset.
Copied architecture_izutsu_46.jpg to random_subset.
Copied geometry_corner_write_62.jpg to random_subset.
Copied plant_katabami_vinegar_grass_katabami_51.jpg to random_subset.
Copied object_car_5.jpg to random_subset.
Copied object_mari_scissors_2.jpg to random_subset.
Copied geometry_ring:_other_wa_50.jpg to random_subset.
Copied plant_chrysanthemum_hear_2.jpg to random_subset.
Copied geometry_ring:_middle_ring_nakawa_15.jpg to random_subset.
Copied plant_kikyo_30.jpg to random_subset.
Copied animal_crane_tsuru_44.jpg to random_subset.
Copied object_apricot_leav