# 01. Image Crawling: 아이돌 사진 크롤링하기

selenium을 사용해서 구글에서 이미지를 검색해서 이미지 파일을 저장하는 크롤링을 직접 구현할 수도 있지만, 이미 동일한 기능을 구현한 라이브러리가 공개되어 있으므로, google_images_download를 이용해 아이돌 가수들의 구글 이미지 검색결과를 크롤링한다.

참고: [google images download github repository](https://github.com/hardikvasa/google-images-download)

In [45]:
!pip install tqdm google_images_download
from tqdm import tqdm
from google_images_download import google_images_download



## 1. 기획사별 아이돌 멤버 이름 list 만들기

소속사별 남/녀 아이돌 멤버 이름 리스트로 받기

In [46]:
import pandas as pd
def get_idol_list(company, gender):
    df = pd.read_csv("{}.csv".format(company), encoding='utf-8')
    names = df[df["성별"]==gender]["멤버"].tolist()
    names = [x.split(',') for x in names]
    names = [item for sublist in names for item in sublist]
    names = [x.strip() for x in names]
    print("{} 소속 {}자 가수: {}명".format(company.upper(), gender, len(names)))
    return names

In [49]:
sm = pd.read_csv("sm.csv", encoding='utf-8')
sm


Unnamed: 0,멤버,성별
0,BTS V,남
1,BTS Jeongguk,남
2,Bigbang TOP,남


### 1.1 SM

In [50]:
sm_m_names = get_idol_list('sm', '남')

SM 소속 남자 가수: 3명


In [None]:
sm_w_names = get_idol_list('sm', '여')

### 1.2 JYP

In [None]:
jyp = pd.read_csv("jyp.csv", encoding='euc_kr')
jyp

In [None]:
jyp_m_names = get_idol_list('jyp', '남')

In [None]:
jyp_w_names = get_idol_list('jyp', '여')

### 1.3 YG

In [None]:
yg = pd.read_csv("yg.csv", encoding='euc_kr')
yg

In [None]:
yg_m_names = get_idol_list('yg', '남')

In [None]:
yg_w_names = get_idol_list('yg', '여')

## 2. 소속사별 아이돌 멤버 사진 크롤링

구글 이미지 검색 결과 100개씩 다운로드 받기

In [53]:
#-*- coding:utf-8 -*-
#!export PYTHONIOENCODING=UTF-8

#!pip install unidecode

def transliterate(string):
    """Transliterates string into his closest representation.
    Ex: 1. àé => ae,
        2. สวัสดีครับ => swasdiikhrab.
    :param string: string
    :return: closest string.
    """
    from unidecode import unidecode

    if not isinstance(string, bytes):
        string = u''.join(string)

    return unidecode(string)

def get_images(names):
    for name in tqdm(names):
        response = google_images_download.googleimagesdownload()
        real_name = transliterate(name)
        arguments = {"keywords":real_name,"limit":100,"print_urls":False}   
        paths = response.download(arguments)

### 2.1 SM 소속 아이돌 사진 크롤링

In [54]:
get_images(sm_m_names)


  0%|          | 0/3 [00:00<?, ?it/s][A


Item no.: 1 --> Item name = BTS V
Evaluating...
Starting Download...
Completed Image ====> 1.MV5BMjUyZGRlZGUtMjU1OC00NjJjLTk0MWQtN2RjZTYzYzgxZjVhXkEyXkFqcGdeQXVyMzM4MjM0Nzg@._V1_.jpg
Completed Image ====> 2.190417_V_at_the_Map_of_the_Soul_Persona_press_conference.jpg
Completed Image ====> 3.maxresdefault.jpg
Invalid or missing image format. Skipping...
Completed Image ====> 4.71IvZBWxh-L._SY606_.jpg
Completed Image ====> 5.70613ea3a3e8d5a71241540fb9ba10ea.jpg
Invalid or missing image format. Skipping...
Invalid or missing image format. Skipping...
Completed Image ====> 6.bts-v.jpg
Completed Image ====> 7.13443470_f520.jpg
Invalid or missing image format. Skipping...
Invalid or missing image format. Skipping...
Completed Image ====> 8.1567511126-edhvuf9uuaajhc3.jpg
Completed Image ====> 9.people-are-defending-v-of-bts-after-the-idol-pretended-to-take-videos-of-his-fans-at-the-airport-photo-by-kpop-news-youtube.jpg
Completed Image ====> 10.bts-v-pants-cover.jpg
Invalid or missing image 


 33%|███▎      | 1/3 [00:51<01:42, 51.22s/it][A

Completed Image ====> 87.k-pop-boy-band-bts-member-v-whose-real-name-is-kim-tae-hyung-plays-hong-han-sung-in-kbs-2tvs-hwarang-the-poet-warrior-youth.jpg


Unfortunately all 100 could not be downloaded because some images were not downloadable. 87 is all we got for this search filter!

Errors: 13


Item no.: 1 --> Item name = BTS Jeongguk
Evaluating...
Starting Download...
Invalid or missing image format. Skipping...
Completed Image ====> 1.1b4dd56b15ed17864e769e354a3d887a.jpg
Completed Image ====> 2.e2c84cbec827a2d6cf05bec4f91b1cbe.jpg
Completed Image ====> 3.2ybp8yy.jpg
Completed Image ====> 4.bts-jungkook---jeon-jeongguk--1537799622-view-0.png
Completed Image ====> 5.bbf2b0ab78123ecb8d84eecf8c27f17b.jpg
Completed Image ====> 6.BTS%20Jungkook%20The%20Fact%20Music%20Awards.jpg
Completed Image ====> 7.37ae8004d737015fb05a8bd7a35300d9.jpg
Completed Image ====> 8.large.png
Completed Image ====> 9.Black-Jungkook-640x430.jpg
Invalid or missing image format. Skipping...
Completed Image ====>


 67%|██████▋   | 2/3 [01:57<00:55, 55.63s/it][A

Completed Image ====> 87.BTS%20Jungkook%20The%20Fact%20Music%20Awards.jpg


Unfortunately all 100 could not be downloaded because some images were not downloadable. 87 is all we got for this search filter!

Errors: 13


Item no.: 1 --> Item name = Bigbang TOP
Evaluating...
Starting Download...
Completed Image ====> 1.t.o.p-of-bigbang-sept-2016-billboard-1548.jpg
Completed Image ====> 2.20190707-big-bang.jpg
Completed Image ====> 3.rs_600x600-190707200610-e-asia-big-bang-top-thumbnail-GettyImages-610388818.jpg
Completed Image ====> 4.soompi.com_-2.jpg
Completed Image ====> 5.T.O.P_-_0.TO.10_in_Seoul_-_2.jpg
Completed Image ====> 6.rs_600x600-190707200610-e-asia-big-bang-top-thumbnail-GettyImages-610388818.jpg
Completed Image ====> 7.2ee1fe9e3a7d96eefed620f67f2f945b.png
Completed Image ====> 8.img_2991.jpg
Completed Image ====> 9.BIGBANG-T.O.P-2.jpg
Completed Image ====> 10.top-cover.png
Completed Image ====> 11.Screen%2BShot%2B2019-07-05%2Bat%2B6.11.31%2BPM.png
Completed Image ====> 12.


100%|██████████| 3/3 [03:19<00:00, 66.35s/it][A

Completed Image ====> 94.TOP-top-from-big-bang-32702291-500-733.jpg


Unfortunately all 100 could not be downloaded because some images were not downloadable. 94 is all we got for this search filter!

Errors: 6






In [None]:
get_images(sm_w_names)

### 2.2 JYP 소속 아이돌 사진 크롤링

In [None]:
get_images(sm_m_names)

In [None]:
get_images(sm_w_names)

### 2.3 YG 소속 아이돌 사진 크롤링

In [None]:
get_images(yg_w_names)

In [None]:
get_images(yg_w_names)

### 2.4 보충

활동명이 보통명사이거나 동명이인이 있는 경우, 기타 검색이 잘 되지 않은 경우 검색이 잘되는 형태의 이름으로 바꾸고 크롤링을 보충해준다

In [None]:
ls = ['비', 'ses 바다', '태양', '탑', '마크',
      '위너 이승훈', '아이콘 정찬우',
      '박봄', '로제', '리사', '씨엘',
      '세븐', '승리', '지드래곤', '지디',
      '루나', 'ses 슈', '미쓰에이 페이']

In [None]:
for name in tqdm(ls):
    response = google_images_download.googleimagesdownload()
    arguments = {"keywords":"{} 얼굴".format(name),"limit":100,"print_urls":False}   #creating list of arguments
    paths = response.download(arguments)