## 1) 라이브러리 설치 및 라이브러리 설명

- beautifulsoup : BeautifulSoup는 HTML 및 XML 파일을 파싱(Parsing)하는 라이브러리입니다.
- request : requests는 HTTP 요청을 보내는 것을 간편하게 만들어주는 라이브러리입니다. 웹 서버에 GET, POST 등의 요청을 보내서 HTML, 이미지, JSON 등의 데이터를 가져오는 데 사용됩니다.
- lxml : lxml은 XML 및 HTML 파싱을 위한 빠르고 강력한 라이브러리입니다. BeautifulSoup의 파서(Parser) 중 하나로 자주 사용됩니다.

In [None]:
!pip -q install requests beautifulsoup4 lxml

**참고**: ```-q``` 옵션을 사용한 pip install: 오류나 경고가 발생하지 않는 한, 대부분의 메시지를 출력하지 않습니다. 설치가 완료된 후에는 아무런 메시지가 표시되지 않거나, 최소한의 정보만 출력됩니다.

## 2) 가져오기

In [None]:

import requests
from bs4 import BeautifulSoup
import lxml
import pprint as pp


In [None]:
# 3) 대상 URL
URL = "https://anilife.app/content/6750/tab=info"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
    "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
}


resp = requests.get(URL, headers=headers, timeout=20)
print("응답 코드:", resp.status_code)   # 200 나오면 성공
print("응답 헤더:")
pprint.pprint(dict(resp.headers))
print("컨텐츠 길이:", len(resp.text))


---
응답 코드는 웹 서버가 클라이언트(브라우저 등)의 요청에 대해 "무슨 일이 있었는지" 알려주는 세 자리 숫자입니다. 이 코드들은 첫째 자리에 따라 의미가 분류됩니다.

### **1xx: 정보 (Informational)**
요청을 받았고, 계속 처리 중이라는 의미입니다.

* **100 Continue**: 요청의 일부를 받았고, 나머지도 보내라는 의미입니다.

### **2xx: 성공 (Success)**
요청이 성공적으로 처리되었음을 나타냅니다.

* **200 OK**: **가장 일반적인 성공 코드.** 요청이 성공적으로 완료되었고, 서버가 데이터를 잘 보냈다는 의미입니다.
* **201 Created**: 요청이 성공했고, 그 결과로 새로운 리소스(자원)가 생성되었습니다.
* **204 No Content**: 요청은 성공했지만, 보낼 데이터가 없습니다. (예: 삭제 요청 후)

### **3xx: 리다이렉션 (Redirection)**
요청을 완료하려면 추가적인 조치가 필요하다는 의미입니다. 주로 다른 주소로 옮겨졌을 때 사용됩니다.

* **301 Moved Permanently**: 요청한 페이지가 **영구적으로** 다른 주소로 옮겨졌습니다.
* **304 Not Modified**: 요청한 파일이 변경되지 않았으므로, 캐시된 버전을 사용해도 됩니다.

### **4xx: 클라이언트 오류 (Client Error)**
클라이언트(요청을 보낸 쪽)의 잘못으로 인해 요청을 처리할 수 없다는 의미입니다.

* **400 Bad Request**: 요청 문법이 잘못되었습니다.
* **401 Unauthorized**: 인증되지 않은 사용자입니다. 로그인 등이 필요합니다.
* **403 Forbidden**: 접근이 금지되었습니다. 권한이 없다는 뜻입니다.
* **404 Not Found**: **가장 흔한 오류.** 요청한 페이지를 찾을 수 없습니다. (주소가 잘못되었을 때)
* **429 Too Many Requests**: 정해진 시간 안에 너무 많은 요청을 보냈습니다.

### **5xx: 서버 오류 (Server Error)**
서버의 문제로 인해 요청을 처리할 수 없다는 의미입니다.

* **500 Internal Server Error**: 서버에 알 수 없는 오류가 발생했습니다.
* **503 Service Unavailable**: 서버가 일시적으로 요청을 처리할 수 없습니다. (서버 점검, 과부하 등)

In [None]:
# 4) BeautifulSoup으로 파싱
soup = BeautifulSoup(resp.text, "lxml")
# 간단하게 설명하자면 html 코드를 lxml파서로 객체로 변환합니다

In [None]:
print("\n====== HTML 부분 =======")
pprint.pprint(resp.text)

In [None]:
korean_title_tag = soup.find('h1', class_='fpUXWby')

In [None]:
pp.pprint(korean_title_tag.text)

In [None]:
# japanese_title_section = soup.find('h2', class_='visually-hidden')

In [None]:
# 1. <h2> 태그 안에서 모든 <span> 태그를 찾습니다.
span_tags = japanese_title_section.find_all('span')

# 2. 각 <span> 태그에서 .text를 사용해 텍스트만 추출합니다.
#    List Comprehension을 사용하면 코드가 간결해집니다.
titles = [span.text for span in span_tags]

In [None]:
pp.pprint(titles)

In [None]:
japanese_title = titles[0]
english_title = titles[1]

In [None]:
print("일본어 제목은:", japanese_title)
print("영어 제목은", english_title)

In [None]:
quarter_info = soup.find('div', class_='nBnfiIh')

In [None]:
pp.pprint(quarter_info.text)

In [None]:
year = "NULL"
quarter = "NULL"
broadcast_format = "NULL"
full_format = quarter_info.text.strip()
print(full_format)

# 매체부터 쪼개고
parts = full_format.split(' · ')
print(parts)
#년도를 쪼개자

season_info = parts[0].split(' ')
year = season_info[0]
print(year)

quarter = season_info[1]
print(quarter)



In [None]:
genre_tags = soup.select('a[rel="genre"]')


In [None]:
genre_list = [tag.text.strip() for tag in genre_tags]

In [None]:
pp.pprint(genre_list)

In [3]:
# 라이브러리 설치 및 가져오기
# !pip -q install requests beautifulsoup4 lxml

import requests
from bs4 import BeautifulSoup
import re
import json
from typing import Dict, List, Optional
import pprint as pp

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = requests.get(url, headers=self.headers, timeout=20)
            print(f"응답 코드: {resp.status_code}")
            print(f"최종 URL: {url}")

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러"}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 정보 추출
            anime_info = {
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}"}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}"}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            # __NUXT__ 데이터가 있는 부분 찾기
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # JavaScript 함수 매개변수를 실제 값으로 치환
                # HTML에서 실제 사용되는 매개변수 값 확인
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"'
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱
                data = json.loads(json_str)
                return data

            return {}

        except Exception as e:
            print(f"Nuxt 데이터 추출 에러: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        # Nuxt 데이터에서 우선 추출 (더 정확함)
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            name_data = content_detail.get('name', {})

            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML에서도 확인
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (방영 시기, 포맷 등)"""
        basic_info = {}

        # Nuxt 데이터에서 우선 추출
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            if content_detail.get('format'):
                basic_info["format"] = content_detail['format']

            if content_detail.get('status'):
                basic_info["status"] = content_detail['status']

            season_data = content_detail.get('season', {})
            if season_data:
                basic_info["year"] = str(season_data.get('year', ''))
                basic_info["quarter"] = f"{season_data.get('quarter', '')}분기"

            if content_detail.get('startDate'):
                basic_info["start_date"] = content_detail['startDate']

            if content_detail.get('endDate') and content_detail['endDate'] != "null":
                basic_info["end_date"] = content_detail['endDate']

            if content_detail.get('totalEpisode') and content_detail['totalEpisode'] != "N/A":
                basic_info["total_episodes"] = str(content_detail['totalEpisode'])

            if content_detail.get('duration') and content_detail['duration'] != "N/A":
                basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            print(f"Nuxt 기본 정보 추출 에러: {e}")

        # HTML에서 방영 시기 정보 (Nuxt에서 못 찾은 경우)
        if not basic_info.get('year') or not basic_info.get('quarter'):
            quarter_info = soup.find('div', class_='nBnfiIh')
            if quarter_info:
                full_format = quarter_info.get_text(strip=True)
                parts = full_format.split(' · ')

                if len(parts) >= 2:
                    if not basic_info.get('format'):
                        basic_info["format"] = parts[1]

                    season_info = parts[0].split(' ')
                    if len(season_info) >= 2:
                        if not basic_info.get('year'):
                            basic_info["year"] = season_info[0]
                        if not basic_info.get('quarter'):
                            basic_info["quarter"] = season_info[1]

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        # Nuxt 데이터에서 우선 추출
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        # HTML에서도 확인 (Nuxt에서 못 찾은 경우)
        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        # Nuxt 데이터에서 우선 추출
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            tag_data = content_detail.get('tag', [])

            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception as e:
            print(f"Nuxt 태그 추출 에러: {e}")

        # HTML에서도 확인 - 실제 HTML 구조에 맞게 수정
        if not tags:
            # 작품 태그 섹션 찾기
            tag_section = None

            # 방법 1: h2 텍스트로 섹션 찾기
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                # 태그 컨테이너 찾기
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    # a 태그들 찾기
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            # 스포일러 태그 확인 (class에 iYz6NWc가 있으면 스포일러)
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        # Nuxt 데이터에서 우선 추출
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        # HTML에서 줄거리
        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []

        # HTML에서 캐릭터 카드 찾기
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            # 캐릭터 정보 (왼쪽)
            character_div = card.find('div', class_='OuXf8uf')
            # 성우 정보 (오른쪽)
            voice_actor_link = card.find('a')

            character_info = {}

            # 캐릭터 이름과 역할 추출
            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                # data-original-title 속성도 확인
                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            # 성우 이름 추출
            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    # title 속성도 확인
                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # HTML에서 제작 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')

        if production_section:
            # 제작진 링크들 찾기
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        # 역할별로 정리
                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    # title 속성도 확인
                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info

    def print_results(self, anime_info: Dict):
        """결과를 보기 좋게 출력"""
        print("=" * 80)
        print("🎬 애니메이션 정보 크롤링 결과")
        print("=" * 80)

        # 제목 정보
        titles = anime_info.get('title', {})
        print(f"\n📺 제목:")
        if titles.get('korean'):
            print(f"  • 한국어: {titles['korean']}")
        if titles.get('japanese'):
            print(f"  • 일본어: {titles['japanese']}")
        if titles.get('english'):
            print(f"  • 영어: {titles['english']}")

        # 기본 정보
        basic_info = anime_info.get('basic_info', {})
        if basic_info:
            print(f"\n📋 기본 정보:")
            for key, value in basic_info.items():
                print(f"  • {key}: {value}")

        # 장르
        genres = anime_info.get('genres', [])
        if genres:
            print(f"\n🎭 장르: {', '.join(genres)}")

        # 태그
        tags = anime_info.get('tags', [])
        if tags:
            print(f"\n🏷️ 태그:")
            for tag in tags[:10]:  # 처음 10개만 표시
                print(f"  • {tag}")
            if len(tags) > 10:
                print(f"  • ... 총 {len(tags)}개 태그")

        # 줄거리
        synopsis = anime_info.get('synopsis', '')
        if synopsis:
            print(f"\n📖 줄거리:\n{synopsis}")

        # 캐릭터 & 성우
        characters = anime_info.get('characters_voice_actors', [])
        if characters:
            print(f"\n🎭 캐릭터 & 성우:")
            for char in characters:
                char_name = char.get('character_name', 'N/A')
                char_role = char.get('character_role', 'N/A')
                voice_actor = char.get('voice_actor', '')

                if voice_actor:
                    print(f"  • {char_name} ({char_role}) - 성우: {voice_actor}")
                else:
                    print(f"  • {char_name} - {char_role}")

        # 제작 정보
        production = anime_info.get('production_info', {})
        if production:
            print(f"\n🏭 제작 정보:")
            for key, value in production.items():
                print(f"  • {key}: {value}")

        print("\n" + "=" * 80)

# 사용 예제
if __name__ == "__main__":
    scraper = AnilifeScraper()

    # 대상 URL들 (작품 정보 탭으로 확실히 이동)
    urls = [
        "https://anilife.app/content/101?tab=info"
    ]

    for url in urls:
        print(f"\n크롤링 시작: {url}")
        anime_data = scraper.scrape_anime_info(url)

        if "error" in anime_data:
            print(f"에러 발생: {anime_data['error']}")
        else:
            scraper.print_results(anime_data)

            # JSON 형태로도 출력
            print(f"\nJSON 데이터:")
            pp.pprint(anime_data)

        print("\n" + "=" * 50 + "\n")


크롤링 시작: https://anilife.app/content/101?tab=info
응답 코드: 200
최종 URL: https://anilife.app/content/101?tab=info
🎬 애니메이션 정보 크롤링 결과

📺 제목:
  • 한국어: 원피스
  • 일본어: ONE PIECE
  • 영어: ONE PIECE

🎭 장르: 액션, 모험, 코미디, 드라마, 판타지

🏷️ 태그:
  • 해적
  • 여행
  • 앙상블 캐스트
  • 소년 만화
  • 슈퍼 파워
  • 찾은 가족
  • 남성 주인공
  • 때림 개그
  • 비극
  • 음모
  • ... 총 71개 태그

📖 줄거리:
부-명성-힘⋯. 한때 이 세상의 모든 것을 손에 넣은 사나이. 「해적왕 골드 로저」그가 죽음을 앞두고 남긴 한마디는⋯ 전세계 사람들을 바다로 향하게 만들었다."내 보물 말이냐? 원한다면 주도록 하지. 잘 찾아봐. 이 세상의 모든 것을 거기에 두고 왔으니까."세상은 대해적시대를 맞는다.

🎭 캐릭터 & 성우:
  • 몽키 D. 루피 (주연) - 성우: 타나카 마유미
  • 니코 로빈 (주연) - 성우: 야마구치 유리코
  • 롤로노아 조로 (주연) - 성우: 나카이 카즈야
  • 롤로노아 조로 (주연) - 성우: 우라와 메구미
  • 프랑키 (주연) - 성우: 야오 이치키
  • 상디 (주연) - 성우: 히라타 히로아키

🏭 제작 정보:
  • 원작자: 오다 에이치로
  • 애니메이션 제작: 토에이 애니메이션, TAP, 매직 버스, 무시 프로덕션, 스튜디오 거츠, 아사히 프로덕션, 퍼니메이션, 후지 TV, 4키즈 엔터테인먼트, 아사츠 DK, 에이벡스 픽처스
  • 각본가: 나카야마 토모히로, 무카미 준키, 타나카 히토시, 야마구치 료우타, 스가 요시유키, 시마다 미츠루


JSON 데이터:
{'basic_info': {},
 'characters_voice_actors': [{'character_name': '몽키 D. 루피',
                      

In [4]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": int(re.search(r'/content/(\d+)', url).group(1)),
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"'
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                data = json.loads(json_str)
                return data

            return {}

        except Exception:
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            name_data = content_detail.get('name', {})

            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출"""
        basic_info = {}

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            if content_detail.get('format'):
                basic_info["format"] = content_detail['format']

            if content_detail.get('status'):
                basic_info["status"] = content_detail['status']

            season_data = content_detail.get('season', {})
            if season_data:
                basic_info["year"] = str(season_data.get('year', ''))
                basic_info["quarter"] = f"{season_data.get('quarter', '')}분기"

            if content_detail.get('startDate'):
                basic_info["start_date"] = content_detail['startDate']

            if content_detail.get('endDate') and content_detail['endDate'] != "null":
                basic_info["end_date"] = content_detail['endDate']

            if content_detail.get('totalEpisode') and content_detail['totalEpisode'] != "N/A":
                basic_info["total_episodes"] = str(content_detail['totalEpisode'])

            if content_detail.get('duration') and content_detail['duration'] != "N/A":
                basic_info["duration"] = str(content_detail['duration'])

        except Exception:
            pass

        if not basic_info.get('year') or not basic_info.get('quarter'):
            quarter_info = soup.find('div', class_='nBnfiIh')
            if quarter_info:
                full_format = quarter_info.get_text(strip=True)
                parts = full_format.split(' · ')

                if len(parts) >= 2:
                    if not basic_info.get('format'):
                        basic_info["format"] = parts[1]

                    season_info = parts[0].split(' ')
                    if len(season_info) >= 2:
                        if not basic_info.get('year'):
                            basic_info["year"] = season_info[0]
                        if not basic_info.get('quarter'):
                            basic_info["quarter"] = season_info[1]

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            tag_data = content_detail.get('tag', [])

            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')

        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("제작사", ""),
            "original_work": anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 20  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-05 16:36:26,094 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-05 16:36:26,096 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-05 16:36:26,842 - INFO - 진행률: 10/6900 (0.1%)
2025-09-05 16:36:27,198 - INFO - 진행률: 20/6900 (0.3%)
2025-09-05 16:36:27,383 - INFO - 진행률: 30/6900 (0.4%)
2025-09-05 16:36:27,630 - INFO - 진행률: 40/6900 (0.6%)
2025-09-05 16:36:27,880 - INFO - 진행률: 50/6900 (0.7%)
2025-09-05 16:36:28,080 - INFO - 진행률: 60/6900 (0.9%)
2025-09-05 16:36:28,345 - INFO - 진행률: 70/6900 (1.0%)
2025-09-05 16:36:28,644 - INFO - 진행률: 80/6900 (1.2%)
2025-09-05 16:36:28,939 - INFO - 진행률: 90/6900 (1.3%)
2025-09-05 16:36:29,224 - INFO - 진행률: 100/6900 (1.4%)
2025-09-05 16:36:31,239 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-05 16:36:31,771 - INFO - 진행률: 110/6900 (1.6%)
2025-09-05 16:36:31,983 - INFO - 진행률: 120/6900 (1.7%)
2025-09-05 16:36:32,134 - INFO - 진행률: 130/6900 (1.9%)
2025-09-05 16:36:32,354 - INFO - 진행률: 140/6900 (2.0%)
2025-09-05 16:36:32,742 - INFO - 진행률: 150/6900 (2.2%)
2025-09-05 16:36:32,85

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Anilife 애니메이션 대량 크롤링 스크립트
ID 101부터 7000까지 병렬로 크롤링하여 CSV로 저장
"""

import csv
import json
import time
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Dict, List, Optional
from datetime import datetime
import os
from tqdm import tqdm

# 이전에 작성한 AnilifeScraper 클래스를 import
# from anilife_scraper import AnilifeScraper

import requests
from bs4 import BeautifulSoup
import re

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('crawling.log', encoding='utf-8'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    """기존 스크래퍼 클래스 (간소화 버전)"""
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            resp = requests.get(url, headers=self.headers, timeout=10)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code}"}

            soup = BeautifulSoup(resp.text, "lxml")
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 간소화된 데이터 추출
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            anime_info = {
                "id": content_detail.get('id', ''),
                "title_kr": content_detail.get('name', {}).get('kr', ''),
                "title_en": content_detail.get('name', {}).get('en', ''),
                "title_jp": content_detail.get('name', {}).get('jp', ''),
                "format": content_detail.get('format', ''),
                "status": content_detail.get('status', ''),
                "year": content_detail.get('season', {}).get('year', ''),
                "quarter": content_detail.get('season', {}).get('quarter', ''),
                "start_date": content_detail.get('startDate', ''),
                "end_date": content_detail.get('endDate', ''),
                "episodes": content_detail.get('totalEpisode', ''),
                "duration": content_detail.get('duration', ''),
                "genres": '|'.join(content_detail.get('genre', [])),
                "tags": '|'.join([tag.get('name', '') if isinstance(tag, dict) else tag
                                 for tag in content_detail.get('tag', [])]),
                "description": content_detail.get('description', ''),
                "url": url
            }

            return anime_info

        except Exception as e:
            return {"error": str(e)}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """Nuxt 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"'
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                return json.loads(json_str)

            return {}
        except:
            return {}


class BulkCrawler:
    """대량 크롤링 관리 클래스"""

    def __init__(self, start_id: int = 101, end_id: int = 7000, max_workers: int = 10):
        self.start_id = start_id
        self.end_id = end_id
        self.max_workers = max_workers
        self.scraper = AnilifeScraper()
        self.results = []
        self.failed_ids = []

    def crawl_single_anime(self, anime_id: int) -> Optional[Dict]:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}"

        try:
            result = self.scraper.scrape_anime_info(url)

            if "error" in result:
                logging.warning(f"ID {anime_id} 크롤링 실패: {result['error']}")
                self.failed_ids.append(anime_id)
                return None

            # ID가 없으면 수동으로 추가
            if not result.get('id'):
                result['id'] = anime_id

            # 데이터 유효성 검사
            if not any([result.get('title_kr'), result.get('title_en'), result.get('title_jp')]):
                logging.warning(f"ID {anime_id}: 제목이 없음")
                self.failed_ids.append(anime_id)
                return None

            logging.debug(f"ID {anime_id} 성공: {result.get('title_kr', 'No title')}")
            return result

        except Exception as e:
            logging.error(f"ID {anime_id} 처리 중 에러: {str(e)}")
            self.failed_ids.append(anime_id)
            return None

    def save_to_csv(self, data: List[Dict], filename: str):
        """결과를 CSV로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        # CSV 필드명
        fieldnames = [
            'id', 'title_kr', 'title_en', 'title_jp',
            'format', 'status', 'year', 'quarter',
            'start_date', 'end_date', 'episodes', 'duration',
            'genres', 'tags', 'description', 'url'
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()

            for item in data:
                # 누락된 필드를 빈 문자열로 채우기
                row = {field: item.get(field, '') for field in fieldnames}
                writer.writerow(row)

        logging.info(f"데이터가 {filename}에 저장되었습니다.")

    def save_checkpoint(self, checkpoint_id: int):
        """진행 상황 체크포인트 저장"""
        checkpoint_data = {
            'last_id': checkpoint_id,
            'timestamp': datetime.now().isoformat(),
            'total_collected': len(self.results),
            'failed_ids': self.failed_ids
        }

        with open('checkpoint.json', 'w', encoding='utf-8') as f:
            json.dump(checkpoint_data, f, ensure_ascii=False, indent=2)

    def load_checkpoint(self) -> Optional[int]:
        """체크포인트에서 재시작 위치 로드"""
        if os.path.exists('checkpoint.json'):
            with open('checkpoint.json', 'r', encoding='utf-8') as f:
                data = json.load(f)
                return data.get('last_id', self.start_id)
        return self.start_id

    def run_parallel(self):
        """병렬 크롤링 실행"""
        # 체크포인트 확인
        resume_id = self.load_checkpoint()
        if resume_id > self.start_id:
            logging.info(f"체크포인트에서 재시작: ID {resume_id}")
            self.start_id = resume_id

        anime_ids = list(range(self.start_id, self.end_id + 1))
        total = len(anime_ids)

        logging.info(f"크롤링 시작: ID {self.start_id} ~ {self.end_id} (총 {total}개)")
        logging.info(f"워커 수: {self.max_workers}")

        # 프로그레스 바 설정
        with tqdm(total=total, desc="크롤링 진행") as pbar:
            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                # 작업 제출
                futures = {
                    executor.submit(self.crawl_single_anime, anime_id): anime_id
                    for anime_id in anime_ids
                }

                # 결과 수집
                batch_count = 0
                for future in as_completed(futures):
                    anime_id = futures[future]

                    try:
                        result = future.result()
                        if result:
                            self.results.append(result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        self.failed_ids.append(anime_id)

                    pbar.update(1)
                    batch_count += 1

                    # 100개마다 중간 저장
                    if batch_count % 100 == 0:
                        self.save_checkpoint(anime_id)
                        self.save_intermediate_results(batch_count)

                    # 요청 간 딜레이 (너무 빠른 요청 방지)
                    time.sleep(0.1)

        logging.info(f"크롤링 완료! 성공: {len(self.results)}개, 실패: {len(self.failed_ids)}개")

    def save_intermediate_results(self, batch_num: int):
        """중간 결과 저장"""
        if self.results:
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            filename = f'anilife_data_batch_{batch_num}_{timestamp}.csv'
            self.save_to_csv(self.results, filename)

    def save_failed_ids(self):
        """실패한 ID 목록 저장"""
        if self.failed_ids:
            with open('failed_ids.txt', 'w') as f:
                for id in self.failed_ids:
                    f.write(f"{id}\n")
            logging.info(f"실패한 ID 목록이 failed_ids.txt에 저장되었습니다.")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 10  # 동시 실행 스레드 수 (서버 부하 고려하여 조절)

    # 크롤러 초기화
    crawler = BulkCrawler(
        start_id=START_ID,
        end_id=END_ID,
        max_workers=MAX_WORKERS
    )

    try:
        # 병렬 크롤링 실행
        start_time = time.time()
        crawler.run_parallel()

        # 최종 결과 저장
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        final_filename = f'anilife_complete_{timestamp}.csv'
        crawler.save_to_csv(crawler.results, final_filename)

        # 실패한 ID 저장
        crawler.save_failed_ids()

        # 실행 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        print(f"\n크롤링 완료!")
        print(f"총 실행 시간: {hours}시간 {minutes}분 {seconds}초")
        print(f"수집된 데이터: {len(crawler.results)}개")
        print(f"실패한 ID: {len(crawler.failed_ids)}개")
        print(f"최종 파일: {final_filename}")

    except KeyboardInterrupt:
        print("\n크롤링이 사용자에 의해 중단되었습니다.")
        print("현재까지의 결과를 저장합니다...")

        # 중단 시점까지의 결과 저장
        if crawler.results:
            interrupt_filename = f'anilife_interrupted_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
            crawler.save_to_csv(crawler.results, interrupt_filename)
            print(f"중간 결과가 {interrupt_filename}에 저장되었습니다.")

    except Exception as e:
        logging.error(f"예상치 못한 에러 발생: {str(e)}")
        raise


if __name__ == "__main__":
    main()

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Anilife 테스트 크롤러 - ID 101~110만 크롤링하여 문제 파악
"""

import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time

class TestCrawler:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }

    def test_single_page(self, anime_id: int):
        """단일 페이지 테스트 크롤링"""
        url = f"https://anilife.app/content/{anime_id}"
        print(f"\n{'='*60}")
        print(f"테스트 ID: {anime_id}")
        print(f"URL: {url}")
        print(f"{'='*60}")

        try:
            # 페이지 요청
            resp = requests.get(url, headers=self.headers, timeout=10)
            print(f"상태 코드: {resp.status_code}")

            if resp.status_code != 200:
                print(f"❌ HTTP 에러: {resp.status_code}")
                return None

            # HTML 파싱
            soup = BeautifulSoup(resp.text, 'lxml')

            # 1. 페이지 제목 확인
            page_title = soup.find('title')
            if page_title:
                print(f"페이지 타이틀: {page_title.text[:50]}...")

            # 2. Nuxt 데이터 확인
            nuxt_match = re.search(r'window\.__NUXT__', resp.text)
            if nuxt_match:
                print("✓ Nuxt 데이터 발견")
                # Nuxt 데이터 추출 시도
                self.extract_nuxt_data(resp.text)
            else:
                print("✗ Nuxt 데이터 없음")

            # 3. HTML 구조 확인
            print("\nHTML 구조 확인:")

            # h1 태그들
            h1_tags = soup.find_all('h1')
            print(f"  h1 태그 개수: {len(h1_tags)}")
            for i, h1 in enumerate(h1_tags[:3]):
                print(f"    h1[{i}]: {h1.get_text(strip=True)[:50]}")
                if h1.get('class'):
                    print(f"      class: {h1.get('class')}")

            # h2 태그들
            h2_tags = soup.find_all('h2')
            print(f"  h2 태그 개수: {len(h2_tags)}")
            for i, h2 in enumerate(h2_tags[:3]):
                text = h2.get_text(strip=True)[:50]
                if text:
                    print(f"    h2[{i}]: {text}")

            # 장르 링크
            genre_links = soup.select('a[rel="genre"]')
            print(f"  장르 링크 개수: {len(genre_links)}")
            if genre_links:
                genres = [g.get_text(strip=True) for g in genre_links[:5]]
                print(f"    장르: {', '.join(genres)}")

            # 404 체크
            if '404' in soup.text[:1000] or 'Not Found' in soup.text[:1000]:
                print("⚠️ 404 페이지일 가능성")

            # 데이터 추출 시도
            result = self.extract_data(soup, url, resp.text)

            print("\n추출된 데이터:")
            for key, value in result.items():
                if value:
                    print(f"  {key}: {str(value)[:50]}")

            return result

        except Exception as e:
            print(f"❌ 에러 발생: {str(e)}")
            import traceback
            traceback.print_exc()
            return None

    def extract_nuxt_data(self, html_content: str):
        """Nuxt 데이터 추출 테스트"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)
                print(f"  Nuxt 데이터 길이: {len(json_str)} 문자")

                # 변수 치환
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"'
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                data = json.loads(json_str)

                # 데이터 구조 확인
                print("  Nuxt 데이터 키:")
                for key in list(data.keys())[:10]:
                    print(f"    - {key}")

                # pinia 확인
                if 'pinia' in data:
                    print("  ✓ pinia 발견")
                    if 'content' in data['pinia']:
                        print("    ✓ content 발견")
                        if 'contentDetail' in data['pinia']['content']:
                            print("      ✓ contentDetail 발견")
                            detail = data['pinia']['content']['contentDetail']
                            if 'name' in detail:
                                print(f"        name: {detail['name']}")

                return data
        except Exception as e:
            print(f"  Nuxt 파싱 에러: {str(e)}")
            return {}

    def extract_data(self, soup: BeautifulSoup, url: str, html_text: str) -> dict:
        """데이터 추출"""
        result = {'id': re.search(r'/content/(\d+)', url).group(1), 'url': url}

        # 1. 다양한 h1 클래스 시도
        h1_classes = ['fpUXWby', 'title', 'content-title', 'anime-title']
        for cls in h1_classes:
            h1 = soup.find('h1', class_=cls)
            if h1:
                result['title_kr'] = h1.get_text(strip=True)
                break

        # 2. h1 클래스 없이 시도
        if not result.get('title_kr'):
            h1_all = soup.find_all('h1')
            for h1 in h1_all:
                text = h1.get_text(strip=True)
                if text and len(text) > 2 and '404' not in text:
                    result['title_kr'] = text
                    break

        # 3. Nuxt에서 시도
        if not result.get('title_kr'):
            nuxt_data = self.extract_nuxt_data(html_text)
            if nuxt_data:
                try:
                    detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
                    if detail and 'name' in detail:
                        if isinstance(detail['name'], dict):
                            result['title_kr'] = detail['name'].get('kr', '')
                            result['title_en'] = detail['name'].get('en', '')
                            result['title_jp'] = detail['name'].get('jp', '')
                except:
                    pass

        return result

    def run_test(self):
        """테스트 실행"""
        print("Anilife 테스트 크롤링 시작 (ID 101-110)")
        print("="*60)

        results = []

        for anime_id in range(101, 111):
            result = self.test_single_page(anime_id)
            if result:
                results.append(result)
            time.sleep(1)  # 서버 부하 방지

        # 결과 저장
        print("\n" + "="*60)
        print("크롤링 완료!")
        print(f"성공: {len(results)}/10")

        # CSV 저장
        if results:
            with open('test_results.csv', 'w', newline='', encoding='utf-8-sig') as f:
                fieldnames = ['id', 'url', 'title_kr', 'title_en', 'title_jp']
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                writer.writeheader()

                for result in results:
                    row = {field: result.get(field, '') for field in fieldnames}
                    writer.writerow(row)

            print("결과가 test_results.csv에 저장되었습니다.")

        # 요약
        print("\n요약:")
        for result in results:
            if result.get('title_kr'):
                print(f"  ID {result['id']}: {result['title_kr']}")
            else:
                print(f"  ID {result['id']}: 제목 없음")

if __name__ == "__main__":
    crawler = TestCrawler()
    crawler.run_test()

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Anilife 테스트 크롤러 - ID 101~110만 크롤링하여 문제 파악
"""

import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time

class TestCrawler:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }

    def test_single_page(self, anime_id: int):
        """단일 페이지 테스트 크롤링"""
        # URL에 tab=info 추가하여 작품 정보 페이지로 이동
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        print(f"\n{'='*60}")
        print(f"테스트 ID: {anime_id}")
        print(f"URL: {url}")
        print(f"{'='*60}")

        try:
            # 페이지 요청
            resp = requests.get(url, headers=self.headers, timeout=10)
            print(f"상태 코드: {resp.status_code}")

            if resp.status_code != 200:
                print(f"❌ HTTP 에러: {resp.status_code}")
                return None

            # HTML 파싱
            soup = BeautifulSoup(resp.text, 'lxml')

            # 1. 페이지 제목 확인
            page_title = soup.find('title')
            if page_title:
                print(f"페이지 타이틀: {page_title.text[:50]}...")

            # 2. Nuxt 데이터 확인
            nuxt_match = re.search(r'window\.__NUXT__', resp.text)
            if nuxt_match:
                print("✓ Nuxt 데이터 발견")
                # Nuxt 데이터 추출 시도
                self.extract_nuxt_data(resp.text)
            else:
                print("✗ Nuxt 데이터 없음")

            # 3. HTML 구조 확인
            print("\nHTML 구조 확인:")

            # h1 태그들
            h1_tags = soup.find_all('h1')
            print(f"  h1 태그 개수: {len(h1_tags)}")
            for i, h1 in enumerate(h1_tags[:3]):
                print(f"    h1[{i}]: {h1.get_text(strip=True)[:50]}")
                if h1.get('class'):
                    print(f"      class: {h1.get('class')}")

            # h2 태그들
            h2_tags = soup.find_all('h2')
            print(f"  h2 태그 개수: {len(h2_tags)}")
            for i, h2 in enumerate(h2_tags[:3]):
                text = h2.get_text(strip=True)[:50]
                if text:
                    print(f"    h2[{i}]: {text}")

            # 장르 링크
            genre_links = soup.select('a[rel="genre"]')
            print(f"  장르 링크 개수: {len(genre_links)}")
            if genre_links:
                genres = [g.get_text(strip=True) for g in genre_links[:5]]
                print(f"    장르: {', '.join(genres)}")

            # 404 체크
            if '404' in soup.text[:1000] or 'Not Found' in soup.text[:1000]:
                print("⚠️ 404 페이지일 가능성")

            # 데이터 추출 시도
            result = self.extract_data(soup, url, resp.text)

            print("\n추출된 데이터:")
            for key, value in result.items():
                if value:
                    print(f"  {key}: {str(value)[:50]}")

            return result

        except Exception as e:
            print(f"❌ 에러 발생: {str(e)}")
            import traceback
            traceback.print_exc()
            return None

    def extract_nuxt_data(self, html_content: str):
        """Nuxt 데이터 추출 테스트"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)
                print(f"  Nuxt 데이터 길이: {len(json_str)} 문자")

                # 변수 치환
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"'
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                data = json.loads(json_str)

                # 데이터 구조 확인
                print("  Nuxt 데이터 키:")
                for key in list(data.keys())[:10]:
                    print(f"    - {key}")

                # pinia 확인
                if 'pinia' in data:
                    print("  ✓ pinia 발견")
                    if 'content' in data['pinia']:
                        print("    ✓ content 발견")
                        if 'contentDetail' in data['pinia']['content']:
                            print("      ✓ contentDetail 발견")
                            detail = data['pinia']['content']['contentDetail']
                            if 'name' in detail:
                                print(f"        name: {detail['name']}")

                return data
        except Exception as e:
            print(f"  Nuxt 파싱 에러: {str(e)}")
            return {}

    def extract_data(self, soup: BeautifulSoup, url: str, html_text: str) -> dict:
        """데이터 추출"""
        result = {'id': re.search(r'/content/(\d+)', url).group(1), 'url': url}

        # 1. 다양한 h1 클래스 시도
        h1_classes = ['fpUXWby', 'title', 'content-title', 'anime-title']
        for cls in h1_classes:
            h1 = soup.find('h1', class_=cls)
            if h1:
                result['title_kr'] = h1.get_text(strip=True)
                break

        # 2. h1 클래스 없이 시도
        if not result.get('title_kr'):
            h1_all = soup.find_all('h1')
            for h1 in h1_all:
                text = h1.get_text(strip=True)
                if text and len(text) > 2 and '404' not in text:
                    result['title_kr'] = text
                    break

        # 3. Nuxt에서 시도
        if not result.get('title_kr'):
            nuxt_data = self.extract_nuxt_data(html_text)
            if nuxt_data:
                try:
                    detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
                    if detail and 'name' in detail:
                        if isinstance(detail['name'], dict):
                            result['title_kr'] = detail['name'].get('kr', '')
                            result['title_en'] = detail['name'].get('en', '')
                            result['title_jp'] = detail['name'].get('jp', '')
                except:
                    pass

        return result

    def run_test(self):
        """테스트 실행"""
        print("Anilife 테스트 크롤링 시작 (ID 101-110)")
        print("="*60)

        results = []

        for anime_id in range(101, 111):
            result = self.test_single_page(anime_id)
            if result:
                results.append(result)
            time.sleep(1)  # 서버 부하 방지

        # 결과 저장
        print("\n" + "="*60)
        print("크롤링 완료!")
        print(f"성공: {len(results)}/10")

        # CSV 저장
        if results:
            with open('test_results.csv', 'w', newline='', encoding='utf-8-sig') as f:
                fieldnames = ['id', 'url', 'title_kr', 'title_en', 'title_jp']
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                writer.writeheader()

                for result in results:
                    row = {field: result.get(field, '') for field in fieldnames}
                    writer.writerow(row)

            print("결과가 test_results.csv에 저장되었습니다.")

        # 요약
        print("\n요약:")
        for result in results:
            if result.get('title_kr'):
                print(f"  ID {result['id']}: {result['title_kr']}")
            else:
                print(f"  ID {result['id']}: 제목 없음")

if __name__ == "__main__":
    crawler = TestCrawler()
    crawler.run_test()

In [5]:
import requests
from bs4 import BeautifulSoup
import re
import json
import pprint as pp

def debug_scrape(url):
    """디버깅을 위한 상세 크롤링"""
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
        "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
    }

    # URL 정규화
    if "tab=info" not in url:
        if "?" in url:
            url = url.split("?")[0] + "?tab=info"
        else:
            url = url + "?tab=info"

    print(f"크롤링 URL: {url}")

    resp = requests.get(url, headers=headers, timeout=20)
    print(f"응답 코드: {resp.status_code}")

    soup = BeautifulSoup(resp.text, "lxml")

    print("\n" + "="*80)
    print("1. NUXT 데이터 추출 시도")
    print("="*80)

    # Nuxt 데이터 추출
    pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
    match = re.search(pattern, resp.text, re.DOTALL)

    nuxt_data = {}
    if match:
        json_str = match.group(1)
        print(f"Nuxt 데이터 찾음! (길이: {len(json_str)})")

        # 매개변수 치환
        replacements = {
            r'\ba\b': 'false',
            r'\bb\b': '1',
            r'\bc\b': 'true',
            r'\bd\b': 'null',
            r'\be\b': '"system"',
            r'\bf\b': '"https://anilife.app"',
            r'\bg\b': '"N/A"'
        }

        for pattern, value in replacements.items():
            json_str = re.sub(pattern, value, json_str)

        try:
            nuxt_data = json.loads(json_str)
            print("Nuxt 데이터 파싱 성공!")

            # 구조 탐색
            if 'pinia' in nuxt_data:
                print(f"\nPinia 키들: {list(nuxt_data['pinia'].keys())}")

                # content 관련 키 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        print(f"\n'{key}' 발견!")
                        content = nuxt_data['pinia'][key]

                        if 'contentDetail' in content:
                            detail = content['contentDetail']
                            print(f"contentDetail 키들: {list(detail.keys())[:20]}")

                            # 시즌 정보
                            if 'season' in detail:
                                print(f"\nseason 데이터: {detail['season']}")

                            # 스태프/제작 정보
                            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                                if staff_key in detail:
                                    print(f"\n{staff_key} 데이터: {detail[staff_key][:3] if isinstance(detail[staff_key], list) else detail[staff_key]}")

                            # 원작 정보
                            for source_key in ['source', 'original', 'originalWork']:
                                if source_key in detail:
                                    print(f"\n{source_key}: {detail[source_key]}")

        except Exception as e:
            print(f"Nuxt 데이터 파싱 실패: {e}")
    else:
        print("Nuxt 데이터를 찾을 수 없음")

    print("\n" + "="*80)
    print("2. HTML 직접 파싱")
    print("="*80)

    # 방영 정보 찾기
    print("\n[방영 정보]")
    for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
        elem = soup.find('div', class_=class_name)
        if elem:
            text = elem.get_text(strip=True)
            print(f"클래스 '{class_name}': {text}")

            # 연도와 분기 추출
            year_match = re.search(r'(19\d{2}|20\d{2})', text)
            quarter_match = re.search(r'(\d)분기|(\d)쿨|Q(\d)|봄|여름|가을|겨울', text)

            if year_match:
                print(f"  → 연도: {year_match.group(1)}")
            if quarter_match:
                print(f"  → 분기: {quarter_match.group(0)}")

    # 모든 div의 텍스트에서 연도 찾기
    print("\n[연도가 포함된 모든 요소]")
    for div in soup.find_all('div'):
        text = div.get_text(strip=True)
        if re.search(r'20\d{2}년|20\d{2}\s', text) and len(text) < 100:
            print(f"  - {text[:80]}")

    print("\n[제작 정보 섹션]")
    # 제작 정보 섹션 찾기 - 여러 방법 시도

    # 방법 1: h2로 섹션 찾기
    for h2 in soup.find_all('h2'):
        h2_text = h2.get_text(strip=True)
        if any(keyword in h2_text for keyword in ['제작', '스태프', 'Staff', '제작진', '스튜디오']):
            print(f"제작 관련 h2 발견: {h2_text}")

            # 부모 섹션 찾기
            section = h2.find_parent('section')
            if section:
                # 섹션 내의 모든 링크 확인
                links = section.find_all('a')
                print(f"  섹션 내 링크 수: {len(links)}")

                for i, link in enumerate(links[:5]):  # 처음 5개만
                    link_text = link.get_text(strip=True)
                    print(f"    링크 {i+1}: {link_text}")

                    # div 구조 확인
                    divs = link.find_all('div')
                    for div in divs:
                        div_class = div.get('class', [])
                        div_text = div.get_text(strip=True)
                        print(f"      div (class={div_class}): {div_text}")

    # 방법 2: 특정 클래스로 찾기
    for class_combo in [['_1coMKET', '-HW4ChD'], ['production-info'], ['staff-info']]:
        if len(class_combo) == 1:
            elem = soup.find('div', class_=class_combo[0])
        else:
            elem = soup.find('div', class_=' '.join(class_combo))

        if elem:
            print(f"\n클래스 {class_combo}로 요소 발견")
            links = elem.find_all('a')
            print(f"  링크 수: {len(links)}")

            for link in links[:3]:
                print(f"  - {link.get_text(strip=True)}")

    print("\n" + "="*80)
    print("3. 메타 데이터 확인")
    print("="*80)

    # meta 태그 확인
    for meta in soup.find_all('meta'):
        if meta.get('property') and 'og:' in meta.get('property', ''):
            print(f"{meta.get('property')}: {meta.get('content', '')[:100]}")

    # script 태그에서 JSON-LD 찾기
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)
            print(f"\nJSON-LD 데이터 발견: {list(data.keys())}")
        except:
            pass

    return nuxt_data, soup


# 테스트 실행
if __name__ == "__main__":
    # 몇 개의 URL로 테스트
    test_urls = [
        "https://anilife.app/content/101",  # 원피스
        "https://anilife.app/content/110",  # 꿈속의 뮤
        "https://anilife.app/content/119",  # 일하는 세포 BLACK
    ]

    for url in test_urls:
        print("\n" + "="*100)
        print(f"테스트: {url}")
        print("="*100)

        nuxt_data, soup = debug_scrape(url)

        print("\n완료!\n")


테스트: https://anilife.app/content/101
크롤링 URL: https://anilife.app/content/101?tab=info
응답 코드: 200

1. NUXT 데이터 추출 시도
Nuxt 데이터를 찾을 수 없음

2. HTML 직접 파싱

[방영 정보]
클래스 'nBnfiIh': 1999년도 4분기·TV
  → 연도: 1999
  → 분기: 4분기

[연도가 포함된 모든 요소]

[제작 정보 섹션]
제작 관련 h2 발견: 작품 제작
  섹션 내 링크 수: 18
    링크 1: 오다 에이치로원작자
      div (class=['OuXf8uf', 'z4xkYZ9']): 오다 에이치로원작자
      div (class=['ygvbJ2N']): 
      div (class=['whFyH-k']): 
      div (class=['H3oaiWl']): 
      div (class=['nshcU0W']): 오다 에이치로원작자
      div (class=['C9a9MX4']): 오다 에이치로원작자
      div (class=['iO6bs1d']): 오다 에이치로
      div (class=['_99DZmqJ']): 원작자
    링크 2: 토에이 애니메이션애니메이션 제작
      div (class=['OuXf8uf', 'z4xkYZ9']): 토에이 애니메이션애니메이션 제작
      div (class=['ygvbJ2N']): 
      div (class=['whFyH-k']): 
      div (class=['H3oaiWl']): 
      div (class=['nshcU0W']): 토에이 애니메이션애니메이션 제작
      div (class=['C9a9MX4']): 토에이 애니메이션애니메이션 제작
      div (class=['iO6bs1d']): 토에이 애니메이션
      div (class=['_99DZmqJ']): 애니메이션 제작
    링크 3: TAP애니메이션 제작
      d

In [6]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 디버깅: Nuxt 데이터 구조 확인 (첫 몇 개만)
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))
            if anime_id <= 105:  # 처음 몇 개만 디버깅
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {nuxt_data['pinia'].keys()}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"'
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                data = json.loads(json_str)
                return data

            return {}

        except Exception:
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            name_data = content_detail.get('name', {})

            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출"""
        basic_info = {}

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            if content_detail.get('format'):
                basic_info["format"] = content_detail['format']

            if content_detail.get('status'):
                basic_info["status"] = content_detail['status']

            season_data = content_detail.get('season', {})
            if season_data:
                basic_info["year"] = str(season_data.get('year', ''))
                basic_info["quarter"] = f"{season_data.get('quarter', '')}분기"

            if content_detail.get('startDate'):
                basic_info["start_date"] = content_detail['startDate']

            if content_detail.get('endDate') and content_detail['endDate'] != "null":
                basic_info["end_date"] = content_detail['endDate']

            if content_detail.get('totalEpisode') and content_detail['totalEpisode'] != "N/A":
                basic_info["total_episodes"] = str(content_detail['totalEpisode'])

            if content_detail.get('duration') and content_detail['duration'] != "N/A":
                basic_info["duration"] = str(content_detail['duration'])

        except Exception:
            pass

        if not basic_info.get('year') or not basic_info.get('quarter'):
            quarter_info = soup.find('div', class_='nBnfiIh')
            if quarter_info:
                full_format = quarter_info.get_text(strip=True)
                parts = full_format.split(' · ')

                if len(parts) >= 2:
                    if not basic_info.get('format'):
                        basic_info["format"] = parts[1]

                    season_info = parts[0].split(' ')
                    if len(season_info) >= 2:
                        if not basic_info.get('year'):
                            basic_info["year"] = season_info[0]
                        if not basic_info.get('quarter'):
                            basic_info["quarter"] = season_info[1]

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            tag_data = content_detail.get('tag', [])

            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})
            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')

        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수
    USE_BATCH = True  # True: 배치 처리, False: 전체 동시 처리

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        if USE_BATCH:
            # 배치 단위로 처리 (안정적, 권장)
            scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)
        else:
            # 모든 ID를 한 번에 처리 (빠르지만 부하 높음)
            scraper.scrape_range_all_at_once(START_ID, END_ID)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-05 16:50:19,863 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-05 16:50:19,864 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-05 16:50:20,649 - INFO - 진행률: 10/6900 (0.1%)
2025-09-05 16:50:20,889 - INFO - 진행률: 20/6900 (0.3%)
2025-09-05 16:50:21,035 - INFO - 진행률: 30/6900 (0.4%)
2025-09-05 16:50:21,194 - INFO - 진행률: 40/6900 (0.6%)
2025-09-05 16:50:21,510 - INFO - 진행률: 50/6900 (0.7%)
2025-09-05 16:50:21,773 - INFO - 진행률: 60/6900 (0.9%)
2025-09-05 16:50:21,932 - INFO - 진행률: 70/6900 (1.0%)
2025-09-05 16:50:22,246 - INFO - 진행률: 80/6900 (1.2%)
2025-09-05 16:50:22,496 - INFO - 진행률: 90/6900 (1.3%)
2025-09-05 16:50:22,618 - INFO - 진행률: 100/6900 (1.4%)
2025-09-05 16:50:24,620 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-05 16:50:25,243 - INFO - 진행률: 110/6900 (1.6%)
2025-09-05 16:50:25,424 - INFO - 진행률: 120/6900 (1.7%)
2025-09-05 16:50:25,576 - INFO - 진행률: 130/6900 (1.9%)
2025-09-05 16:50:25,791 - INFO - 진행률: 140/6900 (2.0%)
2025-09-05 16:50:26,050 - INFO - 진행률: 150/6900 (2.2%)
2025-09-05 16:50:26,48

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 ID 추출
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))

            # 디버깅: Nuxt 데이터 구조 확인 (첫 200개만)
            if anime_id <= 200:
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {list(nuxt_data['pinia'].keys())}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # 변수 치환 패턴 (확장된 버전)
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"',
                    # 추가 패턴들
                    r'\bh\b': '2',
                    r'\bi\b': '3',
                    r'\bj\b': '4',
                    r'\bk\b': '"TV"',
                    r'\bl\b': '"OVA"',
                    r'\bm\b': '"Movie"',
                    r'\bn\b': '"Web"',
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱 시도
                data = json.loads(json_str)
                return data

            return {}

        except json.JSONDecodeError as e:
            logging.debug(f"JSON parsing error: {e}")
            return {}
        except Exception as e:
            logging.debug(f"Unexpected error in extract_nuxt_data: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            name_data = content_detail.get('name', {})
            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML 파싱 대체
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (수정된 버전)"""
        basic_info = {}

        try:
            # pinia에서 content 관련 키 찾기 (정확한 키 이름이 'content'가 아닐 수 있음)
            content_detail = {}
            if 'pinia' in nuxt_data:
                # content 관련 키를 동적으로 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

                # 만약 못 찾았으면 직접 접근 시도
                if not content_detail:
                    content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # 디버깅 로그
            if content_detail:
                logging.debug(f"Content detail keys: {list(content_detail.keys())[:20]}")

                if content_detail.get('format'):
                    basic_info["format"] = content_detail['format']

                if content_detail.get('status'):
                    basic_info["status"] = content_detail['status']

                # season 데이터 추출
                season_data = content_detail.get('season', {})
                if season_data:
                    logging.debug(f"Season data found: {season_data}")
                    if season_data.get('year'):
                        basic_info["year"] = str(season_data['year'])
                    if season_data.get('quarter'):
                        # quarter가 숫자로 오면 "N분기" 형태로 변환
                        quarter_val = season_data['quarter']
                        if isinstance(quarter_val, (int, str)) and str(quarter_val).isdigit():
                            basic_info["quarter"] = f"{quarter_val}분기"
                        else:
                            basic_info["quarter"] = str(quarter_val)

                # 날짜 정보
                if content_detail.get('startDate'):
                    basic_info["start_date"] = content_detail['startDate']

                if content_detail.get('endDate') and content_detail['endDate'] != "null":
                    basic_info["end_date"] = content_detail['endDate']

                # 에피소드 정보
                if content_detail.get('totalEpisode') and str(content_detail['totalEpisode']) != "N/A":
                    basic_info["total_episodes"] = str(content_detail['totalEpisode'])

                # 방영 시간
                if content_detail.get('duration') and str(content_detail['duration']) != "N/A":
                    basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            logging.debug(f"Error extracting from Nuxt data: {e}")

        # Nuxt 데이터에서 못 찾은 경우 HTML 파싱으로 대체
        if not basic_info.get('year') or not basic_info.get('quarter'):
            # 여러 클래스명 시도
            for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
                quarter_info = soup.find('div', class_=class_name)
                if quarter_info:
                    full_format = quarter_info.get_text(strip=True)
                    logging.debug(f"Found season info in class '{class_name}': {full_format}")

                    # "2024년 1분기 · TV" 같은 형태 파싱
                    parts = full_format.split(' · ')

                    if len(parts) >= 1:
                        season_text = parts[0]

                        # 연도 추출 (여러 패턴 시도)
                        year_patterns = [
                            r'(20\d{2})년',  # "2024년"
                            r'(20\d{2})\s',  # "2024 "
                            r'(19\d{2}|20\d{2})',  # 일반적인 연도
                        ]

                        for pattern in year_patterns:
                            year_match = re.search(pattern, season_text)
                            if year_match and not basic_info.get('year'):
                                basic_info["year"] = year_match.group(1)
                                break

                        # 분기 추출 (여러 패턴 시도)
                        quarter_patterns = [
                            r'(\d)분기',  # "1분기"
                            r'(\d)쿨',    # "1쿨"
                            r'Q(\d)',     # "Q1"
                            r'(봄|여름|가을|겨울)',  # 계절
                        ]

                        for pattern in quarter_patterns:
                            quarter_match = re.search(pattern, season_text)
                            if quarter_match and not basic_info.get('quarter'):
                                if pattern == r'(봄|여름|가을|겨울)':
                                    # 계절을 분기로 변환
                                    season_to_quarter = {'봄': '2분기', '여름': '3분기', '가을': '4분기', '겨울': '1분기'}
                                    basic_info["quarter"] = season_to_quarter.get(quarter_match.group(1), quarter_match.group(1))
                                else:
                                    basic_info["quarter"] = f"{quarter_match.group(1)}분기"
                                break

                        # format 정보 추출
                        if len(parts) >= 2 and not basic_info.get('format'):
                            basic_info["format"] = parts[1].strip()

                    if basic_info.get('year') or basic_info.get('quarter'):
                        break  # 정보를 찾았으면 루프 종료

        # 최종 확인 로그
        logging.debug(f"Final basic_info: year={basic_info.get('year')}, quarter={basic_info.get('quarter')}")

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            tag_data = content_detail.get('tag', [])
            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # 먼저 Nuxt 데이터에서 시도
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # staff 또는 production 정보 찾기
            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                if staff_key in content_detail:
                    staff_data = content_detail[staff_key]
                    if isinstance(staff_data, list):
                        for staff in staff_data:
                            if isinstance(staff, dict):
                                role = staff.get('role', '')
                                name = staff.get('name', '')
                                if role and name:
                                    if role not in production_info:
                                        production_info[role] = []
                                    production_info[role].append(name)
        except:
            pass

        # HTML에서 추가 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')
        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            if name not in production_info[role]:  # 중복 방지
                                production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "image_url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-08 09:30:54,600 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-08 09:30:54,601 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-08 09:30:55,368 - INFO - 진행률: 10/6900 (0.1%)
2025-09-08 09:30:55,788 - INFO - 진행률: 20/6900 (0.3%)
2025-09-08 09:30:55,964 - INFO - 진행률: 30/6900 (0.4%)
2025-09-08 09:30:56,204 - INFO - 진행률: 40/6900 (0.6%)
2025-09-08 09:30:56,491 - INFO - 진행률: 50/6900 (0.7%)
2025-09-08 09:30:56,712 - INFO - 진행률: 60/6900 (0.9%)
2025-09-08 09:30:56,851 - INFO - 진행률: 70/6900 (1.0%)
2025-09-08 09:30:57,078 - INFO - 진행률: 80/6900 (1.2%)
2025-09-08 09:30:57,450 - INFO - 진행률: 90/6900 (1.3%)
2025-09-08 09:30:57,594 - INFO - 진행률: 100/6900 (1.4%)
2025-09-08 09:30:59,599 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-08 09:31:00,288 - INFO - 진행률: 110/6900 (1.6%)
2025-09-08 09:31:00,417 - INFO - 진행률: 120/6900 (1.7%)
2025-09-08 09:31:00,760 - INFO - 진행률: 130/6900 (1.9%)
2025-09-08 09:31:01,363 - INFO - 진행률: 140/6900 (2.0%)
2025-09-08 09:31:01,590 - INFO - 진행률: 150/6900 (2.2%)
2025-09-08 09:31:01,79

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 ID 추출
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))

            # 디버깅: Nuxt 데이터 구조 확인 (첫 200개만)
            if anime_id <= 200:
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {list(nuxt_data['pinia'].keys())}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # 변수 치환 패턴 (확장된 버전)
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"',
                    # 추가 패턴들
                    r'\bh\b': '2',
                    r'\bi\b': '3',
                    r'\bj\b': '4',
                    r'\bk\b': '"TV"',
                    r'\bl\b': '"OVA"',
                    r'\bm\b': '"Movie"',
                    r'\bn\b': '"Web"',
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱 시도
                data = json.loads(json_str)
                return data

            return {}

        except json.JSONDecodeError as e:
            logging.debug(f"JSON parsing error: {e}")
            return {}
        except Exception as e:
            logging.debug(f"Unexpected error in extract_nuxt_data: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            name_data = content_detail.get('name', {})
            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML 파싱 대체
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (수정된 버전)"""
        basic_info = {}

        try:
            # pinia에서 content 관련 키 찾기 (정확한 키 이름이 'content'가 아닐 수 있음)
            content_detail = {}
            if 'pinia' in nuxt_data:
                # content 관련 키를 동적으로 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

                # 만약 못 찾았으면 직접 접근 시도
                if not content_detail:
                    content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # 디버깅 로그
            if content_detail:
                logging.debug(f"Content detail keys: {list(content_detail.keys())[:20]}")

                if content_detail.get('format'):
                    basic_info["format"] = content_detail['format']

                if content_detail.get('status'):
                    basic_info["status"] = content_detail['status']

                # season 데이터 추출
                season_data = content_detail.get('season', {})
                if season_data:
                    logging.debug(f"Season data found: {season_data}")
                    if season_data.get('year'):
                        basic_info["year"] = str(season_data['year'])
                    if season_data.get('quarter'):
                        # quarter가 숫자로 오면 "N분기" 형태로 변환
                        quarter_val = season_data['quarter']
                        if isinstance(quarter_val, (int, str)) and str(quarter_val).isdigit():
                            basic_info["quarter"] = f"{quarter_val}분기"
                        else:
                            basic_info["quarter"] = str(quarter_val)

                # 날짜 정보
                if content_detail.get('startDate'):
                    basic_info["start_date"] = content_detail['startDate']

                if content_detail.get('endDate') and content_detail['endDate'] != "null":
                    basic_info["end_date"] = content_detail['endDate']

                # 에피소드 정보
                if content_detail.get('totalEpisode') and str(content_detail['totalEpisode']) != "N/A":
                    basic_info["total_episodes"] = str(content_detail['totalEpisode'])

                # 방영 시간
                if content_detail.get('duration') and str(content_detail['duration']) != "N/A":
                    basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            logging.debug(f"Error extracting from Nuxt data: {e}")

        # Nuxt 데이터에서 못 찾은 경우 HTML 파싱으로 대체
        if not basic_info.get('year') or not basic_info.get('quarter'):
            # 여러 클래스명 시도
            for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
                quarter_info = soup.find('div', class_=class_name)
                if quarter_info:
                    full_format = quarter_info.get_text(strip=True)
                    logging.debug(f"Found season info in class '{class_name}': {full_format}")

                    # "2024년 1분기 · TV" 같은 형태 파싱
                    parts = full_format.split(' · ')

                    if len(parts) >= 1:
                        season_text = parts[0]

                        # 연도 추출 (여러 패턴 시도)
                        year_patterns = [
                            r'(20\d{2})년',  # "2024년"
                            r'(20\d{2})\s',  # "2024 "
                            r'(19\d{2}|20\d{2})',  # 일반적인 연도
                        ]

                        for pattern in year_patterns:
                            year_match = re.search(pattern, season_text)
                            if year_match and not basic_info.get('year'):
                                basic_info["year"] = year_match.group(1)
                                break

                        # 분기 추출 (여러 패턴 시도)
                        quarter_patterns = [
                            r'(\d)분기',  # "1분기"
                            r'(\d)쿨',    # "1쿨"
                            r'Q(\d)',     # "Q1"
                            r'(봄|여름|가을|겨울)',  # 계절
                        ]

                        for pattern in quarter_patterns:
                            quarter_match = re.search(pattern, season_text)
                            if quarter_match and not basic_info.get('quarter'):
                                if pattern == r'(봄|여름|가을|겨울)':
                                    # 계절을 분기로 변환
                                    season_to_quarter = {'봄': '2분기', '여름': '3분기', '가을': '4분기', '겨울': '1분기'}
                                    basic_info["quarter"] = season_to_quarter.get(quarter_match.group(1), quarter_match.group(1))
                                else:
                                    basic_info["quarter"] = f"{quarter_match.group(1)}분기"
                                break

                        # format 정보 추출
                        if len(parts) >= 2 and not basic_info.get('format'):
                            basic_info["format"] = parts[1].strip()

                    if basic_info.get('year') or basic_info.get('quarter'):
                        break  # 정보를 찾았으면 루프 종료

        # 최종 확인 로그
        logging.debug(f"Final basic_info: year={basic_info.get('year')}, quarter={basic_info.get('quarter')}")

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            tag_data = content_detail.get('tag', [])
            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # 먼저 Nuxt 데이터에서 시도
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # staff 또는 production 정보 찾기
            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                if staff_key in content_detail:
                    staff_data = content_detail[staff_key]
                    if isinstance(staff_data, list):
                        for staff in staff_data:
                            if isinstance(staff, dict):
                                role = staff.get('role', '')
                                name = staff.get('name', '')
                                if role and name:
                                    if role not in production_info:
                                        production_info[role] = []
                                    production_info[role].append(name)
        except:
            pass

        # HTML에서 추가 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')
        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            if name not in production_info[role]:  # 중복 방지
                                production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "image_url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-08 09:39:31,975 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-08 09:39:31,976 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-08 09:39:34,329 - INFO - 진행률: 10/6900 (0.1%)
2025-09-08 09:39:35,088 - INFO - 진행률: 20/6900 (0.3%)
2025-09-08 09:39:35,278 - INFO - 진행률: 30/6900 (0.4%)
2025-09-08 09:39:35,608 - INFO - 진행률: 40/6900 (0.6%)
2025-09-08 09:39:36,104 - INFO - 진행률: 50/6900 (0.7%)
2025-09-08 09:39:36,286 - INFO - 진행률: 60/6900 (0.9%)
2025-09-08 09:39:36,477 - INFO - 진행률: 70/6900 (1.0%)
2025-09-08 09:39:36,751 - INFO - 진행률: 80/6900 (1.2%)
2025-09-08 09:39:36,954 - INFO - 진행률: 90/6900 (1.3%)
2025-09-08 09:39:37,042 - INFO - 진행률: 100/6900 (1.4%)
2025-09-08 09:39:39,049 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-08 09:39:40,429 - INFO - 진행률: 110/6900 (1.6%)
2025-09-08 09:39:40,768 - INFO - 진행률: 120/6900 (1.7%)
2025-09-08 09:39:40,914 - INFO - 진행률: 130/6900 (1.9%)
2025-09-08 09:39:41,472 - INFO - 진행률: 140/6900 (2.0%)
2025-09-08 09:39:41,727 - INFO - 진행률: 150/6900 (2.2%)
2025-09-08 09:39:41,87

In [4]:
import requests
from bs4 import BeautifulSoup
import re
import json

def debug_image_extraction(url):
    """이미지 추출 디버깅"""
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
        "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
    }

    if "tab=info" not in url:
        if "?" in url:
            url = url.split("?")[0] + "?tab=info"
        else:
            url = url + "?tab=info"

    print(f"크롤링 URL: {url}")

    resp = requests.get(url, headers=headers, timeout=20)
    print(f"응답 코드: {resp.status_code}")

    soup = BeautifulSoup(resp.text, "lxml")

    print("\n" + "="*80)
    print("1. 모든 img 태그 찾기")
    print("="*80)

    all_imgs = soup.find_all('img')
    print(f"총 {len(all_imgs)}개의 img 태그 발견")

    for i, img in enumerate(all_imgs[:10], 1):  # 처음 10개만
        src = img.get('src', '')
        alt = img.get('alt', '')
        srcset = img.get('srcset', '')

        print(f"\n이미지 {i}:")
        print(f"  alt: {alt[:50]}")
        print(f"  src: {src}")
        if srcset:
            print(f"  srcset: {srcset[:100]}...")

        # 메인 이미지 패턴 확인
        if '/cv/' in src or '/cv/' in srcset:
            print(f"  *** 커버 이미지 후보! ***")

    print("\n" + "="*80)
    print("2. Nuxt 데이터에서 이미지 찾기")
    print("="*80)

    # Nuxt 데이터 추출
    pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
    match = re.search(pattern, resp.text, re.DOTALL)

    if match:
        json_str = match.group(1)
        print("Nuxt 데이터 발견!")

        # 변수 치환
        replacements = {
            r'\ba\b': 'false',
            r'\bb\b': '1',
            r'\bc\b': 'true',
            r'\bd\b': 'null',
            r'\be\b': '"system"',
            r'\bf\b': '"https://anilife.app"',
            r'\bg\b': '"N/A"'
        }

        for pattern, value in replacements.items():
            json_str = re.sub(pattern, value, json_str)

        try:
            nuxt_data = json.loads(json_str)

            # pinia 데이터 확인
            if 'pinia' in nuxt_data:
                print("Pinia 데이터 구조:")

                # content 관련 키 찾기
                for key in nuxt_data['pinia'].keys():
                    print(f"  - {key}")

                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]

                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            detail = content_data['contentDetail']

                            # images 필드 확인
                            if 'images' in detail:
                                print(f"\n  images 필드 발견:")
                                print(f"    {detail['images']}")

                            # 다른 이미지 필드 확인
                            for img_key in ['image', 'cover', 'poster', 'thumbnail']:
                                if img_key in detail:
                                    print(f"\n  {img_key} 필드 발견:")
                                    print(f"    {detail[img_key]}")

                            # name 필드 (제목)
                            if 'name' in detail:
                                print(f"\n  작품 제목:")
                                print(f"    {detail['name']}")

        except Exception as e:
            print(f"Nuxt 데이터 파싱 실패: {e}")
    else:
        print("Nuxt 데이터를 찾을 수 없음")

    print("\n" + "="*80)
    print("3. 특정 선택자로 이미지 찾기")
    print("="*80)

    selectors = [
        'img[data-nuxt-img]',
        'div.pLEPMwQ img',
        'div.aYso5bw img',
        'header img',
        'main img'
    ]

    for selector in selectors:
        imgs = soup.select(selector)
        if imgs:
            print(f"\n'{selector}' 선택자: {len(imgs)}개 발견")
            for img in imgs[:2]:  # 처음 2개만
                src = img.get('src', '')
                print(f"  - {src}")

    print("\n" + "="*80)
    print("4. 메타 태그 확인")
    print("="*80)

    meta_image = soup.find('meta', property='og:image')
    if meta_image:
        content = meta_image.get('content', '')
        print(f"og:image: {content}")

    return soup, nuxt_data if 'nuxt_data' in locals() else None


# 테스트
if __name__ == "__main__":
    test_urls = [
        "https://anilife.app/content/6840",  # 닥터 스톤
        "https://anilife.app/content/101",   # 원피스
        "https://anilife.app/content/110",   # 다른 예시
    ]

    for url in test_urls[:1]:  # 첫 번째 URL만 테스트
        print("\n" + "="*100)
        print(f"테스트 URL: {url}")
        print("="*100)

        soup, nuxt_data = debug_image_extraction(url)


테스트 URL: https://anilife.app/content/6840
크롤링 URL: https://anilife.app/content/6840?tab=info
응답 코드: 200

1. 모든 img 태그 찾기
총 12개의 img 태그 발견

이미지 1:
  alt: 애니라이프 로고
  src: /_ipx/f_webp&s_145x36/imgs/logo.png
  srcset: /_ipx/f_webp&s_145x36/imgs/logo.png 1x, /_ipx/f_webp&s_290x72/imgs/logo.png 2x...

이미지 2:
  alt: 닥터 스톤 SCIENCE FUTURE 파트 2
  src: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg
  srcset: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg 1x, https://cdn.anilife....
  *** 커버 이미지 후보! ***

이미지 3:
  alt: 닥터 스톤 SCIENCE FUTURE 파트 2
  src: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg
  srcset: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg 1x, https://cdn.anilife....
  *** 커버 이미지 후보! ***

이미지 4:
  alt: OUR Dr.STONE
  src: /imgs/loading-thumbnail.svg

이미지 5:
  alt: 한때 없애려 했던 존재는
  src: /imgs/loading-thumbnail.svg

이미지 6:
  alt: STONE SANCTUARY
  src: /imgs/loading-thumb

In [5]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 ID 추출
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))

            # 디버깅: Nuxt 데이터 구조 확인 (첫 200개만)
            if anime_id <= 200:
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {list(nuxt_data['pinia'].keys())}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # 변수 치환 패턴 (확장된 버전)
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"',
                    # 추가 패턴들
                    r'\bh\b': '2',
                    r'\bi\b': '3',
                    r'\bj\b': '4',
                    r'\bk\b': '"TV"',
                    r'\bl\b': '"OVA"',
                    r'\bm\b': '"Movie"',
                    r'\bn\b': '"Web"',
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱 시도
                data = json.loads(json_str)
                return data

            return {}

        except json.JSONDecodeError as e:
            logging.debug(f"JSON parsing error: {e}")
            return {}
        except Exception as e:
            logging.debug(f"Unexpected error in extract_nuxt_data: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            name_data = content_detail.get('name', {})
            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML 파싱 대체
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (수정된 버전)"""
        basic_info = {}

        try:
            # pinia에서 content 관련 키 찾기 (정확한 키 이름이 'content'가 아닐 수 있음)
            content_detail = {}
            if 'pinia' in nuxt_data:
                # content 관련 키를 동적으로 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

                # 만약 못 찾았으면 직접 접근 시도
                if not content_detail:
                    content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # 디버깅 로그
            if content_detail:
                logging.debug(f"Content detail keys: {list(content_detail.keys())[:20]}")

                if content_detail.get('format'):
                    basic_info["format"] = content_detail['format']

                if content_detail.get('status'):
                    basic_info["status"] = content_detail['status']

                # season 데이터 추출
                season_data = content_detail.get('season', {})
                if season_data:
                    logging.debug(f"Season data found: {season_data}")
                    if season_data.get('year'):
                        basic_info["year"] = str(season_data['year'])
                    if season_data.get('quarter'):
                        # quarter가 숫자로 오면 "N분기" 형태로 변환
                        quarter_val = season_data['quarter']
                        if isinstance(quarter_val, (int, str)) and str(quarter_val).isdigit():
                            basic_info["quarter"] = f"{quarter_val}분기"
                        else:
                            basic_info["quarter"] = str(quarter_val)

                # 날짜 정보
                if content_detail.get('startDate'):
                    basic_info["start_date"] = content_detail['startDate']

                if content_detail.get('endDate') and content_detail['endDate'] != "null":
                    basic_info["end_date"] = content_detail['endDate']

                # 에피소드 정보
                if content_detail.get('totalEpisode') and str(content_detail['totalEpisode']) != "N/A":
                    basic_info["total_episodes"] = str(content_detail['totalEpisode'])

                # 방영 시간
                if content_detail.get('duration') and str(content_detail['duration']) != "N/A":
                    basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            logging.debug(f"Error extracting from Nuxt data: {e}")

        # Nuxt 데이터에서 못 찾은 경우 HTML 파싱으로 대체
        if not basic_info.get('year') or not basic_info.get('quarter'):
            # 여러 클래스명 시도
            for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
                quarter_info = soup.find('div', class_=class_name)
                if quarter_info:
                    full_format = quarter_info.get_text(strip=True)
                    logging.debug(f"Found season info in class '{class_name}': {full_format}")

                    # "2024년 1분기 · TV" 같은 형태 파싱
                    parts = full_format.split(' · ')

                    if len(parts) >= 1:
                        season_text = parts[0]

                        # 연도 추출 (여러 패턴 시도)
                        year_patterns = [
                            r'(20\d{2})년',  # "2024년"
                            r'(20\d{2})\s',  # "2024 "
                            r'(19\d{2}|20\d{2})',  # 일반적인 연도
                        ]

                        for pattern in year_patterns:
                            year_match = re.search(pattern, season_text)
                            if year_match and not basic_info.get('year'):
                                basic_info["year"] = year_match.group(1)
                                break

                        # 분기 추출 (여러 패턴 시도)
                        quarter_patterns = [
                            r'(\d)분기',  # "1분기"
                            r'(\d)쿨',    # "1쿨"
                            r'Q(\d)',     # "Q1"
                            r'(봄|여름|가을|겨울)',  # 계절
                        ]

                        for pattern in quarter_patterns:
                            quarter_match = re.search(pattern, season_text)
                            if quarter_match and not basic_info.get('quarter'):
                                if pattern == r'(봄|여름|가을|겨울)':
                                    # 계절을 분기로 변환
                                    season_to_quarter = {'봄': '2분기', '여름': '3분기', '가을': '4분기', '겨울': '1분기'}
                                    basic_info["quarter"] = season_to_quarter.get(quarter_match.group(1), quarter_match.group(1))
                                else:
                                    basic_info["quarter"] = f"{quarter_match.group(1)}분기"
                                break

                        # format 정보 추출
                        if len(parts) >= 2 and not basic_info.get('format'):
                            basic_info["format"] = parts[1].strip()

                    if basic_info.get('year') or basic_info.get('quarter'):
                        break  # 정보를 찾았으면 루프 종료

        # 최종 확인 로그
        logging.debug(f"Final basic_info: year={basic_info.get('year')}, quarter={basic_info.get('quarter')}")

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            tag_data = content_detail.get('tag', [])
            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # 먼저 Nuxt 데이터에서 시도
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # staff 또는 production 정보 찾기
            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                if staff_key in content_detail:
                    staff_data = content_detail[staff_key]
                    if isinstance(staff_data, list):
                        for staff in staff_data:
                            if isinstance(staff, dict):
                                role = staff.get('role', '')
                                name = staff.get('name', '')
                                if role and name:
                                    if role not in production_info:
                                        production_info[role] = []
                                    production_info[role].append(name)
        except:
            pass

        # HTML에서 추가 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')
        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            if name not in production_info[role]:  # 중복 방지
                                production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "image_url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-08 09:48:24,597 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-08 09:48:24,598 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-08 09:48:27,399 - INFO - 진행률: 10/6900 (0.1%)
2025-09-08 09:48:27,716 - INFO - 진행률: 20/6900 (0.3%)
2025-09-08 09:48:27,904 - INFO - 진행률: 30/6900 (0.4%)
2025-09-08 09:48:28,524 - INFO - 진행률: 40/6900 (0.6%)
2025-09-08 09:48:28,772 - INFO - 진행률: 50/6900 (0.7%)
2025-09-08 09:48:28,983 - INFO - 진행률: 60/6900 (0.9%)
2025-09-08 09:48:29,148 - INFO - 진행률: 70/6900 (1.0%)
2025-09-08 09:48:29,375 - INFO - 진행률: 80/6900 (1.2%)
2025-09-08 09:48:29,605 - INFO - 진행률: 90/6900 (1.3%)
2025-09-08 09:48:29,731 - INFO - 진행률: 100/6900 (1.4%)
2025-09-08 09:48:31,769 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-08 09:48:32,562 - INFO - 진행률: 110/6900 (1.6%)
2025-09-08 09:48:33,202 - INFO - 진행률: 120/6900 (1.7%)
2025-09-08 09:48:33,396 - INFO - 진행률: 130/6900 (1.9%)
2025-09-08 09:48:33,526 - INFO - 진행률: 140/6900 (2.0%)
2025-09-08 09:48:33,706 - INFO - 진행률: 150/6900 (2.2%)
2025-09-08 09:48:34,09

In [6]:
import requests
from bs4 import BeautifulSoup
import logging

logging.basicConfig(level=logging.INFO)

def simple_image_extract(url):
    """단순화된 이미지 추출"""

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
        "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
    }

    # URL 정규화
    if "tab=info" not in url:
        if "?" in url:
            url = url.split("?")[0] + "?tab=info"
        else:
            url = url + "?tab=info"

    print(f"크롤링 URL: {url}")

    # 요청
    resp = requests.get(url, headers=headers, timeout=20)
    soup = BeautifulSoup(resp.text, "lxml")

    # 이미지 찾기 - 가장 간단한 방법
    image_url = ""

    # 방법 1: 모든 img 태그에서 cv 폴더 찾기
    all_imgs = soup.find_all('img')
    print(f"\n총 {len(all_imgs)}개의 이미지 발견")

    for img in all_imgs:
        src = img.get('src', '')
        alt = img.get('alt', '')

        # cv 폴더 = 커버 이미지
        if '/assets/img/cv/' in src:
            image_url = src
            if not image_url.startswith('http'):
                image_url = f"https://cdn.anilife.live{image_url}"
            print(f"✓ 커버 이미지 찾음: {image_url}")
            print(f"  (alt: {alt})")
            return image_url

    # 방법 2: 특정 div 클래스에서 찾기
    poster_div = soup.find('div', class_='pLEPMwQ')
    if poster_div:
        img = poster_div.find('img')
        if img:
            src = img.get('src', '')
            if src:
                image_url = src
                if not image_url.startswith('http'):
                    image_url = f"https://cdn.anilife.live{image_url}"
                print(f"✓ 포스터 div에서 이미지 찾음: {image_url}")
                return image_url

    print("✗ 이미지를 찾을 수 없음")
    return ""


# 전체 스크래퍼 클래스 테스트
from typing import Dict, List
import csv
import time

class SimpleAnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """간단한 애니메이션 정보 크롤링"""
        try:
            # URL 정규화
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            resp = self.session.get(url, timeout=20)
            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code}", "url": url}

            soup = BeautifulSoup(resp.text, "lxml")

            # 기본 정보 추출
            anime_info = {
                "url": url,
                "title": "",
                "image_url": ""
            }

            # 제목 추출
            title_tag = soup.find('h1', class_='fpUXWby')
            if title_tag:
                anime_info["title"] = title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

            # 이미지 추출 (가장 간단한 방법)
            all_imgs = soup.find_all('img')
            for img in all_imgs:
                src = img.get('src', '')
                if '/assets/img/cv/' in src:  # cv = cover
                    anime_info["image_url"] = src
                    if not anime_info["image_url"].startswith('http'):
                        anime_info["image_url"] = f"https://cdn.anilife.live{anime_info['image_url']}"
                    break

            print(f"제목: {anime_info['title']}")
            print(f"이미지: {anime_info['image_url']}")

            return anime_info

        except Exception as e:
            return {"error": str(e), "url": url}


# 테스트 실행
if __name__ == "__main__":
    print("=" * 80)
    print("1. 단순 함수 테스트")
    print("=" * 80)

    test_urls = [
        "https://anilife.app/content/6840",  # 닥터 스톤
        "https://anilife.app/content/101",   # 원피스
        "https://anilife.app/content/110",   # 다른 예시
    ]

    for url in test_urls[:1]:
        image_url = simple_image_extract(url)
        print()

    print("\n" + "=" * 80)
    print("2. 클래스 테스트")
    print("=" * 80)

    scraper = SimpleAnilifeScraper()
    for url in test_urls[:1]:
        result = scraper.scrape_anime_info(url)
        print(f"결과: {result}")

    print("\n" + "=" * 80)
    print("3. CSV 저장 테스트")
    print("=" * 80)

    # 여러 개 크롤링 후 CSV 저장
    results = []
    for anime_id in [6840, 101, 110]:
        url = f"https://anilife.app/content/{anime_id}"
        print(f"\n크롤링 중: {url}")
        result = scraper.scrape_anime_info(url)
        if "error" not in result:
            results.append({
                "id": anime_id,
                "url": result["url"],
                "title": result["title"],
                "image_url": result["image_url"]
            })
        time.sleep(1)  # 서버 부하 방지

    # CSV 저장
    if results:
        filename = "test_anime_images.csv"
        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            fieldnames = ["id", "url", "title", "image_url"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(results)
        print(f"\nCSV 파일 저장: {filename}")

        # 저장된 내용 확인
        print("\n저장된 데이터:")
        for row in results:
            print(f"  ID {row['id']}: {row['title'][:20]}... -> {row['image_url'][-30:] if row['image_url'] else 'NO IMAGE'}")

1. 단순 함수 테스트
크롤링 URL: https://anilife.app/content/6840?tab=info

총 12개의 이미지 발견
✓ 커버 이미지 찾음: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg
  (alt: 닥터 스톤 SCIENCE FUTURE 파트 2)


2. 클래스 테스트
제목: 닥터 스톤 SCIENCE FUTURE 파트 2
이미지: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg
결과: {'url': 'https://anilife.app/content/6840?tab=info', 'title': '닥터 스톤 SCIENCE FUTURE 파트 2', 'image_url': 'https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg'}

3. CSV 저장 테스트

크롤링 중: https://anilife.app/content/6840
제목: 닥터 스톤 SCIENCE FUTURE 파트 2
이미지: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg

크롤링 중: https://anilife.app/content/101
제목: 원피스
이미지: https://cdn.anilife.live/assets/img/cv/95cc7d964d03e510cd705bc723a32f79.jpg

크롤링 중: https://anilife.app/content/110
제목: 꿈속의 뮤
이미지: https://cdn.anilife.live/assets/img/cv/820cd9a4176cc266d95ab6a1f43f5e47.jpg

CSV 파일 저장: test_anime_images.csv

저장된 데이터:
  ID 6840: 닥터 스

In [7]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 ID 추출
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))

            # 디버깅: Nuxt 데이터 구조 확인 (첫 200개만)
            if anime_id <= 200:
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {list(nuxt_data['pinia'].keys())}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # 변수 치환 패턴 (확장된 버전)
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"',
                    # 추가 패턴들
                    r'\bh\b': '2',
                    r'\bi\b': '3',
                    r'\bj\b': '4',
                    r'\bk\b': '"TV"',
                    r'\bl\b': '"OVA"',
                    r'\bm\b': '"Movie"',
                    r'\bn\b': '"Web"',
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱 시도
                data = json.loads(json_str)
                return data

            return {}

        except json.JSONDecodeError as e:
            logging.debug(f"JSON parsing error: {e}")
            return {}
        except Exception as e:
            logging.debug(f"Unexpected error in extract_nuxt_data: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            name_data = content_detail.get('name', {})
            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML 파싱 대체
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (수정된 버전)"""
        basic_info = {}

        try:
            # pinia에서 content 관련 키 찾기 (정확한 키 이름이 'content'가 아닐 수 있음)
            content_detail = {}
            if 'pinia' in nuxt_data:
                # content 관련 키를 동적으로 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

                # 만약 못 찾았으면 직접 접근 시도
                if not content_detail:
                    content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # 디버깅 로그
            if content_detail:
                logging.debug(f"Content detail keys: {list(content_detail.keys())[:20]}")

                if content_detail.get('format'):
                    basic_info["format"] = content_detail['format']

                if content_detail.get('status'):
                    basic_info["status"] = content_detail['status']

                # season 데이터 추출
                season_data = content_detail.get('season', {})
                if season_data:
                    logging.debug(f"Season data found: {season_data}")
                    if season_data.get('year'):
                        basic_info["year"] = str(season_data['year'])
                    if season_data.get('quarter'):
                        # quarter가 숫자로 오면 "N분기" 형태로 변환
                        quarter_val = season_data['quarter']
                        if isinstance(quarter_val, (int, str)) and str(quarter_val).isdigit():
                            basic_info["quarter"] = f"{quarter_val}분기"
                        else:
                            basic_info["quarter"] = str(quarter_val)

                # 날짜 정보
                if content_detail.get('startDate'):
                    basic_info["start_date"] = content_detail['startDate']

                if content_detail.get('endDate') and content_detail['endDate'] != "null":
                    basic_info["end_date"] = content_detail['endDate']

                # 에피소드 정보
                if content_detail.get('totalEpisode') and str(content_detail['totalEpisode']) != "N/A":
                    basic_info["total_episodes"] = str(content_detail['totalEpisode'])

                # 방영 시간
                if content_detail.get('duration') and str(content_detail['duration']) != "N/A":
                    basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            logging.debug(f"Error extracting from Nuxt data: {e}")

        # Nuxt 데이터에서 못 찾은 경우 HTML 파싱으로 대체
        if not basic_info.get('year') or not basic_info.get('quarter'):
            # 여러 클래스명 시도
            for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
                quarter_info = soup.find('div', class_=class_name)
                if quarter_info:
                    full_format = quarter_info.get_text(strip=True)
                    logging.debug(f"Found season info in class '{class_name}': {full_format}")

                    # "2024년 1분기 · TV" 같은 형태 파싱
                    parts = full_format.split(' · ')

                    if len(parts) >= 1:
                        season_text = parts[0]

                        # 연도 추출 (여러 패턴 시도)
                        year_patterns = [
                            r'(20\d{2})년',  # "2024년"
                            r'(20\d{2})\s',  # "2024 "
                            r'(19\d{2}|20\d{2})',  # 일반적인 연도
                        ]

                        for pattern in year_patterns:
                            year_match = re.search(pattern, season_text)
                            if year_match and not basic_info.get('year'):
                                basic_info["year"] = year_match.group(1)
                                break

                        # 분기 추출 (여러 패턴 시도)
                        quarter_patterns = [
                            r'(\d)분기',  # "1분기"
                            r'(\d)쿨',    # "1쿨"
                            r'Q(\d)',     # "Q1"
                            r'(봄|여름|가을|겨울)',  # 계절
                        ]

                        for pattern in quarter_patterns:
                            quarter_match = re.search(pattern, season_text)
                            if quarter_match and not basic_info.get('quarter'):
                                if pattern == r'(봄|여름|가을|겨울)':
                                    # 계절을 분기로 변환
                                    season_to_quarter = {'봄': '2분기', '여름': '3분기', '가을': '4분기', '겨울': '1분기'}
                                    basic_info["quarter"] = season_to_quarter.get(quarter_match.group(1), quarter_match.group(1))
                                else:
                                    basic_info["quarter"] = f"{quarter_match.group(1)}분기"
                                break

                        # format 정보 추출
                        if len(parts) >= 2 and not basic_info.get('format'):
                            basic_info["format"] = parts[1].strip()

                    if basic_info.get('year') or basic_info.get('quarter'):
                        break  # 정보를 찾았으면 루프 종료

        # 최종 확인 로그
        logging.debug(f"Final basic_info: year={basic_info.get('year')}, quarter={basic_info.get('quarter')}")

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            tag_data = content_detail.get('tag', [])
            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # 먼저 Nuxt 데이터에서 시도
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # staff 또는 production 정보 찾기
            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                if staff_key in content_detail:
                    staff_data = content_detail[staff_key]
                    if isinstance(staff_data, list):
                        for staff in staff_data:
                            if isinstance(staff, dict):
                                role = staff.get('role', '')
                                name = staff.get('name', '')
                                if role and name:
                                    if role not in production_info:
                                        production_info[role] = []
                                    production_info[role].append(name)
        except:
            pass

        # HTML에서 추가 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')
        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            if name not in production_info[role]:  # 중복 방지
                                production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "image_url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-08 10:00:51,436 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-08 10:00:51,437 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-08 10:00:53,232 - INFO - 진행률: 10/6900 (0.1%)
2025-09-08 10:00:54,261 - INFO - 진행률: 20/6900 (0.3%)
2025-09-08 10:00:54,618 - INFO - 진행률: 30/6900 (0.4%)
2025-09-08 10:00:55,161 - INFO - 진행률: 40/6900 (0.6%)
2025-09-08 10:00:55,348 - INFO - 진행률: 50/6900 (0.7%)
2025-09-08 10:00:55,533 - INFO - 진행률: 60/6900 (0.9%)
2025-09-08 10:00:55,717 - INFO - 진행률: 70/6900 (1.0%)
2025-09-08 10:00:56,294 - INFO - 진행률: 80/6900 (1.2%)
2025-09-08 10:00:56,472 - INFO - 진행률: 90/6900 (1.3%)
2025-09-08 10:00:56,687 - INFO - 진행률: 100/6900 (1.4%)
2025-09-08 10:00:58,713 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-08 10:00:59,598 - INFO - 진행률: 110/6900 (1.6%)
2025-09-08 10:00:59,878 - INFO - 진행률: 120/6900 (1.7%)
2025-09-08 10:01:00,002 - INFO - 진행률: 130/6900 (1.9%)
2025-09-08 10:01:00,160 - INFO - 진행률: 140/6900 (2.0%)
2025-09-08 10:01:00,727 - INFO - 진행률: 150/6900 (2.2%)
2025-09-08 10:01:00,98

In [8]:
import requests
from bs4 import BeautifulSoup
import logging

logging.basicConfig(level=logging.INFO)

def simple_image_extract(url):
    """단순화된 이미지 추출"""

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
        "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
    }

    # URL 정규화
    if "tab=info" not in url:
        if "?" in url:
            url = url.split("?")[0] + "?tab=info"
        else:
            url = url + "?tab=info"

    print(f"크롤링 URL: {url}")

    # 요청
    resp = requests.get(url, headers=headers, timeout=20)
    soup = BeautifulSoup(resp.text, "lxml")

    # 이미지 찾기 - 가장 간단한 방법
    image_url = ""

    # 방법 1: 모든 img 태그에서 cv 폴더 찾기
    all_imgs = soup.find_all('img')
    print(f"\n총 {len(all_imgs)}개의 이미지 발견")

    for img in all_imgs:
        src = img.get('src', '')
        alt = img.get('alt', '')

        # cv 폴더 = 커버 이미지
        if '/assets/img/cv/' in src:
            image_url = src
            if not image_url.startswith('http'):
                image_url = f"https://cdn.anilife.live{image_url}"
            print(f"✓ 커버 이미지 찾음: {image_url}")
            print(f"  (alt: {alt})")
            return image_url

    # 방법 2: 특정 div 클래스에서 찾기
    poster_div = soup.find('div', class_='pLEPMwQ')
    if poster_div:
        img = poster_div.find('img')
        if img:
            src = img.get('src', '')
            if src:
                image_url = src
                if not image_url.startswith('http'):
                    image_url = f"https://cdn.anilife.live{image_url}"
                print(f"✓ 포스터 div에서 이미지 찾음: {image_url}")
                return image_url

    print("✗ 이미지를 찾을 수 없음")
    return ""


# 전체 스크래퍼 클래스 테스트
from typing import Dict, List
import csv
import time

class SimpleAnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """간단한 애니메이션 정보 크롤링"""
        try:
            # URL 정규화
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            resp = self.session.get(url, timeout=20)
            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code}", "url": url}

            soup = BeautifulSoup(resp.text, "lxml")

            # 기본 정보 추출
            anime_info = {
                "url": url,
                "title": "",
                "image_url": ""
            }

            # 제목 추출
            title_tag = soup.find('h1', class_='fpUXWby')
            if title_tag:
                anime_info["title"] = title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

            # 이미지 추출 (가장 간단한 방법)
            all_imgs = soup.find_all('img')
            for img in all_imgs:
                src = img.get('src', '')
                if '/assets/img/cv/' in src:  # cv = cover
                    anime_info["image_url"] = src
                    if not anime_info["image_url"].startswith('http'):
                        anime_info["image_url"] = f"https://cdn.anilife.live{anime_info['image_url']}"
                    break

            print(f"제목: {anime_info['title']}")
            print(f"이미지: {anime_info['image_url']}")

            return anime_info

        except Exception as e:
            return {"error": str(e), "url": url}


# 테스트 실행
if __name__ == "__main__":
    print("=" * 80)
    print("1. 단순 함수 테스트")
    print("=" * 80)

    test_urls = [
        "https://anilife.app/content/6840",  # 닥터 스톤
        "https://anilife.app/content/101",   # 원피스
        "https://anilife.app/content/110",   # 다른 예시
    ]

    for url in test_urls[:1]:
        image_url = simple_image_extract(url)
        print()

    print("\n" + "=" * 80)
    print("2. 클래스 테스트")
    print("=" * 80)

    scraper = SimpleAnilifeScraper()
    for url in test_urls[:1]:
        result = scraper.scrape_anime_info(url)
        print(f"결과: {result}")

    print("\n" + "=" * 80)
    print("3. CSV 저장 테스트")
    print("=" * 80)

    # 여러 개 크롤링 후 CSV 저장
    results = []
    for anime_id in [6840, 101, 110]:
        url = f"https://anilife.app/content/{anime_id}"
        print(f"\n크롤링 중: {url}")
        result = scraper.scrape_anime_info(url)
        if "error" not in result:
            results.append({
                "id": anime_id,
                "url": result["url"],
                "title": result["title"],
                "image_url": result["image_url"]
            })
        time.sleep(1)  # 서버 부하 방지

    # CSV 저장
    if results:
        filename = "test_anime_images.csv"
        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            fieldnames = ["id", "url", "title", "image_url"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(results)
        print(f"\nCSV 파일 저장: {filename}")

        # 저장된 내용 확인
        print("\n저장된 데이터:")
        for row in results:
            print(f"  ID {row['id']}: {row['title'][:20]}... -> {row['image_url'][-30:] if row['image_url'] else 'NO IMAGE'}")

1. 단순 함수 테스트
크롤링 URL: https://anilife.app/content/6840?tab=info

총 12개의 이미지 발견
✓ 커버 이미지 찾음: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg
  (alt: 닥터 스톤 SCIENCE FUTURE 파트 2)


2. 클래스 테스트
제목: 닥터 스톤 SCIENCE FUTURE 파트 2
이미지: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg
결과: {'url': 'https://anilife.app/content/6840?tab=info', 'title': '닥터 스톤 SCIENCE FUTURE 파트 2', 'image_url': 'https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg'}

3. CSV 저장 테스트

크롤링 중: https://anilife.app/content/6840
제목: 닥터 스톤 SCIENCE FUTURE 파트 2
이미지: https://cdn.anilife.live/assets/img/cv/a8ed6fd6956b13c980082b951df5a699.jpg

크롤링 중: https://anilife.app/content/101
제목: 원피스
이미지: https://cdn.anilife.live/assets/img/cv/95cc7d964d03e510cd705bc723a32f79.jpg

크롤링 중: https://anilife.app/content/110
제목: 꿈속의 뮤
이미지: https://cdn.anilife.live/assets/img/cv/820cd9a4176cc266d95ab6a1f43f5e47.jpg

CSV 파일 저장: test_anime_images.csv

저장된 데이터:
  ID 6840: 닥터 스

In [9]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 ID 추출
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))

            # 디버깅: Nuxt 데이터 구조 확인 (첫 200개만)
            if anime_id <= 200:
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {list(nuxt_data['pinia'].keys())}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # 변수 치환 패턴 (확장된 버전)
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"',
                    # 추가 패턴들
                    r'\bh\b': '2',
                    r'\bi\b': '3',
                    r'\bj\b': '4',
                    r'\bk\b': '"TV"',
                    r'\bl\b': '"OVA"',
                    r'\bm\b': '"Movie"',
                    r'\bn\b': '"Web"',
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱 시도
                data = json.loads(json_str)
                return data

            return {}

        except json.JSONDecodeError as e:
            logging.debug(f"JSON parsing error: {e}")
            return {}
        except Exception as e:
            logging.debug(f"Unexpected error in extract_nuxt_data: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            name_data = content_detail.get('name', {})
            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML 파싱 대체
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (수정된 버전)"""
        basic_info = {}

        try:
            # pinia에서 content 관련 키 찾기 (정확한 키 이름이 'content'가 아닐 수 있음)
            content_detail = {}
            if 'pinia' in nuxt_data:
                # content 관련 키를 동적으로 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

                # 만약 못 찾았으면 직접 접근 시도
                if not content_detail:
                    content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # 디버깅 로그
            if content_detail:
                logging.debug(f"Content detail keys: {list(content_detail.keys())[:20]}")

                if content_detail.get('format'):
                    basic_info["format"] = content_detail['format']

                if content_detail.get('status'):
                    basic_info["status"] = content_detail['status']

                # season 데이터 추출
                season_data = content_detail.get('season', {})
                if season_data:
                    logging.debug(f"Season data found: {season_data}")
                    if season_data.get('year'):
                        basic_info["year"] = str(season_data['year'])
                    if season_data.get('quarter'):
                        # quarter가 숫자로 오면 "N분기" 형태로 변환
                        quarter_val = season_data['quarter']
                        if isinstance(quarter_val, (int, str)) and str(quarter_val).isdigit():
                            basic_info["quarter"] = f"{quarter_val}분기"
                        else:
                            basic_info["quarter"] = str(quarter_val)

                # 날짜 정보
                if content_detail.get('startDate'):
                    basic_info["start_date"] = content_detail['startDate']

                if content_detail.get('endDate') and content_detail['endDate'] != "null":
                    basic_info["end_date"] = content_detail['endDate']

                # 에피소드 정보
                if content_detail.get('totalEpisode') and str(content_detail['totalEpisode']) != "N/A":
                    basic_info["total_episodes"] = str(content_detail['totalEpisode'])

                # 방영 시간
                if content_detail.get('duration') and str(content_detail['duration']) != "N/A":
                    basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            logging.debug(f"Error extracting from Nuxt data: {e}")

        # Nuxt 데이터에서 못 찾은 경우 HTML 파싱으로 대체
        if not basic_info.get('year') or not basic_info.get('quarter'):
            # 여러 클래스명 시도
            for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
                quarter_info = soup.find('div', class_=class_name)
                if quarter_info:
                    full_format = quarter_info.get_text(strip=True)
                    logging.debug(f"Found season info in class '{class_name}': {full_format}")

                    # "2024년 1분기 · TV" 같은 형태 파싱
                    parts = full_format.split(' · ')

                    if len(parts) >= 1:
                        season_text = parts[0]

                        # 연도 추출 (여러 패턴 시도)
                        year_patterns = [
                            r'(20\d{2})년',  # "2024년"
                            r'(20\d{2})\s',  # "2024 "
                            r'(19\d{2}|20\d{2})',  # 일반적인 연도
                        ]

                        for pattern in year_patterns:
                            year_match = re.search(pattern, season_text)
                            if year_match and not basic_info.get('year'):
                                basic_info["year"] = year_match.group(1)
                                break

                        # 분기 추출 (여러 패턴 시도)
                        quarter_patterns = [
                            r'(\d)분기',  # "1분기"
                            r'(\d)쿨',    # "1쿨"
                            r'Q(\d)',     # "Q1"
                            r'(봄|여름|가을|겨울)',  # 계절
                        ]

                        for pattern in quarter_patterns:
                            quarter_match = re.search(pattern, season_text)
                            if quarter_match and not basic_info.get('quarter'):
                                if pattern == r'(봄|여름|가을|겨울)':
                                    # 계절을 분기로 변환
                                    season_to_quarter = {'봄': '2분기', '여름': '3분기', '가을': '4분기', '겨울': '1분기'}
                                    basic_info["quarter"] = season_to_quarter.get(quarter_match.group(1), quarter_match.group(1))
                                else:
                                    basic_info["quarter"] = f"{quarter_match.group(1)}분기"
                                break

                        # format 정보 추출
                        if len(parts) >= 2 and not basic_info.get('format'):
                            basic_info["format"] = parts[1].strip()

                    if basic_info.get('year') or basic_info.get('quarter'):
                        break  # 정보를 찾았으면 루프 종료

        # 최종 확인 로그
        logging.debug(f"Final basic_info: year={basic_info.get('year')}, quarter={basic_info.get('quarter')}")

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            tag_data = content_detail.get('tag', [])
            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # 먼저 Nuxt 데이터에서 시도
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # staff 또는 production 정보 찾기
            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                if staff_key in content_detail:
                    staff_data = content_detail[staff_key]
                    if isinstance(staff_data, list):
                        for staff in staff_data:
                            if isinstance(staff, dict):
                                role = staff.get('role', '')
                                name = staff.get('name', '')
                                if role and name:
                                    if role not in production_info:
                                        production_info[role] = []
                                    production_info[role].append(name)
        except:
            pass

        # HTML에서 추가 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')
        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            if name not in production_info[role]:  # 중복 방지
                                production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "image_url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-08 10:10:06,547 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-08 10:10:06,548 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-08 10:10:08,250 - INFO - 진행률: 10/6900 (0.1%)
2025-09-08 10:10:09,662 - INFO - 진행률: 20/6900 (0.3%)
2025-09-08 10:10:09,886 - INFO - 진행률: 30/6900 (0.4%)
2025-09-08 10:10:10,169 - INFO - 진행률: 40/6900 (0.6%)
2025-09-08 10:10:10,407 - INFO - 진행률: 50/6900 (0.7%)
2025-09-08 10:10:10,603 - INFO - 진행률: 60/6900 (0.9%)
2025-09-08 10:10:10,789 - INFO - 진행률: 70/6900 (1.0%)
2025-09-08 10:10:10,975 - INFO - 진행률: 80/6900 (1.2%)
2025-09-08 10:10:11,481 - INFO - 진행률: 90/6900 (1.3%)
2025-09-08 10:10:11,583 - INFO - 진행률: 100/6900 (1.4%)
2025-09-08 10:10:13,590 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-08 10:10:14,393 - INFO - 진행률: 110/6900 (1.6%)
2025-09-08 10:10:14,681 - INFO - 진행률: 120/6900 (1.7%)
2025-09-08 10:10:14,892 - INFO - 진행률: 130/6900 (1.9%)
2025-09-08 10:10:15,130 - INFO - 진행률: 140/6900 (2.0%)
2025-09-08 10:10:15,342 - INFO - 진행률: 150/6900 (2.2%)
2025-09-08 10:10:15,54

In [10]:
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
import time
from typing import Dict, List, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
import logging
from datetime import datetime
import os

# 로깅 설정
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('anilife_scraping.log'),
        logging.StreamHandler()
    ]
)

class AnilifeScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) StepByStepCrawler/0.1",
            "Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def scrape_anime_info(self, url: str) -> Dict:
        """애니메이션 정보를 크롤링하는 메인 함수"""
        try:
            # URL 정규화 - info 탭으로 변경
            if "tab=info" not in url:
                if "?" in url:
                    url = url.split("?")[0] + "?tab=info"
                else:
                    url = url + "?tab=info"

            # 웹페이지 요청
            resp = self.session.get(url, timeout=20)

            if resp.status_code != 200:
                return {"error": f"HTTP {resp.status_code} 에러", "url": url}

            # BeautifulSoup으로 파싱
            soup = BeautifulSoup(resp.text, "lxml")

            # Nuxt.js 데이터 추출
            nuxt_data = self.extract_nuxt_data(resp.text)

            # 애니메이션 ID 추출
            anime_id = int(re.search(r'/content/(\d+)', url).group(1))

            # 디버깅: Nuxt 데이터 구조 확인 (첫 200개만)
            if anime_id <= 200:
                logging.debug(f"ID {anime_id} - Nuxt data keys: {nuxt_data.keys() if nuxt_data else 'No data'}")
                if nuxt_data and 'pinia' in nuxt_data:
                    logging.debug(f"ID {anime_id} - Pinia keys: {list(nuxt_data['pinia'].keys())}")

            # 애니메이션 정보 추출
            anime_info = {
                "url": url,
                "id": anime_id,
                "title": self.extract_titles(soup, nuxt_data),
                "basic_info": self.extract_basic_info(soup, nuxt_data),
                "genres": self.extract_genres(soup, nuxt_data),
                "tags": self.extract_tags(soup, nuxt_data),
                "synopsis": self.extract_synopsis(soup, nuxt_data),
                "characters_voice_actors": self.extract_characters_and_voice_actors(soup, nuxt_data),
                "production_info": self.extract_production_info(soup, nuxt_data)
            }

            return anime_info

        except requests.RequestException as e:
            return {"error": f"요청 에러: {str(e)}", "url": url}
        except Exception as e:
            return {"error": f"파싱 에러: {str(e)}", "url": url}

    def extract_nuxt_data(self, html_content: str) -> Dict:
        """HTML에서 Nuxt.js __NUXT__ 데이터 추출"""
        try:
            pattern = r'window\.__NUXT__=\(function\([^)]*\)\{return (.+?)\}\)\([^)]+\)'
            match = re.search(pattern, html_content, re.DOTALL)

            if match:
                json_str = match.group(1)

                # 변수 치환 패턴 (확장된 버전)
                replacements = {
                    r'\ba\b': 'false',
                    r'\bb\b': '1',
                    r'\bc\b': 'true',
                    r'\bd\b': 'null',
                    r'\be\b': '"system"',
                    r'\bf\b': '"https://anilife.app"',
                    r'\bg\b': '"N/A"',
                    # 추가 패턴들
                    r'\bh\b': '2',
                    r'\bi\b': '3',
                    r'\bj\b': '4',
                    r'\bk\b': '"TV"',
                    r'\bl\b': '"OVA"',
                    r'\bm\b': '"Movie"',
                    r'\bn\b': '"Web"',
                }

                for pattern, value in replacements.items():
                    json_str = re.sub(pattern, value, json_str)

                # JSON 파싱 시도
                data = json.loads(json_str)
                return data

            return {}

        except json.JSONDecodeError as e:
            logging.debug(f"JSON parsing error: {e}")
            return {}
        except Exception as e:
            logging.debug(f"Unexpected error in extract_nuxt_data: {e}")
            return {}

    def extract_titles(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제목들 추출 (한국어, 일본어, 영어)"""
        titles = {"korean": "", "japanese": "", "english": ""}

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            name_data = content_detail.get('name', {})
            if name_data.get('kr'):
                titles["korean"] = name_data['kr']
            if name_data.get('jp'):
                titles["japanese"] = name_data['jp']
            if name_data.get('en'):
                titles["english"] = name_data['en']
        except:
            pass

        # HTML 파싱 대체
        if not titles["korean"]:
            korean_title_tag = soup.find('h1', class_='fpUXWby')
            if korean_title_tag:
                titles["korean"] = korean_title_tag.get_text(strip=True).replace(" 에피소드", "").replace("정보", "")

        if not titles["japanese"] or not titles["english"]:
            japanese_title_section = soup.find('h2', class_='visually-hidden')
            if japanese_title_section:
                span_tags = japanese_title_section.find_all('span')
                if len(span_tags) >= 2:
                    if not titles["japanese"]:
                        titles["japanese"] = span_tags[0].get_text(strip=True)
                    if not titles["english"]:
                        titles["english"] = span_tags[1].get_text(strip=True)

        return titles

    def extract_basic_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """기본 정보 추출 (수정된 버전)"""
        basic_info = {}

        try:
            # pinia에서 content 관련 키 찾기 (정확한 키 이름이 'content'가 아닐 수 있음)
            content_detail = {}
            if 'pinia' in nuxt_data:
                # content 관련 키를 동적으로 찾기
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

                # 만약 못 찾았으면 직접 접근 시도
                if not content_detail:
                    content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # 디버깅 로그
            if content_detail:
                logging.debug(f"Content detail keys: {list(content_detail.keys())[:20]}")

                if content_detail.get('format'):
                    basic_info["format"] = content_detail['format']

                if content_detail.get('status'):
                    basic_info["status"] = content_detail['status']

                # season 데이터 추출
                season_data = content_detail.get('season', {})
                if season_data:
                    logging.debug(f"Season data found: {season_data}")
                    if season_data.get('year'):
                        basic_info["year"] = str(season_data['year'])
                    if season_data.get('quarter'):
                        # quarter가 숫자로 오면 "N분기" 형태로 변환
                        quarter_val = season_data['quarter']
                        if isinstance(quarter_val, (int, str)) and str(quarter_val).isdigit():
                            basic_info["quarter"] = f"{quarter_val}분기"
                        else:
                            basic_info["quarter"] = str(quarter_val)

                # 날짜 정보
                if content_detail.get('startDate'):
                    basic_info["start_date"] = content_detail['startDate']

                if content_detail.get('endDate') and content_detail['endDate'] != "null":
                    basic_info["end_date"] = content_detail['endDate']

                # 에피소드 정보
                if content_detail.get('totalEpisode') and str(content_detail['totalEpisode']) != "N/A":
                    basic_info["total_episodes"] = str(content_detail['totalEpisode'])

                # 방영 시간
                if content_detail.get('duration') and str(content_detail['duration']) != "N/A":
                    basic_info["duration"] = str(content_detail['duration'])

        except Exception as e:
            logging.debug(f"Error extracting from Nuxt data: {e}")

        # Nuxt 데이터에서 못 찾은 경우 HTML 파싱으로 대체
        if not basic_info.get('year') or not basic_info.get('quarter'):
            # 여러 클래스명 시도
            for class_name in ['nBnfiIh', 'season-info', 'anime-info', 'broadcast-info']:
                quarter_info = soup.find('div', class_=class_name)
                if quarter_info:
                    full_format = quarter_info.get_text(strip=True)
                    logging.debug(f"Found season info in class '{class_name}': {full_format}")

                    # "2024년 1분기 · TV" 같은 형태 파싱
                    parts = full_format.split(' · ')

                    if len(parts) >= 1:
                        season_text = parts[0]

                        # 연도 추출 (여러 패턴 시도)
                        year_patterns = [
                            r'(20\d{2})년',  # "2024년"
                            r'(20\d{2})\s',  # "2024 "
                            r'(19\d{2}|20\d{2})',  # 일반적인 연도
                        ]

                        for pattern in year_patterns:
                            year_match = re.search(pattern, season_text)
                            if year_match and not basic_info.get('year'):
                                basic_info["year"] = year_match.group(1)
                                break

                        # 분기 추출 (여러 패턴 시도)
                        quarter_patterns = [
                            r'(\d)분기',  # "1분기"
                            r'(\d)쿨',    # "1쿨"
                            r'Q(\d)',     # "Q1"
                            r'(봄|여름|가을|겨울)',  # 계절
                        ]

                        for pattern in quarter_patterns:
                            quarter_match = re.search(pattern, season_text)
                            if quarter_match and not basic_info.get('quarter'):
                                if pattern == r'(봄|여름|가을|겨울)':
                                    # 계절을 분기로 변환
                                    season_to_quarter = {'봄': '2분기', '여름': '3분기', '가을': '4분기', '겨울': '1분기'}
                                    basic_info["quarter"] = season_to_quarter.get(quarter_match.group(1), quarter_match.group(1))
                                else:
                                    basic_info["quarter"] = f"{quarter_match.group(1)}분기"
                                break

                        # format 정보 추출
                        if len(parts) >= 2 and not basic_info.get('format'):
                            basic_info["format"] = parts[1].strip()

                    if basic_info.get('year') or basic_info.get('quarter'):
                        break  # 정보를 찾았으면 루프 종료

        # 최종 확인 로그
        logging.debug(f"Final basic_info: year={basic_info.get('year')}, quarter={basic_info.get('quarter')}")

        return basic_info

    def extract_genres(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """장르 정보 추출"""
        genres = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            nuxt_genres = content_detail.get('genre', [])
            if nuxt_genres:
                genres = nuxt_genres
        except:
            pass

        if not genres:
            genre_tags = soup.select('a[rel="genre"]')
            genres = [tag.get_text(strip=True) for tag in genre_tags]

        return genres

    def extract_tags(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[str]:
        """태그 정보 추출"""
        tags = []

        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            tag_data = content_detail.get('tag', [])
            for tag_item in tag_data:
                if isinstance(tag_item, dict) and tag_item.get('name'):
                    tag_name = tag_item['name']
                    if tag_item.get('spoiler'):
                        tag_name += " (스포일러)"
                    tags.append(tag_name)
                elif isinstance(tag_item, str):
                    tags.append(tag_item)

        except Exception:
            pass

        if not tags:
            tag_section = None
            for h2 in soup.find_all('h2', class_='wXeFmvm'):
                if '작품 태그' in h2.get_text():
                    tag_section = h2.find_parent('section')
                    break

            if tag_section:
                tag_container = tag_section.find('div', class_='-mMZ9fV')
                if tag_container:
                    tag_links = tag_container.find_all('a', class_='MbHceQh')
                    for link in tag_links:
                        span = link.find('span')
                        if span:
                            tag_text = span.get_text(strip=True).replace('#', '')
                            if 'iYz6NWc' in span.get('class', []):
                                tag_text += " (스포일러)"
                            tags.append(tag_text)

        return tags

    def extract_synopsis(self, soup: BeautifulSoup, nuxt_data: Dict) -> str:
        """줄거리 추출"""
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            description = content_detail.get('description', '')
            if description and description != "등록된 줄거리가 없습니다.":
                return description
        except:
            pass

        description_div = soup.find('div', class_='bnHDzeE')
        if description_div:
            synopsis = description_div.get_text(strip=True)
            if synopsis and synopsis != "등록된 줄거리가 없습니다.":
                return synopsis

        return "등록된 줄거리가 없습니다."

    def extract_characters_and_voice_actors(self, soup: BeautifulSoup, nuxt_data: Dict) -> List[Dict]:
        """캐릭터 및 성우 정보 추출"""
        characters = []
        character_cards = soup.find_all('div', class_='otjBFjd')

        for card in character_cards:
            character_div = card.find('div', class_='OuXf8uf')
            voice_actor_link = card.find('a')
            character_info = {}

            if character_div:
                name_elem = character_div.find('div', class_='iO6bs1d')
                role_elem = character_div.find('div', class_='_99DZmqJ')

                if name_elem:
                    character_info['character_name'] = name_elem.get_text(strip=True)
                if role_elem:
                    character_info['character_role'] = role_elem.get_text(strip=True)

                if character_div.get('data-original-title'):
                    if not character_info.get('character_name'):
                        character_info['character_name'] = character_div['data-original-title']

            if voice_actor_link:
                voice_actor_div = voice_actor_link.find('div', class_='_0fu6hck')
                if voice_actor_div:
                    voice_name_elem = voice_actor_div.find('div', class_='iO6bs1d')
                    if voice_name_elem:
                        character_info['voice_actor'] = voice_name_elem.get_text(strip=True)

                    if voice_actor_div.get('title'):
                        if not character_info.get('voice_actor'):
                            character_info['voice_actor'] = voice_actor_div['title']

            if character_info:
                characters.append(character_info)

        return characters

    def extract_production_info(self, soup: BeautifulSoup, nuxt_data: Dict) -> Dict[str, str]:
        """제작 정보 추출"""
        production_info = {}

        # 먼저 Nuxt 데이터에서 시도
        try:
            # pinia에서 content 관련 키 찾기
            content_detail = {}
            if 'pinia' in nuxt_data:
                for key in nuxt_data['pinia'].keys():
                    if 'content' in key.lower():
                        content_data = nuxt_data['pinia'][key]
                        if isinstance(content_data, dict) and 'contentDetail' in content_data:
                            content_detail = content_data['contentDetail']
                            break

            if not content_detail:
                content_detail = nuxt_data.get('pinia', {}).get('content', {}).get('contentDetail', {})

            # staff 또는 production 정보 찾기
            for staff_key in ['staff', 'staffs', 'production', 'studio']:
                if staff_key in content_detail:
                    staff_data = content_detail[staff_key]
                    if isinstance(staff_data, list):
                        for staff in staff_data:
                            if isinstance(staff, dict):
                                role = staff.get('role', '')
                                name = staff.get('name', '')
                                if role and name:
                                    if role not in production_info:
                                        production_info[role] = []
                                    production_info[role].append(name)
        except:
            pass

        # HTML에서 추가 정보 추출
        production_section = soup.find('div', class_='_1coMKET -HW4ChD')
        if production_section:
            production_links = production_section.find_all('a', class_='_2hRLd-G')

            for link in production_links:
                staff_div = link.find('div', class_='OuXf8uf')

                if staff_div:
                    name_elem = staff_div.find('div', class_='iO6bs1d')
                    role_elem = staff_div.find('div', class_='_99DZmqJ')

                    if name_elem and role_elem:
                        name = name_elem.get_text(strip=True)
                        role = role_elem.get_text(strip=True)

                        if role not in production_info:
                            production_info[role] = []

                        if isinstance(production_info[role], list):
                            if name not in production_info[role]:  # 중복 방지
                                production_info[role].append(name)
                        else:
                            production_info[role] = [production_info[role], name]

                    if staff_div.get('title'):
                        if not name_elem:
                            name = staff_div['title']
                            if role_elem:
                                role = role_elem.get_text(strip=True)
                                if role not in production_info:
                                    production_info[role] = name

        # 리스트를 문자열로 변환
        for role, names in production_info.items():
            if isinstance(names, list):
                production_info[role] = ', '.join(names)

        return production_info


class ParallelAnilifeScraper:
    def __init__(self, max_workers=10):
        self.max_workers = max_workers
        self.results = []
        self.errors = []
        self.lock = Lock()
        self.progress_lock = Lock()
        self.completed_count = 0
        self.total_count = 0

    def scrape_single(self, anime_id: int) -> Dict:
        """단일 애니메이션 크롤링"""
        url = f"https://anilife.app/content/{anime_id}?tab=info"
        scraper = AnilifeScraper()

        try:
            result = scraper.scrape_anime_info(url)

            with self.progress_lock:
                self.completed_count += 1
                if self.completed_count % 10 == 0:
                    logging.info(f"진행률: {self.completed_count}/{self.total_count} ({self.completed_count/self.total_count*100:.1f}%)")

            return result
        except Exception as e:
            logging.error(f"ID {anime_id} 크롤링 실패: {str(e)}")
            return {"error": str(e), "id": anime_id, "url": url}

    def process_result(self, anime_data: Dict) -> Dict:
        """크롤링 결과를 CSV용 플랫 딕셔너리로 변환"""
        if "error" in anime_data:
            return {"id": anime_data.get("id", ""), "error": anime_data["error"]}

        flat_data = {
            "id": anime_data.get("id", ""),
            "url": anime_data.get("url", ""),
            "title_korean": anime_data.get("title", {}).get("korean", ""),
            "title_japanese": anime_data.get("title", {}).get("japanese", ""),
            "title_english": anime_data.get("title", {}).get("english", ""),
            "format": anime_data.get("basic_info", {}).get("format", ""),
            "status": anime_data.get("basic_info", {}).get("status", ""),
            "year": anime_data.get("basic_info", {}).get("year", ""),
            "quarter": anime_data.get("basic_info", {}).get("quarter", ""),
            "start_date": anime_data.get("basic_info", {}).get("start_date", ""),
            "end_date": anime_data.get("basic_info", {}).get("end_date", ""),
            "total_episodes": anime_data.get("basic_info", {}).get("total_episodes", ""),
            "duration": anime_data.get("basic_info", {}).get("duration", ""),
            "genres": "|".join(anime_data.get("genres", [])),
            "tags": "|".join(anime_data.get("tags", [])),
            "synopsis": anime_data.get("synopsis", ""),
            "num_characters": len(anime_data.get("characters_voice_actors", [])),
            "main_characters": "|".join([
                f"{c.get('character_name', '')}({c.get('voice_actor', '')})"
                for c in anime_data.get("characters_voice_actors", [])[:5]
            ]),
            "director": anime_data.get("production_info", {}).get("감독", ""),
            "studio": anime_data.get("production_info", {}).get("애니메이션 제작", ""),
            "original_work": anime_data.get("production_info", {}).get("원작자", "") or anime_data.get("production_info", {}).get("원작", ""),
            "error": ""
        }

        return flat_data

    def scrape_range(self, start_id: int, end_id: int, batch_size: int = 100):
        """지정된 범위의 애니메이션 병렬 크롤링"""
        self.total_count = end_id - start_id + 1
        self.completed_count = 0

        logging.info(f"크롤링 시작: ID {start_id}부터 {end_id}까지 (총 {self.total_count}개)")

        # 배치 단위로 처리
        for batch_start in range(start_id, end_id + 1, batch_size):
            batch_end = min(batch_start + batch_size - 1, end_id)
            batch_ids = list(range(batch_start, batch_end + 1))

            logging.info(f"배치 처리 중: ID {batch_start} ~ {batch_end}")

            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
                futures = {executor.submit(self.scrape_single, anime_id): anime_id
                          for anime_id in batch_ids}

                for future in as_completed(futures):
                    anime_id = futures[future]
                    try:
                        result = future.result(timeout=30)
                        processed_result = self.process_result(result)

                        with self.lock:
                            if processed_result.get("error"):
                                self.errors.append(processed_result)
                            else:
                                self.results.append(processed_result)

                    except Exception as e:
                        logging.error(f"ID {anime_id} 처리 실패: {str(e)}")
                        with self.lock:
                            self.errors.append({"id": anime_id, "error": str(e)})

            # 배치 간 대기 시간 (서버 부하 방지)
            time.sleep(2)

            # 중간 저장 (매 500개마다)
            if len(self.results) % 500 == 0 and self.results:
                self.save_intermediate_results()

    def save_intermediate_results(self):
        """중간 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"anilife_intermediate_{timestamp}.csv"

        with self.lock:
            if self.results:
                self.save_to_csv(filename, self.results)
                logging.info(f"중간 결과 저장: {filename} ({len(self.results)}개 항목)")

    def save_to_csv(self, filename: str, data: List[Dict]):
        """결과를 CSV 파일로 저장"""
        if not data:
            logging.warning("저장할 데이터가 없습니다.")
            return

        fieldnames = [
            "id", "url", "image_url", "title_korean", "title_japanese", "title_english",
            "format", "status", "year", "quarter", "start_date", "end_date",
            "total_episodes", "duration", "genres", "tags", "synopsis",
            "num_characters", "main_characters", "director", "studio",
            "original_work", "error"
        ]

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(data)

        logging.info(f"CSV 파일 저장 완료: {filename}")

    def save_all_results(self):
        """모든 결과 저장"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

        # 성공 데이터 저장
        if self.results:
            success_filename = f"anilife_data_{timestamp}.csv"
            self.save_to_csv(success_filename, self.results)
            logging.info(f"성공 데이터: {len(self.results)}개 항목")

        # 에러 데이터 저장
        if self.errors:
            error_filename = f"anilife_errors_{timestamp}.csv"
            self.save_to_csv(error_filename, self.errors)
            logging.info(f"에러 데이터: {len(self.errors)}개 항목")

        # 통계 출력
        total = len(self.results) + len(self.errors)
        success_rate = (len(self.results) / total * 100) if total > 0 else 0

        logging.info(f"\n크롤링 완료 통계:")
        logging.info(f"- 전체: {total}개")
        logging.info(f"- 성공: {len(self.results)}개")
        logging.info(f"- 실패: {len(self.errors)}개")
        logging.info(f"- 성공률: {success_rate:.1f}%")


def main():
    """메인 실행 함수"""
    # 설정
    START_ID = 101
    END_ID = 7000
    MAX_WORKERS = 30  # 동시 실행 스레드 수 (서버 부하 고려하여 조정)
    BATCH_SIZE = 100  # 한 번에 처리할 항목 수

    # 스크래퍼 초기화
    scraper = ParallelAnilifeScraper(max_workers=MAX_WORKERS)

    # 시작 시간 기록
    start_time = time.time()

    try:
        # 크롤링 실행
        scraper.scrape_range(START_ID, END_ID, batch_size=BATCH_SIZE)

        # 결과 저장
        scraper.save_all_results()

    except KeyboardInterrupt:
        logging.info("\n크롤링이 사용자에 의해 중단되었습니다.")
        scraper.save_all_results()

    except Exception as e:
        logging.error(f"크롤링 중 오류 발생: {str(e)}")
        scraper.save_all_results()

    finally:
        # 소요 시간 출력
        elapsed_time = time.time() - start_time
        hours = int(elapsed_time // 3600)
        minutes = int((elapsed_time % 3600) // 60)
        seconds = int(elapsed_time % 60)

        logging.info(f"\n총 소요 시간: {hours}시간 {minutes}분 {seconds}초")


if __name__ == "__main__":
    main()

2025-09-08 10:18:29,868 - INFO - 크롤링 시작: ID 101부터 7000까지 (총 6900개)
2025-09-08 10:18:29,869 - INFO - 배치 처리 중: ID 101 ~ 200
2025-09-08 10:18:30,453 - INFO - 진행률: 10/6900 (0.1%)
2025-09-08 10:18:31,117 - INFO - 진행률: 20/6900 (0.3%)
2025-09-08 10:18:31,360 - INFO - 진행률: 30/6900 (0.4%)
2025-09-08 10:18:31,618 - INFO - 진행률: 40/6900 (0.6%)
2025-09-08 10:18:31,933 - INFO - 진행률: 50/6900 (0.7%)
2025-09-08 10:18:32,248 - INFO - 진행률: 60/6900 (0.9%)
2025-09-08 10:18:32,514 - INFO - 진행률: 70/6900 (1.0%)
2025-09-08 10:18:32,760 - INFO - 진행률: 80/6900 (1.2%)
2025-09-08 10:18:32,990 - INFO - 진행률: 90/6900 (1.3%)
2025-09-08 10:18:33,278 - INFO - 진행률: 100/6900 (1.4%)
2025-09-08 10:18:35,707 - INFO - 배치 처리 중: ID 201 ~ 300
2025-09-08 10:18:38,263 - INFO - 진행률: 110/6900 (1.6%)
2025-09-08 10:18:38,503 - INFO - 진행률: 120/6900 (1.7%)
2025-09-08 10:18:38,954 - INFO - 진행률: 130/6900 (1.9%)
2025-09-08 10:18:39,224 - INFO - 진행률: 140/6900 (2.0%)
2025-09-08 10:18:39,355 - INFO - 진행률: 150/6900 (2.2%)
2025-09-08 10:18:39,59