Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions packages/crawling/src/cron/taggingSongs.ts
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
import { getSongTagSongIdsDB, getSongsAllDB } from '@/supabase/getDB';
import { postSongTagsDB } from '@/supabase/postDB';
import { autoTagSong } from '@/utils/getSongTag';
import { autoTagSong, getTagsForPrompt } from '@/utils/getSongTag';

const resultsLog = {
success: 0,
failed: 0,
skipped: 0,
};

// 1. 전체 곡 조회 + 이미 태그된 곡 ID 로드
const [allSongs, taggedSongIds] = await Promise.all([getSongsAllDB(), getSongTagSongIdsDB()]);
// 1. 전체 곡 조회 + 이미 태그된 곡 ID + 태그 프롬프트 로드
const [allSongs, taggedSongIds, tagsPrompt] = await Promise.all([
getSongsAllDB(),
getSongTagSongIdsDB(),
getTagsForPrompt(),
]);

console.log('전체 곡 수:', allSongs.length);
console.log('이미 태그된 곡 수:', taggedSongIds.size);
Expand All @@ -23,7 +27,7 @@ for (const song of allSongs) {
}

try {
const tagIds = await autoTagSong(song.title, song.artist);
const tagIds = await autoTagSong(song.title, song.artist, tagsPrompt);

if (tagIds.length === 0) {
resultsLog.failed++;
Expand Down
59 changes: 35 additions & 24 deletions packages/crawling/src/utils/getSongTag.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,10 @@ interface Tag {
category: string;
}

let cachedTagsPrompt: string | null = null;

/**
* DB에서 전체 태그 목록을 읽어와 AI 프롬프트용 텍스트로 변환한다.
*/
const getTagsForPrompt = async (): Promise<string> => {
if (cachedTagsPrompt) return cachedTagsPrompt;

export const getTagsForPrompt = async (): Promise<string> => {
const supabase = getClient();
const { data: tags, error } = await supabase
.from('tags')
Expand All @@ -36,34 +32,51 @@ const getTagsForPrompt = async (): Promise<string> => {
}

// AI가 읽기 편하게 "ID: 이름 (카테고리)" 형식으로 변환
cachedTagsPrompt = tags.map((tag: Tag) => `${tag.id}: ${tag.name} (${tag.category})`).join('\n');
return cachedTagsPrompt;
return tags.map((tag: Tag) => `${tag.id}: ${tag.name} (${tag.category})`).join('\n');
};

/**
* AI를 활용해 노래에 적절한 태그 ID들을 추출한다.
*/
export const autoTagSong = async (title: string, artist: string): Promise<number[]> => {
export const autoTagSong = async (
title: string,
artist: string,
tagsPrompt: string,
): Promise<number[]> => {
try {
// 1단계: 프롬프트용 태그 리스트 준비
const tagsPrompt = await getTagsForPrompt();
if (!tagsPrompt) return [];

// 1단계: 정규식을 이용한 문자열 사전 분석 (Harness)
const hasHangul = /[ㄱ-ㅎ|ㅏ-ㅣ|가-힣]/.test(title + artist);
const hasKana = /[ぁ-んァ-ヶ]/.test(title + artist);

// LLM에게 줄 강력한 힌트 생성
const languageHints = `
- [Detected Script] Hangul Present: ${hasHangul}, Japanese Kana Present: ${hasKana}
`.trim();

// 2단계: OpenAI API 호출
const response = await client.chat.completions.create({
model: 'gpt-4o-mini', // 가성비가 좋은 모델 사용
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `
You are a music database expert. Based on the song title and artist, categorize the song by selecting appropriate tag IDs from the provided list.
You are a music database expert specializing in global artist categorization.

[Language Selection Strategy]
- **Do NOT** assume a song is 102 (팝송) solely based on English/Latin characters.
- If title/artist are in English, research the **artist's origin and primary market**.
- Priority Logic:
1. If Hangul is detected OR the artist is a K-Pop artist: Select 100 (한국노래).
2. If Kana is detected OR the artist is a J-Pop/Japanese artist: Select 101 (일본노래).
3. Select 102 (팝송) ONLY if the artist is primarily from Western/English-speaking regions.
4. For all other cases or truly global/mixed origins, use 103 (글로벌).

[Selection Rules]
- Language Slot (100-199): EXACTLY 1 tag.
- Genre Slot (200-299): EXACTLY 1 tag.
- Origin Slot (300-399): 1 to 2 tags, sorted by relevance.

Guidelines:
1. Select at least one tag, but no more than 4.
2. Prioritize Language (100s), then Genre (200s), then Origin (300s).
3. If it's Japanese music, ALWAYS include 101 (J-POP).
4. Be precise. If it's from an Anime, use 302 (애니메이션).
5. Return only JSON: {"tag_ids": [number, number, ...]}
[Contextual Hints]
${languageHints}

Allowed Tags List:
${tagsPrompt}
Comment on lines 61 to 82
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Llm json 스키마 불일치 🐞 Bug ≡ Correctness

autoTagSong은 응답을 {tag_ids:number[]}로 파싱해 result.tag_ids를 그대로 반환하지만, 새 system 프롬프트에는 tag_ids 키를 반드시
포함하라는 출력 스키마 지시가 없어 json_object 모드에서 다른 키로 응답할 경우 tagIds가 undefined가 됩니다. 그 결과 taggingSongs에서
tagIds.length 접근 시 TypeError가 발생해 해당 곡 태깅이 실패합니다.
Agent Prompt
### Issue description
`autoTagSong()`이 LLM 응답에서 `tag_ids`를 항상 제공한다고 가정하고 `result.tag_ids`를 그대로 반환합니다. 그러나 현재 프롬프트는 `json_object`만 강제하고 `tag_ids` 필드를 명시하지 않아, 모델이 다른 키로 반환하면 `tagIds.length`에서 런타임 오류가 발생할 수 있습니다.

### Issue Context
- `response_format: { type: 'json_object' }`는 **유효한 JSON 객체**만 보장하며, 객체의 **필드명/스키마**는 보장하지 않습니다.
- 다운스트림(`taggingSongs.ts`)은 `tagIds`가 배열임을 전제로 동작합니다.

### Fix Focus Areas
- packages/crawling/src/utils/getSongTag.ts[56-102]
- packages/crawling/src/cron/taggingSongs.ts[27-37]

### What to change
1) system 프롬프트에 출력 스키마를 명시적으로 복구/추가하세요. 예:
- "Return ONLY valid JSON with EXACTLY this shape: {\"tag_ids\": number[]}".

2) 파싱 후 런타임 검증을 추가하세요.
- `const parsed = JSON.parse(content)`
- `const tagIds = Array.isArray(parsed.tag_ids) ? parsed.tag_ids : []`
- 유효하지 않으면 로그를 남기고 `[]` 반환.

3) (선택) `taggingSongs.ts`에서도 방어적으로 `Array.isArray(tagIds)` 체크 후 사용하세요(이중 안전망).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines 60 to 82
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Ambiguous llm json contract 🐞 Bug ≡ Correctness

autoTagSong() still parses the response as {"tag_ids": number[]} but the updated system prompt no
longer instructs the model to return a tag_ids field, so valid JSON responses without that key will
make the function return undefined and cause repeated per-song failures.
Agent Prompt
## Issue description
`autoTagSong()` parses the OpenAI response as `{ tag_ids: number[] }`, but the updated prompt no longer requires the model to return a `tag_ids` field. This makes the parser contract ambiguous and can yield `undefined`/non-array values.

## Issue Context
The call uses `response_format: { type: 'json_object' }`, which enforces JSON validity but does not guarantee a particular key name or schema.

## Fix Focus Areas
- packages/crawling/src/utils/getSongTag.ts[55-99]

## Suggested fix
1. Re-add an explicit output instruction in the system prompt, e.g.:
   - `Return JSON with this exact shape: {"tag_ids": [number, ...]}`
2. Add runtime validation after parsing:
   - If `result.tag_ids` is not an array of numbers, return `[]` (and optionally log the raw content for debugging).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Expand All @@ -75,14 +88,12 @@ export const autoTagSong = async (title: string, artist: string): Promise<number
},
],
response_format: { type: 'json_object' },
temperature: 0, // 결과의 일관성을 위해 0으로 설정
max_tokens: 50, // 결과가 짧으므로 토큰 제한
temperature: 0,
});

const content = response.choices[0].message.content;
if (!content) return [];

// 3단계: 결과 파싱 및 반환
const result: { tag_ids: number[] } = JSON.parse(content);
return result.tag_ids;
} catch (error) {
Expand Down