We should call `.py` so that cuda memory will be automatically released after each part.

This notebook is used to organize the codes

## Setup

### What you need to do

- Throw audio files into `data/` directory

- Correctly set `input_list`

- Create a `.env` file in the `process/` directory if using `glm-4`

In [1]:
import os

In [2]:
DATA_PATH = os.getcwd() + '/data/'
MODEL_PATH = '/ssdshare/LLMs/'
MUSIC_PATH = os.getcwd() + '/data/music/'
LLM_MODEL = "glm-4"
GENRATE_MODEL = "playground-v2.5-1024px-aesthetic"
CONTENT_PATH = DATA_PATH + '.tmp/generate/'
STYLE_PATH = DATA_PATH + 'style/'

if not os.path.exists(DATA_PATH + '.tmp/'):
  os.makedirs(DATA_PATH + '.tmp/')

list = ['extract/', 'generate/', 'process/', 'inprompt', 'style_transfer']

for folder in list:
  if not os.path.exists(DATA_PATH + '.tmp/' + folder):
    os.makedirs(DATA_PATH + folder)


In [3]:
input_list = [
  'FULi AUTO SHOOTER.mp3',
]
prompts = [r'''
  The name of this song is 'FULi AUTO SHOOTER'. 
''',
]
# Pick the style images in the style library
style_list = [
  # 'opia.png'
]
# You should check both input_list and prompts modified!!!
with open(DATA_PATH + 'input_list.txt', 'w') as f:
  for item in input_list:
    f.write("%s\n" % item)

with open(DATA_PATH + 'style_list.txt', 'w') as f:
  for item in style_list:
    f.write("%s\n" % item)

tmp_list = []
for item in input_list:
  tmp_list.append(item[:-4])
input_list = tmp_list

# if not os.path.exists(DATA_PATH + '.tmp/inprompt/'):
#   os.makedirs(DATA_PATH + '.tmp/inprompt/')
for (prompt, name) in zip(prompts, input_list):
  with open(DATA_PATH + '.tmp/inprompt/' + name + '.prompt', 'w') as f:
    f.write(prompt)

## Extract

In [7]:
os.system(f'python extract/extract.py --model_path {MODEL_PATH} --data_path {DATA_PATH} --music_path {MUSIC_PATH} --output_path {DATA_PATH}.tmp/extract/ --device_num 2')

FULi AUTO SHOOTER.mp3
['FULi AUTO SHOOTER.wav']
audio_start_id: 155163, audio_end_id: 155164, audio_pad_id: 151851.


The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|██████████| 9/9 [00:03<00:00,  2.56it/s]
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|██████████| 9/9 [00:03<00:00,  2.64it/s]
Traceback (most recent call last):
  File "/root/LLM_project/codes/extract/extract.py", line 173, in <module>
    tmp, device_start = make_prompt(file_name[:-4], device_start = device_start)
  File "/root/LLM_project/codes/extract/extract.py", line 152, in make_prompt
    description, lyrics, device_start = partition_extract(file_name, device_sta

using device 0


256

In [6]:
for file_name in input_list:
  with open(DATA_PATH + '.tmp/extract/' + file_name + '.prompt', 'r') as f:
    print(f.read())

This music is cut into 8 pieces. Each piece has a length of 30 seconds and an overlap of 5 seconds. The description of each piece is as follows:
Description piece 1: This is a song whose genre is Pop, and the lyrics are "求生之命 适合作为 破坏者 邀请一坛酒请 我保护你 你系谁的眷属 请风流一流 永远不息的血泪 湖光山色 情义江湖".
Description piece 2: The genre of the music is electronic. The tempo is fast with a strong beat. The music is upbeat and energetic. The lyrics are in Chinese. The lyrics are about a person who is deeply in love with someone and is willing to do anything for them. The music is perfect for a dance party or a club.
Description piece 3: This is a high-energy electronic dance music piece. It features a catchy melody, a strong beat, and synthesizer arrangements. The overall emotion of the piece is energetic and upbeat. It would be suitable for use in a variety of settings, including workout videos, dance clubs, and video games. The gender of the piece is difficult to determine as it does not contain any lyrics.
Descr

## Process

In [11]:
os.system(f'python process/process.py --model_path {MODEL_PATH} --data_path {DATA_PATH} --model {LLM_MODEL} --prompt_path {DATA_PATH}.tmp/extract/ --output_path {DATA_PATH}.tmp/process/')

['茶鸣拾贰律 - Feast远东之宴']
/root/LLM_project/codes/data/.tmp/inprompt/茶鸣拾贰律 - Feast远东之宴.prompt_total
Loading model
User return code: 0
User output: 
User error: [?25l⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ 

0

In [12]:
for file_name in input_list:
  with open(DATA_PATH + '.tmp/process/' + file_name + '.prompt', 'r') as f:
    print(f.read())

Here are the prompts for image generation based on each piece:

**Piece 1**
Prompt: Vibrant nightclub scene with flashing lights, energetic dancers, and a DJ spinning tracks. Incorporate Chinese characters and symbols to reflect the song's lyrics.

**Piece 2**
Prompt: Futuristic cityscape at night with neon lights, fast cars, and a sense of excitement and energy. A couple embracing in the foreground, surrounded by dynamic lines and shapes.

**Piece 3**
Prompt: High-tech laboratory or futuristic gaming setup with sleek lines, neon accents, and a sense of intensity. Incorporate synthesizer-like patterns and shapes to reflect the music's electronic elements.

**Piece 4**
Prompt: Energetic dance party scene with strobing lights, confetti, and dancers in motion. Electric guitars and synthesizers visualized as vibrant, dynamic shapes.

**Piece 5**
Prompt: Dreamlike landscape with a river flowing through it, symbolizing hope and pursuit of happiness. Incorporate Chinese characters and waterco

## Generate

In [17]:
os.system(f'python generate/generate.py --model_path {MODEL_PATH} --data_path {DATA_PATH} --model {GENRATE_MODEL} --output_path {DATA_PATH}.tmp/generate/ --prompt_path {DATA_PATH}.tmp/process/ --image_num 3')

Loading prompt from file
茶鸣拾贰律 - Feast远东之宴.prompt
Prompt loaded
Loading model


Loading pipeline components...: 100%|██████████| 7/7 [00:01<00:00,  4.26it/s]
NVIDIA GeForce RTX 4090 with CUDA capability sm_89 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 4090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



Model loaded
Generating for 茶鸣拾贰律 - Feast远东之宴.prompt


Token indices sequence length is longer than the specified maximum sequence length for this model (322 > 77). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (322 > 77). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "/root/LLM_project/codes/generate/generate.py", line 102, in <module>
    output = pipe(prompt[audio], 
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/diffusers_modules/local/lpw_stable_diffusion_xl.py", line 1628, in __call__
    ) = get_weighted_text_embeddings_sdxl(
  File "/root/.cache/huggingface/modules/diffusers_modules/local/lpw_stable_diffusion_xl.py", line 361, in get_weighted_text_embeddings_sdxl
    prompt_embeds_1 = pipe.text_encoder(token_tenso

256

# Style transfer

In [6]:
os.system(f'python style_transfer/style_transfer.py --data_path {DATA_PATH} --output_path {DATA_PATH}.tmp/style_transfer/ --style_path {STYLE_PATH} --content_path {CONTENT_PATH} -c_p')

Building the style transfer model..


Style Loss : 0.222865 Content Loss: 1.303105: 100%|██████████| 300/300 [00:33<00:00,  9.00it/s]   


Building the style transfer model..


Style Loss : 0.000000 Content Loss: 0.000000:  94%|█████████▎| 281/300 [00:28<00:01,  9.70it/s]


0