正确训练的权重，聚类优化

KakaruHayate · Jan 13, 2024 · 4c702b0 · 4c702b0
1 parent 70e6453
commit 4c702b0
Show file tree

Hide file tree

Showing 13 changed files with 522 additions and 255 deletions.
diff --git a/README.md b/README.md
@@ -10,11 +10,40 @@ The encoder model train by total of 303 speakers for 52 hours data
 
 # Introduction
 
-ColorSplitter is a command-line tool designed to classify the timbre styles of single-speaker data in the early stages of vocal data processing.
+ColorSplitter is a command-line tool for classifying the vocal timbre styles of single-speaker data in the pre-processing stage of vocal data.
 
-**Please note**, this project is based on speaker identification technology, and it is currently uncertain whether the timbre changes in singing are completely related to the differences in voiceprints, just for fun :) 
+For scenarios that do not require style classification, using this tool to filter data can also reduce the problem of unstable timbre performance of the model.
 
-The research in this field is still lacking, and this is just a start. Thanks to the community users:洛泠羽
+**Please note** that this project is based on Speaker Verification technology, and it is not clear whether the timbre changes of singing are completely related to the voiceprint differences, just for fun :)
+
+The research in this field is still scarce, hoping to inspire more ideas.
+
+Thanks to the community user: 洛泠羽
+
+# New version features
+
+Implemented automatic optimization of clustering results, no longer need users to judge the optimal clustering results themselves.
+
+`splitter.py` deleted the `--nmax` parameter, added `--nmin` (minimum number of timbre types, invalid when cluster parameter is 2) `--cluster` (clustering method, 1:SpectralCluster, 2:UmapHdbscan), `--mer_cosine` to merge clusters that are too similar.
+
+**New version tips**
+
+1. Run `splitter.py` directly with the default parameters by specifying the speaker.
+
+2. If the result has only one cluster, observe the distribution map, set `--nmin` to the number you think is reasonable, and run `splitter.py` again.
+
+3. The optimal value of `--nmin` may be smaller than expected in actual tests.
+
+4. The new clustering algorithm is faster, it is recommended to try multiple times.
+
+# Progress
+
+- [x] **Correctly trained weights**
+- [x] Clustering algorithm optimization
+- [ ] CAM++
+- [ ] ERes2Net
+- [ ] emotional encoder
+- [ ] embed mix
 
 # Environment Configuration
 
@@ -33,10 +62,10 @@ Tips:This tools running in CPU much quicker than GPU
 **1. Move your well-made Diffsinger dataset to the `.\input` folder and run the following command**
 
 ```
-python splitter.py --spk <speaker_name> --nmax <'N'_max_num>
+python splitter.py --spk <speaker_name> --nmin <'N'_min_num>
 ```
 
-Enter the speaker name after `--spk`, and enter the maximum number of timbre types after `--nmax` (minimum 2, maximum 14)
+Enter the speaker name after `--spk`, and enter the minimum number of timbre types after `--nmin` (minimum 1, maximum 14，default 1)
 
 Tips: This project does not need to read the annotation file (transcriptions.csv) of the Diffsinger dataset, so as long as the file structure is as shown below, it can work normally
 ```
@@ -56,19 +85,15 @@ The wav files are best already split
 
 As shown, cluster 3 is obviously a minority outlier, you can use the following command to separate it from the dataset
 ```
-python kick.py --spk <speaker_name> --n <n_num> --clust <clust_num>
+python kick.py --spk <speaker_name> --clust <clust_num>
 ```
 The separated data will be saved in `.\input\<speaker_name>_<n_num>_<clust_num>`
 
 Please note that running this step may not necessarily optimize the results
 
-**3. Find the optimal result through the silhouette score. The higher the silhouette score, the better the result, but the optimal result may not be at the highest score, it may be on the adjacent result**
-
-![scores](IMG/{6BDE2B2B-3C7A-4de5-90E8-C55DB1FC18C0}.png)
-
-After you select the optimal result you think, run the following command to classify the wav files in the dataset
+**3. After you select the optimal result you think, run the following command to classify the wav files in the dataset
 ```
-python move_files.py --spk <speaker_name> --n <n_num>
+python move_files.py --spk <speaker_name>
 ```
 The classified results will be saved in `.\output\<speaker_name>\<clust_num>`
 After that, you still need to manually merge the too small clusters to meet the training requirements
@@ -80,3 +105,5 @@ After that, you still need to manually merge the too small clusters to meet the
 # Based on Project
 
 [Resemblyzer](https://github.com/resemble-ai/Resemblyzer/)
+
+[3D-Speaker](https://github.com/alibaba-damo-academy/3D-Speaker/)
diff --git a/README_CN.md b/README_CN.md
@@ -10,12 +10,39 @@
 
 ColorSplitter是一个为了在歌声数据的处理前期，对单说话人数据的音色风格进行分类的命令行工具
 
-**请注意**，本项目基于声纹识别（speaker identification）技术，目前并不确定唱歌的音色变化是与声纹差异完全相关，just for fun：)
+对于不需要进行风格分类的场合，使用本工具进行数据筛选，也可以减轻模型的音色表现不稳定问题
+
+**请注意**，本项目基于说话人确认（Speaker Verification）技术，目前并不确定唱歌的音色变化是与声纹差异完全相关，just for fun：)
 
 目前该领域研究仍然匮乏，抛砖引玉
 
 感谢社区用户：洛泠羽
 
+# 新版本特性
+
+实装了聚类结果自动优化，不再需要用户自己判断聚类最优结果
+
+`splitter.py`删除了`--nmax`参数，添加了`--nmin`（最小音色类型数量，cluster参数为2时无效）`--cluster`（聚类方式，1:SpectralCluster, 2:UmapHdbscan），`--mer_cosine`合并过于相似的簇
+
+**新版本使用技巧**
+
+1.默认参数直接指定说话人运行`splitter.py`
+
+2.如果结果只有一个簇，观察分布图，将`--nmin`设为你认为合理的数量，再次运行`splitter.py`
+
+3.实际测试下`--nmin`的最优值可能比想象的要小
+
+4.新的聚类算法速度较快，建议多次尝试
+
+# 进展
+
+- [x] **正确训练的权重**
+- [x] 聚类算法优化
+- [ ] CAM++
+- [ ] ERes2Net
+- [ ] emotional encoder
+- [ ] embed mix
+
 # 环境配置
 
 `python3.8`下使用正常，请先安装[Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
@@ -32,10 +59,10 @@ pip install -r requirements.txt
 **1.将你制作好的Diffsinger数据集移动到`.\input`文件夹下，运行以下命令**
 
 ```
-python splitter.py --spk <speaker_name> --nmax <'N'_max_num>
+python splitter.py --spk <speaker_name> --nmin <'N'_min_num>
 ```
 
-其中`--spk`后输入说话人名称，`--nmax`后输入最大音色类型数量（最小2最大14）
+其中`--spk`后输入说话人名称，`--nmin`后输入最小音色类型数量（最小1最大14默认1）
 
 tips:本项目并不需要读取Diffsinger数据集的标注文件（transcriptions.csv），所以保证只要文件结构如下所示就可以正常工作
 ```
@@ -55,19 +82,15 @@ tips:本项目并不需要读取Diffsinger数据集的标注文件（transcripti
 
 如同所示，簇3明显为少数离群点，可以使用以下命令将其从数据集中分离
 ```
-python kick.py --spk <speaker_name> --n <n_num> --clust <clust_num>
+python kick.py --spk <speaker_name> --clust <clust_num>
 ```
-被分离出的数据将保存在`.\input\<speaker_name>_<n_num>_<clust_num>`
+被分离出的数据将保存在`.\input\<speaker_name>_<clust_num>`
 
 请注意运行此步骤未必会对结果产生正向优化
 
-**3.通过轮廓分数寻找最优结果，轮廓分数越高则结果越好，但最优结果不一定在最高分处，可能在邻近的结果上**
-
-![scores](IMG/{6BDE2B2B-3C7A-4de5-90E8-C55DB1FC18C0}.png)
-
-选定你认为的最优结果后，运行以下命令将数据集中的wav文件分类
+**3.选定你认为的最优结果后，运行以下命令将数据集中的wav文件分类
 ```
-python move_files.py --spk <speaker_name> --n <n_num>
+python move_files.py --spk <speaker_name>
 ```
 分类后结果将保存到`.\output\<speaker_name>\<clust_num>`中
 在那之后还需要人工对过小的簇进行归并，以达到训练的需求
@@ -77,3 +100,5 @@ python move_files.py --spk <speaker_name> --n <n_num>
 # 基于项目
 
 [Resemblyzer](https://github.com/resemble-ai/Resemblyzer/)
+
+[3D-Speaker](https://github.com/alibaba-damo-academy/3D-Speaker/)
diff --git a/kick.py b/kick.py
@@ -5,23 +5,21 @@
 
 parser = argparse.ArgumentParser()
 parser.add_argument('--spk', type=str, help='Speaker name')
-parser.add_argument('--n', type=str, help='N num')
 parser.add_argument('--clust', type=int, help='Cluster value')
 
 args = parser.parse_args()
 
 Speaker_name = args.spk #Speaker name
-Nnum = args.n
 clust_value = args.clust # Cluster value
 
-data = pd.read_csv(os.path.join('output', Speaker_name, f'clustered_files_{Nnum}.csv'))
+data = pd.read_csv(os.path.join('output', Speaker_name, f'clustered_files.csv'))
 
 for index, row in data.iterrows():
     file_path = row['filename']
     clust = row['clust']
 
     if clust == clust_value:
-        clust_dir = os.path.join('input', f'{Speaker_name}_{Nnum}_{clust_value}')
+        clust_dir = os.path.join('input', f'{Speaker_name}_{clust_value}')
         if not os.path.exists(clust_dir):
             os.makedirs(clust_dir)
 

diff --git a/modules/Resemblyzer/visualizations.py b/modules/Resemblyzer/visualizations.py
@@ -96,8 +96,8 @@ def plot_projections(embeds, speakers, ax=None, colors=None, markers=None, legen
 
     # Compute the 2D projections. You could also project to another number of dimensions (e.g. 
     # for a 3D plot) or use a different different dimensionality reduction like PCA or TSNE.
-    #reducer = UMAP(**kwargs)
-    reducer = TSNE(init='pca', **kwargs)
+    reducer = UMAP(**kwargs)
+    #reducer = TSNE(init='pca', **kwargs)
     projs = reducer.fit_transform(embeds)
 
     # Draw the projections
@@ -107,7 +107,7 @@ def plot_projections(embeds, speakers, ax=None, colors=None, markers=None, legen
         speaker_projs = projs[speakers == speaker]
         marker = "o" if markers is None else markers[i]
         label = speaker if legend else None
-        ax.scatter(*speaker_projs.T, s=100, c=[colors[i]], marker=marker, label=label, edgecolors='k')
+        ax.scatter(*speaker_projs.T, s=60, c=[colors[i]], marker=marker, label=label, edgecolors='k')
         center = speaker_projs.mean(axis=0)
         ax.scatter(*center, s=200, c=[colors[i]], marker="X", edgecolors='k')
 

diff --git a/modules/Resemblyzer/voice_encoder.py b/modules/Resemblyzer/voice_encoder.py