# WeNet Model Evaluation On Stutterance

1. Turn original transcripts into dataframes
2. Read generated transcript dataframes
3. Get Error & Stutterance count
4. Get Error & Stutterance Type ==> show which stutterance type is more prone to error
5. Calculate Correlation Score
6. Generate Heatmap
7. Summarize Trend

*** 

### Error Metrics
1. Net / Total Word Error Rate
2. Word Error Rate Specific After Cleaning other Stutterance Type Annotations

***

## Using Custom Kernel on SCC

SCC sometimes has the problem with installed library not importable [`module not found` error], this is an alternative.

Assuming you have a conda environment created, you would do the following:
1. `conda install -c anaconda ipykernel` 
2. `python -m ipykernel install --user --name=<env name>`
3. If the new kernel cannot be found, relaunch a new SCC instance

**Remember to switch to the conda env kernel**

In [35]:
ds_transcript_path = "/projectnb/ds549/projects/AImpower/datasets/updated_annotation_deid_full"

In [36]:
!pip install pandas numpy scipy tqdm

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


---

## Word Error Rate

Objectives:
* split sequence into characters
* count:
    * deletion: missing words
    * substitutions: wrongly recognized words
    * insertions: extra words

In [37]:
def wer(candidate, reference):
    """
    Parameter(s)
    ------------
    candidate ==> generated transcript
    reference ==> dataset transcript
    """
    
    candidate_tokens = list(candidate)
    reference_tokens = list(reference)
    
    cand_len = len(candidate_tokens)
    ref_len = len(reference_tokens)
    
    dist_mat = np.zeros((ref_len, cand_len), dtype=int)
    
    for i in range(ref_len):
        dist_mat[i][0] = i
    for j in range(cand_len):
        dist_mat[0][j] = j
        
    for i in range(1, ref_len):
        for j in range(1, cand_len):
            if (candidate_tokens[j - 1] == reference_tokens[i - 1]):
                cost = 0
            else:
                cost = 1
                
            dist_mat[i][j] = min(
                dist_mat[i-1][j] + 1,
                dist_mat[i][j-1] + 1,
                dist_mat[i-1][j-1] + cost
            )
            # print(dist_mat)
            
    wer = dist_mat[-1][-1] / len(reference_tokens)
    return wer

***

## Imports and Data Ingestion 

In [38]:
import pandas as pd
import numpy as np
import scipy
import os
from tqdm import tqdm
import re
import sys

In [39]:
net_data = pd.DataFrame(columns=["Filename", "Start_time", "End_time", "Transcript"]) 
net_aigenerated_data_wenet = pd.read_csv("/projectnb/ds549/projects/AImpower/datasets/generated-transcripts/WeNet.csv", delimiter=",")

del net_aigenerated_data_wenet[net_aigenerated_data_wenet.columns[0]]

In [40]:
for folder in os.listdir(ds_transcript_path):
    if folder == "command_stats.xlsx" or folder == "command_stats.csv":
        continue
    for audio_sample in os.listdir(os.path.join(ds_transcript_path, f"{folder}")):
        if ("_A.txt" in audio_sample):
            net_data = pd.concat([net_data, pd.read_csv(os.path.join(ds_transcript_path, f"{folder}/{audio_sample}"), sep="\t", names=["Start_time", "End_time", "Transcript"]).assign(Filename=f"D{folder}_A")])
        if ("_B.txt" in audio_sample):
            net_data = pd.concat([net_data, pd.read_csv(os.path.join(ds_transcript_path, f"{folder}/{audio_sample}"), sep="\t", names=["Start_time", "End_time", "Transcript"]).assign(Filename=f"D{folder}_B")])
        if ("P" in audio_sample):
            net_data = pd.concat([net_data, pd.read_csv(os.path.join(ds_transcript_path, f"{folder}/{audio_sample}"), sep="\t", names=["Start_time", "End_time", "Transcript"]).assign(Filename=f"P{folder}")])

In [41]:
mask_pattern = r"\<.*?\>"
repetition_pattern = r"\[.*?\]"
annotation_pattern = r"/\w"


net_data = net_data.assign(Cleaned_Transcript=net_data['Transcript'].apply(lambda x: re.sub(annotation_pattern, "", re.sub(repetition_pattern, "", re.sub(mask_pattern, "", x)))))
net_data = net_data.assign(Stutterance_Count=net_data['Transcript'].apply(lambda x: len(re.findall(mask_pattern, x)) + len(re.findall(repetition_pattern, x)) + len(re.findall(annotation_pattern, x))))

In [42]:
net_data

Unnamed: 0,Filename,Start_time,End_time,Transcript,Cleaned_Transcript,Stutterance_Count
0,D0045_A,48.330,58.020,嗯，我[我/b我]现在已经工作了，我是/p一八年毕业的，然后我的专业是/p国际经济与贸易。,嗯，我现在已经工作了，我是一八年毕业的，然后我的专业是国际经济与贸易。,4
1,D0045_A,58.900,72.140,然后我现在嗯/i/p我现在的工作是/p[是]在/p[是在]<人民银行>，但是这个工作就是/p...,然后我现在嗯我现在的工作是在，但是这个工作就是可能快要就不干了。,8
2,D0045_A,74.100,81.260,嗯/i/p然后我平常的爱好是/p比较喜欢看电影，然后还喜欢打网球。,嗯然后我平常的爱好是比较喜欢看电影，然后还喜欢打网球。,3
3,D0045_A,84.930,86.320,嗯/i<overlap>结束了。,嗯结束了。,2
4,D0045_A,100.950,113.700,嗯，这个就是它[它]不是银行系统，然后它然/r后那个之所以不干，是因为就是这个，嗯/i就是现...,嗯，这个就是它不是银行系统，然后它然后那个之所以不干，是因为就是这个，嗯就是现在人民银行的县...,4
...,...,...,...,...,...,...
135,P0030,3202.732,3213.472,单曲循环歌曲最[最最]爱的为何没结果。,单曲循环歌曲最爱的为何没结果。,1
136,P0030,3216.222,3221.042,来[来]一首朴/p树的歌。,来一首朴树的歌。,2
137,P0030,3225.732,3236.142,搜[搜/r/b搜]索小灿的多/p年/p/r/b以后。,搜索小灿的多年以后。,7
138,P0030,3240.652,3253.172,播放半吨/b/r/p孙弟歌孙兄弟的歌。,播放半吨孙弟歌孙兄弟的歌。,3


In [43]:
net_aigenerated_data_wenet

Unnamed: 0,Filename,Start_time,End_time,WeNet
0,D0001_A,2081.540000,2109.650000,我说出来就比较地需要时间然后那个识别的他的不就是他等你一会你那个话还没有说完的还没有说出来的...
1,D0001_A,790.130000,796.580000,第四句有我我说的话
2,D0001_A,1562.083518,1586.220000,这部剧是不怎么评分是不怎么好的就因为评评论区的那那些人他们都在说拿到号做好惨的那个就那么勤奋...
3,D0001_A,2016.780000,2035.673559,很精准对讯飞语音还是讯飞助手来着就我记得他是叫讯飞我之前就是他
4,D0001_A,1682.670000,1709.110000,一个个是叫啥来的我忘了就是出现了这一个人然后呢他他就射了一把剑然后就就把那个拿二号给长杀了就...
...,...,...,...,...
37248,P0070,2586.616000,2589.346000,单曲循环歌曲这样而已
37249,P0070,2782.496000,2783.706000,杨幂的电影
37250,P0070,2995.296000,2998.116000,你好米娅今天柴油价怎么样
37251,P0070,2604.066000,2606.656000,单曲循环歌曲琉璃光之歌


**Now we have raw data of all audio transcriptions from datasets [updated_annotation_deid_full] in `net_data` and AI predicted transcriptions in `net_aigenerated_data_wenet`**

***

## WeNet Large WER Analysis

In [46]:
na_count_large = 0
na_count_cleaned = 0
for index, row in tqdm(net_aigenerated_data_wenet.iterrows(), total=len(net_aigenerated_data_wenet)):
    
    mask_large = (
        (net_aigenerated_data_wenet["Filename"] == row["Filename"]) &
        (net_aigenerated_data_wenet["Start_time"] == row["Start_time"])
    )

    mask_net = (
        (net_data["Filename"] == row["Filename"]) &
        (net_data["Start_time"] == row["Start_time"])
    )

    
    large_row = net_aigenerated_data_wenet.loc[mask_large]
    net_row = net_data.loc[mask_net]

    # print(large_row)
    # print('\n\n\n\n')
    # print(net_row)
    
    if large_row.empty or net_row.empty:
        print("Skipping: One of the rows is empty.")
        continue
        
    wenet = large_row["WeNet"].values[0]
    cleaned_transcript = net_row["Cleaned_Transcript"].values[0]
    
    if pd.isna(wenet) or not isinstance(wenet, str):
        print("Skipping due to missing or non-string wenet.")
        na_count_large = na_count_large + 1
        continue
    if pd.isna(cleaned_transcript) or not isinstance(cleaned_transcript, str):
        print("Skipping due to missing or non-string Cleaned_Transcript.")
        na_count_cleaned = na_count_cleaned + 1
        continue

    try:
        
        wer_value = wer(wenet, cleaned_transcript)
        
        net_aigenerated_data_wenet.loc[mask_large, "WER"] = wer_value
        
        stutterance_count = net_row["Stutterance_Count"].values[0]
        net_aigenerated_data_wenet.loc[mask_large, "Stutterance_Count"] = stutterance_count

        # Verify assignment
        # print(f'Assigned Stutterance_Count: {stutterance_count}')
        # print(net_aigenerated_data_wenet.loc[mask_large, "Stutterance_Count"])

    except Exception as e:
        print(f'ERROR: {e}')
        print('Occurred with the following data:')
        print(large_row)
        print(net_row)
        
net_aigenerated_data_wenet = net_aigenerated_data_wenet.assign(NA_Count=na_count_large)
net_aigenerated_data_wenet = net_aigenerated_data_wenet.assign(NA_Cleaned_Count=na_count_cleaned)

  2%|▏         | 711/37253 [00:04<03:51, 157.84it/s]

Skipping due to missing or non-string wenet.


  4%|▎         | 1347/37253 [00:08<03:59, 149.77it/s]

Skipping due to missing or non-string wenet.


  4%|▍         | 1455/37253 [00:09<04:07, 144.92it/s]

Skipping due to missing or non-string wenet.


  5%|▍         | 1853/37253 [00:11<03:41, 160.16it/s]

Skipping due to missing or non-string wenet.
Skipping: One of the rows is empty.


  6%|▌         | 2163/37253 [00:13<03:40, 159.02it/s]

Skipping due to missing or non-string wenet.


  6%|▌         | 2214/37253 [00:14<03:39, 159.40it/s]

Skipping due to missing or non-string wenet.


  7%|▋         | 2736/37253 [00:17<03:30, 164.01it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 10%|█         | 3838/37253 [00:25<03:17, 168.84it/s]

Skipping due to missing or non-string wenet.


 11%|█         | 4051/37253 [00:26<03:12, 172.76it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 11%|█         | 4123/37253 [00:27<03:12, 171.90it/s]

Skipping due to missing or non-string wenet.


 12%|█▏        | 4450/37253 [00:29<03:08, 173.98it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 12%|█▏        | 4593/37253 [00:30<03:20, 163.16it/s]

Skipping due to missing or non-string wenet.


 13%|█▎        | 4764/37253 [00:31<03:14, 166.72it/s]

Skipping due to missing or non-string wenet.


 14%|█▎        | 5114/37253 [00:33<03:01, 176.98it/s]

Skipping due to missing or non-string wenet.


 14%|█▍        | 5239/37253 [00:33<03:18, 161.13it/s]

Skipping due to missing or non-string wenet.


 14%|█▍        | 5273/37253 [00:34<03:20, 159.13it/s]

Skipping due to missing or non-string wenet.


 14%|█▍        | 5339/37253 [00:34<03:24, 156.06it/s]

Skipping due to missing or non-string wenet.


 16%|█▌        | 5808/37253 [00:37<03:42, 141.46it/s]

Skipping due to missing or non-string wenet.


 16%|█▌        | 5921/37253 [00:38<03:12, 162.62it/s]

Skipping due to missing or non-string wenet.


 17%|█▋        | 6254/37253 [00:40<03:07, 165.31it/s]

Skipping due to missing or non-string wenet.


 17%|█▋        | 6291/37253 [00:40<02:59, 172.57it/s]

Skipping due to missing or non-string wenet.


 18%|█▊        | 6753/37253 [00:43<03:17, 154.50it/s]

Skipping due to missing or non-string wenet.


 19%|█▉        | 7165/37253 [00:46<02:51, 175.74it/s]

Skipping due to missing or non-string wenet.


 19%|█▉        | 7236/37253 [00:46<03:04, 162.82it/s]

Skipping due to missing or non-string wenet.


 20%|██        | 7498/37253 [00:48<03:06, 159.57it/s]

Skipping due to missing or non-string wenet.


 20%|██        | 7546/37253 [00:48<03:09, 156.80it/s]

Skipping due to missing or non-string wenet.


 21%|██        | 7897/37253 [00:51<03:02, 160.51it/s]

Skipping due to missing or non-string wenet.


 22%|██▏       | 8124/37253 [00:52<02:48, 172.63it/s]

Skipping due to missing or non-string wenet.


 22%|██▏       | 8211/37253 [00:52<02:54, 166.12it/s]

Skipping due to missing or non-string wenet.


 22%|██▏       | 8246/37253 [00:53<02:56, 164.31it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 22%|██▏       | 8297/37253 [00:53<02:59, 161.61it/s]

Skipping due to missing or non-string wenet.


 22%|██▏       | 8367/37253 [00:53<02:50, 169.03it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 23%|██▎       | 8418/37253 [00:54<02:54, 165.22it/s]

Skipping due to missing or non-string wenet.


 23%|██▎       | 8526/37253 [00:54<02:44, 174.38it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 23%|██▎       | 8628/37253 [00:55<02:59, 159.81it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 24%|██▎       | 8790/37253 [00:56<03:13, 147.32it/s]

Skipping due to missing or non-string wenet.


 24%|██▍       | 8852/37253 [00:56<03:14, 146.12it/s]

Skipping due to missing or non-string wenet.


 24%|██▍       | 9037/37253 [00:58<03:09, 148.95it/s]

Skipping due to missing or non-string wenet.


 25%|██▌       | 9381/37253 [01:00<03:13, 143.68it/s]

Skipping due to missing or non-string wenet.


 28%|██▊       | 10325/37253 [01:06<02:38, 170.34it/s]

Skipping due to missing or non-string wenet.


 28%|██▊       | 10497/37253 [01:07<02:58, 149.90it/s]

Skipping: One of the rows is empty.


 29%|██▉       | 10865/37253 [01:09<02:35, 169.43it/s]

Skipping due to missing or non-string wenet.


 29%|██▉       | 10986/37253 [01:10<02:45, 158.67it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 30%|███       | 11278/37253 [01:12<02:44, 157.93it/s]

Skipping due to missing or non-string wenet.


 31%|███       | 11595/37253 [01:14<02:41, 158.94it/s]

Skipping due to missing or non-string wenet.


 33%|███▎      | 12154/37253 [01:17<02:37, 158.88it/s]

Skipping due to missing or non-string wenet.


 33%|███▎      | 12222/37253 [01:18<02:32, 163.61it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 33%|███▎      | 12308/37253 [01:18<02:36, 159.27it/s]

Skipping due to missing or non-string wenet.


 34%|███▍      | 12735/37253 [01:21<02:33, 159.42it/s]

Skipping due to missing or non-string wenet.


 35%|███▍      | 12964/37253 [01:23<02:39, 152.33it/s]

Skipping due to missing or non-string wenet.


 35%|███▌      | 13108/37253 [01:23<02:39, 151.53it/s]

Skipping due to missing or non-string wenet.


 35%|███▌      | 13157/37253 [01:24<02:36, 153.83it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 36%|███▌      | 13288/37253 [01:25<02:30, 159.11it/s]

Skipping due to missing or non-string wenet.


 38%|███▊      | 13993/37253 [01:29<02:37, 147.89it/s]

Skipping due to missing or non-string wenet.


 38%|███▊      | 14100/37253 [01:30<02:16, 170.04it/s]

Skipping due to missing or non-string wenet.


 38%|███▊      | 14207/37253 [01:30<02:16, 169.40it/s]

Skipping due to missing or non-string wenet.


 39%|███▊      | 14396/37253 [01:32<02:27, 155.08it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 40%|███▉      | 14738/37253 [01:34<02:12, 169.76it/s]

Skipping due to missing or non-string wenet.


 42%|████▏     | 15674/37253 [01:39<01:54, 188.47it/s]

Skipping due to missing or non-string wenet.


 45%|████▍     | 16624/37253 [01:44<01:49, 188.08it/s]

Skipping due to missing or non-string wenet.


 47%|████▋     | 17499/37253 [01:49<01:44, 189.22it/s]

Skipping due to missing or non-string wenet.


 48%|████▊     | 17882/37253 [01:51<01:42, 189.68it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 48%|████▊     | 18038/37253 [01:52<01:41, 189.96it/s]

Skipping due to missing or non-string wenet.


 51%|█████     | 19090/37253 [01:57<01:36, 188.62it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 52%|█████▏    | 19262/37253 [01:58<01:35, 189.07it/s]

Skipping due to missing or non-string wenet.


 54%|█████▎    | 20010/37253 [02:02<01:31, 189.15it/s]

Skipping due to missing or non-string wenet.


 57%|█████▋    | 21271/37253 [02:09<01:24, 189.67it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 57%|█████▋    | 21310/37253 [02:09<01:24, 189.70it/s]

Skipping due to missing or non-string wenet.


 58%|█████▊    | 21558/37253 [02:10<01:23, 188.69it/s]

Skipping due to missing or non-string wenet.


 58%|█████▊    | 21615/37253 [02:11<01:23, 187.87it/s]

Skipping due to missing or non-string wenet.


 58%|█████▊    | 21710/37253 [02:11<01:22, 188.34it/s]

Skipping due to missing or non-string wenet.


 58%|█████▊    | 21748/37253 [02:11<01:22, 188.31it/s]

Skipping due to missing or non-string wenet.


 63%|██████▎   | 23424/37253 [02:20<01:13, 188.40it/s]

Skipping due to missing or non-string wenet.


 66%|██████▋   | 24720/37253 [02:27<01:06, 188.43it/s]

Skipping due to missing or non-string wenet.


 67%|██████▋   | 24853/37253 [02:28<01:06, 187.55it/s]

Skipping due to missing or non-string wenet.


 67%|██████▋   | 25081/37253 [02:29<01:04, 188.06it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 81%|████████  | 30116/37253 [02:56<00:37, 188.09it/s]

Skipping due to missing or non-string wenet.


 81%|████████  | 30249/37253 [02:57<00:37, 188.34it/s]

Skipping due to missing or non-string wenet.
Skipping due to missing or non-string wenet.


 82%|████████▏ | 30363/37253 [02:57<00:36, 187.99it/s]

Skipping due to missing or non-string wenet.


 82%|████████▏ | 30705/37253 [02:59<00:34, 188.10it/s]

Skipping due to missing or non-string wenet.


 83%|████████▎ | 30762/37253 [02:59<00:34, 188.50it/s]

Skipping due to missing or non-string wenet.


 91%|█████████▏| 34068/37253 [03:17<00:16, 188.52it/s]

Skipping: One of the rows is empty.
Skipping: One of the rows is empty.


 93%|█████████▎| 34600/37253 [03:20<00:14, 187.96it/s]

Skipping: One of the rows is empty.


100%|██████████| 37253/37253 [03:34<00:00, 173.77it/s]


In [47]:
net_aigenerated_data_wenet

Unnamed: 0,Filename,Start_time,End_time,WeNet,WER,Stutterance_Count,NA_Count,NA_Cleaned_Count
0,D0001_A,2081.540000,2109.650000,我说出来就比较地需要时间然后那个识别的他的不就是他等你一会你那个话还没有说完的还没有说出来的...,0.226804,1.0,100,0
1,D0001_A,790.130000,796.580000,第四句有我我说的话,0.500000,1.0,100,0
2,D0001_A,1562.083518,1586.220000,这部剧是不怎么评分是不怎么好的就因为评评论区的那那些人他们都在说拿到号做好惨的那个就那么勤奋...,0.203125,10.0,100,0
3,D0001_A,2016.780000,2035.673559,很精准对讯飞语音还是讯飞助手来着就我记得他是叫讯飞我之前就是他,0.210526,7.0,100,0
4,D0001_A,1682.670000,1709.110000,一个个是叫啥来的我忘了就是出现了这一个人然后呢他他就射了一把剑然后就就把那个拿二号给长杀了就...,0.250000,8.0,100,0
...,...,...,...,...,...,...,...,...
37248,P0070,2586.616000,2589.346000,单曲循环歌曲这样而已,0.090909,0.0,100,0
37249,P0070,2782.496000,2783.706000,杨幂的电影,0.166667,0.0,100,0
37250,P0070,2995.296000,2998.116000,你好米娅今天柴油价怎么样,0.266667,0.0,100,0
37251,P0070,2604.066000,2606.656000,单曲循环歌曲琉璃光之歌,0.083333,0.0,100,0


In [48]:
# net_aigenerated_data_wenet.to_csv('net_aigenerated_data_wenet_performance_data.csv', sep=',')

In [49]:
# Check available columns and identify any issues
print("Available columns in large_row:", large_row.columns)

# Optionally, try stripping whitespace
large_row.columns = large_row.columns.str.strip()

# Check if "wenet" exists before accessing it
if "WeNet" in large_row.columns:
    wenet = large_row["WeNet"].values[0]
else:
    print("Column 'wenet' not found in large_row.")


Available columns in large_row: Index(['Filename', 'Start_time', 'End_time', 'WeNet', 'WER',
       'Stutterance_Count'],
      dtype='object')


***

## Visualization of Relationship between Stutterance Count and Word Error Rate 

In [50]:
## Load data from csv if starting here

net_aigenerated_data_wenet = pd.read_csv('/projectnb/ds549/projects/AImpower/evaluation/net_aigenerated_data_wenet_performance_data.csv', delimiter=',')

FileNotFoundError: [Errno 2] No such file or directory: '/projectnb/ds549/projects/AImpower/evaluation/net_aigenerated_data_wenet_performance_data.csv'

In [None]:
import matplotlib.pyplot as plt

In [None]:
## Null value plots

nonnull_count_large = net_aigenerated_data_wenet["NA_Count"].count() - net_aigenerated_data_wenet.iloc[0]["NA_Count"]
null_count_large = net_aigenerated_data_wenet.iloc[0]["NA_Count"]

nonnull_count_cleaned = net_aigenerated_data_wenet["NA_Cleaned_Count"].count() - net_aigenerated_data_wenet.iloc[0]["NA_Cleaned_Count"]
null_count_cleaned = net_aigenerated_data_wenet.iloc[0]["NA_Cleaned_Count"]


data = {
    "NA Values": [null_count_large, null_count_cleaned],
    "Non NA Values": [nonnull_count_large, nonnull_count_cleaned],
}

species = (
    "Whisper Large",
    "Cleaned Ground Truth"
)

width = 0.5

fig, ax = plt.subplots()
bottom = np.zeros(2)

for na, count in data.items():
    p = ax.bar(species, count, width, label=na, bottom=bottom)
    bottom += count


ax.set_title("NA value counting", fontsize=16)
ax.set_xlabel("Source", fontsize=14)
ax.set_ylabel("Count", fontsize=14)
ax.legend()
ax.grid(True)

plt.show()

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["WER"], 
    alpha=0.7  # Handle overlapping points
)

plt.title("WER vs Stutterance Count", fontsize=16)
plt.xlabel("Stutterance Count", fontsize=14)
plt.ylabel("WER", fontsize=14)
plt.grid(True)
plt.show()

In [None]:
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

net_aigenerated_data_wenet['WER_Binned'] = np.round(net_aigenerated_data_wenet['WER'], 2)

grouped_data = net_aigenerated_data_wenet.groupby(
    ['Stutterance_Count', 'WER_Binned']
).size().reset_index(name='Count')
heatmap_data = grouped_data.pivot(index='WER_Binned', columns='Stutterance_Count', values='Count').fillna(0)

plt.figure(figsize=(6, 4))
sns.heatmap(
    heatmap_data, cmap='cool', annot=False, fmt='g', cbar=True
)

plt.title("Stutterance Count vs WER (Color = Number of Cases)", fontsize=16)
plt.xlabel("Stutterance Count", fontsize=14)
plt.ylabel("WER (Binned)", fontsize=14)

plt.show()

In [None]:
from scipy.stats import spearmanr
rho, p = spearmanr(net_aigenerated_data_wenet.dropna()['Stutterance_Count'], net_aigenerated_data_wenet.dropna()['WER'])
print(f"p-value = {p}")
print(f"rho = {rho}")

***

## ROUGE-N/L Scores (Semantic Evaluation)

In [None]:
from rouge_chinese import Rouge
import jieba

In [None]:
rouge = Rouge()

In [None]:
print_ = True

for index, row in tqdm(net_aigenerated_data_wenet.iterrows(), total=len(net_aigenerated_data_wenet)):
    
    mask_large = (
        (net_aigenerated_data_wenet["Filename"] == row["Filename"]) &
        (net_aigenerated_data_wenet["Start_time"] == row["Start_time"])
    )

    mask_net = (
        (net_data["Filename"] == row["Filename"]) &
        (net_data["Start_time"] == row["Start_time"])
    )

    
    large_row = net_aigenerated_data_wenet.loc[mask_large]
    net_row = net_data.loc[mask_net]

    # print(large_row)
    # print('\n\n\n\n')
    # print(net_row)
    
    if large_row.empty or net_row.empty:
        print("Skipping: One of the rows is empty.")
        continue
        
    wenet = large_row["WeNet"].values[0]
    cleaned_transcript = net_row["Cleaned_Transcript"].values[0]
    
    if pd.isna(wenet) or not isinstance(wenet, str):
        print("Skipping due to missing or non-string wenet.")
        continue
    if pd.isna(cleaned_transcript) or not isinstance(cleaned_transcript, str):
        print("Skipping due to missing or non-string Cleaned_Transcript.")
        continue

    try:
        
        scores = rouge.get_scores(' '.join(jieba.cut(wenet)), ' '.join(jieba.cut(cleaned_transcript)))
        
        net_aigenerated_data_wenet.loc[mask_large, "rouge1-precision"] = scores[0]["rouge-1"]["p"]
        net_aigenerated_data_wenet.loc[mask_large, "rouge1-recall"] = scores[0]["rouge-1"]["r"]
        net_aigenerated_data_wenet.loc[mask_large, "rouge1-f1"] = scores[0]["rouge-1"]["f"]
        
        
        net_aigenerated_data_wenet.loc[mask_large, "rouge2-precision"] = scores[0]["rouge-2"]["p"]
        net_aigenerated_data_wenet.loc[mask_large, "rouge2-recall"] = scores[0]["rouge-2"]["r"]
        net_aigenerated_data_wenet.loc[mask_large, "rouge2-f1"] = scores[0]["rouge-2"]["f"]
        
        
        net_aigenerated_data_wenet.loc[mask_large, "rougel-precision"] = scores[0]["rouge-l"]["p"]
        net_aigenerated_data_wenet.loc[mask_large, "rougel-recall"] = scores[0]["rouge-l"]["r"]
        net_aigenerated_data_wenet.loc[mask_large, "rougel-f1"] = scores[0]["rouge-l"]["f"]
        
        stutterance_count = net_row["Stutterance_Count"].values[0]
        net_aigenerated_data_wenet.loc[mask_large, "Stutterance_Count"] = stutterance_count

        if (print_):
            print(net_aigenerated_data_wenet)
            print_ = False
        
        # Verify assignment
        # print(f'Assigned Stutterance_Count: {stutterance_count}')
        # print(net_aigenerated_data_wenet.loc[mask_large, "Stutterance_Count"])

    except Exception as e:
        print(f'ERROR: {e}')
        print('Occurred with the following data:')
        print(large_row)
        print(net_row)

In [None]:
net_aigenerated_data_wenet

In [None]:
# net_aigenerated_data_wenet.to_csv('net_aigenerated_data_wenet_performance_data.csv', sep=',')

***

## Visualization of Relationship between Stutterance Count and Rouge Scores

In [None]:
## Load data from csv if starting here

net_aigenerated_data_wenet = pd.read_csv('/projectnb/ds549/projects/AImpower/evaluation/net_aigenerated_data_wenet_performance_data.csv', delimiter=',')

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rouge1-precision"], 
    facecolors="none", edgecolors='r',
    marker="8",
    alpha=0.7  # Handle overlapping points
)

plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rouge1-recall"], 
    facecolors="none", edgecolors='g',
    marker="^",
    alpha=0.7  # Handle overlapping points
)

plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rouge1-f1"], 
    facecolors="none", edgecolors='b',
    marker=".",
    alpha=0.7  # Handle overlapping points
)

plt.title("Rouge-1 vs Stutterance Count", fontsize=16)
plt.xlabel("Stutterance Count", fontsize=14)
plt.ylabel("Rouge Score", fontsize=14)
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rouge2-precision"], 
    facecolors="none", edgecolors='r',
    marker="8",
    alpha=0.7  # Handle overlapping points
)

plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rouge2-recall"], 
    facecolors="none", edgecolors='g',
    marker="^",
    alpha=0.7  # Handle overlapping points
)

plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rouge2-f1"], 
    facecolors="none", edgecolors='b',
    marker=".",
    alpha=0.7  # Handle overlapping points
)

plt.title("Rouge-2 vs Stutterance Count", fontsize=16)
plt.xlabel("Stutterance Count", fontsize=14)
plt.ylabel("Rouge Score", fontsize=14)
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rougel-precision"], 
    facecolors="none", edgecolors='r',
    marker="8",
    alpha=0.7  # Handle overlapping points
)

plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rougel-recall"], 
    facecolors="none", edgecolors='g',
    marker="^",
    alpha=0.7  # Handle overlapping points
)

plt.scatter(
    net_aigenerated_data_wenet["Stutterance_Count"], 
    net_aigenerated_data_wenet["rougel-f1"], 
    facecolors="none", edgecolors='b',
    marker=".",
    alpha=0.7  # Handle overlapping points
)

plt.title("Rouge-L vs Stutterance Count", fontsize=16)
plt.xlabel("Stutterance Count", fontsize=14)
plt.ylabel("Rouge Score", fontsize=14)
plt.grid(True)
plt.show()

## Correlations between Stuttering and Rouge Scores 

In [None]:
from scipy.stats import spearmanr
rho, p = spearmanr(net_aigenerated_data_wenet.dropna()['Stutterance_Count'], net_aigenerated_data_wenet.dropna()['rouge1-precision'])
print(f"p-value [stuttering count & rouge-1 precision] = {p}")
print(f"rho [stuttering count & rouge-1 precision] = {rho}")

rho, p = spearmanr(net_aigenerated_data_wenet.dropna()['Stutterance_Count'], net_aigenerated_data_wenet.dropna()['rouge1-recall'])
print(f"p-value [stuttering count & rouge-1 recall] = {p}")
print(f"rho [stuttering count & rouge-1 recall] = {rho}")

rho, p = spearmanr(net_aigenerated_data_wenet.dropna()['Stutterance_Count'], net_aigenerated_data_wenet.dropna()['rouge1-f1'])
print(f"p-value [stuttering count & rouge-1 f1] = {p}")
print(f"rho [stuttering count & rouge-1 f1] = {rho}")

***