<a href="https://colab.research.google.com/github/MK316/workspace/blob/main/ASR02/ASR06_EPAprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔎 Recognition: getting result with audio file

1. Original text file: the rainbow passage
2. Speech file: sentence unit, audio file in wav format, zip file upload (or google drive link)
3. ASR (Automated Speech Recognition) process
4. Recognition rate: ((number of recognized words)/number of words in the original text)*100
5. Saving result as a dataframe

#[1] Original text file 📗

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/MK316/workspace/main/ASR01/data/rainbow_bysentence.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,SN,Sentence
0,S1,When the sunlight strikes raindrops in the air...
1,S2,The rainbow is a division of white light into ...
2,S3,"These take the shape of a long round arch, wit..."
3,S4,"There is, according to legend, a boiling pot o..."
4,S5,"People look, but no one ever finds it."


# [2] Speech files

* Mount google drive and check files in the google drive. 

e.g., audio zip file list in your google drive: 

**_Note_**: allow Google Drive authentification when asked.

In [2]:
#@markdown 🎯Mount Google drive and list files (e.g., "asrdata" folder in my case)
from google.colab import drive 
import os

drive.mount('/content/drive')

mydir = input("Type the file directory in your google drive: e.g., asrdata  ")
!ls "/content/drive/MyDrive/{mydir}"
!pwd

Mounted at /content/drive
Type the file directory in your google drive: e.g., asrdata  asrdata
Esent.wav  HE.zip  Ksent.wav  SE.zip  SK.zip
/content


* Make a new directory for audio files and place unzipped files to there.

In [3]:
#@markdown 🎯 To do  

#@markdown   [1] Make a new folder: type a new folder name (e.g., myaudio)
import os

folder_name = input("Type a name for the new folder.")

if not os.path.exists(folder_name):
  os.makedirs(folder_name)
  print(f"A new folder name '{folder_name}' has been created.")
else:
  print(f"{folder_name} already exists.")

#@markdown [2] Unzip files: type a zip file name (e.g., SE.zip)

zipname = input("Type your zip file name (e.g., se.zip) to process (unzip and save them under the new folder")
!unzip "/content/drive/MyDrive/asrdata/{zipname}" -d "/content/{folder_name}/"

print(f"Your {zipname} is unzipped under '{folder_name}' folder")

Type a name for the new folder.myaudio
A new folder name 'myaudio' has been created.
Type your zip file name (e.g., se.zip) to process (unzip and save them under the new folderHE
Archive:  /content/drive/MyDrive/asrdata/HE.zip
  inflating: /content/myaudio/HE01.wav  
  inflating: /content/myaudio/HE02.wav  
  inflating: /content/myaudio/HE03.wav  
  inflating: /content/myaudio/HE04.wav  
  inflating: /content/myaudio/HE05.wav  
  inflating: /content/myaudio/HE06.wav  
  inflating: /content/myaudio/HE07.wav  
  inflating: /content/myaudio/HE08.wav  
  inflating: /content/myaudio/HE09.wav  
  inflating: /content/myaudio/HE10.wav  
  inflating: /content/myaudio/HE11.wav  
  inflating: /content/myaudio/HE12.wav  
  inflating: /content/myaudio/HE13.wav  
  inflating: /content/myaudio/HE14.wav  
  inflating: /content/myaudio/HE15.wav  
  inflating: /content/myaudio/HE16.wav  
  inflating: /content/myaudio/HE17.wav  
  inflating: /content/myaudio/HE18.wav  
  inflating: /content/myaudio/HE19.

In [None]:
#@markdown 🎯 Unmount Google drive (clearing): optional

from google.colab import drive
drive.flush_and_unmount()

# [3] ASR process

model = whisper.load_model('base.en')  
**model2 = whisper.load_model('medium.en') for accented speech**

In [4]:
#@markdown 🎯Install SR tool
%%capture
!pip install git+https://github.com/openai/whisper.git 

import whisper
model = whisper.load_model('base.en') 
model2 = whisper.load_model('medium.en')

In [6]:
#@markdown 🎯 Create a file list as a dataframe (df1)
import os
import pandas as pd

filepath = input("Type the full file path to locate audio files. (e.g., /content/HE) ")
dir_path = str(filepath)
dir_files = os.listdir(dir_path)
str1 = 'wav'
flist = []

for i in range(0, len(dir_files)):
  str2 = dir_files[i]
  if str1 in str2:
    flist.append(str2)

flist = sorted(flist)

df1 = pd.DataFrame()
n = len(flist)
nt = n + 1
fn = range(1, nt)
df1['ID'] = fn
df1['Filename'] = flist

# print(df.to_string(index=False))
df1.head()

Type the full file path to locate audio files. (e.g., /content/HE) myaudio


Unnamed: 0,ID,Filename
0,1,HE01.wav
1,2,HE02.wav
2,3,HE03.wav
3,4,HE04.wav
4,5,HE05.wav


In [7]:
#@markdown 🎯Install {autotime} to measure runtime (From here, runtime appears automatically.)
%%capture
!pip install ipython-autotime
%load_ext autotime

time: 480 µs (started: 2023-05-18 15:02:35 +00:00)


In [8]:
df1.tail()

Unnamed: 0,ID,Filename
14,15,HE15.wav
15,16,HE16.wav
16,17,HE17.wav
17,18,HE18.wav
18,19,HE19.wav


time: 29.1 ms (started: 2023-05-18 15:02:39 +00:00)


In [9]:
#@markdown 🎯Change directory to the audio file folder
import os
a = !pwd
print(a)

filepath = input("Change current directory to... /content/destination/")
os.chdir(filepath)
b = !pwd
print("Current directory changed to: %s"%b)

['/content']
Change current directory to... /content/destination//content/myaudio
Current directory changed to: ['/content/myaudio']
time: 9.9 s (started: 2023-05-18 15:02:47 +00:00)


In [10]:
#@markdown 🎯 Testing ASR (single file): Type a number between 1~19
rname = input("Type ID: ")
ind = int(rname) - 1
myf = df1['Filename'][ind]
result = model.transcribe(myf, language="en",fp16=False)
print('Filename: %s'%myf)
print('='*30)

print("Speech-to-text (recognized): %s"%result["text"])  

Type ID: 1
Filename: HE01.wav
Speech-to-text (recognized):  When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
time: 9.46 s (started: 2023-05-18 15:03:07 +00:00)


Model2 for accented speech

In [None]:
#@markdown 🎯 Testing ASR (single file): Type a number between 1~19
rname = input("Type ID: ")
ind = int(rname) - 1
myf = df1['Filename'][ind]
result = model2.transcribe(myf, language="en",fp16=False)
print('Filename: %s'%myf)
print('='*30)

print("Speech-to-text (recognized): %s"%result["text"])  

Type ID: 1
Filename: SK01.wav
Speech-to-text (recognized):  When the sunlight strikes raindrops in the air, please liked and subscribe my channel.
time: 7.29 s (started: 2023-01-10 16:31:18 +00:00)


* Processing ASR for multiple files

In [11]:
#@markdown =======================================================================
#@markdown ## 🎯Processing multiple files (19) and saving it as **df2**

#@markdown ## **Note:** Current directory should be the folder having the audio files (e.g., spk01)
#@markdown =======================================================================

import os
a = !pwd
print(a)

checkdir = input("Need to change current directory? (y for 'yes', n for 'no'")

if checkdir == 'y':

  filepath = input("Change current directory to... /content/destination/")
  os.chdir(filepath)
  b = !pwd
  print("Current directory changed to: %s"%b)
else:
  print('Okay, proceed ASR.')

import time
import pandas as pd

fname = []
rt = []
rectext = []
df2 = pd.DataFrame()

def measure_time(function):
  start = time.time()
  function()
  end = time.time()
  return end - start

nfiles = len(df1['Filename']) #19

for i in range(0, nfiles):
  rname = df1['ID'][i]
  ind = int(rname)
  myf = df1['Filename'][i]
  fname.append(myf)

  def code_to_measure():
# your code here  
    result = model.transcribe(myf, language="en",fp16=False)
    print('='*30)
    print("Speech-to-text (recognized): %s"%result['text']) 
    recresult = result['text']
    rectext.append(recresult)

  runtime = round(measure_time(code_to_measure),3)
  rt.append(str(runtime))

  print(f"Runtime: {runtime} seconds")
  print('Filename: %s'%myf)

df2['Filename'] = fname
df2['Runtime'] = rt
df2['Recognized'] = rectext

df2.head()

['/content/myaudio']
Need to change current directory? (y for 'yes', n for 'no'no
Okay, proceed ASR.
Speech-to-text (recognized):  When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
Runtime: 0.304 seconds
Filename: HE01.wav
Speech-to-text (recognized):  The rainbow is a division of white light into many beautiful colors.
Runtime: 0.244 seconds
Filename: HE02.wav
Speech-to-text (recognized):  These take the shape of a long round arch with its path high above, and its two ends apparently beyond the horizon.
Runtime: 0.319 seconds
Filename: HE03.wav
Speech-to-text (recognized):  There is, according to a legend, a boiling pot of gold at one end.
Runtime: 0.269 seconds
Filename: HE04.wav
Speech-to-text (recognized):  People look, but no one ever finds it.
Runtime: 0.224 seconds
Filename: HE05.wav
Speech-to-text (recognized):  When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.
Runt

Unnamed: 0,Filename,Runtime,Recognized
0,HE01.wav,0.304,When the sunlight strikes raindrops in the ai...
1,HE02.wav,0.244,The rainbow is a division of white light into...
2,HE03.wav,0.319,These take the shape of a long round arch wit...
3,HE04.wav,0.269,"There is, according to a legend, a boiling po..."
4,HE05.wav,0.224,"People look, but no one ever finds it."


time: 10.9 s (started: 2023-05-18 15:03:33 +00:00)


In [None]:
#@markdown ===📕READ: This code is to run Model 2 for accented speech (for English speech, skip this part)============
#@markdown ## 🎯Processing multiple files (19) and saving it as **df2**

#@markdown ## **Note:** Current directory should be the folder having the audio files (e.g., spk01)
#@markdown =======================================================================

import os
a = !pwd
print(a)

import time
import pandas as pd

fname = []
rt = []
rectext = []
df3 = pd.DataFrame()

def measure_time(function):
  start = time.time()
  function()
  end = time.time()
  return end - start

nfiles = len(df1['Filename']) #19

for i in range(0, nfiles):
  rname = df1['ID'][i]
  ind = int(rname)
  myf = df1['Filename'][i]
  fname.append(myf)

  def code_to_measure():
# your code here  
    result = model2.transcribe(myf, language="en",fp16=False)
    print('='*30)
    print("Speech-to-text (recognized with medium): %s"%result['text']) 
    recresult = result['text']
    rectext.append(recresult)

  runtime = round(measure_time(code_to_measure),3)
  rt.append(str(runtime))

  print(f"Runtime: {runtime} seconds")
  print('Filename: %s'%myf)

df3['Filename'] = fname
df3['Runtime'] = rt
df3['Recognized_medium'] = rectext

df3.head()

In [None]:
# 📕 Run this if model2 was used. (if not, skip this code cell.)
print(df3.head())

df2 = df3
df2.columns = ['Filename','Runtime','Recognized','Sentence','SN']
df2.head()

In [12]:
# Add original text to the result file

df2['Sentence'] = df['Sentence']
df2['SN'] = df['SN']

df2 = df2[['SN', 'Filename', 'Runtime', 'Sentence', 'Recognized']]
cols = list(df2.columns.values)

df2.head()

Unnamed: 0,SN,Filename,Runtime,Sentence,Recognized
0,S1,HE01.wav,0.304,When the sunlight strikes raindrops in the air...,When the sunlight strikes raindrops in the ai...
1,S2,HE02.wav,0.244,The rainbow is a division of white light into ...,The rainbow is a division of white light into...
2,S3,HE03.wav,0.319,"These take the shape of a long round arch, wit...",These take the shape of a long round arch wit...
3,S4,HE04.wav,0.269,"There is, according to legend, a boiling pot o...","There is, according to a legend, a boiling po..."
4,S5,HE05.wav,0.224,"People look, but no one ever finds it.","People look, but no one ever finds it."


time: 15.5 ms (started: 2023-05-18 15:04:03 +00:00)


In [13]:
# Important: the current dataframe to the new variable "se"
se = df2

time: 413 µs (started: 2023-05-18 15:04:10 +00:00)


In [14]:
#@markdown Calculate recognition rate, record missing words => dataframe (df2) 
from nltk.tokenize import RegexpTokenizer

# number of words in the sentence
nws = []
# number of words in the recognized text
nwr = []
# number of missing words 
nmw = []
# number of words actually recognized
nwar = []

# Recgonition Rate
rr = []
#
nr = []
# Missing word list
mw = []
# Correctly recognized word list
recword=[]


for i in range(0,len(se['SN'])):
  t1 = se['Sentence'][i]

# text 1
  txt1 = t1.lower()
  tokenizer = RegexpTokenizer(r'\w+')
  wlist = tokenizer.tokenize(txt1)
  nt = len(wlist)
  nws.append(nt)

# text 2
  t2 = se['Recognized'][i]
  txt2 = t2.lower()
  tokenizer = RegexpTokenizer(r'\w+')
  wlist1 = tokenizer.tokenize(txt2)
  nt1 = len(wlist1)
  nwr.append(nt1)

# Recognition rate

  # from tables.idxutils import calc_chunksize
  # from nltk.downloader import ErrorMessage
# txt1(original text), txt2 (recognized text)
 
  mword = []
  rword = []
  score = 0
  for i in range(0, len(wlist1)):
      w = wlist1[i]

      if w in wlist:
        sc = 1
        rword.append(w)
      else:
        sc = 0
        mword.append(w)

      score = score + sc
      mwords = ', '.join(mword)
      rwords = ', '.join(rword)
  mw.append(mwords)
  missingword = round((score/len(wlist))*100,2)
  nr.append(score)
  recword.append(rwords)

  # RecRate = float(format(missingword, '.2f'))
  # ErrRate = float(format((100.0 - RecRate), '.2f'))
  
  rr.append(missingword)
  # print('Matching words: %d'%score, 'out of %d words'%len(wlist))
  # print("="*50)
  # print('Recognition Rate: %f %%'%RecRate)
  # print('Error Rate: %f %%'%ErrRate)

df2['LenS'] = nws
df2['LenR'] = nwr
df2['N_RecW'] = nr
df2['RecRate'] = rr
df2['Recognized_words'] = recword
df2['MissRec_words'] = mw

df2.head(); df2.tail()


Unnamed: 0,SN,Filename,Runtime,Sentence,Recognized,LenS,LenR,N_RecW,RecRate,Recognized_words,MissRec_words
14,S15,HE15.wav,0.217,Many complicated ideas about the rainbow have ...,Many complicated ideas about the rainbow have...,9,9,9,100.0,"many, complicated, ideas, about, the, rainbow,...",
15,S16,HE16.wav,0.365,The difference in the rainbow depends consider...,The difference in the rainbow depends conside...,28,28,28,100.0,"the, difference, in, the, rainbow, depends, co...",
16,S17,HE17.wav,0.308,The actual primary rainbow observed is said to...,The actual primary rainbow observed is said t...,18,18,18,100.0,"the, actual, primary, rainbow, observed, is, s...",
17,S18,HE18.wav,0.456,If the red of the second bow falls upon the gr...,If the red of the second bow falls upon the g...,36,36,36,100.0,"if, the, red, of, the, second, bow, falls, upo...",
18,S19,HE19.wav,0.447,"This is a very common type of bow, one showing...","This is a very common type of bow, one showin...",21,21,21,100.0,"this, is, a, very, common, type, of, bow, one,...",


time: 1.47 s (started: 2023-05-18 15:04:21 +00:00)


💾 Column ordering & save it as csv (e.g., 'spk01_R_base_0110.csv')

In [15]:
#@markdown Reordering columns and save the uptodated result as csv file (option)
df2 = df2[['SN', 'Filename', 'Runtime', 'Recognized','LenR','Sentence','LenS','N_RecW','RecRate','Recognized_words','MissRec_words']]
cols = list(df2.columns.values)

ask =  input('Save a csv file now? (y or n)')
if ask == 'y':
  filename = input('Type a file name to save df2. ')
  df2.to_csv(f'/content/{filename}')
  print(f"{filename} is saved. ")
else:
  print("Okay, let's move on. ")

df2.head()

Save a csv file now? (y or n)y
Type a file name to save df2. HE_result.csv
HE_result.csv is saved. 


Unnamed: 0,SN,Filename,Runtime,Recognized,LenR,Sentence,LenS,N_RecW,RecRate,Recognized_words,MissRec_words
0,S1,HE01.wav,0.304,When the sunlight strikes raindrops in the ai...,17,When the sunlight strikes raindrops in the air...,17,17,100.0,"when, the, sunlight, strikes, raindrops, in, t...",
1,S2,HE02.wav,0.244,The rainbow is a division of white light into...,12,The rainbow is a division of white light into ...,12,12,100.0,"the, rainbow, is, a, division, of, white, ligh...",
2,S3,HE03.wav,0.319,These take the shape of a long round arch wit...,22,"These take the shape of a long round arch, wit...",22,22,100.0,"these, take, the, shape, of, a, long, round, a...",
3,S4,HE04.wav,0.269,"There is, according to a legend, a boiling po...",14,"There is, according to legend, a boiling pot o...",13,14,107.69,"there, is, according, to, a, legend, a, boilin...",
4,S5,HE05.wav,0.224,"People look, but no one ever finds it.",8,"People look, but no one ever finds it.",8,8,100.0,"people, look, but, no, one, ever, finds, it",


time: 32.6 s (started: 2023-05-18 15:04:31 +00:00)


#[4] Recognition rate

- Sample text: "**The rainbow** is a division of white light into many beautiful colors." (12 words)

- Perceived text: **The lane bowl** is a division of white light into many beautiful colors. (11 words)

Perception rate = (11/12)*100 = 91%  
Error rate = 9%

## Recognition rate and finding mismatching words using NLTK

In [16]:
se = df2

time: 387 µs (started: 2023-05-18 15:05:17 +00:00)


# 🌀Recognition Rate without stopwords (23.01.10~)

🌱 Variables: 

* df2 = combined results
* se (reduced for nltk process)= sentence, recognized texts

In [17]:
se = df2[['SN','Sentence','Recognized']]
se.head()

Unnamed: 0,SN,Sentence,Recognized
0,S1,When the sunlight strikes raindrops in the air...,When the sunlight strikes raindrops in the ai...
1,S2,The rainbow is a division of white light into ...,The rainbow is a division of white light into...
2,S3,"These take the shape of a long round arch, wit...",These take the shape of a long round arch wit...
3,S4,"There is, according to legend, a boiling pot o...","There is, according to a legend, a boiling po..."
4,S5,"People look, but no one ever finds it.","People look, but no one ever finds it."


time: 9.23 ms (started: 2023-05-18 15:05:21 +00:00)


**Variable: sw = nltk stopword list** (179 words)

💾 stopword list as csv

In [18]:
#@markdown Install {nltk} and get ready to process stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

sw = stopwords.words('english')
print(stopwords.words('english'))
len(sw)

dfs = pd.DataFrame()
dfs['ID'] = range(1,180)
dfs['Stopwords'] = sw
dfs.to_csv('nltkstopwords.csv', index=False)
dfs.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Unnamed: 0,ID,Stopwords
0,1,i
1,2,me
2,3,my
3,4,myself
4,5,we


time: 505 ms (started: 2023-05-18 15:05:26 +00:00)


Variable: sw (stopword list)

In [19]:
#@markdown Checking (single file processing)
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

trial = input('Type ID number: e.g., 2 for the second entence. ')
n = int(trial)
example_sent = df2['Sentence'][n]
word_tokens = tokenizer.tokenize(example_sent)
stop_words = set(stopwords.words('english'))

filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []
 
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
 
print('Number of word tokens: %d'%len(word_tokens))
print(word_tokens)
print('='*50)
print('Number of words after stopwords: %d'%len(filtered_sentence))
print(filtered_sentence)

Type ID number: e.g., 2 for the second entence. 1
Number of word tokens: 12
['The', 'rainbow', 'is', 'a', 'division', 'of', 'white', 'light', 'into', 'many', 'beautiful', 'colors']
Number of words after stopwords: 8
['The', 'rainbow', 'division', 'white', 'light', 'many', 'beautiful', 'colors']
time: 4.08 s (started: 2023-05-18 15:05:31 +00:00)


Variable: se3 = se2 for further process

In [20]:
se3 = df2
se3.head()

Unnamed: 0,SN,Filename,Runtime,Recognized,LenR,Sentence,LenS,N_RecW,RecRate,Recognized_words,MissRec_words
0,S1,HE01.wav,0.304,When the sunlight strikes raindrops in the ai...,17,When the sunlight strikes raindrops in the air...,17,17,100.0,"when, the, sunlight, strikes, raindrops, in, t...",
1,S2,HE02.wav,0.244,The rainbow is a division of white light into...,12,The rainbow is a division of white light into ...,12,12,100.0,"the, rainbow, is, a, division, of, white, ligh...",
2,S3,HE03.wav,0.319,These take the shape of a long round arch wit...,22,"These take the shape of a long round arch, wit...",22,22,100.0,"these, take, the, shape, of, a, long, round, a...",
3,S4,HE04.wav,0.269,"There is, according to a legend, a boiling po...",14,"There is, according to legend, a boiling pot o...",13,14,107.69,"there, is, according, to, a, legend, a, boilin...",
4,S5,HE05.wav,0.224,"People look, but no one ever finds it.",8,"People look, but no one ever finds it.",8,8,100.0,"people, look, but, no, one, ever, finds, it",


time: 13.8 ms (started: 2023-05-18 15:05:45 +00:00)


## Getting all results 📗

[Refer to the result file list](https://github.com/MK316/workspace/tree/main/ASR01/analysis)

In [21]:
#@markdown ============================
#@markdown ## 📌 Multiple file processing: getting Recognition Rate without stopwords (RecRate_ws)
#@markdown ============================

#@markdown filename: e.g., SK_base_resultall_0110.csv

from pandas.io.formats.style_render import DataFrame
# Bulk process

from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

# list variables to collect each result

sent_wsls = []
rec_wsls = []
rr_ws = []
corecwords = []
newrate = []
# Length of sentence without stopwords, recognized text without stopwords
lenS_sw = []
lenR_sw = []
sn = len(se3['Sentence'])

# 1. removing stopwords in the original & recognized texts
for i in range(0,sn):
  ori_sent = se3['Sentence'][i]
  rec_sent = se3['Recognized'][i]
  word_tokens = tokenizer.tokenize(ori_sent)
  rword_tokens = tokenizer.tokenize(rec_sent)
  stop_words = set(stopwords.words('english'))

  filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
  rfiltered_sentence = [w for w in rword_tokens if not w.lower() in stop_words]
  filtered_sentence = []
  rfiltered_sentence = []
 
  for w in word_tokens:
    if w.lower() not in stop_words:
       filtered_sentence.append(w)
  fs = ' '.join(filtered_sentence)
  sent_wsls.append(fs)

  for w in rword_tokens:
    if w.lower() not in stop_words:
       rfiltered_sentence.append(w)
  rfs = ' '.join(rfiltered_sentence)
  rec_wsls.append(rfs)

  correctwords = []
  score=0
  n1 = len(filtered_sentence)
  n2 = len(rfiltered_sentence)
  for k in range(0,n1):
      if filtered_sentence[k] in rfiltered_sentence:
        sc = 1
        correctwords.append(filtered_sentence[k])
      else:
        sc = 0 
      score = score + sc
  nwrt = round((score/n1)*100,2)
  newrate.append(nwrt)
  rr_ws.append(score)
  cwl = ', '.join(correctwords)
  corecwords.append(cwl)
  lenS_sw.append(n1)
  lenR_sw.append(n2)

#2 removing stopwords in the recognized words


# Fixed due to Copy warning
se3['Sent_ws'] = sent_wsls
se3['LenS_ws'] = lenS_sw
se3['Rec_ws'] = rec_wsls
se3['LenR_ws'] = lenR_sw
se3['Correctwords_ws'] = corecwords
se3['Recwords_ws'] = rr_ws
se3['RecRate_ws'] = newrate

# 3. writing the result in the DataFrame
filename = input('Type a file name to save the result: e.g., /content/spk01.csv ')
se3.to_csv(filename,index=False)

se3.head()

Type a file name to save the result: e.g., /content/spk01.csv HE_result_0519.csv


Unnamed: 0,SN,Filename,Runtime,Recognized,LenR,Sentence,LenS,N_RecW,RecRate,Recognized_words,MissRec_words,Sent_ws,LenS_ws,Rec_ws,LenR_ws,Correctwords_ws,Recwords_ws,RecRate_ws
0,S1,HE01.wav,0.304,When the sunlight strikes raindrops in the ai...,17,When the sunlight strikes raindrops in the air...,17,17,100.0,"when, the, sunlight, strikes, raindrops, in, t...",,sunlight strikes raindrops air act prism form ...,8,sunlight strikes raindrops air act prism form ...,8,"sunlight, strikes, raindrops, air, act, prism,...",8,100.0
1,S2,HE02.wav,0.244,The rainbow is a division of white light into...,12,The rainbow is a division of white light into ...,12,12,100.0,"the, rainbow, is, a, division, of, white, ligh...",,rainbow division white light many beautiful co...,7,rainbow division white light many beautiful co...,7,"rainbow, division, white, light, many, beautif...",7,100.0
2,S3,HE03.wav,0.319,These take the shape of a long round arch wit...,22,"These take the shape of a long round arch, wit...",22,22,100.0,"these, take, the, shape, of, a, long, round, a...",,take shape long round arch path high two ends ...,12,take shape long round arch path high two ends ...,12,"take, shape, long, round, arch, path, high, tw...",12,100.0
3,S4,HE04.wav,0.269,"There is, according to a legend, a boiling po...",14,"There is, according to legend, a boiling pot o...",13,14,107.69,"there, is, according, to, a, legend, a, boilin...",,according legend boiling pot gold one end,7,according legend boiling pot gold one end,7,"according, legend, boiling, pot, gold, one, end",7,100.0
4,S5,HE05.wav,0.224,"People look, but no one ever finds it.",8,"People look, but no one ever finds it.",8,8,100.0,"people, look, but, no, one, ever, finds, it",,People look one ever finds,5,People look one ever finds,5,"People, look, one, ever, finds",5,100.0


time: 16 s (started: 2023-05-18 15:05:59 +00:00)
