# Project Summary

- I will build a system that can automatically recognize speech and summarize it, such as transcribing and summarizing lecture recordings, podcasts, or videos.


- **Project Steps**:

  - Build a voice recognition system using vosk package
  - Add punctuation back to the text transcript using recasepunc package
  - Summarize the text using a huggingface summarization pipeline


- **Files Overview**:

  - shortwave_60s.mp3 - a 60 second English audio clip to train the model
  - shortwave.mp3 - a 25 minute English audio clip to test the model
  - wuxiaobu_50s.mp3 - a 50 second Chinese audio clip to train the model
  - wuxiaobu.mp3 - a 29 minute Chinese audio clip to test the model
  
  
  https://www.npr.org/2022/06/03/1102930066/pride-week-the-importance-of-inclusion-in-sex-education
  


# Step 1: Build a voice recognition system

## Install the vosk package

In [1]:
!pip install vosk



In [2]:
from vosk import Model, KaldiRecognizer

## Download and initiate the vosk-model-en-us for voice recognition

In [3]:
FRAME_RATE = 16000 # The higher the number, the higher the voice quality
CHANNELS = 1

# model = Model(model_name = 'vosk-model-en-us-0.22')
# For a smaller download size
# model is a pretrained model
model = Model(model_name="vosk-model-small-en-us-0.15")

rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True) # show a complete text transcript and individual words with model confidence in those words

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/word_boundary.

## Install the pydub package

In [4]:
!pip install pydub



In [5]:
from pydub import AudioSegment

## Load mp3.file using AudioSegment from pydub

In [6]:
mp3 = AudioSegment.from_mp3("shortwave_60s.mp3")
mp3 = mp3.set_channels(CHANNELS)
mp3 = mp3.set_frame_rate(FRAME_RATE)

In [None]:
# binary representation of the actual data in mp3
mp3.raw_data 

In [8]:
# rec
rec.AcceptWaveform(mp3.raw_data)
result = rec.Result()

In [9]:
import json

text = json.loads(result)['text']

text # result with no punctuations 

"from npr emily pay separate so few and i'll remember getting caught sex ed in high school kind of adult to be honest i did a lot of my own research like my mom got me some bugs and them when knock got to be like a bit much i got my books and taught myself yes yes me to sympathize i actually with it sneak into the little local bookstore and find our bodies ourselves in like be human development section and surreptitiously read it in one corner and hope that nobody i knew walked in and ask what i was reading little erin little little awkward questioning errant and what is maybe not surprising about this given we had to do our own research is that there is actually no national mandate for sex ed in the us really that kind of surprises me like none yeah and not only that but most sex ed the does exist"

In [29]:
json.loads(result)

{'result': [{'conf': 1.0, 'end': 1.29, 'start': 1.08, 'word': 'from'},
  {'conf': 1.0, 'end': 1.98, 'start': 1.29, 'word': 'npr'},
  {'conf': 1.0, 'end': 8.79, 'start': 8.4, 'word': 'emily'},
  {'conf': 0.607331, 'end': 9.142041, 'start': 8.79, 'word': 'pay'},
  {'conf': 0.347527, 'end': 9.63, 'start': 9.142041, 'word': 'separate'},
  {'conf': 0.965751, 'end': 10.05, 'start': 9.69, 'word': 'so'},
  {'conf': 0.494293, 'end': 10.53, 'start': 10.23, 'word': 'few'},
  {'conf': 0.494293, 'end': 10.62, 'start': 10.53, 'word': 'and'},
  {'conf': 0.464595, 'end': 10.77, 'start': 10.62, 'word': "i'll"},
  {'conf': 1.0, 'end': 11.13, 'start': 10.77, 'word': 'remember'},
  {'conf': 1.0, 'end': 11.37, 'start': 11.13, 'word': 'getting'},
  {'conf': 0.862953, 'end': 11.64, 'start': 11.37, 'word': 'caught'},
  {'conf': 0.976158, 'end': 11.957246, 'start': 11.64, 'word': 'sex'},
  {'conf': 0.311309, 'end': 12.089476, 'start': 11.97, 'word': 'ed'},
  {'conf': 0.776959, 'end': 12.18, 'start': 12.089476,

# Step 2: Adding punctuations back in the text with recasepunc

`By default, vosk will output text with no punctuation. To add in punctuation, we'll need a different model. `

`Download the punctuation model from vosk.
Extract the zip file into the same directory as your code.`

In [10]:
!pip install transformers
!pip install torch -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [14]:
!unzip /Users/moonqj/Desktop/Voice_Recognition/vosk-recasepunc-en-0.22.zip -d /Users/moonqj/Desktop/Voice_Recognition/

Archive:  /Users/moonqj/Desktop/Voice_Recognition/vosk-recasepunc-en-0.22.zip
replace /Users/moonqj/Desktop/Voice_Recognition/vosk-recasepunc-en-0.22/recasepunc.py? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [11]:
!pip install regex



In [12]:
import regex as re
import subprocess # use python to run a terminal commmand

# checkpoint is the pretrained model for punctuation
cased = subprocess.check_output('python vosk-recasepunc-en-0.22/recasepunc.py predict vosk-recasepunc-en-0.22/checkpoint', shell=True, text=True, input=text)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
cased

"From NPR, Emily pay separate so few, and I ' ll remember getting caught sex ed in high school kind of adult to be honest. I did a lot of my own research, like my mom got me some bugs and them when knock got to be like a bit much. I got my books and taught myself, yes, yes, me, to sympathize I actually with it. Sneak into the little local bookstore and find our bodies ourselves in like be Human Development section and surreptitiously read it in one corner and hope that nobody I knew walked in and ask what I was reading. Little Erin, little little awkward, questioning, errant And what is maybe not surprising about this, given we had to do our own research, is that there is actually no national mandate for sex ed in the US. Really ? That kind of surprises me Like none. Yeah, and not only that, but most sex ed the does exist.\n"

# Step 1+2 : Define a voice_recognition function (long audio file)

In [14]:
def voice_recognition(filename, language):
    # English
    if language == "English":
        model = Model(model_name="vosk-model-small-en-us-0.15")
        
    # Chinese
    if language == "Chinese":
        model = Model(model_name="vosk-model-small-cn-0.22")
    
    rec = KaldiRecognizer(model, FRAME_RATE)
    rec.SetWords(True)
    
    mp3 = AudioSegment.from_mp3(filename)
    mp3 = mp3.set_channels(1)
    mp3 = mp3.set_frame_rate(FRAME_RATE)
    
    # batch long audtio file into pieces and stick them back  
    step = 45000
    transcript = ""
    for i in range(0, len(mp3), step):
      print(f"Progress: {i/len(mp3)}")
      segment = mp3[i : i+step]

      rec.AcceptWaveform(segment.raw_data)
      result = rec.Result()

      text = json.loads(result)["text"]
      transcript += text
    
    if language == "English":
        cased = subprocess.check_output('python vosk-recasepunc-en-0.22/recasepunc.py predict vosk-recasepunc-en-0.22/checkpoint', shell=True, text=True, input=transcript)
    if language == "Chinese":
        cased = subprocess.check_output('python vosk-recasepunc-en-0.22/recasepunc.py predict vosk-recasepunc-en-0.22/zh.24000', shell=True, text=True, input=transcript)
    return cased

In [15]:
transcript_chn = voice_recognition("wuxiaobo.mp3", 'Chinese')

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=12 max-active=5000 lattice-beam=4
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /Users/moonqj/.cache/vosk/vosk-model-small-cn-0.22/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /Users/moonqj/.cache/vosk/vosk-model-small-cn-0.22/graph/HCLr.fst /Users/moonqj/.cache/vosk/vosk-model-small-cn-0.22/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /Users/moonqj/.cache/vosk/vosk-model-small-cn-0.22/graph/phones/word_boundary.int


Progress: 0.0
Progress: 0.02578068201934926
Progress: 0.05156136403869852
Progress: 0.07734204605804779
Progress: 0.10312272807739704
Progress: 0.1289034100967463
Progress: 0.15468409211609557
Progress: 0.18046477413544482
Progress: 0.2062454561547941
Progress: 0.23202613817414336
Progress: 0.2578068201934926
Progress: 0.28358750221284185
Progress: 0.30936818423219115
Progress: 0.3351488662515404
Progress: 0.36092954827088963
Progress: 0.38671023029023893
Progress: 0.4124909123095882
Progress: 0.4382715943289374
Progress: 0.4640522763482867
Progress: 0.48983295836763596
Progress: 0.5156136403869852
Progress: 0.5413943224063344
Progress: 0.5671750044256837
Progress: 0.592955686445033
Progress: 0.6187363684643823
Progress: 0.6445170504837315
Progress: 0.6702977325030808
Progress: 0.69607841452243
Progress: 0.7218590965417793
Progress: 0.7476397785611286
Progress: 0.7734204605804779
Progress: 0.7992011425998271
Progress: 0.8249818246191764
Progress: 0.8507625066385256
Progress: 0.87654318

Some weights of the model checkpoint at ckiplab/bert-base-chinese were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ckiplab/bert-base-chinese and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.

In [16]:
transcript_chn

'今 天 去 北 京 六 环 之 外 的 一 方 放 刘 强 东, 那 你 的 风 很 大, 前 几 天 刚 刚 去 了 一 次 大 的 老 家, 江 苏 宿 迁, 最 早 下 楼, 长 达 十 年 玩 乐 力 长 的 是 我 给 他 看 拍 个 照 片, 最 大 的 他 应 该 好 友 不 用 没 有 看 坎 坷 想 的 不 断 加 强. 刘 强 东 是 小 镇 青 年 创 业 的 典 范, 在 他 的 身 上 真 人 秀 的 把 那 股 逆 袭 的 浓 浓 气 味. 刚 刚 过 去 的 十 年, 在 他 的 身 上 到 底 发 生 了 什 么 ? 十 年 京 东 的 营 业 额 增 加 了 的 前 辈, 十 年 京 东 的 员 工 人 数 激 增 到 了 十 六 万 人. 十 年, 他 谈 了 一 场 众 人 皆 知 的 恋 爱, 还 当 上 了 父 亲, 把 人 引 向 未 来 的 不 是 上 帝, 而 是 他 本 人 的 做. 刘 强 东 是 怎 么 看 她 刚 刚 度 过 的 十 年, 他 的 骄 傲 和 恐 惧 是 什 么 ? 还 有 什 么 遗 憾 ? 还 有 我 得 问 问 他, 他 是 一 个 保 守 的 人, 还 是 激 进 的 哥 哥, 妹 妹 信 心 了, 非 得 叫 他 从 这 讲 起 来, 京 东 越 多 越 好, 大 新 婚 快 乐 小 工 艺 没 之 后 东 哥 抱 得 美 人 归, 我 就 不 了 你 了, 哈 哈, 你 的 系 统 啊, 谢 谢 我 就 看 完 了, 给 您 带 着, 看 完 了 得 着 老 子 的 节 目 也 是 也 是 这 个 梳 理 开 始 的 啊, 十 年 吗 我 我 下 旋 写 人 聊 的 四 年 变 化 等 会 说 呢 的 数 为 尽 量 图 的 就 拿 过 得 好 的 谢 谢 啊, 我 们 说 这 个 很 赞 啊, 都 是 你 结 婚 的 时 候 在 的, 东 哥 小 天 是 早 生 贵 子 的, 是 这 段 时 间 里 加 大 的 山 上 得 我 的 呀 个 人 那 可 能 是 了 的 奶 猫, 成 家 立 业 成 家 怎 么 的 前 面 你 评 价 送 往 吗 对 小 孩 谁 吗 ? 就 是 更 显 得 特 别 重 要, 为 你 读 大 学 读 的 是 社 会 戏 码 对 你 做 的 社 会 学 

In [17]:
transcript = voice_recognition("shortwave.mp3", 'English')

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /Users/moonqj/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/word_boundary.

Progress: 0.0
Progress: 0.029423476403352706
Progress: 0.05884695280670541
Progress: 0.08827042921005812
Progress: 0.11769390561341082
Progress: 0.14711738201676353
Progress: 0.17654085842011624
Progress: 0.20596433482346896
Progress: 0.23538781122682165
Progress: 0.26481128763017436
Progress: 0.29423476403352705
Progress: 0.3236582404368798
Progress: 0.3530817168402325
Progress: 0.3825051932435852
Progress: 0.4119286696469379
Progress: 0.4413521460502906
Progress: 0.4707756224536433
Progress: 0.500199098856996
Progress: 0.5296225752603487
Progress: 0.5590460516637015
Progress: 0.5884695280670541
Progress: 0.6178930044704068
Progress: 0.6473164808737596
Progress: 0.6767399572771122
Progress: 0.706163433680465
Progress: 0.7355869100838177
Progress: 0.7650103864871703
Progress: 0.7944338628905231
Progress: 0.8238573392938758
Progress: 0.8532808156972285
Progress: 0.8827042921005812
Progress: 0.912127768503934
Progress: 0.9415512449072866
Progress: 0.9709747213106393


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [18]:
len(transcript.split(' '))

4551

In [19]:
transcript

"You ' re listening to shortwave from NPR heavily a separate so do and I ' ll remember getting touch sex that in high school. Ah, kind of adult to be honest. I did a lot of my own research, like my mom got me some bugs and them when not got to be like on bit much. I got my books and taught myself, yes, yes me, to sympathize I actually with it. Sneak into the little local bookstore and find our bodies ourselves in like be human development section and surreptitiously read it in the corner and hope that nobody I knew walked in and asked what I was reading Little and little littleawkward questioning Aaron, And what is maybe not surprising about this, given we had to do our own research, is that there is actually no national mandate for sex ed in the U. S. Really, That kind of surprises me, like none. Yeah, and not only that, but most sex ed that does exist leaves out LGBTQ topics are just barely touches on it. And then there are states where they even require educators to betray topics li

# Step 3: Summarize the text using huggingface transformer

`Download a summarization model from huggingface.`

In [20]:
from transformers import pipeline

# For a smaller model
#summarizer = pipeline("summarization", model="t5-small") # for English only
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Metal device set to: Apple M1


2022-06-20 17:06:27.996935: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-06-20 17:06:27.997725: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [28]:
# with open('/Users/moonqj/Desktop/Voice_Recognition/transcript.txt') as f:
#   transcript = f.read()

In [21]:
# split our transcript into smaller pieces since summerization model in huggingface has lenght limit
# 1024 tokens limit 
split_tokens = transcript.split(" ")
print(len(split_tokens))
docs = []

for i in range(0, len(split_tokens), 850):
    selection = " ".join(split_tokens[i:(i+850)])
    print(selection)
    print(len(selection.split(' ')),'\n')
    docs.append(selection)

4551
You ' re listening to shortwave from NPR heavily a separate so do and I ' ll remember getting touch sex that in high school. Ah, kind of adult to be honest. I did a lot of my own research, like my mom got me some bugs and them when not got to be like on bit much. I got my books and taught myself, yes, yes me, to sympathize I actually with it. Sneak into the little local bookstore and find our bodies ourselves in like be human development section and surreptitiously read it in the corner and hope that nobody I knew walked in and asked what I was reading Little and little littleawkward questioning Aaron, And what is maybe not surprising about this, given we had to do our own research, is that there is actually no national mandate for sex ed in the U. S. Really, That kind of surprises me, like none. Yeah, and not only that, but most sex ed that does exist leaves out LGBTQ topics are just barely touches on it. And then there are states where they even require educators to betray topic

In [22]:
summaries = summarizer(docs)

In [23]:
summaries

[{'summary_text': ' There is actually no national mandate for sex ed in the U.S. Most sex ed that does exist leaves out LGBTQ topics are just barely touches on it . There are states where they even require educators to betray topics like homosexuality in a negative one . This episode is about sex and may not be for everyone .'},
 {'summary_text': ' There are nine states that explicitly require that teachers do not speak about homosexuality or LGBTQ individuals in a positive fashion . Over twenty five states and the District of Columbia do mandate sex education, and eleven states have policies that requires sex Ed to be inclusive of sexual orientation . Erica Heart says people should receive sex positive messages instead of shaming people for wanting to embrace her sexuality .'},
 {'summary_text': " Erica Heart says there isn't a singular or right way to have sex . Masturbation is a great tool for figuring out what you do or don't like . Heart says sex can happen with who ever you want 

In [24]:
summary = "\n\n".join([d["summary_text"] for d in summaries])

In [25]:
print(summary)

 There is actually no national mandate for sex ed in the U.S. Most sex ed that does exist leaves out LGBTQ topics are just barely touches on it . There are states where they even require educators to betray topics like homosexuality in a negative one . This episode is about sex and may not be for everyone .

 There are nine states that explicitly require that teachers do not speak about homosexuality or LGBTQ individuals in a positive fashion . Over twenty five states and the District of Columbia do mandate sex education, and eleven states have policies that requires sex Ed to be inclusive of sexual orientation . Erica Heart says people should receive sex positive messages instead of shaming people for wanting to embrace her sexuality .

 Erica Heart says there isn't a singular or right way to have sex . Masturbation is a great tool for figuring out what you do or don't like . Heart says sex can happen with who ever you want to have it . Erica Heart: Human sexuality is very complex and

In [2]:
!brew install portaudio

To reinstall 19.7.0, run:
  brew reinstall portaudio


In [21]:
# !python3 -m pip install pyaudio --global-option="build_ext" --global-option="-I/opt/homebrew/include" --global-option="-L/opt/homebrew/lib"

!python -m pip install --global-option='build_ext' --global-option='-I/opt/homebrew/Cellar/portaudio/19.7.0/include' --global-option='-L/opt/homebrew/Cellar/portaudio/19.7.0/lib' pyaudio


