# TO Convert Audio Speech Into Text 



**Input Dataset**<br>https://drive.google.com/drive/folders/18Tf6drhGF1tcW-Bax1n1PjJ38_5a0TxW?usp=sharing

# Packages

In [1]:
!pip install SpeechRecognition

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting SpeechRecognition
  Downloading SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8 MB)
[K     |████████████████████████████████| 32.8 MB 51.0 MB/s 
[?25hInstalling collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.8.1


In [2]:
!pip install pydub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


# Libraries

In [3]:
import speech_recognition as sr 
import os 
from pydub import AudioSegment
from pydub.silence import split_on_silence

In [4]:
# create a speech recognition object
r = sr.Recognizer()

# a function that splits the audio file into chunks
# and applies speech recognition
def get_large_audio_transcription(path):
    """
    Splitting the large audio file into chunks
    and apply speech recognition on each of these chunks
    """
    # open the audio file using pydub
    sound = AudioSegment.from_wav(path)  
    # split audio sound where silence is 700 miliseconds or more and get chunks
    chunks = split_on_silence(sound,
        # experiment with this value for your target audio file
        min_silence_len = 500,
        # adjust this per requirement
        silence_thresh = sound.dBFS-14,
        # keep the silence for 1 second, adjustable as well
        keep_silence=500,
    )
    folder_name = "audio-chunks"
    # create a directory to store the audio chunks
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    whole_text = ""
    # process each chunk 
    for i, audio_chunk in enumerate(chunks, start=1):
        # export audio chunk and save it in
        # the `folder_name` directory.
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        # recognize the chunk
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            # try converting it to text
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text
    # return the text for all chunks detected
    return whole_text

In [5]:
Voice_Mp3='/content/kunal01 (1).mp3'
text_voice=get_large_audio_transcription(Voice_Mp3)

Error: 
audio-chunks/chunk2.wav : Yoyomax12. 
Error: 
Error: 
audio-chunks/chunk5.wav : I'm calling you from universal hub. 
audio-chunks/chunk6.wav : Jordan education consultancy. 
Error: 
Error: 
Error: 
Error: 
audio-chunks/chunk11.wav : Okay see you soon. 
audio-chunks/chunk12.wav : No i'm doing research. 
audio-chunks/chunk13.wav : Department. 
Error: 
Error: 
Error: 
audio-chunks/chunk17.wav : Jessie's girl. 
Error: 
Error: 
audio-chunks/chunk20.wav : Play climb 34 commercial.. 
audio-chunks/chunk21.wav : I'd like to know the score.. 
Error: 
audio-chunks/chunk23.wav : Right right. 
Error: 
audio-chunks/chunk25.wav : Basically i can understand that you're looking for a fully-funded pnc core maybe in the usa. 
Error: 
audio-chunks/chunk27.wav : Okay so unfortunately. 
audio-chunks/chunk28.wav : Let me give you the positive first we have a partnership with more than 60 universities. 
audio-chunks/chunk29.wav : Across u.s. uk and canada. 
audio-chunks/chunk30.wav : You do not offer 

# Data Preprocessing

In [6]:
text_voice=text_voice.replace('Error' , '')
print(text_voice)

Yoyomax12. I'm calling you from universal hub. Jordan education consultancy. Okay see you soon. No i'm doing research. Department. Jessie's girl. Play climb 34 commercial.. I'd like to know the score.. Right right. Basically i can understand that you're looking for a fully-funded pnc core maybe in the usa. Okay so unfortunately. Let me give you the positive first we have a partnership with more than 60 universities. Across u.s. uk and canada. You do not offer none of these universities offer free pnp code. And. You know there are scholarships but that totally depends on the months that you will see in your master and the aisle so do you know very much that you would get. Would you be open to options if you get a scholarship of about twenty 30% or are you only looking for free. Gopi sundar. Bhg.com. 94 trans am. That's my plan fully funded and my financial condition is not that much. I completely understand the situation, 22. You're not wrong in your car to apply for a fully-funded you'

**Identify the Grammer mistake and Correct It**

In [7]:
!pip install language_tool_python

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting language_tool_python
  Downloading language_tool_python-2.7.1-py3-none-any.whl (34 kB)
Installing collected packages: language-tool-python
Successfully installed language-tool-python-2.7.1


In [8]:
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
text_voice=tool.correct(text_voice)

Downloading LanguageTool 5.7: 100%|██████████| 225M/225M [00:13<00:00, 16.9MB/s]
Unzipping /tmp/tmppob2wcxh.zip to /root/.cache/language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.7.zip to /root/.cache/language_tool_python.


In [9]:
print(text_voice)

Yoyomax12. I'm calling you from universal hub. Jordan education consultancy. Okay see you soon. No I'm doing research. Department. Jessie's girl. Play climb 34 commercial. I'd like to know the score. Right. Basically I can understand that you're looking for a fully-funded PNC core maybe in the USA. Okay so unfortunately. Let me give you the positive first we have a partnership with more than 60 universities. Across u.s. UK and Canada. You do not offer none of these universities offer free PNP code. And. You know there are scholarships, but that totally depends on the months that you will see in your master and the aisle so do you know very much that you would get. Would you be open to options if you get a scholarship of about twenty 30% or are you only looking for free? GOP Sunday. Bhg.com. 94 trans am. That's my plan fully funded, and my financial condition is not that much. I completely understand the situation, 22. You're not wrong in your car to apply for a fully-funded you're abso

In [10]:
#Remove the ',' and '.' from the text
text_voice=text_voice.replace('.' , '')
text_voice=text_voice.replace(',' , '')

In [11]:
print(text_voice)

Yoyomax12 I'm calling you from universal hub Jordan education consultancy Okay see you soon No I'm doing research Department Jessie's girl Play climb 34 commercial I'd like to know the score Right Basically I can understand that you're looking for a fully-funded PNC core maybe in the USA Okay so unfortunately Let me give you the positive first we have a partnership with more than 60 universities Across us UK and Canada You do not offer none of these universities offer free PNP code And You know there are scholarships but that totally depends on the months that you will see in your master and the aisle so do you know very much that you would get Would you be open to options if you get a scholarship of about twenty 30% or are you only looking for free? GOP Sunday Bhgcom 94 trans am That's my plan fully funded and my financial condition is not that much I completely understand the situation 22 You're not wrong in your car to apply for a fully-funded you're absolutely right and there are u

**Identify the Stop words and Remove it**

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
#REmove the Stop Words
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
text_tokens = word_tokenize(text_voice)
# print(text_tokens)
tokens_without_st = [word for word in text_tokens if not word in stopwords.words()]
print(tokens_without_st)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

['Yoyomax12', 'I', "'m", 'calling', 'universal', 'hub', 'Jordan', 'education', 'consultancy', 'Okay', 'see', 'soon', 'No', 'I', "'m", 'research', 'Department', 'Jessie', "'s", 'girl', 'Play', 'climb', '34', 'commercial', 'I', "'d", 'like', 'know', 'score', 'Right', 'Basically', 'I', 'understand', "'re", 'looking', 'fully-funded', 'PNC', 'core', 'maybe', 'USA', 'Okay', 'unfortunately', 'Let', 'give', 'positive', 'first', 'partnership', '60', 'universities', 'Across', 'us', 'UK', 'Canada', 'You', 'offer', 'none', 'universities', 'offer', 'free', 'PNP', 'code', 'And', 'You', 'know', 'scholarships', 'totally', 'depends', 'months', 'see', 'master', 'aisle', 'know', 'much', 'would', 'get', 'Would', 'open', 'options', 'get', 'scholarship', 'twenty', '30', '%', 'looking', 'free', '?', 'GOP', 'Sunday', 'Bhgcom', '94', 'trans', 'That', "'s", 'plan', 'fully', 'funded', 'financial', 'condition', 'much', 'I', 'completely', 'understand', 'situation', '22', 'You', "'re", 'wrong', 'car', 'apply', 'ful




In [14]:
# # Tokenization Text word by word
# from nltk.tokenize import word_tokenize
# text_voice0=word_tokenize(tokens_without_sw)

In [15]:
tokens_without_st

['Yoyomax12',
 'I',
 "'m",
 'calling',
 'universal',
 'hub',
 'Jordan',
 'education',
 'consultancy',
 'Okay',
 'see',
 'soon',
 'No',
 'I',
 "'m",
 'research',
 'Department',
 'Jessie',
 "'s",
 'girl',
 'Play',
 'climb',
 '34',
 'commercial',
 'I',
 "'d",
 'like',
 'know',
 'score',
 'Right',
 'Basically',
 'I',
 'understand',
 "'re",
 'looking',
 'fully-funded',
 'PNC',
 'core',
 'maybe',
 'USA',
 'Okay',
 'unfortunately',
 'Let',
 'give',
 'positive',
 'first',
 'partnership',
 '60',
 'universities',
 'Across',
 'us',
 'UK',
 'Canada',
 'You',
 'offer',
 'none',
 'universities',
 'offer',
 'free',
 'PNP',
 'code',
 'And',
 'You',
 'know',
 'scholarships',
 'totally',
 'depends',
 'months',
 'see',
 'master',
 'aisle',
 'know',
 'much',
 'would',
 'get',
 'Would',
 'open',
 'options',
 'get',
 'scholarship',
 'twenty',
 '30',
 '%',
 'looking',
 'free',
 '?',
 'GOP',
 'Sunday',
 'Bhgcom',
 '94',
 'trans',
 'That',
 "'s",
 'plan',
 'fully',
 'funded',
 'financial',
 'condition',
 'much

In [16]:
#Remove the special character 
res = list(map(str.strip, tokens_without_st))
print(res)

['Yoyomax12', 'I', "'m", 'calling', 'universal', 'hub', 'Jordan', 'education', 'consultancy', 'Okay', 'see', 'soon', 'No', 'I', "'m", 'research', 'Department', 'Jessie', "'s", 'girl', 'Play', 'climb', '34', 'commercial', 'I', "'d", 'like', 'know', 'score', 'Right', 'Basically', 'I', 'understand', "'re", 'looking', 'fully-funded', 'PNC', 'core', 'maybe', 'USA', 'Okay', 'unfortunately', 'Let', 'give', 'positive', 'first', 'partnership', '60', 'universities', 'Across', 'us', 'UK', 'Canada', 'You', 'offer', 'none', 'universities', 'offer', 'free', 'PNP', 'code', 'And', 'You', 'know', 'scholarships', 'totally', 'depends', 'months', 'see', 'master', 'aisle', 'know', 'much', 'would', 'get', 'Would', 'open', 'options', 'get', 'scholarship', 'twenty', '30', '%', 'looking', 'free', '?', 'GOP', 'Sunday', 'Bhgcom', '94', 'trans', 'That', "'s", 'plan', 'fully', 'funded', 'financial', 'condition', 'much', 'I', 'completely', 'understand', 'situation', '22', 'You', "'re", 'wrong', 'car', 'apply', 'ful

In [17]:
# #Stemming
# list_stem=['']
# from nltk.stem import PorterStemmer
# porter = PorterStemmer()
# for x in res:
#   z=porter.stem(x)
#   list_stem.append(z)

In [18]:
# list_stem

# Modeling Techniques 

**Topic Modeling**

In [19]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 4.1 MB/s 
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 6.4 MB/s 
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 9.9 MB/s 
Collecting hdbscan>=0.8.28
  Downloading hdbscan-0.8.28.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 55.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 38.1 MB/s 
Collecting se

# bertopic

In [20]:
# create model 
from bertopic import BERTopic

model = BERTopic(verbose=True)
 
#convert to list 
# docs = text_voice.text.to_list()
 
topics, probabilities = model.fit_transform(res)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/9 [00:00<?, ?it/s]

2022-06-19 06:15:04,551 - BERTopic - Transformed documents to Embeddings
2022-06-19 06:15:14,601 - BERTopic - Reduced dimensionality
2022-06-19 06:15:14,628 - BERTopic - Clustered reduced embeddings


In [21]:
print(topics)
print(probabilities)

[0, 3, 4, 0, 0, 0, 0, 0, 0, 2, 5, 1, 2, 3, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 3, 0, 2, 5, 0, -1, 0, 3, 5, 4, 6, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 2, 3, 5, 0, 1, 1, 1, 5, 4, 0, 5, 0, 1, 0, 1, 6, 0, 0, 0, 0, 0, 0, 6, 0, 2, 0, 1, 0, 0, 0, 2, 4, 0, 0, 0, 0, 0, 0, 3, 0, 5, 0, 0, 3, 4, -1, 0, 0, 0, 4, 1, -1, 0, 3, 0, 6, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 0, 4, 6, 2, 2, 2, 0, 0, 0, 0, 2, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 6, 6, 6, 0, 2, 3, 0, 0, 0, 0, 2, 3, 2, 5, 3, 0, 1, 0, 0, 1, 0, 1, 0, 5, 0, 0, 2, 2, 1, 0, 0, 3, 6, 5, 0, 0, 2, 4, -1, 4, -1, 2, 4, 0, 0, 0, 1, 0, 1, 0, 0, 0, 3, 4, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 6, 0, 0, 0, 0, 3, 0, 0, 3, 4, 0, 0, 0, 6, 1, 5, 4, 0, 0, 5, 1, 0, 0, 0, 3, 0, 0, 0, 0, 2, 3, 4, 0, 3, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, -1, 1, 0, 0, 0, 3, 0, 0, 1, 0, 0, 1]
[1.         0.76305457 0.85182281 1.         0.83939677 1.
 1.         1.         1.         0.94456437 1.         0.63400716
 0.70608307 0.83371559 0.77898105 1.

**Select The top Topic**

In [22]:
model.get_topic_freq().head(15)

Unnamed: 0,Topic,Count
0,0,161
1,1,23
2,2,22
3,3,22
4,4,18
5,5,14
6,6,11
7,-1,6


In [23]:
model.get_topic_freq(5)

14

In [24]:
model.get_topic(4)

[('re', 1.027669402627873),
 ('nt', 0.7921682063542231),
 ('master', 0.4905696006407352),
 ('are', 0.4905696006407352),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05)]

**Model Topic Visulization**

In [25]:
#By using Heap Map
model.visualize_heatmap()

In [26]:
#Using Bar Chart
model.visualize_barchart()