<a href="https://colab.research.google.com/github/PranshuSwaroop15/Projects/blob/master/Pranshu_Swaroop_Task_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **University Living AI Intern Assignment**

Title: Voice-to-Text System Development

Objective:
The primary objective of this project is to design and implement a robust voice-to-text (speech recognition) system capable of accurately transcribing spoken language into textual form. The system should be able to process audio input from various sources and environments, converting it into written text with high

Scope:
1. Develop a speech recognition system capable of processing audio input in real-time or near real
time.
2. Select and implement appropriate speech recognition algorithms and techniques, considering
factors such as noise robustness, language support, and latency.
3. Integrate pre-trained speech recognition models or develop custom models using machine
learning approaches, such as deep learning-based architectures (e.g., convolutional neural
networks, recurrent neural networks, transformer models).
4. Optimize the system for performance across different accents, and speaking styles.
5. Implement features for handling audio pre-processing, including noise reduction, voice activity
detection, and audio normalization.
6. Support for streaming audio input for continuous transcription in real-time applications.

Deliverable:
1. Functional speech recognition system capable of converting audio input into text with specified
accuracy and performance metrics.
2. Deploy the model on Git
3. Integration with audio input sources and interfaces for real-time transcription.
4. Documentation summarizing system architecture, algorithms, implementation details, .
5. Evaluation results demonstrating the performance of the speech recognition system across
different test scenarios and data sets.

# **Steps**

1. Import Important Libraries.
2. Load the Audio File.
3. Import the pretrained model.
4. Make Predictions



In [119]:
!pip install SpeechRecognition




In [80]:
!pip install noisereduce

Collecting noisereduce
  Downloading noisereduce-3.0.2-py3-none-any.whl (22 kB)
Installing collected packages: noisereduce
Successfully installed noisereduce-3.0.2


In [113]:
!pip install -q transformers

# **1. Importing Important Libraries**

In [106]:
import speech_recognition as sr

In [107]:
from IPython.display import Audio
from scipy.io import wavfile
import numpy as np

In [114]:
import soundfile as sf
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

# **2. Loading the Audio File**

Audio Link:https://drive.google.com/file/d/1k2TzKRz72pxodHcKjqiF5RPzi_-jBSwM/view?usp=sharing

In [120]:
file_name = 'FarToLoud.wav'

In [109]:

Audio(file_name)

In [111]:
data = wavfile.read(file_name)
framerate = data[0]
sounddata = data[1]
time = np.arange(0,len(sounddata))/framerate
print('Sample rate:',framerate,'Hz')
print('Total time:',len(sounddata)/framerate,'s')

Sample rate: 16000 Hz
Total time: 1.9626875 s


In [112]:
Audio(file_name)

# **3. Importing Wav2Vec2 Model**

In [115]:

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2

In [116]:

input_audio, _ = librosa.load(file_name,
                              sr=16000)

# **4. Make Predictions**

In [117]:

input_values = tokenizer(input_audio, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]

In [118]:
transcription

'FAR TOO LOUD'