<a href="https://colab.research.google.com/github/MK316/OpenAI/blob/main/Whisper_1st.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Whisper - 1st practice (22.12.20)

[Video tutorial](https://www.youtube.com/watch?v=wrSelk44_Js)

[Whisper license](https://github.com/openai/whisper/blob/main/LICENSE) - Copyright (c) 2022 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"),...

Video tutorial (Play if you want)

In [None]:
%%capture
from IPython.display import YouTubeVideo, display
video = YouTubeVideo("wrSelk44_Js", width=500)
display(video)

github.com/openai/whisper

In [None]:
%%capture
!pip install git+https://github.com/openai/whisper.git 

## [1] Base model access

Upload speech file from your computer

[sample speech short - rainbow passage](https://github.com/MK316/OpenAI/blob/main/audiodata/sample02.wav)

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import whisper

model = whisper.load_model('base.en')
result = model.transcribe("sample02.wav", language="en",fp16=False)
print(result["text"])

# [2] Transcribing Korean audio

In [None]:
# Creating Korean speech file using gTTS

In [None]:
%%capture
!pip install gTTS

In [None]:
#@markdown 🚩 Making a function { tts ( _text_to_say_) }:
def tts(text):

  !pip install gTTS
  from gtts import gTTS
  from IPython.display import Audio

  text_to_say = text

# Step ⓵ Language to choose:
  language_to_choose = "ko" #@param ["en", "fr","ko",'es']
  # lang = language_to_choose
  print("Play language accent: %s"%language_to_choose)
  language = language_to_choose

# gTTS
  gtts_object = gTTS(text = text_to_say,
                     lang = language,
                    slow = False)
  
# #@markdown Step ③: Create the audio file (.wav) to play:
  gtts_object.save("sample.wav")

# # Output
  return Audio("sample.wav")



* text1: "밤이 늦었습니다. 오늘도 재미있는 코딩을 배우는 중입니다. 한국어라서 잘 구현될지 모르겠네요."

* text2: "착하지만 가난한 동생과 욕심 많은 형의 이야기인 흥부놀부전. 내용을 간단히 소개하자면 흥부는 온갖 어려움 속에서도 착한 마음을 잃지 않았지만, 놀부는 끝까지 욕심을 부리다가 벌을 받게 된다. 착한 흥부는 벌을 받은 형을 버리지 않고 도와주고, 뒷날 흥부는 착한 사람을 대표하는 이름이, 놀부는 욕심 많은 사람을 대표하는 이름이 되었다. 나쁜 마음씨를 갖고 살면 벌을 받고, 착하게 살면 복을 받는다는 전형적인 전래동화의 교훈을 담고 있다."

In [None]:
#@markdown 🚩 Type text to say
print('Type texts to create audio:')
txt = input()
tts(txt)

Note: Recording was done manually on Praat by playing the generated gTTS audio (ko).

[sample audio](https://github.com/MK316/OpenAI/blob/main/audiodata/sample_ko_01.wav)

[text2 audio](https://github.com/MK316/OpenAI/blob/main/audiodata/sample_k_shortstory_ttsvoice.wav)  

* _Note_: the sentence was copied to see whether the error comes from the boundary of audio files.

In [None]:
model = whisper.load_model('base')
result = model.transcribe('sample_ko_01.wav', language="ko", fp16=False)
print(result['text'])

In [None]:
model = whisper.load_model("small")
result = model.transcribe('sample_ko_01.wav', language="ko",task = 'translate', fp16=False)
print(result["text"])

sample 2: short story

* Text: "착하지만 가난한 동생과 욕심 많은 형의 이야기인 흥부놀부전. 내용을 간단히 소개하자면 흥부는 온갖 어려움 속에서도 착한 마음을 잃지 않았지만, 놀부는 끝까지 욕심을 부리다가 벌을 받게 된다. 착한 흥부는 벌을 받은 형을 버리지 않고 도와주고, 뒷날 흥부는 착한 사람을 대표하는 이름이, 놀부는 욕심 많은 사람을 대표하는 이름이 되었다. 나쁜 마음씨를 갖고 살면 벌을 받고, 착하게 살면 복을 받는다는 전형적인 전래동화의 교훈을 담고 있다."  
[online source](https://m.post.naver.com/viewer/postView.naver?volumeNo=9154443&memberNo=15460571)  
[audiofile info](https://raw.githubusercontent.com/MK316/OpenAI/main/audiodata/audiofile_info.md)

text2 audio (by human, female)

In [None]:
model = whisper.load_model('base')
result = model.transcribe('sample_k_shortstory.wav', language="ko", fp16=False)
print(result['text'])

translate task

In [None]:
model = whisper.load_model("small")
result = model.transcribe('sample_k_shortstory.wav', language="ko",task = 'translate', fp16=False)
print(result["text"])

text2 audio (by gTTS, male)

In [None]:
model = whisper.load_model('base')
result = model.transcribe('sample_k_shortstory_ttsvoice.wav', language="ko", fp16=False)
print(result['text'])

## [3] Low level model access

In [None]:
model = whisper.load_model('small')

audio = whisper.load_audio('sample_ko_01.wav')
audio = whisper.pad_or_trim(audio)

# make log-mel spectrogram and move to the same device as the model.

mel = whisper.log_mel_spectrogram(audio).to(model.device)


[1] Language detection:

[Trained data](https://raw.githubusercontent.com/openai/whisper/main/language-breakdown.svg)

In [None]:
# detect the spoken language

_, probs = model.detect_language(mel)
lang = max(probs, key = probs.get)
prob = "{0:.0%}".format(max(probs.values()))

# print language that scored teh hightest liklihood

print(f'Detected language (and probability): {lang}', f'({prob})')

[2] Korean to English translation (Text1 is excellent but the result of Text2 is very poor.)

In [None]:
# decode the audio

options = whisper.DecodingOptions(language="ko", task = 'translate')
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)