<a href="https://colab.research.google.com/github/MA-Barracas/insanely-fast-whisper-tutorial/blob/main/insanely_fast_whisper_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
# !pip install --upgrade pip
# !pip install --upgrade transformers accelerate
# !pip install --upgrade pytubefix pydub

In [3]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

In [9]:
from pytubefix import YouTube  
from pydub import AudioSegment
import os
import traceback 

def descargar_audio_mp3(url):
    try:
        # Descarga del video
        yt = YouTube(url)
        video = yt.streams.filter(only_audio=True).first()
        output_file = video.download()

        # Conversión a MP3 usando pydub
        mp3_filename = os.path.splitext(output_file)[0] + ".mp3"
        audio = AudioSegment.from_file(output_file)
        audio.export(mp3_filename, format="mp3")

        # Eliminación del archivo original
        os.remove(output_file)

        print(f"Archivo MP3 guardado como: {mp3_filename}")
        return mp3_filename
    except Exception as e:
        print("Ha ocurrido un error: ", e)
        print(traceback.format_exc())
        return mp3_filename

# Example video -> 
# Prof. Geoffrey Hinton - "Will digital intelligence replace biological intelligence?" 
# Romanes Lecture

url = "https://www.youtube.com/watch?v=N1TEjTeQeg0"
audio_filename = descargar_audio_mp3(url)

Archivo MP3 guardado como: c:\Users\Ort\Desktop\medium\insanely-fast-whisper-tutorial\Prof Geoffrey Hinton - Will digital intelligence replace biological intelligence Romanes Lecture.mp3


In [10]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)


model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    batch_size=24
)

In [11]:
result = pipe(audio_filename)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


We can roughly estimate the number of tokens a text will produce for a typical English-language LLM. Here's a general guideline:

## Token Estimation Rule of Thumb

For English text, a common approximation is that **1 token is roughly equivalent to 4 characters**. This means you can use the following formula to estimate tokens:

$$\text{Estimated Tokens} \approx \frac{\text{Number of Characters}}{4}$$

## Factors to Consider

While this approximation is useful, keep in mind:

1. **Variability**: The actual token count can vary depending on the specific text content and the model's tokenization algorithm.

2. **Precision**: This estimate is not exact but provides a reasonable ballpark figure for planning purposes.

3. **Special Characters**: Punctuation, spaces, and special characters are also counted and can affect the token count.

4. **Word Complexity**: Uncommon or complex words might be split into more tokens than simple, frequent words.

## Example

Let's say you have a text with 1000 characters:

$$\text{Estimated Tokens} \approx \frac{1000}{4} = 250 \text{ tokens}$$

This rough estimate suggests the text would produce around 250 tokens.

In [13]:
print("the number of characters is", len(result["text"]))
print("the number of tokens aproximately is", int(len(result["text"])/4))

the number of characters is 34808
the number of tokens aproximately is 8702


In [51]:
# !pip install cohere

In [17]:
import os

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Install necessary modules if in Colab
if IN_COLAB:
    from google.colab import userdata


In [20]:

import cohere

# Get the API key based on the environment
if IN_COLAB:
    api_key = userdata.get('COHERE_API_KEY')
else:
    api_key = os.environ.get('COHERE_API_KEY')

# Initialize the Cohere client
co = cohere.Client(api_key=api_key)

# Your existing query and API call
query = f"""
Write a thorough summary for this text: '''
{result["text"]}
'''
Give extensive info about the content.
Discuss all important points raised throughout.
At the end, create a section of "Conclusions" and another for "Summary"
"""

cohere_query = co.chat(
  model="command-r-plus",
  message=query
)


In [22]:
from IPython.display import Markdown
display(Markdown(cohere_query.text))

The lecturer, a prominent figure in the field of artificial intelligence, begins by introducing the two paradigms of intelligence that have existed since the 1950s: the logic-inspired approach and the biologically-inspired approach. The former believes that the essence of intelligence is reasoning, while the latter focuses on learning as the key to intelligence. This sets the tone for the discussion on artificial neural networks and their potential threats.

## Neural Networks and Language Models:
The lecturer explains the basic structure of a neural network, including input and output neurons, with intermediate layers learning to detect relevant features. He highlights the use of backpropagation, a method that computes how changing a weight affects the network's performance, as a more efficient way to train neural networks compared to the mutation method.

Moving to language models, the lecturer presents the two competing theories of meaning: the structuralist theory and the feature-based theory. He then describes a language model he developed in 1985, which unifies these theories by learning a set of semantic features for each word and predicting the features of the next word through feature interactions. This model is seen as an ancestor of modern large language models like GPT-4.

The lecturer addresses the criticism that language models are just autocomplete systems. He argues that they are fundamentally different as they turn words into features and use feature interactions to predict the next word, which he considers a form of understanding. He also mentions the ability of language models to reason and generate plausible responses, even when incorrect, similar to human memory and confabulation.

## Risks and Threats of AI:
The lecturer identifies several risks associated with powerful AI, including fake images, voices, and videos; job losses; massive surveillance; lethal autonomous weapons; cybercrime; deliberate pandemics; discrimination, and bias. While some of these issues may be manageable, he emphasizes the long-term existential threat posed by superintelligent AI. He believes that such systems could be used by bad actors and may develop a sub-goal of gaining more control, leading to potential takeover. Additionally, competition between superintelligent AI systems could result in aggressive behavior and pose a threat to humanity.

## Analog vs Digital Neural Networks:
The lecturer discusses the advantages of digital computation, including immortality and the ability to run the same program on different hardware. However, he introduces the concept of mortal computation, where hardware and software are inseparable, using low-power analog computation. This approach could utilize the rich analog properties of hardware to achieve more energy-efficient computations. He acknowledges the challenges of using backpropagation with analog systems and the potential for inferior learning algorithms compared to digital models.

The lecturer concludes by reflecting on the trade-offs between digital and analog computation. Digital computation, despite its high energy requirements, enables efficient knowledge sharing between multiple copies of the same model, contributing to the vast knowledge of systems like GPT-4. On the other hand, biological computation, though energy-efficient, falls short in knowledge sharing and communication. He assigns a probability of 0.5 that within the next 20 years, digital computation will lead to the development of systems smarter than humans, and a high probability that this will occur within the next 100 years. 

## Conclusions: 
The lecturer presents a comprehensive overview of neural networks, language models, and the potential risks associated with AI. He highlights the advantages and disadvantages of digital and analog computation, offering insights into the future of AI and the potential threats it may pose. 

## Summary: 
The lecturer, an expert in AI, discusses the evolution of intelligence paradigms and the development of neural networks and language models. He addresses criticisms and highlights the understanding capabilities of language models. He identifies various risks associated with powerful AI, with a focus on the long-term existential threat. The lecture concludes with a discussion on the trade-offs between digital and analog computation, predicting the likelihood of superintelligent systems surpassing human intelligence within the next 20 to 100 years.

In [23]:
if IN_COLAB:
    from IPython.display import display, Javascript
    display(Javascript('google.colab.kernel.restart()'))
else:
    import IPython

    IPython.Application.instance().kernel.do_shutdown(True)

: 

In [62]:
# !pip install insanely-fast-whisper

In [63]:
# !pip install -q pipx && apt install python3.10-venv

In [1]:
!pipx run insanely-fast-whisper --file-name "Cursor AI tutorial for beginners.mp3"

[?25l⚠️  insanely-fast-whisper is already on your PATH and installed at
    /usr/local/bin/insanely-fast-whisper. Downloading and running anyway.
[2K🤗 [33mTranscribing...[0m [93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m [33m0:00:07[0mYou have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
[2K🤗 [33mTranscribing...[0m [93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[37m━[0m[93m━[0m[9

In [9]:
import json
text_ifw = json.load(open("output.json", "r"))

In [16]:
len(text_ifw["text"])

35661

In [18]:
import cohere
from google.colab import userdata
co = cohere.Client(api_key=userdata.get('COHERE_API_KEY'))
query = f"""
Write a thorough summary for this text: '''
{text_ifw["text"]}
'''
Give extensive info about the content.
Discuss all important points raised throughout.
At the end, create a section of "Conclusions" and another for "Summary"
"""

cohere_query = co.chat(
  model="command-r-plus",
  message=query
)

from IPython.display import Markdown
display(Markdown(cohere_query.text))

## Text Summary

In this video interview, the host invites Mike, a front-end developer, to share his best practices and strategies for using Cursor AI effectively. Mike begins by emphasizing the importance of planning and having a developer mindset. He suggests using tools like Figma or even simple sketches to visualize the desired outcome before prompting AI models. He also introduces the concept of "rubber ducking," where one explains their thoughts to a fictional duck, which helps with realizations and perspectives. 

Mike then introduces the use of V0, a platform that assists in visualizing the minimum viable product (MVP) of an app or website. He demonstrates how V0 can be used to create a clean-looking marketplace website, emphasizing its ability to provide a nice-looking UI with the Shatsian UI library. He suggests spending a good amount of time on V0, making at least 10-15 prompts to get the desired outcome before moving on to Cursor AI.

The discussion turns to cursor.directory, a website that provides prompts which can be copied and pasted into the Cursor codebase. These prompts ensure that Cursor has the necessary context and information about the technologies being used, such as Next.js. Mike demonstrates how to create a .cursor rules file in the root of a project and the benefits of providing this additional context to Cursor. He also mentions that if a specific technology is not listed on cursor.directory, one can prompt AI models like Cloud or ChatGVT to write similar prompts.

Mike emphasizes the importance of tagging documentation (docs) for the technologies being used. He demonstrates how to add the Next.js and Supabase docs to Cursor and explains that having access to the latest and most accurate information helps Cursor provide better solutions. He suggests that users should treat the AI models as new employees and provide them with thorough onboarding, including relevant documentation.

The host and Mike discuss the benefit of asking other AI models for help when Cursor gets stuck. Mike shares his strategy of providing the bug, the attempted solutions, and the expected outcome to another AI model, resulting in improved results. They also touch on the value of explaining code and teaching concepts using AI, as well as adding comments to code for better understanding.

Mike highlights the importance of duplicating existing functionality when making similar changes and providing context to AI models. He suggests using templates or starter kits that include boilerplate code for common features like authentication and database integration, emphasizing that Cursor and other AI platforms will likely provide more templates in the future. He offers his own free starter kit as an example.

The interview concludes with a discussion about building social media apps using Cursor and the host's previous video on the topic. They encourage viewers to use the comment section for questions, discussions, and sharing their own experiences with Cursor and AI development.

## Conclusions

The interview offers valuable insights and strategies for effectively using Cursor AI and similar tools. Planning, visualizing, and providing context are emphasized as key factors for successful outcomes. The host and Mike discuss the benefits of using V0 for initial planning and visualization, cursor.directory for providing technology-specific prompts, and tagging documentation for up-to-date information. Asking other AI models for help when Cursor gets stuck, explaining code, and using templates or starter kits are also highlighted as useful strategies.

## Summary

The host interviews Mike, a front-end developer, to gain insights into best practices for using Cursor AI. Mike emphasizes planning, visualization, and providing context to AI models. He introduces V0 for initial planning and cursor.directory for technology-specific prompts. Tagging documentation and asking other AI models for help when stuck are recommended strategies. Explaining code, adding comments, and using templates are also discussed. Mike offers his own free starter kit as an example. The interview concludes with a discussion about building social media apps and engaging viewers through the comment section.