# Chat with Audio Locally: A Guide to RAG with Whisper, Ollama, and Chromadb(can also use FAISS)
Features
1. Featured timestamp attached detection, for timestamp speech slice trace
2. manual cosine similarity search for audio
3. vector store similarity fetch docs for QA

Inspired by: 
* https://medium.com/@ingridwickstevens/chat-with-your-audio-locally-a-guide-to-rag-with-whisper-ollama-and-faiss-6656b0b40a68
* https://www.youtube.com/watch?v=TdMkKvzPe3E

### 1. Transcribe audio to text

In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from progress_bar_decorator import progress_bar

In [3]:
has_mps = torch.backends.mps.is_available()
has_cuda = torch.cuda.is_available()
device = "mps" if has_mps else "cuda" if has_cuda else "cpu"
torch_dtype = torch.float16 if has_mps else torch.float32
device, torch_dtype

('mps', torch.float16)

In [4]:
model_id = "openai/whisper-large-v3"
# model_id = "openai/whisper-medium"

hf_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True,
    cache_dir='/Users/leon/Documents/03.LLM/whisper/models/'
).to(device)

processor = AutoProcessor.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
pipe = pipeline(
    task="automatic-speech-recognition",
    model=hf_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,  # 128
    chunk_length_s=64,   # 30 
    batch_size=24,       # 16  
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
    ignore_warning=True,
)

In [7]:
%%time
audio_file = './whisper/audio/如何通过Vision Pro理解ChatGPT.mp3'

@progress_bar(expected_time=180)
def transcribe():
    result = pipe(
        audio_file, 
        generate_kwargs={"language": "Mandarin",},
        return_timestamps=True,
    )
    return result

result = transcribe()

# clear memory
del hf_model
del processor
del pipe
torch.mps.empty_cache()

100%|██████████| 100/100 [01:28<00:00,  1.13it/s]


CPU times: user 33.4 s, sys: 8.17 s, total: 41.5 s
Wall time: 1min 29s


In [8]:
# parse file name
file_name = audio_file.rpartition("/")[-1].rpartition(".")[0]
print(file_name)

如何通过Vision Pro理解ChatGPT


In [9]:
result['text']

'其实在GPT的大模型里面的知识是非常大的我们在现在只不过是activate它的其中一点点然后是通过这种instruct或者说是chat的方式去让这个模型的输出变得更palatable to human就是我们更容易理解它的思维模式可能远远在我们所能理解之上其实有一个问题就是你的prompt你要去让ChatGPT的五个问题我发现我其实写了一个slides然后才去年三月份今天我就会结合这个slides里的内容然后和我们这一年的观察给大家重温一下这五个问题然后带来一些新的见解像大家看到的呢我现在其实是在一个�可以去给大家去讲我的这个PPTLights upOh nice time第一个问题好像这样还是不太好玩我们还是回到我们原来的那个模式吧它这批地到底是什么它是一个范式突破吗我们在这个Vision Pro出来的时候在iPhone出来的时候在一个新科技出来的时候这都是一个我们最需要问的问题为什么这个问题重要因为它只有范式突破才能带来一个十倍百倍的机会如果你在原有的基础之上做一个就是小范围的突破而不是一个范式突破的话那它我们如何使用大圆模型第五个是我们人类和大圆模型有什么不一样这四个问题都是顺接的就是我们了解了它到底能产生什么样的影响尤其是站在这个技术刚刚出现的时候我们要去想象它五年后十年后的影响这样的话我们才可以知道它给我们带来的机会那最后呢其实是一个究极问题了就是我们站到这个技术已经发展到完全非常成熟的时候那人类和这个技术还有什么不一样就是说我们还应该去做什么才是不会被这个技术所改变的就是为什么这五个问题是这样的顺序和为什么这五个问题这么重要物理网上很多东西都是noise嘛再远一点吗再来一好像也就这么远了就是ChaiGBD它是一个叫做Generative Auto-Regressive Large Language Model它是一个代言模型然后这个代言模型它的本质是一个生成式的然后它是一个Auto-Regressive就是自回归式的那这个具体怎么理解就是它是这个是一个就是Stefan Wolfram的一个图生成下一个次然后你这样自然而然就可以把人类已知的文本和它去进行一个匹配看它生成的对不对这样的话你就有海量的标签好的数据去帮助你这个模型去学习之后就是这个他们非常重要的scaling law的这个observation我在这儿稍微再多解释就在这里就多解释一下吧为什么g

In [10]:
import pandas as pd
df_transcribe = pd.DataFrame(result['chunks'])
df_transcribe

Unnamed: 0,timestamp,text
0,"(0.0, 4.48)",其实在GPT的大模型里面的知识是非常大的
1,"(4.48, 7.76)",我们在现在只不过是activate它的其中一点点
2,"(7.76, 11.84)",然后是通过这种instruct或者说是chat的方式
3,"(11.84, 16.16)",去让这个模型的输出变得更palatable to human
4,"(16.16, 17.68)",就是我们更容易理解
...,...,...
217,"(1719.36, 1720.06)",一个
218,"(1722.36, 1726.86)",看看刚才东西有没有录上
219,"(1728.96, 1729.46)",希望大家喜欢
220,"(1731.76, 1733.36)",或者说觉得这是一个有用的东西我的视频到那边去了


In [11]:
# parse timestamp function
def parse_audio_slice_timestamp(time_tuple):
    time_list = list(time_tuple)
    return time_list[0], time_list[1]

In [12]:
transcribe_filename = f'./whisper/transcribe/{file_name}.csv'

df_transcribe.loc[:, 'start'] = df_transcribe['timestamp'].apply(lambda x: list(x)[0])
df_transcribe.loc[:, 'end'] = df_transcribe['timestamp'].apply(lambda x: list(x)[1])
df_transcribe.to_csv(transcribe_filename, index=False)
df_transcribe.head()

Unnamed: 0,timestamp,text,start,end
0,"(0.0, 4.48)",其实在GPT的大模型里面的知识是非常大的,0.0,4.48
1,"(4.48, 7.76)",我们在现在只不过是activate它的其中一点点,4.48,7.76
2,"(7.76, 11.84)",然后是通过这种instruct或者说是chat的方式,7.76,11.84
3,"(11.84, 16.16)",去让这个模型的输出变得更palatable to human,11.84,16.16
4,"(16.16, 17.68)",就是我们更容易理解,16.16,17.68


In [13]:
transcribe_text_filename = f'./whisper/transcribe/{file_name}.txt'

with open(transcribe_text_filename, 'w', encoding='utf-8') as f:
    f.write(result['text'])

In [14]:
from pydub import AudioSegment
from pydub.playback import play

sound = AudioSegment.from_file(audio_file)
print(f'Length of this audio file {round(len(sound)/1000/60, 2)} minutes')

row = df_transcribe.iloc[int(len(df_transcribe)/2), :]
print('Text:', row['text'])
print('Playing audio slice start from {}m to {}m'.format(row['start']/60, row['end']/60))

# audio timestamp in ms, hence times 1000
play(sound[row['start']*1000: row['end']*1000])

Length of this audio file 28.99 minutes
Text: 它是在数据里面寻找一些规律
Playing audio slice start from 14.4185m to 14.4695m


Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmp6tp9c1_r.wav':
  Duration: 00:00:03.06, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   2.85 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




   2.97 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 

In [15]:
# play(sound[-1000:])

### 2. Tokenize and embed text

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.vectorstores import Chroma, FAISS
from langchain_core.output_parsers import StrOutputParser
import pandas as pd

#### 2.1 Direct embedding against audio

In [17]:
# transcribe_filename = f'./whisper/transcribe/{file_name}.csv'

df_embed = pd.read_csv(transcribe_filename)
df_embed.head()

Unnamed: 0,timestamp,text,start,end
0,"(0.0, 4.48)",其实在GPT的大模型里面的知识是非常大的,0.0,4.48
1,"(4.48, 7.76)",我们在现在只不过是activate它的其中一点点,4.48,7.76
2,"(7.76, 11.84)",然后是通过这种instruct或者说是chat的方式,7.76,11.84
3,"(11.84, 16.16)",去让这个模型的输出变得更palatable to human,11.84,16.16
4,"(16.16, 17.68)",就是我们更容易理解,16.16,17.68


##### Choose embedding model

In [18]:
from langchain.embeddings import OllamaEmbeddings, SentenceTransformerEmbeddings
# embeddings = OllamaEmbeddings(model='llama2-chinese:latest')
# embeddings = OllamaEmbeddings(model='mxbai-embed-large:latest')
# embeddings = OllamaEmbeddings(model='nomic-embed-text:latest')

embeddings = SentenceTransformerEmbeddings(
    model_name='BAAI/bge-large-zh-v1.5', 
    cache_folder='/Users/leon/Documents/03.LLM/embedding_models'
)

In [19]:
# Lambda function to embed audio text
add_embed = lambda x: embeddings.embed_query(x['text'])

In [20]:
# similiarity search function
import numpy as np
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [21]:
%%time
df_embed.loc[:, 'text_embed'] = df_embed.apply(add_embed, axis=1)
df_embed.head()

CPU times: user 5.87 s, sys: 734 ms, total: 6.6 s
Wall time: 15.5 s


Unnamed: 0,timestamp,text,start,end,text_embed
0,"(0.0, 4.48)",其实在GPT的大模型里面的知识是非常大的,0.0,4.48,"[0.020624252036213875, 0.033191632479429245, -..."
1,"(4.48, 7.76)",我们在现在只不过是activate它的其中一点点,4.48,7.76,"[0.026757562533020973, 0.008384518325328827, -..."
2,"(7.76, 11.84)",然后是通过这种instruct或者说是chat的方式,7.76,11.84,"[0.05034959688782692, 0.0021004120353609324, 0..."
3,"(11.84, 16.16)",去让这个模型的输出变得更palatable to human,11.84,16.16,"[0.022767867892980576, -0.001781404484063387, ..."
4,"(16.16, 17.68)",就是我们更容易理解,16.16,17.68,"[0.014743323437869549, -0.043621234595775604, ..."


In [22]:
# check embeded vector length
len(df_embed['text_embed'].iloc[0])

1024

In [39]:
# give your search query
search_term = 'In-Context Learning'
search_term_embed = embeddings.embed_query(search_term)
# len(search_term_embed)

In [40]:
# conduct similiarity and sorting
df_embed.loc[:, 'cosine_similarity'] = df_embed['text_embed'].apply(lambda x: cosine_similarity(x, search_term_embed))
df_sorted = df_embed.sort_values(by='cosine_similarity', ascending=False)
df_sorted.head(5)

Unnamed: 0,timestamp,text,start,end,text_embed,cosine_similarity
59,"(444.11, 445.87)",但是 in context learning 呢,444.11,445.87,"[0.044630419462919235, -0.01232170220464468, 0...",0.794197
87,"(651.6, 654.84)",因为GPT有了in context learning的这样的一个,651.6,654.84,"[0.02766631357371807, -0.006514569744467735, -...",0.629797
54,"(401.86, 437.63)",就是接下来一个技术很重要的点就是In-Context Learning然后这一点其实从一定程...,401.86,437.63,"[0.024835968390107155, 0.012752030044794083, 0...",0.587457
73,"(533.0, 535.8)",然后reinforcement learning with human feedback,533.0,535.8,"[0.02373799867928028, 0.01962607353925705, 0.0...",0.561551
81,"(576.19, 608.05)",reinforcement learning with human feedback其实如果...,576.19,608.05,"[0.04316745698451996, 0.01569114439189434, 0.0...",0.531998


In [25]:
# playsound for top 5 ranking
for index, row in df_sorted.iloc[:5].iterrows():
    play(sound[row.start*1000: row.end*1000])

Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpki_o2mf4.wav':
  Duration: 00:00:01.60, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.51 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpkryd8zs5.wav':
  Duration: 00:00:01.04, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   0.96 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmp680ma925.wav':
  Duration: 00:00:02.32, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   2.26 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpxkwck7w0.wav':
  Duration: 00:00:38.89, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
  38.80 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmp6491kuwz.wav':
  Duration: 00:00:02.94, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   2.83 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




   2.87 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 

In [54]:
# playback function 
def playback_by_query(query, k=3, show_text=False):
    """
    A quick playback function
    
    Parameters:
    query: str, a user query for similarity search
    k: int, number of top results to search
    """
    search_term = query
    search_term_embed = embeddings.embed_query(search_term)

    # conduct similiarity and sorting
    df_embed.loc[:, 'cosine_similarity'] = df_embed['text_embed'].apply(lambda x: cosine_similarity(x, search_term_embed))
    df_sorted = df_embed.sort_values(by='cosine_similarity', ascending=False)

    if show_text:
        display(df_sorted.iloc[:k]['text'])

    for index, row in df_sorted.iloc[:k].iterrows():
        play(sound[row.start*1000: row.end*1000])

In [46]:
playback_by_query("model inference", 3)

Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpmicqjme_.wav':
  Duration: 00:00:01.16, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.07 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpo08r5v6g.wav':
  Duration: 00:00:01.80, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.74 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmptycdb4g5.wav':
  Duration: 00:00:01.84, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.72 M-A: -0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




   1.75 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 

#### 2.2 Embedding for LLM-based RAG

In [27]:
# define text to split
# transcribe_text_filename = f'./whisper/transcribe/{file_name}.txt'
with open(transcribe_text_filename, 'r') as f:
    transcribe_text = f.read()

# split the text into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = splitter.split_text(transcribe_text)

In [28]:
len(texts), texts[-2]

(14,
 'with build simple things that people really wantInstagram其实之前就是一个filter app它不是一个social media或者怎么样现在的这些东西都是它做了一个人们真正需要的东西之后才出来的你没有人们需要的那个东西的话其他都是白谈你说那么多概念没有用的然后这就是我的一些opinion了就是它到底是一个2B2C的机会呢它是ScanIt吗我觉得不是但是按下private search就是retrieve这个环节非常重要的其实到现在ChaiGBT都没有做得很好它也是一个未来非常重要的方向那个Big')

In [29]:
# create vector store using Chroma, in-memory without setting persistant cache
speech_vector = Chroma.from_texts(
    texts, 
    embedding=embeddings, 
    metadatas=[{'source': str(i)} for i in range(len(texts))],
    collection_name='speech-rag',
)

### 3.Setup LLM and Prompt

In [30]:
!ollama list

NAME                                          	ID          	SIZE  	MODIFIED     
brxce/stable-diffusion-prompt-generator:latest	474a09318a2e	4.1 GB	6 days ago  	
codellama:7b-python-fp16                      	c586d7593fc9	13 GB 	7 days ago  	
codellama:latest                              	8fdf8f752f6e	3.8 GB	7 days ago  	
command-r:35b-v0.1-q6_K                       	c46e949ec735	28 GB 	2 weeks ago 	
dolphin-llama3:latest                         	613f068e29f8	4.7 GB	7 days ago  	
llama2:13b-f16                                	18051f2e82e3	26 GB 	4 weeks ago 	
llama2:7b-f32                                 	4901050728fc	26 GB 	4 weeks ago 	
llama2-chinese:13b-chat-fp16                  	3d4c5a00962c	26 GB 	4 weeks ago 	
llama2-chinese:7b-chat-fp16                   	b73150f2949c	13 GB 	4 weeks ago 	
llama3:70b-instruct-q4_0                      	bcfb190ca3a7	39 GB 	13 days ago 	
llama3:8b-instruct-fp16                       	c1d0ea97005c	16 GB 	13 days ago 	
llava:34b-v1.6-q6_K         

In [31]:
from langchain.llms import Ollama

# setup llm
# local_llm = 'llama3:8b-instruct-fp16'
# local_llm = 'command-r:35b-v0.1-q6_K'
# local_llm = 'wizardlm2:7b-fp16'
# local_llm = 'mistral:7b-instruct-v0.2-fp16'
local_llm = 'qwen:14b'

llm = Ollama(model=local_llm)

In [32]:
# from langchain_community.llms.chatglm3 import ChatGLM3

# llm = ChatGLM3(
#     model='chatglm3-6b',
#     endpoint_url='http://127.0.0.1:8000/v1/chat/completions',
#     verbose=True
# )
# llm.invoke('你好')

In [33]:
# setup prompt
from langchain.prompts import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

In [34]:
# create RAG prompt
rag_prompt = ChatPromptTemplate(
    input_variables=['context', 'question'],
    messages=[
        HumanMessagePromptTemplate(
            prompt=PromptTemplate(
                input_variables=['context', 'question'],
                # template="""You answer questions about the contents of a transcribed audio file.
                # Use only the provided audio file transcription as context to answer the question. 
                # Do not use any additional information.
                # If you don't know the answer, just say that you don't know. Do not use external knowledge. 
                # Use three sentences maximum and keep the answer concise. 
                # Make sure to reference your sources with quotes of the provided context as citations.
                # \nQuestion: {question} \nContext: {context} \nAnswer:
                # """,
                template="""你针对会议录音转的文字内容回答问题。
                只利用录音转的文字内容作为上下文来回答问题。
                不要使用任何其它额外信息。
                如果你不知道答案，就回答不知道，不要使用外部知识。
                用最多五句话来回答，并确保答案准确。
                确保在答案中对上下文的源信息进行引用。
                \nQuestion: {question} \nContext: {context} \nAnswer:
                """
            )
        )
    ]
)

In [35]:
# load qa chain
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm=llm, chain_type='stuff', prompt=rag_prompt, verbose=False)

### Query and Answering

In [36]:
# setup a query
query = '如何提升大模型的推理能力？'
# query = '监管政策的解读'

In [64]:
# similarity search
# docs = speech_vector.max_marginal_relevance_search(query, k=5, fetch_k=28, lambda_mult=0.5)
docs = speech_vector.similarity_search(query, )
docs

[Document(page_content='learning呢可以让你做到new tasksame modeldifferent alignment然后第三个重要的点就是emergence就是涌现大模型的涌现是一个非常非常重要的点就是它不是线性的一点一点变好的而是过了一个节点突这个图像很可怕但是我觉得其实讲的还是有点道理的就是我们在背后呢是整个就是GPT的这个所谓的unsupervised learning当然这个词也不是完全准我觉得更准的是把这个unsupervised learning变成那个GPT就是他的base model然后supervised fine tuning我们把它叫做就是alignment然后reinforcement learning with human feedback这个可以是alignment中间的一环或者说是chat的方式去让这个模型的输出变得更palatable to human就是我们更容易理解或者说我们更容易appreciate其实它的思维模式可能远远在我们所能理解之上只不过我们没有办法理解模型在想什么我们需要用chat的方式去理解模型在想什么最后就是这个reinforcement', metadata={'source': '4'}),
 Document(page_content='Model是什么过去的Motion Learning Model它其实就是Find Correspondence它是在数据里面寻找一些规律你可以告诉它你这个规律寻找的对不对最简单的就是你给它一起来你就可以让它做很多的事情这是过去的machine learning但是啊但是machine learning它有一个问题就是它只会英武学舍它不能理解当然这两个词都非常的重所以说我们接下来就要去讲这个什么是理解在这里边有一个window grade发现车是可以在红绿灯前面停下来的车可以压碎它的坚果然后它就会把坚果drop到这个红绿灯前面让车去把它压碎然后等到红灯的时候再去把这个坚果给pick up起来在这里边它只有一次任务就是它如果被车撞了就被撞死了所以说它所有过去的MOS', metadata={'source': '7'}),
 Document(page_content='Auto-Regressive Large Language Model

In [38]:
# using chain for the query
response = chain.invoke(
    input={'input_documents': docs, 'question': query}, 
    # return_only_outputs=True,
)

print(response["output_text"])

提升大模型推理能力的关键在于几个技术点：

1. In-Context Learning：这种学习方式允许模型在上下文中激活不同基础模型的方式，从而提高了处理新任务的能力。

2. Generative Task Importance：生成式任务对于训练高质量的文本和提升模型性能至关重要。通过生成样本并评估其质量，模型可以持续优化。

3. Scaling Law Observation and Leap of Faith：研究者观察到的关于大模型的规模定律以及对模型能产出高质量结果的信心（Leap of Faith），也是推动模型推理能力提升的重要因素。

综上所述，在提升大模型推理能力的过程中，技术手段、任务重要性认识以及信心的树立等方面都发挥着关键作用。



In [55]:
playback_by_query("Scaling Law", 3, True)

44     之后就是这个他们非常重要的scaling law的这个observation
144            就是GUI加Morse law for everything
199                                 它是ScanIt吗
Name: text, dtype: object

Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmppryf5wq2.wav':
  Duration: 00:00:04.94, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   4.84 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpygjnur2c.wav':
  Duration: 00:00:02.00, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.94 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpyzllyq5z.wav':
  Duration: 00:00:02.00, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.92 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




In [50]:
print(chain.invoke({'input_documents': docs, 'question': query},)['output_text'])

提升大模型推理能力的关键在于几个方面：

1. **生成式学习**：通过训练模型生成文本，这种自回归式的任务有助于提高理解力。

2. **上下文学习（In-Context Learning）**：通过在给定的上下文中展示特定的任务或例子，模型可以在无需额外训练的情况下学会新的任务和模式。

3. **强化学习与反馈**：通过与用户交互，并根据用户的反馈调整其输出，这有助于提升模型在理解人类意图方面的表现。

这些方法和技术相互结合，共同推动大模型推理能力的提升。



In [56]:
playback_by_query("强化学习与反馈", 3, True)

73         然后reinforcement learning with human feedback
81    reinforcement learning with human feedback其实如果...
47    为什么generative这个任务如此之重要标注都是所谓的labeled包括你的这个就是un...
Name: text, dtype: object

Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpj2372qb4.wav':
  Duration: 00:00:02.80, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   2.70 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpc5m7s1lm.wav':
  Duration: 00:00:31.86, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
  31.79 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpp05k_dw3.wav':
  Duration: 00:00:34.42, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
  34.35 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




In [58]:
from langchain_community.chat_models import ChatOllama

# setup llm
# local_llm = 'wizardlm2:7b-fp16'
# local_llm = 'llama3:8b-instruct-fp16'
llm = ChatOllama(model=local_llm)

In [66]:
# get retriever --> equvalent to vector search
retriever = speech_vector.as_retriever(
    search_type='similarity',  # similarity, mmr, similarity_score_threshold
    search_kwargs={'k':4, },  # k, score_threshold
)

# check retriever
docs = retriever.invoke(query)
assert docs == speech_vector.similarity_search(query, )

In [67]:
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

# Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

In [68]:
print(chain.invoke(query))

提升大模型的推理能力可以通过多种方式进行。首先，涌现（emergence）是一个关键点，它指的是模型在某个节点上突然改善或学习新能力的过程。这并不是一个线性的变化，而是模型在训练过程中达到一个阶段性的转折点后发生的。GPT模型通常使用无监督学习（unsupervised learning）作为基础模型，然后通过监督细化（supervised fine-tuning）来进行对齐（alignment）。这个对齐过程中，可以引入强化学习（reinforcement learning）与人类反馈来改善模型的输出，使其更符合人类的理解和期望。

其次，理解（understanding）是机器学习中一个重要的概念。模型能否找到数据中的规律并理解这些规律的关键在于。例如，模型可以学习通过观察数据来识别和执行任务，如识别车辆是否可以在红绿灯前停留等。

第三，作为一个自回归式的生成模型，GPT可以根据先前生成的内容生成后续内容，这种自然语言处理能力使得模型能够匹配和生成高质量的文本。此外，模型的学习效率受到所谓“缩放定律”（scaling law）的限制，这意味着模型的大小与数据量的关系在一定程度上决定了其性能。

最后，信仰（Leap of Faith）是实现高质量结果的关键。OpenAI团队对此有着强烈的信念，并通过技术细节来实现这一点。在这个过程中，新兴的技术如内上下文学习（In-Context Learning）也显得尤为重要，它允许模型在不改变基础模型的情况下针对新任务进行适应和优化。这种方法与传统的机器学习模式有所不同，后者通常需要为每个新任务创建一个新的模型。


In [70]:
print(query)
playback_by_query("模型能否找到数据中的规律并理解这些规律的关键在于", 3, True)

如何提升大模型的推理能力？


79                                             去理解模型在想什么
111                                        它是在数据里面寻找一些规律
23     而不是一个范式突破的话那它我们如何使用大圆模型第五个是我们人类和大圆模型有什么不一样这四个问...
Name: text, dtype: object

Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpfhmplgcf.wav':
  Duration: 00:00:01.60, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.50 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpxc1p6tlr.wav':
  Duration: 00:00:03.06, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   2.93 M-A: -0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmplno_ggbg.wav':
  Duration: 00:00:31.89, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
  31.80 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




#### 2.4 Meeting Minutes Summary

In [71]:
# type(transcribe_text)

In [72]:
# create summary prompt
summary_prompt_template = """Your goal is to summarize the meeting transcription that is given to you as the following:
                "{text}"
                The summarization of the meeting minutes shall limit to 2500 words.
                Only output the summary without any additional text.
                Focus on providing a summary in a structured format text of what subject reviewed and the action items out of it.
                """

prompt_test = """
You are a commentator. Your task is to write a report on a meeting transcription. 
When presented with the meeting minutes, come up with interesting questions to ask,
and answer each question. 
Afterward, combine all the information and write a report in the markdown
format. 
Focus on providing a summary in a structured format text of the overall performance rating and the action items out of it.

# Meeting Keynotes: 
"{text}"

# Instructions: 
## Summarize:
In clear and concise language, use only the context information, to summarize the key points of:
- Overall Performance ratings by any of the [Green, Amber, Red]
- Financial status
- Projects status
- Action items

## Interesting Questions: 
Generate three distinct and thought-provoking questions that can be 
asked about the content of the meeting. For each question:
- After "Q: ", describe the problem 
- After "A: ", provide a detailed explanation of the problem addressed 
in the question.
- Enclose the ultimate answer in <>.

## Write a analysis report
Using the summary and the answers to the interesting questions, 
create a comprehensive report in Markdown format. 
"""

summary_prompt = PromptTemplate.from_template(prompt_test)


In [79]:
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.chains.llm import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

docs = [Document(page_content=transcribe_text, metadata={"source": "local"})]

llm_chain = LLMChain(llm=llm, prompt=summary_prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

print(stuff_chain.invoke(docs)['output_text'])


### Summary:
The speaker discusses the evolution of technology, particularly focusing on the advancements in AI and machine learning, exemplified by GPT (Generative Pre-trained Transformer) models like ChatGPT. They highlight the shift from specific computer solutions to more general AI that can understand and interact with users naturally. The speaker emphasizes the importance of using GPT not just as a tool but also as a partner that can autonomously perform tasks such as data analysis, given proper training and alignment.

They suggest that in the past year, the focus should have been on teaching GPT how to perform tasks (engineering ability) and moving forward, it will be more about telling GPT what to do (alignment). The speaker also touches upon historical examples like XGBT and Google, illustrating how new technology can create vast opportunities by doing what previous technologies couldn't. They mention Instagram as an example of a simple tool that became a social media platfor

In [82]:
playback_by_query("Instagram and Google", 3, True)

192                                           就比如说Google
193    Google在一开始的互联网上是没有用的因为那个时候互联网上都没有什么信息就是start w...
170                                             而是应该去做网页
Name: text, dtype: object

Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpybri4uua.wav':
  Duration: 00:00:01.44, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.31 M-A: -0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmph84e8a5g.wav':
  Duration: 00:00:39.31, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
  39.20 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/lv/4kql5s856s56ycnzm1ly8y0m0000gn/T/tmpi6yxwg1w.wav':
  Duration: 00:00:01.74, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   1.66 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 


