# huggingface - the first step

## 需要安装的东西
- pip install transformers datasets 
- pip install torch
- pip install tensorflow
- pip3 install librosa soundfile

## 要点
- 有大量的模型和数据，首次运行的时候会下载
- transformer / datasets 是huggingface的
- 本地运行
- 

In [1]:
import env
import utils

## 情绪分析模块 - pipeline
- 整个模块都会被下载下来，然后在本地运行
- https://huggingface.co/docs/transformers/v4.27.2/zh/quicktour

In [2]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

2024-08-03 11:55:59.572201: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [3]:
results = classifier([
    "We are very happy to show you the 🤗 Transformers library.", 
    "We hope you don't hate it.",
    "I love you."
    ])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
label: POSITIVE, with score: 0.9999


## 语音识别模块
- 理解pipeline
- pipeline 和model 是分开的

In [4]:
# step 1: pipeline, model
import torch
from transformers import pipeline

speech_recognizer = pipeline(
    "automatic-speech-recognition", # task
    model="facebook/wav2vec2-base-960h" # model
)

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

In [5]:
# step 2: 在线的dataset - 会下载下来
# https://huggingface.co/datasets/PolyAI/minds14

from datasets import load_dataset, Audio

dataset = load_dataset(
    "PolyAI/minds14", 
    name="en-US", 
    split="train",
    trust_remote_code=True
    )


In [6]:
# step 3: dataset转换
dataset = dataset.cast_column(
    "audio", 
    Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)
    )

In [7]:
# step 4: 预测 - 识别前四个音频
result = speech_recognizer(dataset[:6]["audio"])
print([d["text"] for d in result])

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT', 'CAN NOW YOU HELP ME SET UP AN JOINT LEAKACCOUNT', 'HOW TO FET UP A JOINA COWT']


## 

In [8]:
#
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)