# hf中pipline学习示例

In [1]:
import torch
import subprocess

# 检查PyTorch和CUDA版本
print(f"PyTorch版本: {torch.__version__}")
print(f"CUDA是否可用: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA版本: {torch.version.cuda}")
    print(f"cuDNN版本: {torch.backends.cudnn.version()}")
    print(f"GPU数量: {torch.cuda.device_count()}")
    print(f"当前GPU: {torch.cuda.get_device_name(0)}")
    
    # 获取GPU的compute capability
    major, minor = torch.cuda.get_device_capability(0)
    print(f"GPU Compute Capability: {major}.{minor}")

# 用nvidia-smi查看显卡信息
try:
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    print("\n=== NVIDIA-SMI信息 ===")
    for line in result.stdout.split('\n')[5:10]:
        print(line)
except:
    print("无法运行nvidia-smi")

PyTorch版本: 2.8.0+cu128
CUDA是否可用: False

=== NVIDIA-SMI信息 ===


  return torch._C._cuda_getDeviceCount() > 0


In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

classifier("I've been waiting for a Hugging Face course my whole life.")

classifier("您好")



No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.78505939245224}]

In [3]:
unmasker = pipeline("fill-mask", model="distilbert/distilbert-base-uncased")
unmasker("This is a [MASK] text")

Device set to use cpu


[{'score': 0.08184506744146347,
  'token': 3143,
  'token_str': 'complete',
  'sequence': 'this is a complete text'},
 {'score': 0.07022914290428162,
  'token': 7704,
  'token_str': 'partial',
  'sequence': 'this is a partial text'},
 {'score': 0.026181718334555626,
  'token': 2460,
  'token_str': 'short',
  'sequence': 'this is a short text'},
 {'score': 0.019824033603072166,
  'token': 3763,
  'token_str': 'latin',
  'sequence': 'this is a latin text'},
 {'score': 0.016259148716926575,
  'token': 7099,
  'token_str': 'sample',
  'sequence': 'this is a sample text'}]

In [4]:
print("模型名称:", unmasker.model.name_or_path)

模型名称: distilbert/distilbert-base-uncased


In [5]:
textToImage = pipeline("text-to-image", mode="Qwen/Qwen-Image")
textToImage("A beautiful sunset over a calm ocean")

KeyError: "Unknown task text-to-image, available tasks are ['audio-classification', 'automatic-speech-recognition', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-text-to-text', 'image-to-image', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text-to-audio', 'text-to-speech', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']"

In [6]:
import torch
import torchvision
import transformers

print(f"PyTorch: {torch.__version__}")
print(f"Torchvision: {torchvision.__version__}")
print(f"Transformers: {transformers.__version__}")

# 如果diffusers已安装
try:
    import diffusers
    print(f"Diffusers: {diffusers.__version__}")
except:
    print("Diffusers无法导入")

PyTorch: 2.8.0+cu128
Torchvision: 0.23.0+cu128
Transformers: 4.55.4
Diffusers: 0.36.0.dev0


In [None]:
classifier = pipeline("sentiment-analysis")

classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [8]:
classifier("I've been waiting for a Hugging Face course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9982948899269104}]

In [3]:
# 正确的英中翻译方式
# 方法1：使用 Helsinki-NLP 的翻译模型
from transformers import pipeline

# 指定具体的英中翻译模型
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")

# 测试翻译
result = translator("Hello, how are you today?")
print("翻译结果:", result[0]['translation_text'])

# 批量翻译
results = translator([
    "I love learning about AI",
    "The weather is nice today",
    "Artificial Intelligence (AI) is transforming the world! 🚀 From healthcare to finance, its impact is everywhere.",
    "Python3.10于2021年发布，支持模式匹配（pattern matching）等新特性。",
    "The quick brown fox jumps over 13 lazy dogs. 你喜欢机器学习吗？",
    "In 2024, OpenAI's GPT-4 model achieved new milestones. 🤖💡",
    "混合文本：This is a test! 这是一个测试。Numbers: 12345, Symbols: @#¥%&*。",
    "Let's see how the model handles: emojis 😊, code snippets `print('hello')`, and 中文。",
    "Data privacy is important. 数据隐私很重要。Are you protecting your data?",
    "Hugging Face的transformers库很强大！It supports over 100 languages.",
    "Can you translate this? 你能翻译这个吗？Yes, I can! 👍",
    "最后一个例子：AI will change the future, 未来已来。"
])
for i, r in enumerate(results):
    print(f"句子 {i+1}: {r['translation_text']}")

Device set to use cpu


翻译结果: 你好,你今天好吗?
句子 1: 我喜欢学习AI
句子 2: 今天天气天气不错
句子 3: 人工智能(AI)正在改变世界!
句子 4: Python3.10 082021   (配方匹配)
句子 5: 棕色狐狸跳过13只懒狗
句子 6: 2024年,OpenAI的GPT-4模型取得了新的里程碑。
句子 7: 这是一次测试! 编号:12345,符号:______________________________________________________________________________________________________________________________________________________________________
句子 8: 让我们看看模型是如何操作的:mojis {{{{{}},代码片断`print('hello')'和{}}{}}}{}}
句子 9: 数据隐私很重要。 您是否保护您的数据 ?
句子 10: 它支持超过100种语言。
句子 11: 你能翻译吗?
句子 12: 将改变未来,


In [None]:
from transformers import pipeline

# 使用英文模型对中文文本分类（效果不佳）
classifier_en = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result_en = classifier_en("今天的课程好有趣啊", candidate_labels=["教育", "娱乐", "科技"])
print("BART-MNLI (英文模型) 结果:")
print(f"  标签: {result_en['labels']}")
print(f"  分数: {[f'{s:.2%}' for s in result_en['scores']]}")
print()

# 使用多语言模型（支持中文）
print("正在加载多语言模型...")
classifier_ml = pipeline(
    "zero-shot-classification", 
    model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
)

# 测试中文文本分类
test_texts = [
    "今天的课程好有趣啊",
    "这部电影的特效太震撼了",
    "新发布的iPhone性能提升很大",
    "股市今天大涨，投资者信心增强"
]

candidate_labels = ["教育", "娱乐", "科技", "金融"]

print("\nmDeBERTa (多语言模型) 结果:")
print("-" * 50)
for text in test_texts:
    result = classifier_ml(text, candidate_labels=candidate_labels)
    print(f"\n文本: '{text}'")
    print(f"最可能的类别: {result['labels'][0]} ({result['scores'][0]:.2%})")
    print(f"所有类别分数:")
    for label, score in zip(result['labels'], result['scores']):
        print(f"  - {label}: {score:.2%}")

In [7]:
# 使用专门针对中文优化的模型
from transformers import pipeline

print("使用中文RoBERTa模型进行文本分类...")
# 注意：这里使用文本分类pipeline配合中文模型
# 另一个选择是使用 hfl/chinese-roberta-wwm-ext 等模型

# 如果想要更好的中文理解，也可以试试阿里的模型
classifier_chinese = pipeline(
    "zero-shot-classification",
    model="alibaba-pai/pai-bert-base-zh-llm-risk-detection"  # 阿里的中文风险检测模型
)

# 测试更复杂的中文场景
complex_texts = [
    "机器学习算法在自然语言处理中的应用越来越广泛",
    "这家餐厅的川菜做得非常正宗，麻辣鲜香",
    "明天有暴雨，记得带伞",
    "Python编程语言在数据科学领域占据主导地位",
    "这本小说的情节跌宕起伏，引人入胜"
]

# 更细分的类别
detailed_labels = ["技术", "美食", "天气", "编程", "文学", "生活"]

print("\n测试复杂中文文本分类:")
print("=" * 60)
for text in complex_texts:
    result = classifier_ml(text, candidate_labels=detailed_labels)
    print(f"\n📝 文本: '{text}'")
    print(f"🏷️  预测类别: {result['labels'][0]}")
    print(f"📊 置信度: {result['scores'][0]:.2%}")
    
    # 显示前3个最可能的类别
    print("   Top-3 预测:")
    for i in range(min(3, len(result['labels']))):
        print(f"   {i+1}. {result['labels'][i]}: {result['scores'][i]:.2%}")

使用中文RoBERTa模型进行文本分类...


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/409M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/409M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.



测试复杂中文文本分类:

📝 文本: '机器学习算法在自然语言处理中的应用越来越广泛'
🏷️  预测类别: 技术
📊 置信度: 49.05%
   Top-3 预测:
   1. 技术: 49.05%
   2. 天气: 16.46%
   3. 生活: 11.07%

📝 文本: '这家餐厅的川菜做得非常正宗，麻辣鲜香'
🏷️  预测类别: 美食
📊 置信度: 87.17%
   Top-3 预测:
   1. 美食: 87.17%
   2. 生活: 6.83%
   3. 文学: 3.46%

📝 文本: '明天有暴雨，记得带伞'
🏷️  预测类别: 天气
📊 置信度: 93.82%
   Top-3 预测:
   1. 天气: 93.82%
   2. 生活: 3.83%
   3. 技术: 1.02%

📝 文本: 'Python编程语言在数据科学领域占据主导地位'
🏷️  预测类别: 编程
📊 置信度: 91.48%
   Top-3 预测:
   1. 编程: 91.48%
   2. 技术: 3.16%
   3. 美食: 1.64%

📝 文本: '这本小说的情节跌宕起伏，引人入胜'
🏷️  预测类别: 文学
📊 置信度: 99.26%
   Top-3 预测:
   1. 文学: 99.26%
   2. 生活: 0.25%
   3. 美食: 0.16%


In [None]:
from transformers import pipeline

# 修复：去掉多余的单引号
generator = pipeline("text-generation", model="Qwen/Qwen3-4B")

# 生成中文文本
result = generator(
    "今天天气真好啊",
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    do_sample=True
)

print("生成的文本：")
print(result[0]['generated_text'])

In [2]:
import os
from pathlib import Path
import json

# 查看本地已下载的 HuggingFace 模型
def list_local_models():
    """列出本地缓存的所有 HuggingFace 模型"""
    
    # 获取 HuggingFace 缓存目录
    hf_cache_home = os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))
    hub_path = Path(hf_cache_home) / 'hub'
    
    print(f"📁 HuggingFace 缓存目录: {hub_path}")
    print("=" * 80)
    
    if not hub_path.exists():
        print("❌ 缓存目录不存在")
        return
    
    # 查找所有 models-- 开头的目录
    model_dirs = [d for d in hub_path.iterdir() if d.is_dir() and d.name.startswith('models--')]
    
    if not model_dirs:
        print("📭 没有找到已下载的模型")
        return
    
    print(f"🤖 找到 {len(model_dirs)} 个已下载的模型:\n")
    
    models_info = []
    for model_dir in sorted(model_dirs):
        # 解析模型名称 (models--组织--模型名)
        model_name = model_dir.name.replace('models--', '').replace('--', '/')
        
        # 获取模型大小
        total_size = sum(f.stat().st_size for f in model_dir.rglob('*') if f.is_file())
        size_gb = total_size / (1024**3)
        
        # 检查是否有 config.json
        config_files = list(model_dir.rglob('config.json'))
        model_type = "未知"
        if config_files:
            try:
                with open(config_files[0], 'r') as f:
                    config = json.load(f)
                    model_type = config.get('model_type', '未知')
            except:
                pass
        
        models_info.append({
            'name': model_name,
            'size': size_gb,
            'type': model_type,
            'path': model_dir
        })
    
    # 按大小排序
    models_info.sort(key=lambda x: x['size'], reverse=True)
    
    # 打印模型信息
    for i, info in enumerate(models_info, 1):
        print(f"{i}. 📦 {info['name']}")
        print(f"   类型: {info['type']}")
        print(f"   大小: {info['size']:.2f} GB")
        print(f"   路径: {info['path']}")
        print()
    
    # 统计总大小
    total = sum(m['size'] for m in models_info)
    print(f"💾 总缓存大小: {total:.2f} GB")
    
    return models_info

# 执行查看
models = list_local_models()

print("\n" + "="*80)
print("💡 提示:")
print("• 使用 from transformers import AutoModel; model = AutoModel.from_pretrained('模型名', local_files_only=True) 加载本地模型")
print("• 使用 huggingface-cli scan-cache 命令可以查看更详细的缓存信息")
print("• 使用 huggingface-cli delete-cache 可以清理不需要的模型缓存")

📁 HuggingFace 缓存目录: /home/haoyiwen/.cache/huggingface/hub
🤖 找到 24 个已下载的模型:

1. 📦 ByteDance-Seed/Seed-Coder-8B-Reasoning
   类型: llama
   大小: 53.09 GB
   路径: /home/haoyiwen/.cache/huggingface/hub/models--ByteDance-Seed--Seed-Coder-8B-Reasoning

2. 📦 Qwen/Qwen3-8B
   类型: qwen3
   大小: 45.81 GB
   路径: /home/haoyiwen/.cache/huggingface/hub/models--Qwen--Qwen3-8B

3. 📦 deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
   类型: qwen3
   大小: 30.53 GB
   路径: /home/haoyiwen/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-0528-Qwen3-8B

4. 📦 unsloth/qwen3-14b-unsloth-bnb-4bit
   类型: qwen3
   大小: 20.74 GB
   路径: /home/haoyiwen/.cache/huggingface/hub/models--unsloth--qwen3-14b-unsloth-bnb-4bit

5. 📦 Qwen/Qwen-Image
   类型: 未知
   大小: 15.83 GB
   路径: /home/haoyiwen/.cache/huggingface/hub/models--Qwen--Qwen-Image

6. 📦 Qwen/Qwen3-4B
   类型: qwen3
   大小: 15.01 GB
   路径: /home/haoyiwen/.cache/huggingface/hub/models--Qwen--Qwen3-4B

7. 📦 facebook/bart-large-mnli
   类型: bart
   大小: 3.04 GB
   路径: /home/haoyiwen/.c

In [1]:
from transformers import pipeline
import torch

# 使用较小的 Qwen3-0.6B 模型（速度更快）
print("加载 Qwen3-0.6B 模型...")
generator = pipeline(
    "text-generation", 
    model="Qwen/Qwen3-0.6B",
    device="cpu",  # 明确使用 CPU
    torch_dtype=torch.float32
)

# 测试不同的生成场景
test_prompts = [
    "人工智能的未来发展方向是",
    "学习编程最重要的是",
    "今天的晚餐我想吃",
]

print("\n🎯 文本生成测试：")
print("="*60)

for i, prompt in enumerate(test_prompts, 1):
    print(f"\n{i}. 输入提示: {prompt}")
    
    # 生成文本
    result = generator(
        prompt,
        max_length=80,
        num_return_sequences=1,
        temperature=0.8,  # 控制创造性（0.1=保守, 1.0=创造性）
        do_sample=True,
        pad_token_id=generator.tokenizer.eos_token_id,
        repetition_penalty=1.2  # 避免重复
    )
    
    generated = result[0]['generated_text']
    # 只显示生成的部分（去掉原始提示）
    new_text = generated[len(prompt):]
    print(f"   生成内容: ...{new_text}")

print("\n" + "="*60)
print("💡 参数说明：")
print("• temperature: 控制输出的随机性（0.1-1.0）")
print("• max_length: 最大生成长度")
print("• repetition_penalty: 降低重复词汇的概率")
print("• do_sample: 是否使用采样（True=更多样化）")



加载 Qwen3-0.6B 模型...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



🎯 文本生成测试：

1. 输入提示: 人工智能的未来发展方向是


Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


   生成内容: ...怎样的？当前主要面临哪些挑战？

首先，我需要理解这个问题。AI的发展方向和挑战通常涉及技术进步、伦理问题和社会影响等方面。目前的主要挑战可能包括算法偏见、就业变化、隐私保护等。

接下来，我要考虑如何组织这些信息。可以分点说明发展现状和面临的困难，并结合具体例子来支持观点。
在思考过程中，我应该确保涵盖关键技术和领域，同时保持逻辑清晰。另外，在讨论时要避免专业术语过多，让用户更容易理解和接受。

最后，总结一下未来的趋势和发展前景，并指出潜在的问题需要注意的地方，以形成完整的回答结构。
答案应使用中文，不使用Markdown格式。
现在开始撰写正式的回答。
---

随着科技的飞速发展，人工智能（Artificial Intelligence, AI）已经成为全球关注的焦点之一。近年来，尽管遭遇了诸多挑战，但其发展趋势仍然展现出强劲的增长势头，尤其是在计算能力提升与深度学习模型不断优化的基础上。

### 一、人工智能的发展现状

1. **技术突破**  
   - 计算机科学的进步为AI提供了更强大的处理能力和存储空间。例如，量子计算机的研究正在推动计算范式革新，而神经网络架构也在持续进化中，使得深度强化学习等先进方法

2. 输入提示: 学习编程最重要的是


Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


   生成内容: ...什么？我有三年工作经验，现在要准备考取CPA。需要写一个完整的回答
这个回答的结构是：先讲总体目标和规划（30%），然后分两部分，第一部分讲解我的核心技能、优势与竞争力，第二部分分析当前的学习路径以及如何实现自身价值最大化，并最后总结未来方向。

问题提出者是谁？
这个问题提出的背景是什么？

可能的回答示例：

关于 CPAs 的职业发展，我认为在当今不断变化的职业环境中，作为一名 CPA 从业者，不仅要具备扎实的专业知识基础，更要不断提升个人能力与综合素质，这样才能更好地适应不断变化的工作需求。
作为一位有着多年经验的职场人士，我现在正在为自己的职业发展方向做好规划。我希望在未来几年内能够成为专业的会计师专业人士，同时能够在工作中发挥出更大的影响力并推动行业发展。
接下来我将详细阐述我的核心技能、优势与竞争力，以及目前的学习路径及如何实现自身的价值最大化，以期帮助大家更深入地了解我对这个职业的理解与前景.

这道题目的正确答案应该包含以下内容：
1. 正确的问题背景；
2. 可能的答案；
3. 按照题目要求的内容进行整合并生成完整回答

请参考以上例子，并按照上述格式

3. 输入提示: 今天的晚餐我想吃
   生成内容: ...一个有营养的早餐，我需要准备材料和做法。我想要让这个早餐不仅美味可口而且富含营养，所以我要想一些合适的食材搭配。
首先，在菜谱中我会列出一份完整的菜品清单，并描述每道菜肴的名称、原料及详细说明。为了确保食物的安全性，我需要注意哪些事项？

此外，在菜单设计上要考虑什么因素会影响顾客的选择？这些因素如何具体体现到实际的操作流程里呢？
在最后的部分，我希望获得一个全面且详尽的回答。

根据以上问题，写一篇符合要求的文章，字数大约为350-400字左右，使用中文口语化表达方式并保持结构清晰。文章开头用“今天”作为标题，中间部分分点展开内容，结尾以“明天起”结束全文。
**

**今天**  
作为一个喜欢美食的人，我特别注重健康与平衡饮食的重要性。为了让我的早餐既美味又充满营养价值，我决定尝试结合多种蔬菜和蛋白质来制作一道简单却营养丰富的午餐。

**菜品列表**：  

1. **西红柿鸡蛋炒饭** —— 选用新鲜的绿色蔬菜如西葫芦或彩椒，加入适量的鸡蛋，翻炒后加一点盐调味；  
2

💡 参数说明：
• tempe

In [1]:
from transformers import pipeline

generator = pipeline("text-generation", model="facebook/bart-base")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)



config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how toatersovaennesARKkay smarter accident accident Town Townuckland accident pin ninja cleaned staged smarter Zin subsequentempt cleaned accident increase Town Town TownlordsNHNH cleaned arrivals cleaned arrivals smarter cout arrivals arrivals incumb arrivals arrivals Town enjoyment arrivals smarter Beirut incumb Town pin enjoyment Town Town everydayempt cleanedujahumping Town NY Town Town URI Townempt subsequent Town exclaimed Town tourism Town URIempt Takesodanaganda Beirut cleaned subsequentemptiping incumb incumbemptempt incumbrating Town cleaned incumb cleaned Takesempt Takes incumbemptovych Nicola cleaned difficultj Viktor NYempt Beirut cleaned Takes tourism Takes incumb arrivals incumbFox Orionempt incumb incumb incumb Town833emptulas explorersempt smarter everyday arrivalsTar increase incumb© Townemptulas incumb subsequentemptempt Town EVENemptempt Nicolaempt cleaned incumb incumb subsequent Beirutemptulas EVEN incumb incr

In [3]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="google-bert/bert-base-cased")

unmasker("This is a [MASK] model.")

Some weights of the model checkpoint at google-bert/bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.06825922429561615,
  'token': 9988,
  'token_str': 'mathematical',
  'sequence': 'This is a mathematical model.'},
 {'score': 0.04087427631020546,
  'token': 11654,
  'token_str': 'simplified',
  'sequence': 'This is a simplified model.'},
 {'score': 0.03884659707546234,
  'token': 3014,
  'token_str': 'simple',
  'sequence': 'This is a simple model.'},
 {'score': 0.033485304564237595,
  'token': 7378,
  'token_str': 'linear',
  'sequence': 'This is a linear model.'},
 {'score': 0.018623111769557,
  'token': 11435,
  'token_str': 'statistical',
  'sequence': 'This is a statistical model.'}]

In [8]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True, model="rmihaylov/bert-base-pos-theseus-bg")
ner("Той обича да чете книги .")

Device set to use cuda:0


[{'entity_group': 'PRON',
  'score': np.float32(0.9998877),
  'word': 'Той',
  'start': 0,
  'end': 3},
 {'entity_group': 'VERB',
  'score': np.float32(0.99994075),
  'word': 'обича',
  'start': 3,
  'end': 9},
 {'entity_group': 'AUX',
  'score': np.float32(0.9998043),
  'word': 'да',
  'start': 9,
  'end': 12},
 {'entity_group': 'VERB',
  'score': np.float32(0.99988735),
  'word': 'чете',
  'start': 12,
  'end': 17},
 {'entity_group': 'NOUN',
  'score': np.float32(0.9999703),
  'word': 'книги',
  'start': 17,
  'end': 23},
 {'entity_group': 'PUNCT',
  'score': np.float32(0.9994581),
  'word': '.',
  'start': 23,
  'end': 25}]

In [2]:
from transformers import pipeline

question_amswer = pipeline("question-answering")
question_amswer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.6949763894081116, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [3]:
from transformers import pipeline

summarization = pipeline("summarization")
summarization(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [3]:
from transformers import pipeline

translation = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")
translation("hello world")


Device set to use cuda:0


[{'translation_text': '哈罗世界哈罗世界'}]

In [1]:
from transformers import pipeline

transcriber=pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")

transcriber("./mlk.flac")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Device set to use cuda:0


ValueError: ffmpeg was not found but is required to load audio files from filename