<a href="https://colab.research.google.com/github/Bhuvansai-16/ExcelR-assignemts/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
#text classification
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Load pre-trained model and tokenizer
model_name = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define topic labels - **Added the missing labels to match the model's output**
labels = [
    "arts_&_culture", "business_&_enterpreneurs", "celebrity_&_pop_culture",
    "diaries_&_daily_life", "family", "fashion_&_style", "film_tv_&_video",
    "fitness_&_health", "food_&_dining", "gaming", "learning_&_educational",
    "music", "news_&_social_concern", "other_hobbies", "relationships",
    "science_&_technology", "sports", "travel_&_tourism", "weather",  #add missing labels
    "polititcs", "animal_&_nature"

]

# Input texts
texts = [
    "The latest iphone was just released with an incredible new camera!",
    "Manchester United just won the premier league!",
    "I just finished my first marathon!",
    "Nasa has just discovered a new planet!",
]

# Tokenize input texts
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predictions = torch.argmax(probabilities, dim=-1)

# Print results with improved formatting
for i in range(len(texts)):
    print(f"Text: {texts[i]}")
    print(f"Predicted Topic: {labels[predictions[i].item()]}")
    print(f"Confidence: {probabilities[i][predictions[i].item()].item():.4f}\n")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Text: The latest iphone was just released with an incredible new camera!
Predicted Topic: science_&_technology
Confidence: 0.9298

Text: Manchester United just won the premier league!
Predicted Topic: sports
Confidence: 0.9989

Text: I just finished my first marathon!
Predicted Topic: sports
Confidence: 0.9833

Text: Nasa has just discovered a new planet!
Predicted Topic: science_&_technology
Confidence: 0.8000



In [6]:
#textsummarization
from transformers import pipeline

summarizer = pipeline("summarization")
text = """
Hugging face is a company that specializes in natureal language processing (NLP).
It has developed the Transformers library,which provides state-of-the-art models for NLP tasks for a wide range of NLP tasks such as text classification,information extraction,question answering,summarization and mroe.
The library is widely used in both academia and industry due to its ease of use and fleaxibility.
"""
summary = summarizer(text, max_length=50, min_length=10, do_sample=False)
print("Summary:",summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Summary:  Hugging face is a company that specializes in natureal language processing (NLP) It has developed the Transformers library, which provides state-of-the-art models for NLP tasks . The library is widely used in both academia


In [25]:
#Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model directly

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
prompt = "once upon a time in distinct galaxy."
input_ids = tokenizer(prompt, return_tensors='pt')
output = model.generate(**input_ids,max_length=50,num_return_sequences=1,temperature=1.0,top_k=50)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


once upon a time in distinct galaxy.

The first of these was the first known instance of a galaxy-wide supernova explosion. The first such event occurred in the early 20th century, when a supernova exploded in the constellation of Vir


In [28]:
#chatbot
import os
import atexit
import shutil
from transformers import BlenderbotTokenizer,BlenderbotForConditionalGeneration

model_name = "facebook/blenderbot-400M-distill"
tokenizer = BlenderbotTokenizer.from_pretrained(model_name)
model = BlenderbotForConditionalGeneration.from_pretrained(model_name)

def interact_with_chatbot(user_input,conversation_history):
  conversation_history.append(user_input)
  input_ids = tokenizer.encode("\n".join(conversation_history)+tokenizer.eos_token,return_tensors='pt')
  output_ids = model.generate(input_ids,max_length=1000,pad_token_id=tokenizer.eos_token_id)
  response = tokenizer.decode(output_ids[:,input_ids.shape[-1]:][0],skip_special_tokens=True)
  conversation_history.append(response)
  return response,conversation_history

user_input = input()
conversation_history=[]
interact_with_chatbot(user_input,conversation_history)

what is java?


(' electronic language that is used in many fields of business.',
 ['what is java?',
  ' electronic language that is used in many fields of business.'])