<a href="https://colab.research.google.com/github/Tharungowdapr/aiml-basics/blob/main/nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Simple NLP Model**

---



In [None]:
!pip install transformers datasets torch --quiet


In [None]:
from transformers import pipeline

# Load sentiment-analysis pipeline
sentiment_model = pipeline("sentiment-analysis")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [None]:
sentiment_model("I really love this movie! It was amazing.")


[{'label': 'POSITIVE', 'score': 0.9998825788497925}]

In [None]:
texts = [
    "I hate this product.",
    "The service was fantastic!",
    "It was okay, nothing special."
]

results = sentiment_model(texts)
for text, res in zip(texts, results):
    print(f"{text} → {res['label']} ({res['score']:.2f})")


I hate this product. → NEGATIVE (1.00)
The service was fantastic! → POSITIVE (1.00)
It was okay, nothing special. → NEGATIVE (0.98)


In [None]:
while True:
    sentence = input("Enter a sentence (or 'quit'): ")
    if sentence.lower() == 'quit':
        break
    print(sentiment_model(sentence)[0])


Enter a sentence (or 'quit'): quit


new code

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
dataset


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256)

tokenized_data = dataset.map(tokenize, batched=True)
tokenized_data = tokenized_data.rename_column("label", "labels")
tokenized_data.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./sentiment_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"]
)

In [None]:
trainer.train()


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mcntestchatgpt[0m ([33mcntestchatgpt-r-v-college-of-engineering[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,0.2567,0.255029
2,0.1686,0.269359


TrainOutput(global_step=3126, training_loss=0.22764022153535868, metrics={'train_runtime': 1436.8385, 'train_samples_per_second': 34.799, 'train_steps_per_second': 2.176, 'total_flos': 3311684966400000.0, 'train_loss': 0.22764022153535868, 'epoch': 2.0})

In [None]:
trainer.evaluate()


{'eval_loss': 0.26935917139053345,
 'eval_runtime': 160.9448,
 'eval_samples_per_second': 155.333,
 'eval_steps_per_second': 9.711,
 'epoch': 2.0}

In [None]:
def predict_sentiment(text):
    tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    # Move input tensors to the same device as the model
    tokens = {k: v.to(model.device) for k, v in tokens.items()}
    output = model(**tokens)
    label = output.logits.argmax(dim=1).item()
    return "POSITIVE" if label == 1 else "NEGATIVE"

print(predict_sentiment("I love this movie, it was amazing!"))
print(predict_sentiment("This product is terrible."))

POSITIVE
NEGATIVE
