# Getting Started with Sentiment Analysis using Python

Original Article: https://huggingface.co/blog/sentiment-analysis-python

## Install huggingface libraries

In [None]:
!pip install -q transformers emoji xformers datasets accelerate

## How to Use Pre-trained Sentiment Analysis Models with Python

On the [Hugging Face Hub](https://huggingface.co/models), we are building the largest collection of models and datasets publicly available in order to democratize machine learning 🚀. In the Hub, you can find more than 27,000 models shared by the AI community with state-of-the-art performances on tasks such as sentiment analysis, object detection, text generation, speech recognition and more. The Hub is free to use and most models have a widget that allows to test them directly on your browser!

There are more than [215 sentiment analysis models](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment) publicly available on the Hub and integrating them with Python just takes 5 lines of code:

In [None]:
from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")

In [None]:
data = ["I love you", "I hate you"]

In [None]:
sentiment_pipeline(data)

You can use a specific sentiment analysis model that is better suited to your language or use case by providing the name of the model. For example, if you want a sentiment analysis model for tweets, you can specify the [model id](https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis):

In [None]:
specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")

In [None]:
specific_model(data)

## Building Your Own Sentiment Analysis Model

## Activate GPU and Install Dependencies

Activate GPU for faster training by clicking on `Runtime` > `Change runtime type` and then selecting `GPU` as the Hardware accelerator.
Then check if GPU is available

In [None]:
import torch
torch.cuda.is_available()

## Preprocess data

### Load data

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb") # Change to your desired dataset

### Create a smaller training dataset for faster training times

In [None]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])
print(small_train_dataset[0])
print(small_test_dataset[0])

### Set DistilBERT tokenizer

In [None]:
from transformers import AutoTokenizer

In [None]:
MODEL_NAME = "distilbert-base-uncased" # change to you desired model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

### Prepare the text inputs for the model

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

### Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of *padding*

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Training the model

### Define DistilBERT as our base model:

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
num_labels = ### Fill here

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

### Define the evaluation metrics 

In [None]:
import numpy as np
from datasets import load_metric

load_accuracy = load_metric("accuracy")
load_f1 = load_metric("f1")

def compute_metrics(eval_pred):    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

Define a new Trainer with all the objects we constructed so far

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="runs",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_steps=45
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### Train the model

In [None]:
trainer.train()

### Compute the evaluation metrics

In [None]:
trainer.evaluate()

## Analyzing new data with the model

Run inferences with your new model using Pipeline

In [None]:
YOUR_LOCAL_MODEL = ####

In [None]:
sentiment_model = pipeline(task="sentiment-analysis", model=YOUR_LOCAL_MODEL)

In [None]:
sentiment_model(["I love this move", "This movie sucks!"])

## Future Tasks

1. Try to remove the warnings in the notebook, if any
2. Load your own dataset
3. And if you are **really** interested, try writing your own training loop.