This code analyzes movie reviews and determines whether they are positive or negative. The following models were used in the code:

BERT (Bidirectional Encoder Representations from Transformers):

The base pre-trained model bert-base-uncased from the Hugging Face Transformers library.
It was used for classifying text into positive and negative reviews.
LoRA (Low-Rank Adaptation):

This method was applied for additional fine-tuning of the BERT model.
LoRA optimizes training by adding adaptive layers with fewer parameters, which reduces resource consumption.
Thus, the foundation is BERT, fine-tuned and optimized using LoRA to improve accuracy and efficiency.


**Key Components:**

1. Data Preprocessing
What was done:
The IMDB dataset with movie reviews, including text and labels (positive/negative), was used.
The texts were tokenized (converted into numerical format), truncated, and padded to ensure uniform length.

Why its important:
Tokenization prepares the data for input into a neural network model, preserving its structure and context.

2. Transformers, Trained from Scratch and LoRA
What was done:
The BERT model was fine-tuned for the text sentiment classification task. The LoRA (Low-Rank Adaptation) method was used to optimize training and reduce computational resource requirements.

Why its important:
Training a model from scratch is expensive and requires a lot of data. Using a pre-trained model (BERT) with additional fine-tuning via LoRA achieves high accuracy with minimal computational costs.

3. A/B Testing, Fine-Tuning, and Model Comparison
What was done:
The BERT model was fine-tuned using LoRA on a reduced training dataset. The resulting model was compared to the baseline BERT model (without fine-tuning).

Result:
The LoRA model demonstrated significantly higher accuracy compared to the baseline model.

Why its important:
A/B testing validates the effectiveness of fine-tuning and optimization, which is critical for selecting a solution for real-world use.

4. Python API
What was done:
An API was created using FastAPI that accepts input text and returns the models sentiment prediction.
The API allows HTTP requests, making the model accessible for integration into web applications, bots, and other systems.

Why its important:
The API enables real-time text analysis across various applications, automating data processing and accelerating the resolution of business tasks.

In [None]:
pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Загрузка данных IMDB
dataset = load_dataset("imdb")

# Инициализация токенизатора
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Функция токенизации
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

# Токенизация данных
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Разделение на обучающую и тестовую выборки
train_dataset = tokenized_datasets["train"].shuffle(seed=42)
test_dataset = tokenized_datasets["test"].shuffle(seed=42)

print("Train Dataset Size:", len(train_dataset))
print("Test Dataset Size:", len(test_dataset))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Train Dataset Size: 25000
Test Dataset Size: 25000


In [None]:
# Просмотр первых 3 примеров из обучающего набора
print("Примеры из обучающего набора:")
for i in range(3):
    print(f"Пример {i + 1}:")
    print(f"Текст: {train_dataset[i]['text']}")
    print(f"Метка: {'positive' if train_dataset[i]['label'] == 1 else 'negative'}")
    print()

# Просмотр первых 3 примеров из тестового набора
print("Примеры из тестового набора:")
for i in range(3):
    print(f"Пример {i + 1}:")
    print(f"Текст: {test_dataset[i]['text']}")
    print(f"Метка: {'positive' if test_dataset[i]['label'] == 1 else 'negative'}")
    print()


Примеры из обучающего набора:
Пример 1:
Текст: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Метка: positive

Пример 2:
Текст: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# Загрузка модели для классификации
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Уменьшение размера тренировочного набора данных (например, до 5000 примеров)
small_train_dataset = train_dataset.select(range(5000))

# Настройка параметров обучения
training_args = TrainingArguments(
    output_dir="./results",                # Путь для сохранения результатов
    evaluation_strategy="epoch",          # Оценка после каждой эпохи
    save_strategy="epoch",                # Сохранение модели после каждой эпохи
    learning_rate=2e-5,                   # Скорость обучения
    per_device_train_batch_size=32,       # Увеличение размера батча (для ускорения)
    num_train_epochs=1,                   # Уменьшение количества эпох
    weight_decay=0.01,                    # L2-регуляризация
    logging_dir="./logs",                 # Путь для сохранения логов
)

# Создание объекта Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,    # Используем уменьшенный набор
    eval_dataset=test_dataset,            # Полный тестовый набор
)

# Запуск обучения
trainer.train()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss


In [None]:
!pip install scikit-learn

In [None]:
# Импорт
from sklearn.metrics import accuracy_score

In [9]:
# Предсказания LoRA модели
lora_predictions = trainer.predict(test_dataset)

# Загрузка базовой модели для сравнения
baseline_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
trainer_baseline = Trainer(
    model=baseline_model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=test_dataset,
)
baseline_predictions = trainer_baseline.predict(test_dataset)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
from sklearn.metrics import accuracy_score

# Точность модели с LoRA
lora_accuracy = accuracy_score(test_dataset["label"], lora_predictions.predictions.argmax(-1))

# Точность базовой модели
baseline_accuracy = accuracy_score(test_dataset["label"], baseline_predictions.predictions.argmax(-1))

print(f"LoRA Accuracy: {lora_accuracy}")
print(f"Baseline Accuracy: {baseline_accuracy}")


LoRA Accuracy: 0.85372
Baseline Accuracy: 0.50076


In [53]:
# Сохранение дообученной LoRA модели и токенизатора
model.save_pretrained("lora_model")       # Сохраняем модель
tokenizer.save_pretrained("lora_model")   # Сохраняем токенизатор

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

In [11]:
!pip install fastapi uvicorn

Collecting fastapi
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting starlette<0.42.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.41.3-py3-none-any.whl.metadata (6.0 kB)
Downloading fastapi-0.115.6-py3-none-any.whl (94 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading uvicorn-0.34.0-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading starlette-0.41.3-py3-none-any.whl (73 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.2/73.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uvicorn, starlette, fastapi
Successfully installed fastapi-0.115.6 starlette-0.41.3 uvicorn-0.34.0


In [54]:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# Инициализация модели и токенизатора
model.eval()  # Переключаем модель в режим предсказания

class TextRequest(BaseModel):
    text: str

@app.post("/predict")
def predict_sentiment(request: TextRequest):
    inputs = tokenizer(request.text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    sentiment = "positive" if prediction == 1 else "negative"
    return {"sentiment": sentiment}

In [14]:
!pip install fastapi uvicorn pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.2-py3-none-any.whl.metadata (8.4 kB)
Downloading pyngrok-7.2.2-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.2


In [56]:
!ls

api.py	logs  lora_model  __pycache__  results	sample_data  uvicorn.log  wandb


In [57]:
%%writefile api.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification

app = FastAPI()

# Загрузка дообученной LoRA модели и токенизатора
tokenizer = AutoTokenizer.from_pretrained("lora_model")
model = AutoModelForSequenceClassification.from_pretrained("lora_model")
model.eval()

class TextRequest(BaseModel):
    text: str

@app.post("/predict")
def predict_sentiment(request: TextRequest):
    # Токенизация входного текста
    inputs = tokenizer(request.text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = outputs.logits.argmax(-1).item()
    sentiment = "positive" if prediction == 1 else "negative"
    return {"sentiment": sentiment}


Overwriting api.py


In [27]:
!pip install pyngrok



In [31]:
!pip install pyngrok fastapi uvicorn




In [35]:
pip install fastapi uvicorn transformers torch



In [37]:
!pip install requests



In [59]:
# Установка необходимых библиотек
!pip install fastapi uvicorn pyngrok transformers torch requests &> /dev/null

# Сохранение API в файл api.py
api_code = """
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

app = FastAPI()

# Загрузка модели и токенизатора
tokenizer = AutoTokenizer.from_pretrained("lora_model")
model = AutoModelForSequenceClassification.from_pretrained("lora_model", num_labels=2)
model.eval()

class TextRequest(BaseModel):
    text: str

@app.post("/predict")
def predict_sentiment(request: TextRequest):
    inputs = tokenizer(request.text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = outputs.logits.argmax(-1).item()
    sentiment = "positive" if prediction == 1 else "negative"
    return {"sentiment": sentiment}
"""

# Сохранение кода в файл api.py
with open("api.py", "w") as f:
    f.write(api_code)

# Запуск FastAPI в фоне
from pyngrok import ngrok
import threading
import time

def start_server():
    !uvicorn api:app --host 0.0.0.0 --port 8000 &> uvicorn.log

# Запускаем сервер в отдельном потоке
threading.Thread(target=start_server).start()

# Ожидание запуска сервера
time


<module 'time' (built-in)>

In [60]:
!ls


api.py	logs  lora_model  __pycache__  results	sample_data  uvicorn.log  wandb


In [61]:
!nohup uvicorn api:app --host 0.0.0.0 --port 8000 > uvicorn.log 2>&1 &


In [62]:
!curl http://0.0.0.0:8000/docs


    <!DOCTYPE html>
    <html>
    <head>
    <link type="text/css" rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
    <link rel="shortcut icon" href="https://fastapi.tiangolo.com/img/favicon.png">
    <title>FastAPI - Swagger UI</title>
    </head>
    <body>
    <div id="swagger-ui">
    </div>
    <script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
    <!-- `SwaggerUIBundle` is now available on the page -->
    <script>
    const ui = SwaggerUIBundle({
        url: '/openapi.json',
    "dom_id": "#swagger-ui",
"layout": "BaseLayout",
"deepLinking": true,
"showExtensions": true,
"showCommonExtensions": true,
oauth2RedirectUrl: window.location.origin + '/docs/oauth2-redirect',
    presets: [
        SwaggerUIBundle.presets.apis,
        SwaggerUIBundle.SwaggerUIStandalonePreset
        ],
    })
    </script>
    </body>
    </html>
    

In [63]:
import requests

url = "http://0.0.0.0:8000/predict"
data = {"text": "This is a great movie!"}

response = requests.post(url, json=data)
print(response.json())


{'sentiment': 'positive'}
