## Классификация текстов

При решении актуальных задач, связанных с компьютерной лингвистикой, часто вероятны случаи, когда не удается собрать достаточное количество данных или процесс их сбора оказывается слишком трудоемким. Это может быть связано с природой происхождения самих данных, спецификой задачи, сложностью их обработки и т.д. В этой задаче предлагается разработать наиболее оптимальную модель классификации текстов, но заранее известны только возможные классы:

развлечения (entertainment),
наука и технологии (science/technology),
география (geography),
политика (politics),
здоровье (health),
спорт (sports),
путешествия (travel).

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification , BertModel , AutoTokenizer, AutoModel , Trainer, TrainingArguments , pipeline
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import torch
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [None]:
classes = [
    "health",

    "travel",

    "geography",

    "science/technology",

    "politics",

    "sports",

    "entertainment"
]



In [None]:
data = pd.read_csv("/content/drive/MyDrive/Datasets and study/data_For_camp/dataset.csv", header=None)

# Первый вариант

In [None]:

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
for i in range(5):
  text = data.iloc[i]
  result = classifier(text, classes, multi_label=False)
  print(text)
  print(result["labels"][0])  # Наиболее вероятный класс

# Вторая модель BERT

In [None]:
texts = []
for t in range(data.shape[0]):
  texts.append(data.iloc[t][0])

In [None]:
# Загрузка предобученной модели BERT и токенизатора
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
model = AutoModel.from_pretrained("bert-base-multilingual-cased")

In [None]:
def get_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        # Используем эмбеддинг [CLS]-токена как представление текста
        cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()
        embeddings.append(cls_embedding)
    return np.vstack(embeddings)

In [None]:
# Получение эмбеддингов
embeddings = get_bert_embeddings(texts)

# K - MEANS

In [None]:
# Кластеризация K-means
n_clusters = 7  # Укажите число кластеров (у вас 7 классов)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings)

In [None]:
with open("BERT_K-means.csv", "w") as f:
    for text, cluster in zip(texts, clusters):
        f.write(classes[cluster])
        f.write("\n")
        if cluster == 5:
          print(f"{text}... | {classes[cluster]}")
          print()

Carriers like Emirates, Etihad Airways, Qatar Airways, and Turkish Airlines have significantly enhanced their African networks, providing flights to numerous major cities at more competitive pricing compared to their European counterparts.... | sports

The statement indicated that Turkey would also assume responsibility for guarding detained ISIS fighters, whom European countries have declined to repatriate.... | sports

The ongoing diplomatic disputes regarding the region persistently impair the relationship between Armenia and Azerbaijan.... | sports

In the Super-G event for visually impaired male skiers, Maciej Krezel of Poland, along with his guide Anna Ogarzynska, secured the thirteenth position. Meanwhile, Jong Seork Park from South Korea placed twenty-fourth in the men's sitting category of the same event.... | sports

Diplomats stated that they had identified sufficient ambiguity within the Afghan constitution to conclude that a runoff election was unnecessary.... | sports

In

# Cosine similarity

In [None]:
# # Функция для получения эмбеддинга (усреднение по токенам)
# def get_embedding(text):
#     inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
#     with torch.no_grad():
#         outputs = model(**inputs)
#     # Усредняем эмбеддинги всех токенов (можно заменить на [CLS])
#     return torch.mean(outputs.last_hidden_state, dim=1).squeeze().numpy()

# Получаем эмбеддинги для всех классов (на основе ключевых слов)
emb_classes = get_bert_embeddings(classes)

# Классификация текстов
with open("BERT_cosine_similarity.csv", "w") as f:
  for inde,emb in enumerate(embeddings):

      similarities = {
          classes[ind]: cosine_similarity([emb], [class_embed])[0][0]
          for  ind,class_embed in enumerate(emb_classes)
      }
      predicted_class = max(similarities.items(), key=lambda x: x[1])[0]
      f.write(predicted_class)
      f.write("\n")



[('entertainment', array([-1.81229085e-01, -5.38250566e-01,  6.12158895e-01,  1.27642348e-01,
       -1.67561367e-01, -4.16423976e-01, -2.00026423e-01, -2.33119830e-01,
        4.32455204e-02,  3.17526907e-01,  4.04345274e-01,  5.34445094e-03,
        3.95036697e-01,  3.92237991e-01, -6.84628367e-01, -4.89921719e-01,
       -2.60859787e-01,  2.65257597e-01,  3.12557101e-01,  1.07192822e-01,
        5.13790667e-01,  4.50317204e-01, -9.95376110e-01,  1.15489312e-01,
       -3.74878980e-02, -2.53751576e-01, -5.42625450e-02,  5.70849702e-02,
        3.67923565e-02,  1.76938847e-02,  3.95003915e-01,  4.80023921e-01,
       -1.15777984e-01,  2.36947313e-01,  2.84697503e-01, -2.20612675e-01,
       -6.96124017e-01, -4.38880265e-01,  1.87429264e-01, -3.06261301e-01,
        6.06073812e-02, -1.29641429e-01,  3.84439267e-02,  1.04199752e-01,
        1.04685582e-01,  1.08848238e+00, -3.99686575e-01, -1.37831233e-02,
        7.78517187e-01, -1.00740016e+00,  5.99772483e-02, -1.54135033e-01,
      

# Готовая модель


In [None]:
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
# Zero-shot (если нет примеров)


classes = ["politics", "science", "health", "sports", "entertainment", "travel", "technology"]
result = []
with open("Download_model.csv", "w") as f:
  for text in texts:
    result = classifier(text, candidate_labels=classes)["labels"][0]
    f.write(result)
    f.write("\n")
    print(f"{text}... | {result}")
    print()


This conflict spanned four decades, marked by actual combat conducted through proxy forces across diverse regions from Africa to Asia, including Afghanistan, Cuba, and numerous other locations. ... | politics

Les données neurobiologiques offrent des bases concrètes à l'approche théorique dans l'étude de la cognition. Ainsi, elles permettent de circonscrire le champ de recherche et d'en améliorer la précision.  ... | science

D'après les informations fournies par la carte sismique internationale de l'Institut géologique américain (USGS), aucun séisme n'a été enregistré en Islande durant la semaine passée.... | science

Pendant son allocution qui a duré deux heures, il a déclaré que « Aujourd'hui, Apple est en train de révolutionner le téléphone ; nous sommes sur le point de faire une entrée marquante dans l'histoire aujourd'hui. »... | technology

This enables gamers to maneuver characters and execute commands in video games by physically gesturing with the device in mid-air.... | tech

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


While families with toddlers might have to make extra arrangements, going out for a day is still very much doable even when you have babies and pre-school kids in tow. ... | entertainment

Casey Fenton, a computer programmer, established Couchsurfing in January 2004 after he found a reasonably priced flight to Iceland but had no place to spend the night. ... | travel

L'Institut Haitien pour la Justice et la Démocratie a cité des recherches indépendantes indiquant que le contingent de paix népalais de l'ONU a pu introduire la maladie en Haïti de manière non intentionnelle.... | travel

Les familles avec de jeunes enfants pourraient nécessiter une préparation plus approfondie, cependant, il reste envisageable de profiter d'une journée à l'extérieur, même en compagnie de bébés et d'enfants en bas âge.... | travel

San Francisco a construit une importante infrastructure touristique, comprenant de nombreux hôtels, restaurants et installations de congrès de haut niveau.... | travel

Typical