# Прогнозирование уровня потенциального происшествия на производстве

<b>Цель.</b> <b>Оперативное принятие мер для повышения безопасности работников</b> на предприятиях, деятельность которых связана с вредными и опасными условиями труда, за счёт реализации модели машинного обучения, которая будет <b>предсказывать возможные уровни происшествий для конкретных работников</b>.

# Краткое описание проекта

<b>В ходе выполнения проекта:</b>
 * Провёл анализ датасета: проанализировал объём данных и его качество.
 * Выполнил EDA.
 * Выполнил очистку датасета (поиск и исключение имён работников) с помощью  <b>SpaCy</b>.
 * Применил <b>One-Hot Encoding</b> для категориальных признаков и <b>TF-IDF</b> для векторизации описания происшествия.
 * Обучил модель с помощью <b>Gradient Boosting</b> с подбором гиперпараметров <b>GridSearchCV</b>.
 * Обучил модель с помощью <b>MLP (Multilayer Perceptron)</b> с тремя слоями для обучения модели, функцией активации <b>ReLU</b> и <b>Dropout</b> для регуляризации и предотвращения переобучения.
 * Проанализировал и сравнил полученные результаты.
 * Поработал над датасетом, для улучшения результатов обучения (3 разных варианта датасета для Gradient Boosting и MLP).
 * Применил более продвинутые модели обучения <b>CatBoost</b> и <b>Word2Vec</b> (векторизация) для улучшения результатов обучения (метрик).

# Выполнение проекта

Импортируем основные библиотеки.

In [1]:
import joblib
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd
import re
import seaborn as sns
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import string
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter, defaultdict
from collections.abc import Mapping
from IPython.display import display, HTML
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from torch.utils.data import DataLoader, TensorDataset

In [None]:
import warnings

# Скрыть все предупреждения
warnings.filterwarnings('ignore')

<b>Приведём описание признаков:</b>
* <b>Data:</b> временная метка или информация о времени/дате
* <b>Countries:</b> в какой стране произошла авария (анонимно)
* <b>Local:</b> город, в котором находится завод-изготовитель (анонимно)
* <b>Industry sector:</b> к какому сектору относится завод
* <b>Accident level:</b> от I до VI показывает, насколько серьезным было происшествие (I означает "лёгкое", а VI - "очень тяжёлое")
* <b>Potential Accident Level:</b> насколько серьезным могло бы быть происшествие (из-за других факторов, связанных с происшествием)
* <b>Genre:</b> пол (мужчина/женщина)
* <b>Employee or Third Party:</b> является сотрудник штатным или третьей стороной (подрядчик)
* <b>Critical Risk:</b> краткое описание риска, связанного с происшествием
* <b>Description:</b> подробное описание того, как произошло происшествие.

Считываем данные из CSV-файла. Выводим первые 5 строк, чтобы убедиться, что данные считались с локального файла.

In [None]:
df = pd.read_csv('database_with_accidents_description.csv')
pd.set_option('display.max_colwidth', None) # Настраиваем отображение, чтобы текст выводился полностью
pd.set_option('display.max_columns', None) # Настраиваем отображение, чтобы отображалдись все колонки
df.head(2)

Посмотрим на количество записей в файле.

In [None]:
df.shape

Посмотрим, какого типа эти данные:

In [None]:
df.info()

Видим, что пропущенные значения отсутствуют. Есть категориальные признаки.

In [None]:
# Просмотр количества уникальных записей для каждого признака
df.apply(lambda x: x.nunique())

# EDA

Выберем целевую переменную.

In [None]:
fig, ax = plt.subplots()

# Данные для первого графика
accident_level_counts = df['Accident Level'].value_counts().sort_index()
# Данные для второго графика
potential_accident_level_counts = df['Potential Accident Level'].value_counts().sort_index()

# Первый график (Зелёный прозрачный)
ax.bar(accident_level_counts.index, accident_level_counts, color='blue', alpha=0.5, label='Accident Level')

# Второй график (Красный прозрачный)
ax.bar(potential_accident_level_counts.index, potential_accident_level_counts, color='red', alpha=0.5, label='Potential Accident Level')

plt.title('Соотношение значений целевой переменной')
plt.xlabel('Значение целевой переменной')
plt.ylabel('Количество')

# Подписи для первого графика
for i, value in enumerate(accident_level_counts):
    plt.text(i, value, str(value), ha='center', va='bottom', color='blue')

# Подписи для второго графика
for i, value in enumerate(potential_accident_level_counts):
    plt.text(i, value, str(value), ha='center', va='bottom', color='red')

# Отображение легенды
plt.legend()

plt.show()

<b>В данном случае будем использовать потенциальный уровень происшествия (Potential Accident Level), как целевую переменную (метку), т.к. для бизнеса прогнозировать это значение более важно, чем фактический уровень происшествия.</b>

In [None]:
# Группировка данных и подсчет количества записей по признаку "Critical Risk" и уровню "Potential Accident Level"
grouped_data = df.groupby(['Critical Risk', 'Potential Accident Level']).size().unstack(fill_value=0)

# Сортировка по убыванию суммы значений в строках
grouped_data = grouped_data.loc[grouped_data.sum(axis=1).sort_values(ascending=False).index]

# Построение графика
grouped_data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Настройка заголовков и меток осей
plt.title('Количество записей "Potential Accident Level" по признаку "Critical Risk"')
plt.xlabel('Critical Risk')
plt.ylabel('Количество записей')
plt.legend(title='Potential Accident Level')
plt.xticks(rotation=90)
plt.tight_layout()

# Отображение графика
plt.show()

In [None]:
# Фильтрация данных, исключая значение 'Others' в столбце 'Critical Risk'
filtered_df = df[df['Critical Risk'] != 'Others']

# Группировка данных и подсчет количества записей по признаку "Critical Risk" и уровню "Potential Accident Level"
grouped_data = filtered_df.groupby(['Critical Risk', 'Potential Accident Level']).size().unstack(fill_value=0)

# Сортировка по убыванию суммы значений в строках
grouped_data = grouped_data.loc[grouped_data.sum(axis=1).sort_values(ascending=False).index]

# Построение графика
grouped_data.plot(kind='bar', stacked=True, figsize=(10, 6))

# Настройка заголовков и меток осей
plt.title('Количество записей "Potential Accident Level" по признаку "Critical Risk"')
plt.xlabel('Critical Risk')
plt.ylabel('Количество записей')
plt.legend(title='Potential Accident Level')
plt.xticks(rotation=90)
plt.tight_layout()

# Отображение графика
plt.show()

Посмотрим распределение происшествий по месяцам.

In [None]:
# Преобразование столбца Data в тип datetime
df['Data'] = pd.to_datetime(df['Data'], format='%Y-%m-%d %H:%M:%S')

# Извлечение названия месяца
df['Month'] = df['Data'].dt.strftime('%B')

Удаляем ненужные колонки.

In [None]:
columns_to_drop = ["Unnamed: 0", "Data", "Countries", "Local", "Industry Sector", "Accident Level",]
df = df.drop(columns=columns_to_drop)

In [None]:
df.head(2)

In [None]:
# Создаем словарь с порядком сортировки месяцев
month_order = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}

# Выполняем подсчет и сортировку
month_counts = df['Month'].value_counts().sort_index(
    key=lambda x: x.map(month_order)
)

# Создание графика
plt.figure(figsize=(10, 6))  # Размер графика

# Построение столбчатой диаграммы
month_counts.plot(kind='bar', color='skyblue')

# Настройка заголовка и меток осей
plt.title('Количество записей по месяцам')
plt.xlabel('Месяц')
plt.ylabel('Количество записей')

# Отображение графика
plt.show()

# Нормализуем и почистим текстовое описание происшествий "Description" от имён работников.

In [None]:
# !python -m spacy download en_core_web_sm

In [None]:
# Загружаем модель spaCy для NER и лемматизации
nlp = spacy.load('en_core_web_sm')

# Функция для очистки текста
def clean_description(text):
    doc = nlp(text)
    names = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    clean_text = text
    for name in names:
        clean_text = re.sub(r'\b' + re.escape(name) + r'\b', '', clean_text)
    
    # Преобразование текста в нижний регистр
    clean_text = clean_text.lower()
    
    # Удаление стоп-слов
    clean_text = ' '.join([word for word in clean_text.split() if word not in STOP_WORDS])
    
    # Удаление коротких слов
    clean_text = ' '.join([word for word in clean_text.split() if len(word) > 2])
    
    # Удаление чисел
    clean_text = re.sub(r'\b\d+\b', '', clean_text)
    
    # Удаление пунктуации
    clean_text = clean_text.translate(str.maketrans('', '', string.punctuation))
    
    return clean_text, names

# Функция для лемматизации текста
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_text = ' '.join([token.lemma_ if token.lemma_ != '-PRON-' else token.text for token in doc])
    return lemmatized_text

# Функция для удаления лишних пробелов
def remove_extra_spaces(text):
    return ' '.join(text.split())

# Применяем функцию ко всему столбцу Description и сохраняем удаленные имена
df['Cleaned_Description'], df['Removed_Names'] = zip(*df['Description'].apply(clean_description))

# Применяем лемматизацию к столбцу с очищенным описанием
df['Cleaned_Description'] = df['Cleaned_Description'].apply(lemmatize_text)

# Удаляем лишние пробелы
df['Cleaned_Description'] = df['Cleaned_Description'].apply(remove_extra_spaces)

# Выводим имена, которые были удалены
for idx, names in enumerate(df['Removed_Names']):
    if names:
        print(f"Row {idx}: Removed names - {names}")

# Перекодируем уровень происшествия в числовые значения и сохраним в новый датасет.

In [None]:
# Создаем объект LabelEncoder
label_encoder = LabelEncoder()

# Применяем LabelEncoder к столбцу 'Potential Accident Level'
df['Potential Accident Level Encoded'] = label_encoder.fit_transform(df['Potential Accident Level']) + 1

# Маппинг значений после кодирования
level_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_) + 1))
print("Mapping for Potential Accident Level:")
print(level_mapping)

In [None]:
columns_to_drop = ["Removed_Names", "Description", "Potential Accident Level",]
df = df.drop(columns=columns_to_drop)

In [None]:
# Сохраняем очищенный и нормализованный датасет
df.to_csv('descr_cleaned_dataset.csv', index=False)

# Предобработка датасета для обучения. Вариант 1.

# Выполним One-Hot Encoding для категориальных признаков и векторизацию описания происшествия с использованием TF-IDF.

In [None]:
df = pd.read_csv('descr_cleaned_dataset.csv')

In [None]:
df_copy = df.copy()
df_copy['Cleaned_Description'] = df_copy['Cleaned_Description'].apply(lambda x: (x[:197] + '...') if len(x) > 200 else x)
df_copy.head(2)

Применяем One-Hot Encoding с помощью pd.get_dummies

In [None]:
df = pd.get_dummies(df, columns=['Genre', 'Employee or Third Party', 'Critical Risk', 'Month'])

In [None]:
df_copy = df.copy()
df_copy['Cleaned_Description'] = df_copy['Cleaned_Description'].apply(lambda x: (x[:47] + '...') if len(x) > 50 else x)
df_copy.head(2)

Выполняем векторизацию текста с использованием TF-IDF

In [None]:
# Векторизация текста с использованием TF-IDF
tfidf = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf.fit_transform(df['Cleaned_Description'])

# Преобразование в DataFrame для объединения с исходным датасетом
X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())

In [None]:
# Объединение признаков
df = pd.concat([df.drop(columns=['Cleaned_Description']), X_tfidf_df], axis=1)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Удалим происшествия 6й категрории, т.к. их всего 1 штука.
df = df[df['Potential Accident Level Encoded'] != 6]
df.shape

In [None]:
# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)

# Обучение. Градиентный бустинг. Вариант 1.

In [None]:
# Определение гиперпараметров для подбора
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5]
}

# Настройка GridSearchCV
gb_clf = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(estimator=gb_clf, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# Обучение модели с подбором гиперпараметров
grid_search.fit(X_train, y_train)

# Лучшие параметры
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Оценка модели с лучшими параметрами
best_gb_clf = grid_search.best_estimator_
y_pred_gb = best_gb_clf.predict(X_test)

# Получение меток классов из y_test
class_labels = sorted(set(y_test))

In [None]:
# Оценка модели Gradient Boosting
gb_report = classification_report(y_test, y_pred_gb, target_names=[f"{i+1}" for i in class_labels])
print("Gradient Boosting Classification Report:")
print(gb_report)

# Обучение. Нейронная сеть. Вариант 1.

In [None]:
# Преобразование всех столбцов в числовой формат
def convert_to_numeric(df):
    for col in df.columns:
        if df[col].dtype == 'bool':
            df[col] = df[col].astype(int)
        elif df[col].dtype == 'object':
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                df[col] = pd.factorize(df[col])[0]
    return df

# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Преобразование всех данных в числовой формат
X = convert_to_numeric(X)

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Преобразование меток классов в диапазон от 0 до num_classes-1
y_train = y_train - 1
y_test = y_test - 1

# Преобразование данных в тензоры
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

# Создание DataLoader для PyTorch
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Определение модели нейронной сети
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)  # уменьшение размера скрытого слоя
        self.fc3 = nn.Linear(hidden_size // 2, num_classes)
        self.dropout = nn.Dropout(0.6)  # увеличение dropout
        self.batch_norm1 = nn.BatchNorm1d(hidden_size)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size // 2)

    def forward(self, x):
        out = self.fc1(x)
        out = self.batch_norm1(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.batch_norm2(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc3(out)
        return out

input_size = X_train.shape[1]
hidden_size = 128  # уменьшение скрытого слоя
num_classes = len(y.unique())

model = MLP(input_size, hidden_size, num_classes)

# Определение функции потерь и оптимизатора с L2-регуляризацией
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0005, weight_decay=0.01)  # уменьшение learning rate и увеличение weight decay

# Обучение модели с ранним прекращением
num_epochs = 50
train_losses = []
test_losses = []
best_test_loss = float('inf')
patience = 10  # Число эпох без улучшений перед остановкой
early_stopping_counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    train_losses.append(train_loss / len(train_loader))

    # Проверка на тестовом наборе
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            test_loss += loss.item()
    
    test_losses.append(test_loss / len(test_loader))

    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}")

    # Раннее прекращение
    if test_losses[-1] < best_test_loss:
        best_test_loss = test_losses[-1]
        early_stopping_counter = 0
    else:
        early_stopping_counter += 1

    if early_stopping_counter >= patience:
        print("Early stopping triggered.")
        break

# Оценка модели
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend((predicted + 1).numpy())  # Сдвиг предсказанных значений на +1
        y_true.extend((y_batch + 1).numpy())    # Сдвиг истинных значений на +1 для правильного сравнения

# Проверка на переобучение и недообучение
plt.plot(range(len(train_losses)), train_losses, label='Train Loss')
plt.plot(range(len(test_losses)), test_losses, label='Test Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('Train and Test Loss')
plt.show()

In [None]:
# Оценка модели Gradient Boosting
# gb_report = classification_report(y_test, y_pred_gb)
print("Gradient Boosting Classification Report:")
print(gb_report)

# Отчет о классификации
report = classification_report(y_true, y_pred)
print("Neural Network Classification Report:")
print(report)

# Предобработка датасета для обучения. Вариант 2.

<h3>Для каждого класса собирём все описания происшествий в единый текст, затем определим из этого текста:</h3> 
<br>1. Должность.
<br>2. Действия, которые совершал работник.
<br>3. Инструменты и техника.

Добавим эти данные для каждой строки соответствующего класса в датасете (создадим новые поля).

In [None]:
df = pd.read_csv('descr_cleaned_dataset.csv')

In [None]:
# Группировка описаний по категориям
grouped = df.groupby('Potential Accident Level Encoded')['Cleaned_Description'].apply(lambda x: ' '.join(x)).reset_index()
grouped.to_excel('c:/Users/AADementev/Desktop/Projects/python/MachineLearning/graduation_project_pro_final/grouped_descriptions.xlsx', index=False)

In [None]:
grouped.head(1)

In [None]:
grouped_descriptions = dict(zip(grouped['Potential Accident Level Encoded'], grouped['Cleaned_Description']))

<h3>Попробуем сделать автоматическую выборку должностей, действий и минструментов с помощью библиотеки spaCy для обработки естественного языка (NLP).</h3>

In [None]:
# Загружаем модель для обработки текста на английском языке
nlp = spacy.load("en_core_web_sm")

# Текст
text = """
be approximately nv cx ob7 personnel begin task unlock soquet bolt bhb machine penultimate bolt identify hexagonal head wear proceed mr auxiliary assistant climb platform exert pressure hand dado key prevent come bolt moment collaborator rotate lever anticlockwise direction leave key bolt hit palm leave hand cause injury collaborator report work ustulación realize cyclone duct obstruct open door try unclog material detach project employee cause small burn right heel end lunch enable place winche control room get short walk slip sit floor make contact leave knee take importance rest guard guard finish go safety communicate fact reason derive natclar attention trip vehicle end work collaborator step object identify come pierce sole boot cause small hole sole left foot collaborator perforation possibly stump wood area cover collaborator pasture graze recently near residence perform geological mapping activity necessary hammer rock analysis moment clerk hold it point fragment slip quirodactyl right hand cause superficial cut circumstance collaborator perform washing tabolas pot washing area suffer feel dizziness faintness cause fall level produce slight concussion head ground team coordinate prospector assistant wila pm prong opening access collect soil sample come try divert meter right place moment diversion come marimbondo house he give time action thug agitate time ste head neck sting face allergy test verify allergic reaction wash affect return normal activity geological reconnaissance activity farm mr team compose felipe normal activity encounter ciliary forest need enter forest verify rock outcrop front divine realize open access machete moment take bite neck attack allergic reaction continue work normally work complete leave forest access divine assistant attack snake suffer sting forehead moment move away area verify type allergic reaction return normal activity geological reconnaissance activity farm mr team compose felipe normal activity encounter ciliary forest need enter forest verify rock outcrop front divine realize open access machete moment take bite neck attack allergic reaction continue work normally work complete leave forest access divine assistant attack snake suffer sting forehead moment move away area verify type allergic reaction return normal activity employee work electrician management electrometallurgy suffer contusion right leg suffer slip height step staircase code ele abb furnace cat ladder immediately refer collaborator medical service treat circumstance employee connection electric cable no jumbo operator feel discomfort face clean hand rubber glove generate superficial laceration small wound leave cheekbone project vazante carry sediment collection current south mata target drainage serra garrote team compose member wca company move collection point another inside shallow drainage see bee carton reaction away box quickly possible avoid sting run meter look safe area exit radius attack bee ss breno attack consequently suffer sting belly hand verify type allergic reaction return normal activity project vazante carry sediment collection current south mata target drainage serra garrote team compose member wca company move collection point another inside shallow drainage see bee carton reaction away box quickly possible avoid sting run meter look safe area exit radius attack bee ss breno attack consequently suffer sting belly hand verify type allergic reaction return normal activity geologo auxiliary travel evaluate geological point follow gps near drainage follow state highway give access area stop get vehicle point identify gps distance seven meter vehicle follow road surprised bite thorn face neck quickly hurried vehicle move away place clerk wear girdle goggle wear glove enter forest area allergic reaction geologist auxiliary travel field evaluate geological point follow gps near drainage follow state highway give access area stop get vehicle point identify gps mário distancing meter vehicle follow road surprised bite thorn face quickly hurried vehicle move away place clerk wear girdle goggle wear glove enter forest area allergic reaction safety technical move field inspection activity way field pause team order know drainage point check safety get vehicle strike sting weed neck quickly return vehicle radio communication team distance place clerk wear legging glass allergic reaction auxiliary travel field evaluate geological point follow gps near drainage follow state highway give access area stop get vehicle point identify gps distancing meter vehicle accompany geologist surprised bite blow neck quickly hurried vehicle move away place clerk wear girdle goggle wear glove enter forest area allergic reaction travel field order geological mapping geologist accompany prospector stoop deviate vegetation time receive whistling sting they face neck allergic reaction activity follow normally event move field geological mapping prospector accompany geologist stoop deviate vegetation moment receive whistle sting ring finger right hand allergic reaction activity follow normally event mince team carry activity city juína coordinate mining technician felipe time mining technician line away team bite blackjack leave face allergic manifestation team continue work afternoon lunch employee seek medical care medicate release continue activity day level unicon plant collaborator shuttering work concrete water sedimentation basin moment nail wood supply inch strip feel metallic hammer loosen wooden handle fix it grab hammer head hit handle vertically wood generating injury time accident employee use safety glove cut vegetation open bite sickle assistant strike vine twice liana ruptured branch project face auxiliary cause cut upper lip collaborator be clean leave return borehole brapdd slip canva edge well hit right metal structure mudswathe box cause slight excoriation employee refer local hospital medicate release activity be approximately mechanic remove bolt nipple pump lime feeder reactive area mechanic position slightly flex leg perform upward force hand moment feel pain spume right thigh mechanic evacuate help colleague medical post region povoado vista martinópole ce employee perform soil collection activity field auxiliary diassis nascimento be cross fence glove attach wire body project forward cause slight twist leave wrist team travel city granja employee refer hospital consultation doctor diagnose fracture prescribing remedy local pain ice pack medical evaluation employee carry activity normally level dining room collaborator finish wash tabolas food container dimension cm proceed order pink thumb right hand corner aluminum tabola generating lesion employee time accident safety glove preparation scaffold activity employee loading piece designate place finger press metal piece move employee engage removal material excavation level shovel placing bucket day material fall pipe employee boot friction boot calf cause superficial injury leg employee engage removal material excavation level shovel placing bucket day material fall pipe employee boot friction boot calf cause superficial injury leg be activity collect soil collaborator run branch attack maribondo bite twice head pain swell allergic symptom continue activity activity package cylindrical piece easel employee carry piece designate place finger press metal piece move perform carpentry work collaborator hit second finger leave hand hammer hold right hand cause bruise height nail evaluation carry medical center unit final diagnosis contusion finger pm perform magnetometric gps collaborator bump field hat branch attack maribondo bitten ear shoulder continued activity feel pain swelling pm assist gps magnetometric collaborator bump field hat branch attack maribondo moth go eye use sunglass attack region prevent insect move face getting catch ear field hat make helper bite ear allergic marimbondo bite soon activity immediately paralyze drove car accident take medicine antiallergic situation work indicate doctor avoid swell responsible project field mapping activity call radio immediately assistant feel good take emergency hospital lavra sul consult doctor take antiallergic release carry refractory brick chop activity order place support bus bar section particle detach hit assistant right arm meter away work area provoke wound arm treat medical center return usual duty soil sample region employee danillo silva attack bee test rush away place employee take bite chin chest neck hand glove employee take bite hand glove head employee danillo take bite leave arm uniform sketch allergy swell ste site activity stop evaluate site verifying test remain line leave site soil sample region employee danillo silva attack bee test rush away place employee take bite chin chest neck hand glove employee take bite hand glove head employee danillo take bite leave arm uniform sketch allergy swell ste site activity stop evaluate site verifying test remain line leave site soil sample region employee danillo silva attack bee test rush away place employee take bite chin chest neck hand glove employee take bite hand glove head employee danillo take bite leave arm uniform sketch allergy swell ste site activity stop evaluate site verifying test remain line leave site pm perform mag activity employee move acquisition line come small drainage approximately 40 cm wide small gap traverse drainage employee rest right foot ravine come rest cause right ankle twist soon twist activity paralyze employee take local hospital xray take examination physician injury find small swelling release normal activity team vms project perform soil collection xixás target members team move collection point another mr ahead team sting near collection point surprised swarm bee inside play near ground visibility wood hiss noise pass stump attack bee ste left arm uniform prick lip screen rip tangle branch escape team vms project perform soil collection xixás target members team move collection point another mr ahead team sting near collection point surprised swarm bee inside play near ground visibility wood hiss noise pass stump attack bee ste left arm uniform prick lip screen rip tangle branch escape technician magnetometric survey step thorn reaction immediately retreat lose balance magnetometer antenna break 30hs current sediment activity collaborator take bee ste neck screen bee enter screen sting team decide leave workplace presence bee collaborator reaction continue work normally execution soil sample task potion area pm open machete bite wasp right hand time incident epi needed activity employee evaluate technician find mild localize swelling wound employee report feel pain continue activity execution service opening prick future work ip employee line equipment sting wasp right portion neck beetle small size see employee bite cause employee shock insect manifest employee ppe require activity develop bite occur collar shirt face shield technician responsible performing work evaluate ste and injure employee find localized swell allergy need paralyze activity follow normally execution soil sample task potion area pablo move bite bite right elbow wasp sleeve uniform time incident ppe need activity employee evaluate team find mild injury localize swell employee report feel pain continue activity employee pass corner door see virdro slight swell frontal region closing glass door activity maintenance scaller breaker arm extension cylinder local underground level removal cylinder scaller arm releasing fix pin cylinder come down bump tool press hand tool structure equipment field activity amg project target reconnaissance team boarding car park window close enter mr put seat belt inside vehicle press wasp shoulder neck cause sting believe that possibly bee nail clothe car properly close
"""

# Функция для извлечения должностей
def extract_positions(text):
    positions = set()
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ["PERSON", "ORG"]:
            positions.add(ent.text)
    return list(positions)

# Функция для извлечения оборудования и инструментов
def extract_equipment(text):
    equipment = set()
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "NOUN":
            equipment.add(token.text)
    return list(equipment)

# Функция для извлечения действий
def extract_actions(text):
    actions = set()
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "VERB":
            actions.add(token.lemma_)
    return list(actions)

# Извлечение данных
positions = extract_positions(text)
equipment = extract_equipment(text)
actions = extract_actions(text)

# Вывод результатов
print("Должности:")
print(positions)
print("\nДействия:")
print(actions)
print("\nОборудование и инструменты:")
print(equipment)

Как видим результаты не очень точные.

<h3>Выполним сбор данных вручную.</h3>

In [None]:
df = pd.read_csv('descr_cleaned_dataset.csv')

In [None]:
df.head(2)

In [None]:
# Создаем словарь с данными для таблицы
data = {
    "Должности": [
        "Auxiliary assistant, Collaborator, Clerk, Geologist, Safety technical, Mining technician, Mechanical, Danilo Silva, Technician.", 
        "Forklift operator, Collaborator, Operator, Chief guard, Assistant, Engineer trainee, Teacher, Store attendant, Welder, Topographic surveyor, Operator bolter, Technician, Mixkret operator, Geologist, Maintenance team, Mechanic, Surveying worker, Industrial cleaning worker, Food preparer, Mobile equipment maintenance team, Comedor worker.", 
        "Collaborator, Mechanic technician, Electrician supervisor, Master loader, Equipment assistant, Official operator, Assistant loader, Master shipper, Truck crane operator, Mechanic, Technician, Civil operator, Bolted assistant, Welder, Helper, Boltec technician, Sampler, Electrician, Operator, Mechanic operator.", 
        "Maintenance Supervisor, Mechanic, Driller assistant, Operator, Mason Assistant, Bolter Operator, Mill Operator, Assistant Mechanic, Welder, Technician, Truck Driver, Plant Worker, Blaster, Maintenance worker.", 
        "Mixkret operator, Resident engineer, Filter operator, Autoclave operator, Mechanic, Pilot, Mixer operator, Wheel loader operator, Operator of the concrete plant, Scalar operator."
    ],
    "Должности (пример на русском)": [
        "Геолог, Горный техник-технолог.", 
        "Оператор вилочного погрузчика, Оператор горно-шахтной самоходной машины.", 
        "Техник-механик, Оператор автокрана.",
        "Помощник по бурению, Взрыватель.", 
        "Пилот, Оператор фронтального погрузчика, Оператор бетономешалки, Оператор горно-шахтной самоходной машины, Механик."
    ],
    "Действия": [
        "Unlock, Identify hexagonal head, Climb platform, Exert pressure hand, Try unclog material, Detach project, Enable place winch control room, Get short walk slip, Sit floor, Make contact, Carry sediment collection, Run meter, Evaluate geological point, Give access area, Know drainage point, Strike sting weed neck, Perform soil collection activity, Attach wire, Perform magnetometric GPS, Use sunglass, Carry refractory brick, Chop activity, Place support bus bar section, Execute soil sample task, Open machete, Break magnetometer Antenna, Move acquisition line, Traverse drainage.", 
        "Cleaning metal structures, Operating a crane and forklift, Using a hammer and chisel, Releasing a blade and manually displacing sheets, Operating a truck and adjusting bolts, Cutting electro welded mesh, Opening access areas with tools, Welding and inspecting mining cars, Handling chemicals, Performing maintenance on various equipment, Using a torch for cutting activities, Loading and unloading materials, Preparing geological maps and conducting surveys, Painting floors, Carrying out inspections, Preparing food, Cleaning industrial areas, Handling ventilation equipment.", 
        "Excavation work, Unload operation, Unclog discharge mouth, Maneuver to unhook hose, Turn pulley manually, Grab transmission belt, Verify belt tension, Install segment of polyurethane pulley, Clean shutter with air lance, Perform truck unload operation, Remove rope tie, Perform disconnection of power cable, Detach upper support point, Verify remaining position, Identify rock mesh, Change drill bit, Release coupling, Replace telescopic expansion joint, Position portable ladder, Hold base of loader, Clean spatula spear window boiler, Perform radial drilling, Activate hydraulic pump inspection cover, Install support mesh cloth, Handle water supply hose, Verify lock failure, Dismantle scaffold, Perform carbon steel pipe mark activity, Perform maintenance on motor support, Place protective plate on fuel tank, Perform supply operation for zinc powder container, Perform maintenance activity on transmission belt, Test soft starter engine belt, Perform mechanical support activity, Conduct inspection of sulfuric acid spill line, Carry out sand electrolysis piece, Perform brushcutter operation, Handle pneumatic conveyor, Transport dust zinc container, Lower metal sheet, Assemble activity for polypropylene pipe, Clean area near conveyor, Move locomotive personnel, Perform drilling activity with LM17 probe, Remove bucket of pulp sample, Supervise ustulation activity, Carry inspection cut block level OB6A.", 
        "Loosening support, Facilitating removal, Tightening, Activating pump, Designing area, Applying aid, Positioning pot, Operating drill, Cleaning position, Securing pipe, Lifting platform, Aligning cathode press, Manually moving steel cabinet, Drilling hole, Removing flange, Applying shotcrete, Preparing oil cylinder, Mounting rail platform, Feeding bag into furnace, Supporting stabilizer, Conducting maintenance, Checking work front, Verifying ventilation, Placing mesh, Pulling support mesh, Unloading ore, Reshaping hand, Unlocking rod, Tightening bolt, Evaluating acid leakage, Unloading residual water, Operating equipment, Driving truck, Inspecting equipment, Removing suction pipe, Cleaning low floor, Wearing safety equipment, Entering filter belt, Preparing construction, Perform rock untie, Lift brace, Hoisting and setting up equipment, Changing fuses, Soil collection, Cleaning area, Installation of ventilation plug, Welding steel plate, Unloading material, Operating overflow system, Manipulating hose, Testing equipment.", 
        "Open the electric board, Proceed with the installation, Remove the lock, Use the thermomagnetic key, Make phase contact, Check voltage, Plug socket, Cut wire, Transfer pump, Clean pump, Manipulate motor pump transmission, Change cable, Start equipment, Stop equipment, Perform sanitation, Clean compressor, Lubricate equipment, Load explosive equipment, Putty work, Clean material, Change conveyor belt."
    ],
    "Действия (пример на русском)": [
        "Крепление провода, Установка опорной шины, Отбор пробы почвы, Включение пункта управления лебёдкой, Прокладка дренажа.", 
        "Очистка металлических конструкций, Управление краном и вилочным погрузчиком, Сварка и осмотр карьерных машин, Погрузка и разгрузка материалов, Использование горелки для резки материалов.", 
        "Разгрузка грузовика, Радиальное бурение, Отсоединение кабеля питания, Разъединение муфты, Замена телескопического компенсатора.", 
        "Установка резервуара, Монтаж рельсовой платформы, Подготовка масляного баллона, Подача мешка в печь, Установка стабилизатора.", 
        "Проверка напряжения, Очистка насоса, Замена кабеля, Очистка компрессора, Загрузка взрывоопасного оборудования, Замена ленты конвейера."
    ],
    "Инструменты и техника": [
        "Metal piece, Shovel, Bucket, Canvas, Wooden handle, Metallic hammer, Sickle, Electric cable, Magnetometric GPS, Antenna, Hammer, Rock, Machete, Refractor, Shovel, Tabola, Scaller breaker arm, Hand tool, Structure equipment, Magnetometer, Screen.", 
        "Forklift, Hammer and chisel, Ladder, Manual displacement tools, Truck (A30), Electro welded mesh cutter, Machete, Welding equipment, Mixing equipment, Loaders (e.g., J005A), Sludge lever, Shears, Scissor bolter, Mona car, Laboratory sampling tools, Chemical containers, Pressure hoses, Pumps and blowers, Mining and inspection tools, High-pressure pump gun, Shovels and cleaning tools, Power cables and sockets, Cathode cranes, Zinc sheets, Wooden stumps, Rail tracks and corridors, Industrial cleaning equipment, Soil activity tools (e.g., pickaxe).", 
        "Ustilago powder, Silo truck, Transmission belt HM pump, Polyurethane pulley, Electro welded mesh, Hydraulic cylinder, Winch pulley, Air lance, Telescopic ladder, Metal bar hammer, Telescopic expansion joint HDPE pipe, Spatula spear window boiler, Simba M4C ITH equipment, Hydraulic pump, Volumetric balloon, Radial drilling machine, Scissor bolter, HDPE pipe storm drainage system, POM D071 return thickener, LM17 probe, Iron bundle truck, Breaker tip, Jumbo drilling rig, Stilson key, Tire lever, Sledgehammer, Doosan RB equipment, Combination wrench, Hydraulic load maintenance equipment, Mechanized support scissor, Shotcrete gun, Three-way pear pipette, Nitrogen hose, Hydraulic fill pipe.", 
        "Drill rod, Jumbo, Sodium sulphide pump, Hand bar, Pulley motor, Oil cylinder, Scaffold, Truck, Bolt, Drilling machine, Stilson key, Steel wire rope, Winch, Jack, Suction valve, Cable pump, Locomotive, Electrical system, Shotcrete equipment, Ingot rotary table, Air lance, Concrete throwing team, Geho pump, Metal rake, Hydraulic hammer, Chisel, Lubricant, Strip set, Mining car, Diamond drill, Fisherman winch cable, Calibrator, Welding equipment, Peristaltic pump, Tirfor, Autoclave, Conveyor belt, Ventilation plug, Cleaning mechanism, Water hose, Manipulator, Stone cutting machine, Scoop lip, Hoist, Jackleg.", 
        "Split set intersection, Electric board 440V 400A, Thermomagnetic key, Panel shell, Mixkret, Autoclave, Anfo loader, Dumper, Pump, Automatic sampler, Platform, Wheel loader, Manual tick, Hydraulic fill, Intermediate cardan protector, Lamp, Motor transmission belt, Compressor, Cross cutter, Conveyor belt, Suction spool, Scrubber, Hydraulic cylinder, Vibrator, Scoop, Electric cable, PVC pipe, Fan belt, Key, Hose, Shotcrete."
    ],
    "Инструменты и техника (пример на русском)": [
        "Молоток, Лопата, Скобозабивной станок, Магнитометр.", 
        "Сварочное оборудование, Автомобиль, Насосы и воздуходувки, Инструменты для горных работ и инспекции, Силовые кабели и розетки.", 
        "Гидравлический цилиндр, Отбойный молоток, Радиально-сверлильный станок, Гидравлический насос, Тележка для перевозки рулонов железа.", 
        "Насос для перекачки сульфида натрия, Локомотив, Шкив двигателя, Поворотный стол для обработки слитков, Оборудование для торкретирования бетона, Камнерезный станок.", 
        "Электрическая панель 440V 400A, Горно-шахтная самоходная машина, Конвейерная лента, Ремень вентилятора, Самосвал-погрузчик."
    ]
}

# Создаем DataFrame
df_descr = pd.DataFrame(data, index=["1 уровень", "2 уровень", "3 уровень", "4 уровень", "5 уровень"])

# Настраиваем стили для выравнивания текста и заголовков
styled_df = df_descr.style.set_properties(**{
    'text-align': 'left',
    'vertical-align': 'top'
}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center'), ('vertical-align', 'top')]},
    {'selector': 'td', 'props': [('text-align', 'left'), ('vertical-align', 'top')]}
])

# Преобразуем стилизованный DataFrame в HTML и выводим в Jupyter Notebook
html = styled_df.to_html()

# Отображаем HTML
display(HTML(html))

In [None]:
# Проверка наличия строк с Potential Accident Level Encoded = 6
count_6 = df[df['Potential Accident Level Encoded'] == 6].shape[0]
print(f'Количество строк с Potential Accident Level Encoded = 6: {count_6}')

# Удаление всех строк с Potential Accident Level Encoded = 6
df = df[df['Potential Accident Level Encoded'] != 6]

# Проверка количества строк после удаления
print(f'Количество строк после удаления: {df.shape[0]}')

# Проверка информации о DataFrame
print(df.info())

<h3>Создадим новые колонки и сохраним датасет.</h3>

In [None]:
# Создание новых колонок на основе значения "Potential Accident Level Encoded"
def map_data(level):
    return {
        "Positions": data["Должности"][level-1],
        "Actions": data["Действия"][level-1],
        "Tools_and_equipment": data["Инструменты и техника"][level-1]
    }

# Применяем функцию к каждому значению в колонке "Potential Accident Level Encoded"
mapped_data = df['Potential Accident Level Encoded'].apply(map_data)

# Преобразуем Series of dicts в DataFrame
mapped_df = pd.DataFrame(mapped_data.tolist())

# Объединяем оригинальный DataFrame с новым DataFrame
result_df = pd.concat([df.reset_index(drop=True), mapped_df], axis=1)

# Сохранение нового датасета в файл
result_df.to_csv('df_pos_act_tools.csv', index=False)

In [None]:
df = pd.read_csv('df_pos_act_tools.csv')
df.info()

In [None]:
df.shape

In [None]:
df_copy = df.copy()
df_copy['Cleaned_Description'] = df_copy['Cleaned_Description'].apply(lambda x: (x[:47] + '...') if len(x) > 50 else x)
df_copy['Positions'] = df_copy['Positions'].apply(lambda x: (x[:47] + '...') if len(x) > 50 else x)
df_copy['Actions'] = df_copy['Actions'].apply(lambda x: (x[:47] + '...') if len(x) > 50 else x)
df_copy['Tools_and_equipment'] = df_copy['Tools_and_equipment'].apply(lambda x: (x[:47] + '...') if len(x) > 50 else x)
df_copy.head()

In [None]:
# Применение One-Hot Encoding к категориальным столбцам
df = pd.get_dummies(df, columns=['Genre', 'Employee or Third Party', 'Critical Risk', 'Month'])

# Векторизация текста с использованием TF-IDF для Cleaned_Description, Positions, Actions, Tools_and_equipment
tfidf = TfidfVectorizer(max_features=10000)

# Векторизация для каждого текстового столбца и объединение результатов
for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
    X_tfidf = tfidf.fit_transform(df[column])
    X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=[f"{column}_{feature}" for feature in tfidf.get_feature_names_out()])
    df = pd.concat([df.drop(columns=[column]), X_tfidf_df], axis=1)

In [None]:
# Создание копии DataFrame и сокращение текста
df_copy = df.copy()

# Пример сокращения текстов в копии DataFrame
for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
    if column in df_copy.columns:
        df_copy[column] = df_copy[column].apply(lambda x: (x[:47] + '...') if isinstance(x, str) and len(x) > 50 else x)

df_copy.head(2)

In [None]:
df_copy.info()

In [None]:
# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)

# Обучение. Градиентный бустинг. Вариант 2.

In [None]:
# Определение гиперпараметров для подбора
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5]
}

# Настройка GridSearchCV
gb_clf = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(estimator=gb_clf, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# Обучение модели с подбором гиперпараметров
grid_search.fit(X_train, y_train)

# Лучшие параметры
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Оценка модели с лучшими параметрами
best_gb_clf = grid_search.best_estimator_
y_pred_gb = best_gb_clf.predict(X_test)

gb_report = classification_report(y_test, y_pred_gb, target_names=[f"{i+1}" for i in class_labels])

# Обучение. Нейронная сеть. Вариант 2.

In [None]:
# Преобразование всех столбцов в числовой формат
def convert_to_numeric(df):
    for col in df.columns:
        if df[col].dtype == 'bool':
            df[col] = df[col].astype(int)
        elif df[col].dtype == 'object':
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                df[col] = pd.factorize(df[col])[0]
    return df

# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Преобразование всех данных в числовой формат
X = convert_to_numeric(X)

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Преобразование меток классов в диапазон от 0 до num_classes-1
y_train = y_train - 1
y_test = y_test - 1

# Преобразование данных в тензоры
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

# Создание DataLoader для PyTorch
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Определение модели нейронной сети
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)  # уменьшение размера скрытого слоя
        self.fc3 = nn.Linear(hidden_size // 2, num_classes)
        self.dropout = nn.Dropout(0.6)  # увеличение dropout
        self.batch_norm1 = nn.BatchNorm1d(hidden_size)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size // 2)

    def forward(self, x):
        out = self.fc1(x)
        out = self.batch_norm1(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.batch_norm2(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc3(out)
        return out

input_size = X_train.shape[1]
hidden_size = 128  # уменьшение скрытого слоя
num_classes = len(y.unique())

model = MLP(input_size, hidden_size, num_classes)

# Определение функции потерь и оптимизатора с L2-регуляризацией
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0005, weight_decay=0.01)  # уменьшение learning rate и увеличение weight decay

# Обучение модели с ранним прекращением
num_epochs = 50
train_losses = []
test_losses = []
best_test_loss = float('inf')
patience = 10  # Число эпох без улучшений перед остановкой
early_stopping_counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    train_losses.append(train_loss / len(train_loader))

    # Проверка на тестовом наборе
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            test_loss += loss.item()
    
    test_losses.append(test_loss / len(test_loader))

    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}")

    # Раннее прекращение
    if test_losses[-1] < best_test_loss:
        best_test_loss = test_losses[-1]
        early_stopping_counter = 0
    else:
        early_stopping_counter += 1

    if early_stopping_counter >= patience:
        print("Early stopping triggered.")
        break

# Оценка модели
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend((predicted + 1).numpy())  # Сдвиг предсказанных значений на +1
        y_true.extend((y_batch + 1).numpy())    # Сдвиг истинных значений на +1 для правильного сравнения

# Проверка на переобучение и недообучение
plt.plot(range(len(train_losses)), train_losses, label='Train Loss')
plt.plot(range(len(test_losses)), test_losses, label='Test Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('Train and Test Loss')
plt.show()

Сравним результаты.

In [None]:
# Оценка модели Gradient Boosting
# gb_report = classification_report(y_test, y_pred_gb, target_names=[f"{i+1}" for i in class_labels])
print("Gradient Boosting Classification Report:")
print(gb_report)

# Отчет о классификации
report = classification_report(y_true, y_pred)
print("Neural Network Classification Report:")
print(report)

Видим, что модели подстроились под данные, что соответственно нам не подходит.

# Предобработка датасета для обучения. Вариант 3.

<h3>Для каждой строчки с описанием происшествия определим:</h3> 
<br>1. Должность.
<br>2. Действия, которые совершал работник.
<br>3. Инструменты и техника.

Добавим эти данные для каждой обработанной строки (создадим новые поля).

In [None]:
df = pd.read_csv('descr_cleaned_dataset.csv')
df.head(2)

In [None]:
# Создаем словарь с данными для таблицы
data = {
    "Должности": [
        "Auxiliary assistant, Collaborator, Clerk, Geologist, Safety technical, Mining technician, Mechanical, Danilo Silva, Technician.", 
        "Forklift operator, Collaborator, Operator, Chief guard, Assistant, Engineer trainee, Teacher, Store attendant, Welder, Topographic surveyor, Operator bolter, Technician, Mixkret operator, Geologist, Maintenance team, Mechanic, Surveying worker, Industrial cleaning worker, Food preparer, Mobile equipment maintenance team, Comedor worker.", 
        "Collaborator, Mechanic technician, Electrician supervisor, Master loader, Equipment assistant, Official operator, Assistant loader, Master shipper, Truck crane operator, Mechanic, Technician, Civil operator, Bolted assistant, Welder, Helper, Boltec technician, Sampler, Electrician, Operator, Mechanic operator.", 
        "Maintenance Supervisor, Mechanic, Driller assistant, Operator, Mason Assistant, Bolter Operator, Mill Operator, Assistant Mechanic, Welder, Technician, Truck Driver, Plant Worker, Blaster, Maintenance worker.", 
        "Mixkret operator, Resident engineer, Filter operator, Autoclave operator, Mechanic, Pilot, Mixer operator, Wheel loader operator, Operator of the concrete plant, Scalar operator."
    ],
    "Действия": [
        "Unlock, Identify hexagonal head, Climb platform, Exert pressure hand, Try unclog material, Detach project, Enable place winch control room, Get short walk slip, Sit floor, Make contact, Carry sediment collection, Run meter, Evaluate geological point, Give access area, Know drainage point, Strike sting weed neck, Perform soil collection activity, Attach wire, Perform magnetometric GPS, Use sunglass, Carry refractory brick, Chop activity, Place support bus bar section, Execute soil sample task, Open machete, Break magnetometer Antenna, Move acquisition line, Traverse drainage.", 
        "Cleaning metal structures, Operating a crane and forklift, Using a hammer and chisel, Releasing a blade and manually displacing sheets, Operating a truck and adjusting bolts, Cutting electro welded mesh, Opening access areas with tools, Welding and inspecting mining cars, Handling chemicals, Performing maintenance on various equipment, Using a torch for cutting activities, Loading and unloading materials, Preparing geological maps and conducting surveys, Painting floors, Carrying out inspections, Preparing food, Cleaning industrial areas, Handling ventilation equipment.", 
        "Excavation work, Unload operation, Unclog discharge mouth, Maneuver to unhook hose, Turn pulley manually, Grab transmission belt, Verify belt tension, Install segment of polyurethane pulley, Clean shutter with air lance, Perform truck unload operation, Remove rope tie, Perform disconnection of power cable, Detach upper support point, Verify remaining position, Identify rock mesh, Change drill bit, Release coupling, Replace telescopic expansion joint, Position portable ladder, Hold base of loader, Clean spatula spear window boiler, Perform radial drilling, Activate hydraulic pump inspection cover, Install support mesh cloth, Handle water supply hose, Verify lock failure, Dismantle scaffold, Perform carbon steel pipe mark activity, Perform maintenance on motor support, Place protective plate on fuel tank, Perform supply operation for zinc powder container, Perform maintenance activity on transmission belt, Test soft starter engine belt, Perform mechanical support activity, Conduct inspection of sulfuric acid spill line, Carry out sand electrolysis piece, Perform brushcutter operation, Handle pneumatic conveyor, Transport dust zinc container, Lower metal sheet, Assemble activity for polypropylene pipe, Clean area near conveyor, Move locomotive personnel, Perform drilling activity with LM17 probe, Remove bucket of pulp sample, Supervise ustulation activity, Carry inspection cut block level OB6A.", 
        "Loosening support, Facilitating removal, Tightening, Activating pump, Designing area, Applying aid, Positioning pot, Operating drill, Cleaning position, Securing pipe, Lifting platform, Aligning cathode press, Manually moving steel cabinet, Drilling hole, Removing flange, Applying shotcrete, Preparing oil cylinder, Mounting rail platform, Feeding bag into furnace, Supporting stabilizer, Conducting maintenance, Checking work front, Verifying ventilation, Placing mesh, Pulling support mesh, Unloading ore, Reshaping hand, Unlocking rod, Tightening bolt, Evaluating acid leakage, Unloading residual water, Operating equipment, Driving truck, Inspecting equipment, Removing suction pipe, Cleaning low floor, Wearing safety equipment, Entering filter belt, Preparing construction, Perform rock untie, Lift brace, Hoisting and setting up equipment, Changing fuses, Soil collection, Cleaning area, Installation of ventilation plug, Welding steel plate, Unloading material, Operating overflow system, Manipulating hose, Testing equipment.", 
        "Open the electric board, Proceed with the installation, Remove the lock, Use the thermomagnetic key, Make phase contact, Check voltage, Plug socket, Cut wire, Transfer pump, Clean pump, Manipulate motor pump transmission, Change cable, Start equipment, Stop equipment, Perform sanitation, Clean compressor, Lubricate equipment, Load explosive equipment, Putty work, Clean material, Change conveyor belt."
    ],
    "Инструменты и техника": [
        "Metal piece, Shovel, Bucket, Canvas, Wooden handle, Metallic hammer, Sickle, Electric cable, Magnetometric GPS, Antenna, Hammer, Rock, Machete, Refractor, Shovel, Tabola, Scaller breaker arm, Hand tool, Structure equipment, Magnetometer, Screen.", 
        "Forklift, Hammer and chisel, Ladder, Manual displacement tools, Truck (A30), Electro welded mesh cutter, Machete, Welding equipment, Mixing equipment, Loaders (e.g., J005A), Sludge lever, Shears, Scissor bolter, Mona car, Laboratory sampling tools, Chemical containers, Pressure hoses, Pumps and blowers, Mining and inspection tools, High-pressure pump gun, Shovels and cleaning tools, Power cables and sockets, Cathode cranes, Zinc sheets, Wooden stumps, Rail tracks and corridors, Industrial cleaning equipment, Soil activity tools (e.g., pickaxe).", 
        "Ustilago powder, Silo truck, Transmission belt HM pump, Polyurethane pulley, Electro welded mesh, Hydraulic cylinder, Winch pulley, Air lance, Telescopic ladder, Metal bar hammer, Telescopic expansion joint HDPE pipe, Spatula spear window boiler, Simba M4C ITH equipment, Hydraulic pump, Volumetric balloon, Radial drilling machine, Scissor bolter, HDPE pipe storm drainage system, POM D071 return thickener, LM17 probe, Iron bundle truck, Breaker tip, Jumbo drilling rig, Stilson key, Tire lever, Sledgehammer, Doosan RB equipment, Combination wrench, Hydraulic load maintenance equipment, Mechanized support scissor, Shotcrete gun, Three-way pear pipette, Nitrogen hose, Hydraulic fill pipe.", 
        "Drill rod, Jumbo, Sodium sulphide pump, Hand bar, Pulley motor, Oil cylinder, Scaffold, Truck, Bolt, Drilling machine, Stilson key, Steel wire rope, Winch, Jack, Suction valve, Cable pump, Locomotive, Electrical system, Shotcrete equipment, Ingot rotary table, Air lance, Concrete throwing team, Geho pump, Metal rake, Hydraulic hammer, Chisel, Lubricant, Strip set, Mining car, Diamond drill, Fisherman winch cable, Calibrator, Welding equipment, Peristaltic pump, Tirfor, Autoclave, Conveyor belt, Ventilation plug, Cleaning mechanism, Water hose, Manipulator, Stone cutting machine, Scoop lip, Hoist, Jackleg.", 
        "Split set intersection, Electric board 440V 400A, Thermomagnetic key, Panel shell, Mixkret, Autoclave, Anfo loader, Dumper, Pump, Automatic sampler, Platform, Wheel loader, Manual tick, Hydraulic fill, Intermediate cardan protector, Lamp, Motor transmission belt, Compressor, Cross cutter, Conveyor belt, Suction spool, Scrubber, Hydraulic cylinder, Vibrator, Scoop, Electric cable, PVC pipe, Fan belt, Key, Hose, Shotcrete."
    ],
}

# Добавление новых колонок
df['Positions'] = ""
df['Actions'] = ""
df['Tools_and_equipment'] = ""

In [None]:
# Проверка наличия строк с Potential Accident Level Encoded = 6
count_6 = df[df['Potential Accident Level Encoded'] == 6].shape[0]
print(f'Количество строк с Potential Accident Level Encoded = 6: {count_6}')

# Удаление всех строк с Potential Accident Level Encoded = 6
df = df[df['Potential Accident Level Encoded'] != 6]

# Проверка количества строк после удаления
print(f'Количество строк после удаления: {df.shape[0]}')

# Проверка информации о DataFrame
print(df.info())

In [None]:
# Функция для поиска и заполнения данных из словаря по уровню происшествия без учета регистра
def fill_columns_by_level(description, level, data):
    positions = data["Должности"][level-1]
    actions = data["Действия"][level-1]
    tools_and_equipment = data["Инструменты и техника"][level-1]
    
    description = description.lower()
    
    found_positions = [term for term in positions.split(", ") if re.search(r'\b' + re.escape(term.lower()) + r'\b', description)]
    found_actions = [term for term in actions.split(", ") if re.search(r'\b' + re.escape(term.lower()) + r'\b', description)]
    found_tools_and_equipment = [term for term in tools_and_equipment.split(", ") if re.search(r'\b' + re.escape(term.lower()) + r'\b', description)]

    return ", ".join(found_positions), ", ".join(found_actions), ", ".join(found_tools_and_equipment)

# Применение функции к каждой строке
for index, row in df.iterrows():
    level = row['Potential Accident Level Encoded']
    if level in [1, 2, 3, 4, 5]:
        positions, actions, tools_and_equipment = fill_columns_by_level(row['Cleaned_Description'], level, data)
        df.loc[index, 'Positions'] = positions
        df.loc[index, 'Actions'] = actions
        df.loc[index, 'Tools_and_equipment'] = tools_and_equipment

# Сохранение нового датасета в файл
df.to_csv('df_pos_act_tools.csv', index=False)

print("Новый датасет сохранен в файл 'df_pos_act_tools.csv'.")

In [None]:
df = pd.read_csv('df_pos_act_tools.csv')
df.info()

Видим, что не все данные заполнились.

<h3>Дозаполним пустые данные вручную, проанализировав каждое описание происшествия. Соохраним полученные данные в новый датасет df_pos_act_tools_manual_processing.csv</h3>

In [None]:
df = pd.read_csv('df_pos_act_tools_manual_processing_2.csv', sep=';')
df.info()

In [None]:
# # Удаление колонки "Cleaned_Description"
# df = df.drop(columns=['Cleaned_Description'])

In [None]:
df.head(2)

In [None]:
# Функция для приведения текста к нижнему регистру и удаления пунктуации
def preprocess_text(text):
    # Приведение к нижнему регистру
    text = text.lower()
    # Удаление пунктуации
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Применение функции к нужным колонкам
columns_to_preprocess = ['Positions', 'Actions', 'Tools_and_equipment']
for column in columns_to_preprocess:
    df[column] = df[column].apply(preprocess_text)

In [None]:
# Проверка результата
df.head()

In [None]:
# Применение One-Hot Encoding к категориальным столбцам
df = pd.get_dummies(df, columns=['Genre', 'Employee or Third Party', 'Critical Risk', 'Month'])

# Векторизация текста с использованием TF-IDF для Cleaned_Description, Positions, Actions, Tools_and_equipment
tfidf = TfidfVectorizer(max_features=10000)

# Векторизация для каждого текстового столбца и объединение результатов
# for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
    X_tfidf = tfidf.fit_transform(df[column])
    X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=[f"{column}_{feature}" for feature in tfidf.get_feature_names_out()])
    df = pd.concat([df.drop(columns=[column]), X_tfidf_df], axis=1)

In [None]:
# Создание копии DataFrame
df_copy = df.copy()
df_copy.head(2)

In [None]:
# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)

# Обучение. Градиентный бустинг. Вариант 3.

In [None]:
# Определение гиперпараметров для подбора
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5]
}

# Настройка GridSearchCV
gb_clf = GradientBoostingClassifier(random_state=42)
grid_search = GridSearchCV(estimator=gb_clf, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# Обучение модели с подбором гиперпараметров
grid_search.fit(X_train, y_train)

# Лучшие параметры
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Оценка модели с лучшими параметрами
best_gb_clf = grid_search.best_estimator_
y_pred_gb = best_gb_clf.predict(X_test)

gb_report = classification_report(y_test, y_pred_gb, target_names=[f"{i+1}" for i in class_labels])

In [None]:
# Преобразование всех столбцов в числовой формат
def convert_to_numeric(df):
    for col in df.columns:
        if df[col].dtype == 'bool':
            df[col] = df[col].astype(int)
        elif df[col].dtype == 'object':
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                df[col] = pd.factorize(df[col])[0]
    return df

# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Преобразование всех данных в числовой формат
X = convert_to_numeric(X)

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Преобразование меток классов в диапазон от 0 до num_classes-1
y_train = y_train - 1
y_test = y_test - 1

# Преобразование данных в тензоры
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

# Создание DataLoader для PyTorch
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Определение модели нейронной сети
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)  # уменьшение размера скрытого слоя
        self.fc3 = nn.Linear(hidden_size // 2, num_classes)
        self.dropout = nn.Dropout(0.6)  # увеличение dropout
        self.batch_norm1 = nn.BatchNorm1d(hidden_size)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size // 2)

    def forward(self, x):
        out = self.fc1(x)
        out = self.batch_norm1(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.batch_norm2(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc3(out)
        return out

input_size = X_train.shape[1]
hidden_size = 128  # уменьшение скрытого слоя
num_classes = len(y.unique())

model = MLP(input_size, hidden_size, num_classes)

# Определение функции потерь и оптимизатора с L2-регуляризацией
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0005, weight_decay=0.01)  # уменьшение learning rate и увеличение weight decay

# Обучение модели с ранним прекращением
num_epochs = 50
train_losses = []
test_losses = []
best_test_loss = float('inf')
patience = 10  # Число эпох без улучшений перед остановкой
early_stopping_counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    train_losses.append(train_loss / len(train_loader))

    # Проверка на тестовом наборе
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            test_loss += loss.item()
    
    test_losses.append(test_loss / len(test_loader))

    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}")

    # Раннее прекращение
    if test_losses[-1] < best_test_loss:
        best_test_loss = test_losses[-1]
        early_stopping_counter = 0
    else:
        early_stopping_counter += 1

    if early_stopping_counter >= patience:
        print("Early stopping triggered.")
        break

# Оценка модели
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend((predicted + 1).numpy())  # Сдвиг предсказанных значений на +1
        y_true.extend((y_batch + 1).numpy())    # Сдвиг истинных значений на +1 для правильного сравнения

# Проверка на переобучение и недообучение
plt.plot(range(len(train_losses)), train_losses, label='Train Loss')
plt.plot(range(len(test_losses)), test_losses, label='Test Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('Train and Test Loss')
plt.show()

Сравним результаты.

In [None]:
# Оценка модели Gradient Boosting
# gb_report = classification_report(y_test, y_pred_gb, target_names=[f"{i+1}" for i in class_labels])
print("Gradient Boosting Classification Report:")
print(gb_report)

# Отчет о классификации
report = classification_report(y_true, y_pred)
print("Neural Network Classification Report:")
print(report)

# Сравним итоговые результаты.

![SNOWFALL](comparison.png)

<b>Вывод:</b>

<b>Анализ Gradient Boosting:</b> Вариант 3 демонстрирует улучшенные результаты по всем метрикам по сравнению с Вариантом 1. Увеличение точности, F1-оценки и значения Recall указывает на то, что Вариант 3 более эффективен в классификации и лучше справляется с задачей.

<b>Анализ Neural Network:</b> Вариант 3 также демонстрирует улучшение по сравнению с Вариантом 1, однако разница в результатах между двумя вариантами меньшая, чем у Gradient Boosting. Тем не менее, Вариант 3 лучше в точности и Recall по сравнению с Вариантом 1.

Видим, что итоговые результаты в обоих случах довольно слабые. Попробуем использовтаь более продвинутые модели, такие как <b>LightGBM</b>.

Отдельно стоит отметить, что описание происшествий очень сильно разрозненные и практически не похожи друг на друга за некоторыми исключениями.

# Обучение модели LightGBM.

In [None]:
df = pd.read_csv('df_pos_act_tools_manual_processing_2.csv', sep=';')

In [None]:
columns_to_preprocess = ['Positions', 'Actions', 'Tools_and_equipment']
for column in columns_to_preprocess:
    df[column] = df[column].apply(preprocess_text)

In [None]:
# Применение One-Hot Encoding к категориальным столбцам
df = pd.get_dummies(df, columns=['Genre', 'Employee or Third Party', 'Critical Risk', 'Month'])

In [None]:
# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.model_selection import train_test_split

# # Повторное заполнение NaN значений
# X_train = X_train.fillna("")
# X_test = X_test.fillna("")

# # Проверка на наличие NaN перед векторизацией
# print("Проверка на NaN в X_train перед векторизацией:\n", X_train[['Positions', 'Actions', 'Tools_and_equipment']].isna().sum())
# print("Проверка на NaN в X_test перед векторизацией:\n", X_test[['Positions', 'Actions', 'Tools_and_equipment']].isna().sum())

# print(f"Размеры X_train до векторизации: {X_train.shape}")
# print(f"Размеры y_train: {y_train.shape}")

# # Векторизация текста с использованием TF-IDF для текстовых столбцов
# tfidf_vectorizers = {}
# for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
#     tfidf = TfidfVectorizer(max_features=10000)
#     X_tfidf_train = tfidf.fit_transform(X_train[column].astype(str))
    
#     # Преобразуем обучающие данные и проверяем размеры
#     X_tfidf_train_df = pd.DataFrame(X_tfidf_train.toarray(), columns=[f"{column}_{feature}" for feature in tfidf.get_feature_names_out()])
#     X_tfidf_train_df.index = X_train.index  # Присвоение правильных индексов
    
#     print(f"Размеры X_tfidf_train для {column}: {X_tfidf_train_df.shape}")
    
#     # Объединение с сохранением индексов
#     X_train = pd.concat([X_train.drop(columns=[column]), X_tfidf_train_df], axis=1)
    
#     # Применяем ту же трансформацию к тестовым данным
#     X_tfidf_test = tfidf.transform(X_test[column].astype(str))
#     X_tfidf_test_df = pd.DataFrame(X_tfidf_test.toarray(), columns=[f"{column}_{feature}" for feature in tfidf.get_feature_names_out()])
#     X_tfidf_test_df.index = X_test.index  # Присвоение правильных индексов
    
#     X_test = pd.concat([X_test.drop(columns=[column]), X_tfidf_test_df], axis=1)
    
#     # Сохраняем трансформер для дальнейшего использования
#     tfidf_vectorizers[column] = tfidf

# # Проверка на совпадение размеров после векторизации
# print(f"Размеры X_train после векторизации: {X_train.shape}")
# print(f"Размеры y_train после векторизации: {y_train.shape}")

# # Синхронизация индексов, если размеры не совпадают
# if len(X_train) != len(y_train):
#     X_train, y_train = X_train.align(y_train, axis=0)
#     print(f"Размеры после синхронизации: X_train={X_train.shape}, y_train={y_train.shape}")

In [None]:
# Теперь X_train и X_test готовы для обучения модели

In [None]:
# from lightgbm import LGBMClassifier
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import classification_report

# # Определение гиперпараметров для подбора
# param_grid = {
#     'n_estimators': [100, 200],
#     'learning_rate': [0.01, 0.1],
#     'max_depth': [3, 5],
#     'subsample': [0.8, 1.0],
#     'min_child_samples': [20, 50]
# }

# # Настройка GridSearchCV
# lgb_clf = LGBMClassifier(random_state=42)
# grid_search = GridSearchCV(estimator=lgb_clf, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# # Обучение модели с подбором гиперпараметров
# grid_search.fit(X_train, y_train)

# # Лучшие параметры
# best_params = grid_search.best_params_
# print(f"Best parameters: {best_params}")

# # Оценка модели с лучшими параметрами
# best_lgb_clf = grid_search.best_estimator_
# y_pred_lgb = best_lgb_clf.predict(X_test)

# # Оценка модели LightGBM
# class_labels = sorted(y.unique())
# lgb_report = classification_report(y_test, y_pred_lgb, target_names=[f"{i+1}" for i in class_labels])

# print("LightGBM Classification Report:")
# print(lgb_report)

In [None]:
# import pandas as pd
# from gensim.models import Word2Vec
# from sklearn.model_selection import train_test_split
# import numpy as np

In [11]:
def vectorize_text(model, text, size):
    """
    Функция для векторизации текста на основе модели Word2Vec.
    text: текст в виде списка токенов
    model: обученная модель Word2Vec
    size: размер выходного вектора
    """
    vector = np.zeros(size)
    count = 0
    for word in text:
        if word in model.wv:
            vector += model.wv[word]
            count += 1
    if count > 0:
        vector /= count
    return vector

In [None]:
# # Разбиение на тренировочную и тестовую выборки
# # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# # Заполнение NaN значений
# X_train = X_train.fillna("")
# X_test = X_test.fillna("")

# # Токенизация текста
# X_train_tokens = X_train[['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']].applymap(lambda x: x.split())
# X_test_tokens = X_test[['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']].applymap(lambda x: x.split())

# # Обучение модели Word2Vec на токенизированных данных
# word2vec_model = Word2Vec(sentences=pd.concat([X_train_tokens[col] for col in X_train_tokens.columns]).values.flatten(),
#                           vector_size=100,  # Размер вектора
#                           window=5,
#                           min_count=1,
#                           workers=4)

# print(f"Размеры X_train до векторизации: {X_train.shape}")
# print(f"Размеры y_train до векторизации: {y_train.shape}")

# # Векторизация текста
# for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
#     size = word2vec_model.vector_size
#     X_train_vectors = np.array([vectorize_text(word2vec_model, tokens, size) for tokens in X_train_tokens[column]])
#     X_test_vectors = np.array([vectorize_text(word2vec_model, tokens, size) for tokens in X_test_tokens[column]])
    
#     # Преобразуем в датафреймы
#     X_train_vectors_df = pd.DataFrame(X_train_vectors, columns=[f"{column}_w2v_{i}" for i in range(size)])
#     X_test_vectors_df = pd.DataFrame(X_test_vectors, columns=[f"{column}_w2v_{i}" for i in range(size)])
    
#     # Синхронизация индексов
#     X_train_vectors_df.index = X_train.index
#     X_test_vectors_df.index = X_test.index
    
#     # Объединение с основными данными
#     X_train = pd.concat([X_train.drop(columns=[column]), X_train_vectors_df], axis=1)
#     X_test = pd.concat([X_test.drop(columns=[column]), X_test_vectors_df], axis=1)

# # Проверка на совпадение размеров после векторизации
# print(f"Размеры X_train после векторизации: {X_train.shape}")
# print(f"Размеры y_train после векторизации: {y_train.shape}")

# # Синхронизация индексов, если размеры не совпадают
# if len(X_train) != len(y_train):
#     X_train, y_train = X_train.align(y_train, axis=0)
#     print(f"Размеры после синхронизации: X_train={X_train.shape}, y_train={y_train.shape}")

In [None]:
# # Теперь X_train и X_test готовы для обучения модели

# from lightgbm import LGBMClassifier
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import classification_report

# # Определение гиперпараметров для подбора
# param_grid = {
#     'n_estimators': [100, 200],
#     'learning_rate': [0.01, 0.1],
#     'max_depth': [3, 5],
#     'subsample': [0.8, 1.0],
#     'min_child_samples': [20, 50]
# }

# # Настройка GridSearchCV
# lgb_clf = LGBMClassifier(random_state=42)
# grid_search = GridSearchCV(estimator=lgb_clf, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# # Обучение модели с подбором гиперпараметров
# grid_search.fit(X_train, y_train)

# # Лучшие параметры
# best_params = grid_search.best_params_
# print(f"Best parameters: {best_params}")

# # Оценка модели с лучшими параметрами
# best_lgb_clf = grid_search.best_estimator()
# y_pred_lgb = best_lgb_clf.predict(X_test)

# # Оценка модели LightGBM
# class_labels = sorted(y.unique())
# lgb_report = classification_report(y_test, y_pred_lgb, target_names=[f"{i+1}" for i in class_labels])

# print("LightGBM Classification Report:")
# print(lgb_report)

In [None]:
# print(X_train.dtypes)
# print(X_test.dtypes)

# CatBoost Classification

In [8]:
df = pd.read_csv('df_pos_act_tools_manual_processing_2.csv', sep=';')

In [9]:
# Функция для приведения текста к нижнему регистру и удаления пунктуации
def preprocess_text(text):
    # Приведение к нижнему регистру
    text = text.lower()
    # Удаление пунктуации
    text = re.sub(r'[^\w\s]', '', text)
    return text

columns_to_preprocess = ['Positions', 'Actions', 'Tools_and_equipment']
for column in columns_to_preprocess:
    df[column] = df[column].apply(preprocess_text)

In [None]:
df.info()

In [12]:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
import numpy as np
from catboost import CatBoostClassifier

# Загрузка данных
df = pd.read_csv('df_pos_act_tools_manual_processing_2.csv', sep=';')

# Предобработка текста
columns_to_preprocess = ['Positions', 'Actions', 'Tools_and_equipment']
for column in columns_to_preprocess:
    df[column] = df[column].apply(preprocess_text)

# Определение категориальных признаков
categorical_features = ['Genre', 'Employee or Third Party', 'Critical Risk', 'Month']

# Выделяем признаки и метки
X = df.drop(columns=['Potential Accident Level Encoded'])
y = df['Potential Accident Level Encoded']

# Разбиение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Заполнение NaN значений
X_train = X_train.fillna("")
X_test = X_test.fillna("")

# Токенизация текста
X_train_tokens = X_train[['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']].applymap(lambda x: x.split())
X_test_tokens = X_test[['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']].applymap(lambda x: x.split())

# Обучение модели Word2Vec на токенизированных данных
word2vec_model = Word2Vec(sentences=pd.concat([X_train_tokens[col] for col in X_train_tokens.columns]).values.flatten(),
                          vector_size=100,  # Размер вектора
                          window=5,
                          min_count=1,
                          workers=4)

print(f"Размеры X_train до векторизации: {X_train.shape}")
print(f"Размеры y_train до векторизации: {y_train.shape}")

# Векторизация текста
for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
    size = word2vec_model.vector_size
    X_train_vectors = np.array([vectorize_text(word2vec_model, tokens, size) for tokens in X_train_tokens[column]])
    X_test_vectors = np.array([vectorize_text(word2vec_model, tokens, size) for tokens in X_test_tokens[column]])
    
    # Преобразуем в датафреймы
    X_train_vectors_df = pd.DataFrame(X_train_vectors, columns=[f"{column}_w2v_{i}" for i in range(size)])
    X_test_vectors_df = pd.DataFrame(X_test_vectors, columns=[f"{column}_w2v_{i}" for i in range(size)])
    
    # Синхронизация индексов
    X_train_vectors_df.index = X_train.index
    X_test_vectors_df.index = X_test.index
    
    # Объединение с основными данными
    X_train = pd.concat([X_train.drop(columns=[column]), X_train_vectors_df], axis=1)
    X_test = pd.concat([X_test.drop(columns=[column]), X_test_vectors_df], axis=1)

# Проверка на совпадение размеров после векторизации
print(f"Размеры X_train после векторизации: {X_train.shape}")
print(f"Размеры y_train после векторизации: {y_train.shape}")

# Синхронизация индексов, если размеры не совпадают
if len(X_train) != len(y_train):
    X_train, y_train = X_train.align(y_train, axis=0)
    print(f"Размеры после синхронизации: X_train={X_train.shape}, y_train={y_train.shape}")

# Теперь X_train и X_test готовы для обучения модели

# Определение гиперпараметров для подбора
# param_grid = {
#     'iterations': [100, 200],
#     'learning_rate': [0.01, 0.1],
#     'depth': [3, 5],
#     'min_data_in_leaf': [20, 50]
# }

# param_grid = {
#     'min_data_in_leaf': [10, 20, 50],
#     'bootstrap_type': ['Bernoulli', 'Poisson']
# }

param_grid = {
    'iterations': [300],
    'depth': [4],
    'learning_rate': [0.1],
}

# Настройка GridSearchCV для CatBoostClassifier
cat_clf = CatBoostClassifier(cat_features=categorical_features, random_state=42, verbose=0)
grid_search = GridSearchCV(estimator=cat_clf, param_grid=param_grid, cv=3, n_jobs=-1, scoring='accuracy')

# Обучение модели с подбором гиперпараметров
grid_search.fit(X_train, y_train)

# Лучшие параметры
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Оценка модели с лучшими параметрами
best_cat_clf = grid_search.best_estimator_
y_pred_cat = best_cat_clf.predict(X_test)

# Оценка модели CatBoost
class_labels = sorted(y.unique())
cat_report = classification_report(y_test, y_pred_cat, target_names=[f"{i+1}" for i in class_labels])

print("CatBoost Classification Report:")
print(cat_report)

  X_train_tokens = X_train[['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']].applymap(lambda x: x.split())
  X_test_tokens = X_test[['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']].applymap(lambda x: x.split())


Размеры X_train до векторизации: (339, 8)
Размеры y_train до векторизации: (339,)
Размеры X_train после векторизации: (339, 404)
Размеры y_train после векторизации: (339,)
Best parameters: {'depth': 4, 'iterations': 300, 'learning_rate': 0.1}
CatBoost Classification Report:
              precision    recall  f1-score   support

           2       0.86      0.60      0.71        10
           3       0.36      0.47      0.41        19
           4       0.19      0.14      0.16        21
           5       0.36      0.45      0.40        29
           6       1.00      0.17      0.29         6

    accuracy                           0.38        85
   macro avg       0.55      0.37      0.39        85
weighted avg       0.42      0.38      0.37        85



In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from gensim.models import Word2Vec
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report

# 1. Загрузка данных
# df = pd.read_csv('df_pos_act_tools_manual_processing_2.csv', sep=';')

# 2. Преобразование категориальных признаков с помощью One-Hot Encoding
categorical_columns = ['Genre', 'Employee or Third Party', 'Critical Risk', 'Month']
encoder = OneHotEncoder(sparse_output=False, drop='first')
categorical_encoded = encoder.fit_transform(df[categorical_columns])

# Преобразование в DataFrame
categorical_encoded_df = pd.DataFrame(categorical_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# 3. Преобразование текстовых признаков с помощью Word2Vec
text_columns = ['Positions', 'Actions', 'Tools_and_equipment', 'Cleaned_Description']

# Обучение модели Word2Vec на всех текстовых данных
word2vec_models = {}
text_vectors = []

for col in text_columns:
    text_data = df[col].apply(lambda x: x.split())
    model = Word2Vec(sentences=text_data, vector_size=100, window=5, min_count=1, workers=4, sg=0)
    word2vec_models[col] = model

    # Среднее по вектору для каждого текста
    vectors = text_data.apply(lambda words: np.mean([model.wv[word] for word in words if word in model.wv] or [np.zeros(100)], axis=0))
    text_vectors.append(np.array(vectors.tolist()))

# Преобразование в DataFrame
text_encoded_df = pd.DataFrame(np.hstack(text_vectors))

# 4. Объединение всех признаков
X = pd.concat([categorical_encoded_df, text_encoded_df], axis=1)
y = df['Potential Accident Level Encoded']

# 5. Разделение на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 6. Обучение модели CatBoostClassifier с подбором гиперпараметров
catboost_model = CatBoostClassifier(verbose=0, random_state=42)

# Грид поиска гиперпараметров
# param_grid = {
#     'iterations': [100, 200, 300],
#     'depth': [4, 6, 8],
#     'learning_rate': [0.01, 0.05, 0.1],
#     'l2_leaf_reg': [1, 3, 5]
# }

# Грид поиска гиперпараметров
param_grid = {
    'iterations': [300],
    'depth': [4],
    'learning_rate': [0.1],
    'l2_leaf_reg': [5]
}

grid_search = GridSearchCV(estimator=catboost_model, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Обучение модели
grid_search.fit(X_train, y_train)

# Лучшие параметры
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Оценка модели с лучшими параметрами
best_cat_clf = grid_search.best_estimator_
y_pred_cat = best_cat_clf.predict(X_test)

# Оценка модели CatBoost
class_labels = sorted(y.unique())
cat_report = classification_report(y_test, y_pred_cat, target_names=[f"{i}" for i in class_labels])

print("CatBoost Classification Report:")
print(cat_report)

Best parameters: {'depth': 4, 'iterations': 300, 'l2_leaf_reg': 5, 'learning_rate': 0.1}
CatBoost Classification Report:
              precision    recall  f1-score   support

           1       0.67      1.00      0.80         4
           2       0.75      0.21      0.32        29
           3       0.30      0.36      0.33        22
           4       0.32      0.61      0.42        23
           5       0.00      0.00      0.00         7

    accuracy                           0.38        85
   macro avg       0.41      0.44      0.37        85
weighted avg       0.45      0.38      0.35        85



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<b>P.S. Тестовый запуск.</b>

In [None]:
data_employee = {
    # 'Name': ['Ivanov Ivan Ivanovich'],
    'Genre': ['Male'],
    'Employee or Third Party': ['Employee'],
    'Critical Risk': ['Power lock'], # 5 уровень
    'Positions': ['Assistant'], # 1,2 уровень
    'Actions': ['Identify hexagonal head.'], # 1 уровень
    'Tools_and_equipment': ['Hydraulic pump.'], # 3 уровень
    'Cleaned_Description': ['Make sure that the equipment is fully ready to install the 4 detachable kits. Make sure that the operator is ready to supply power to your equipment. Remove the lock and open the electrical panel designed for 220 V and 200 A. Check that all tools and protective equipment are in good condition and ready for use. Lift the thermomagnetic wrench with extreme care. Please note that when lifting the thermomagnetic key, phase contact with the ground on the panel may occur. In case of contact of the phase with the ground on the panel, a flash may occur that can reach the operator. Take all necessary safety measures to prevent injury and damage. Use appropriate personal protective equipment (PPE).']  # 5 уровень
}

df_employee = pd.DataFrame(data_employee)

# Применение One-Hot Encoding к категориальным столбцам
df_employee = pd.get_dummies(df_employee, columns=['Genre', 'Employee or Third Party', 'Critical Risk'])

# Применение TF-IDF к текстовым столбцам
tfidf_dfs = []
for column in ['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment']:
    X_tfidf = tfidf.transform(df_employee[column])
    tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=[f"{column}_{feature}" for feature in tfidf.get_feature_names_out()])
    tfidf_dfs.append(tfidf_df)

# Объединение TF-IDF столбцов и удаление исходных текстовых столбцов
df_employee = pd.concat([df_employee.drop(columns=['Cleaned_Description', 'Positions', 'Actions', 'Tools_and_equipment'])] + tfidf_dfs, axis=1)

# Преобразование всех данных в числовой формат
df_employee = convert_to_numeric(df_employee)

# Заполнение отсутствующих столбцов нулями и упорядочивание столбцов согласно обучающему набору данных
missing_columns = list(set(X.columns) - set(df_employee.columns))
missing_df = pd.DataFrame(0, index=df_employee.index, columns=missing_columns)
df_employee = pd.concat([df_employee, missing_df], axis=1)

# Упорядочивание столбцов
df_employee = df_employee[X.columns]

# Преобразование данных в тензоры
X_employee_tensor = torch.tensor(df_employee.values, dtype=torch.float32)

# Прогноз
with torch.no_grad():
    output = model(X_employee_tensor)
    _, predicted = torch.max(output, 1)
    predicted_class = predicted.item() + 1  # Сдвиг на +1, чтобы вернуть исходные классы

print(f"Predicted class: {predicted_class}")