# Задача 3. Catalog
**Задача от индустриального партнера «ТМК».**

Есть справочник различных позиций, используемых компанией ТМК. Каждая позиция в справочнике содержит всего два атрибута: "Название" и "Группа".

Задача: предсказывать атрибут "Группа" по атрибуту "Название".

Метрикой качества является `accuracy` — доля верных предсказаний.

**Формат ввода**

- train.txt — файл с обучающей выборкой: каждая строка представляет собой одну позицию и состоит из названия позиции и группы, разделенных символом табуляции.
- test.txt — файл с тестовой выборкой: файл состоит из 2346 строк, каждая строка полностью состоит из названия позиции, для которого нужно определить группу.

**Формат вывода**

Ответ требуется в следующем формате: файл из 2346 строк, i-ая строка должно представлять собой предсказанную группу для i-ой строки из файла test.txt.

Соревнование на kaggle: https://www.kaggle.com/c/catalog

## Создание данных

In [4]:
import re
import pandas as pd
import os

In [6]:
files = os.listdir()[2:]
files

['test.txt', 'train.txt']

In [7]:
# Создадим датафрейм тренировочных данных
train_text = [] # название продукта, до табуляции
target_text = [] # группа, после табуляции
with open('train.txt', "r", encoding='utf-8') as f:
    for line in f.readlines():
        train_text.append(line.split('\t')[0])
        target_text.append(line.split('\t')[1].split('\n')[0])

train = pd.DataFrame({'Название': train_text, 'Группа': target_text})
train.head()

Unnamed: 0,Название,Группа
0,Валок ф108 5ФВ ч.В-241178-14,Инструменты
1,Державка 30531402 Mapal,Резцы
2,"Кабель КПСВВнг-LS 1х2х0,75",Кабельная продукция
3,"Трубка электроизоляционная ТКР ф16,0мм",Изделия электроустан
4,"Лента конвейер 2,1-1000-ТК-200-2-5/2",ИзделияРезино-технич


In [8]:
# Создадим тестовый набор данных в виде датафрейма
test_text = []
with open('test.txt', "r", encoding='utf-8') as f:
    for line in f.readlines():
        test_text.append(line.split('\n')[0])

# test_text[0:5]
test = pd.DataFrame({'Название': test_text})
test.head()

Unnamed: 0,Название
0,Подшипник 3630 (22330)
1,Винт 24х110 ГОСТ11738-84(DIN 912)
2,Пускатель ПМ ГОСТО 12-025-150 220В
3,Образец станд Ш13 концентрат плавико
4,Насос A4VG180EP2DT2/32R-PZD02F691LH-S


## Исследование и обработка данных

In [9]:
print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")

Train shape: (23973, 2)
Test shape: (2346, 1)


In [10]:
train['Группа'].value_counts()

Запчасти                5757
Метизы                  1378
З/Ч АвтомобПромышл      1257
З/Ч по чертежам          985
Инструменты              827
                        ... 
ПродукцЦеллюлозБумаж      39
Пилы                      37
ЗаготовкаИнстр и з/ч      34
Цепи и звенья             34
Теплоизоляционные         13
Name: Группа, Length: 96, dtype: int64

In [11]:
train['Группа'].unique()

array(['Инструменты', 'Резцы', 'Кабельная продукция',
       'Изделия электроустан', 'ИзделияРезино-технич', 'Запчасти',
       'З/Ч АвтомобПромышл', 'Подшипники', 'Фрезы',
       'ЗЧ АвтоматПускКонтак', 'РеактивыХимич.', 'З/Ч по чертежам',
       'Метизы', 'МодулПлатыСистАвтом', 'Огнеупоры', 'Редукторы',
       'ИздИзПолимеровСтанд', 'Химпродукция', 'Инструмент слесарный',
       'СветотехнИсточнСвета', 'Спецогнеупоры', 'Инструм. мерительный',
       'ВыключатАвтоматич', 'Комплектующие электр', 'Арматура к трубам',
       'Хоз.товары', 'ИздИзПолимерПоЧертеж', 'Стропы',
       'Инструмент режущий', 'ПрокатСортовойОбНазн', 'Смазки',
       'Конденсаторы', 'Кабельно-проводников', 'Металлопрокат',
       'Расходные материалы', 'Мебель', 'З/Ч Пневмооборудов',
       'Материалы лаб.', 'Сплав твердый', 'Фильтры, фильтроэлем',
       'Сверла', 'МатерСтроительные', 'Стройматериалы',
       'ВычОргТехн и З/Ч', 'Инстр. электрический', 'З/Ч к НасосВентилят',
       'ПриборыСистАвтоматик', 'ИздДля

In [12]:
train[train['Группа'] == 'СветотехнИсточнСвета']

Unnamed: 0,Название,Группа
29,Лампа сигнальная зеленая AD-22DS/230V,СветотехнИсточнСвета
45,Лампа накал ЛОН 220в 100вт,СветотехнИсточнСвета
102,Лампа светодиодн коммут СКЛ-К-2-360,СветотехнИсточнСвета
173,Светильник светодиодный PWP-С2 1200 ДСП,СветотехнИсточнСвета
294,Прожектор ЖО 04-400-001,СветотехнИсточнСвета
...,...,...
23456,Лампа КИПМ 42-22-Б-2-36 белая,СветотехнИсточнСвета
23517,Лампа накал миниат СМН 10в 55ма спец,СветотехнИсточнСвета
23715,Лампа ртутная ДРЛ-1000 Е40,СветотехнИсточнСвета
23725,Лампа ртутная ДРЛ-400вт Е40,СветотехнИсточнСвета


In [13]:
train[train['Группа'] == 'Конденсаторы']

Unnamed: 0,Название,Группа
62,Конденсатор КВИ3 16кв 470пФ 20%,Конденсаторы
342,Конденсатор К50-35 160в 470мкф,Конденсаторы
676,Конденсатор К50-35 25в 470мкф 105С,Конденсаторы
716,"Конденсатор К50-35 50в 4,7мкф 105С",Конденсаторы
754,Конденсатор К50-35 6800мкф 35В,Конденсаторы
767,Конденсатор К50-35 16в 220мкф 105С,Конденсаторы
847,"Конденсатор К50-35 50в 6,8мкф",Конденсаторы
1756,Конденсатор К50-35 100в 47мкф 105С,Конденсаторы
3236,Конденсатор К50-35 16в 1000мкф,Конденсаторы
3339,"Конденсатор К73-17 1500в 0,1мкф",Конденсаторы


In [14]:
train[train['Группа'] == 'МодулПлатыСистАвтом']
# feature engenering

Unnamed: 0,Название,Группа
15,Модуль 6GK7343-1СX10-0XE0 Siemens,МодулПлатыСистАвтом
55,Модуль 6ES7138-4CA01-0AA0,МодулПлатыСистАвтом
220,Соединитель 6ES7972-0BB12-0XA0,МодулПлатыСистАвтом
376,Кабель соед двойной разъемы Lemo 0+CP50,МодулПлатыСистАвтом
404,Модуль вывода сигнала 6ES7322-5GH00-0AB0,МодулПлатыСистАвтом
...,...,...
23300,Разъем DB-9F,МодулПлатыСистАвтом
23363,Карта памяти 6ES7952-1AK00-0AA0,МодулПлатыСистАвтом
23559,Коммутатор NIS-3200-204PSG,МодулПлатыСистАвтом
23660,Индикатор MG3100/IP54/TROP RED TYPE R,МодулПлатыСистАвтом


In [15]:
%%time
from string import punctuation

def remove_punct(text):
    # удаление пунктуации в тексте
    table = {33: ' ', 34: ' ', 35: ' ', 36: ' ', 37: ' ', 38: ' ', 39: ' ', 40: ' ', 41: ' ', 42: ' ',
             43: ' ', 44: ' ', 45: ' ', 46: ' ', 47: ' ', 58: ' ', 59: ' ', 60: ' ', 61: ' ', 62: ' ',
             63: ' ', 64: ' ', 91: ' ', 92: ' ', 93: ' ', 94: ' ', 95: ' ', 96: ' ', 123: ' ', 124: ' ', 125: ' ', 126: ' '}
    return text.translate(table)

def txt_prep(df):
    # функция приводит весь текст к нижнему регистру
    # удаляет пунктуацию
    df['Название начальный вид'] = df['Название']
    df['Название'] = df['Название'].str.lower() # Hello - hello
    df['Название'] = df['Название'].map(lambda x: remove_punct(x)) # удаляем пунктуацию
    df['Название'] = df['Название'].str.replace(r"\d+", "", flags=re.UNICODE) # удаляем цифры
    df['Название'] = df['Название'].str.replace(r"\b\w{1,2}\b", "") # удаляет слова из 1 или 2 символов
    # df['Название'] = df['Название'].str.replace(r"[a-zA-Z]", "")

    return df

Wall time: 0 ns


In [16]:
re.findall(r'кг/м3', 'Карта памяти  кг/м3 6ES7952-1AK00-0AA0')

['кг/м3']

In [17]:
train

Unnamed: 0,Название,Группа
0,Валок ф108 5ФВ ч.В-241178-14,Инструменты
1,Державка 30531402 Mapal,Резцы
2,"Кабель КПСВВнг-LS 1х2х0,75",Кабельная продукция
3,"Трубка электроизоляционная ТКР ф16,0мм",Изделия электроустан
4,"Лента конвейер 2,1-1000-ТК-200-2-5/2",ИзделияРезино-технич
...,...,...
23968,"Фреза шпоночная ц/х 8,0",Фрезы
23969,Кирпич керам полнотел одинарный М200,МатерСтроительные
23970,"Клеймо тв спл 122""Ф"" ВК15",Инструменты
23971,Элемент питания Saft LS 14250/STD 1/2AA,Запчасти


In [18]:
%%time
def feature_generation(df):

    df['кг/м3'] = ''
    df['мм2'] = ''
    df['куллон'] = ''
    df['м2/см3'] = ''
    df['вт'] = ''
    df['в'] = ''
    df['кгс/см2'] = ''
    df['кг'] = ''
    df['Gb'] = ''
    df['ед'] = ''
    df['амп'] = ''
    df['л/мин'] = ''
    df['мм'] = ''

    for index, row in df.iterrows():
        if len(re.findall(r'кг/м3', row['Название'])) != 0:
            df.loc[index, 'кг/м3'] = 1
        else:
            df.loc[index, 'кг/м3'] = 0

        if len(re.findall(r'мм2', row['Название'])) != 0:
            df.loc[index, 'мм2'] = 1
        else:
            df.loc[index, 'мм2'] = 0

        if len(re.findall(r'(кл[0-9])', row['Название'])) != 0:
            df.loc[index, 'куллон'] = 1
        else:
            df.loc[index, 'куллон'] = 0

        if len(re.findall(r'м2/см3', row['Название'])) != 0:
            df.loc[index, 'м2/см3'] = 1
        else:
            df.loc[index, 'м2/см3'] = 0

        if len(re.findall(r'[\d]вт', row['Название'])) != 0:
            df.loc[index, 'вт'] = 1
        else:
            df.loc[index, 'вт'] = 0
        
        if len(re.findall(r'[\d]{1,}в', row['Название'])) != 0:
            df.loc[index, 'в'] = 1
        else:
            df.loc[index, 'в'] = 0
        
        if len(re.findall(r'[\d]мкф', row['Название'])) != 0:
            df.loc[index, 'мкф'] = 1
        else:
            df.loc[index, 'мкф'] = 0
        
        if len(re.findall(r'кгс/см2', row['Название'])) != 0:
            df.loc[index, 'кгс/см2'] = 1
        else:
            df.loc[index, 'кгс/см2'] = 0
        
        if len(re.findall(r'[\d]{1,}кг ', row['Название'])) != 0:
            df.loc[index, 'кг'] = 1
        else:
            df.loc[index, 'кг'] = 0
        
        if len(re.findall(r'[\d]{1,}Gb ', row['Название'])) != 0:
            df.loc[index, 'Gb'] = 1
        else:
            df.loc[index, 'Gb'] = 0
        
        if len(re.findall(r'[\d]{1,}ед', row['Название'])) != 0:
            df.loc[index, 'ед'] = 1
        else:
            df.loc[index, 'ед'] = 0

        if len(re.findall(r'[\d]{1,}амп ', row['Название'])) != 0:
            df.loc[index, 'амп'] = 1
        else:
            df.loc[index, 'амп'] = 0
        
        if len(re.findall(r'[\d]{1,}л/мин', row['Название'])) != 0:
            df.loc[index, 'л/мин'] = 1
        else:
            df.loc[index, 'л/мин'] = 0
        
        if len(re.findall(r'[\d]{1,}mm|[\d]{1,}мм', row['Название'])) != 0:
            df.loc[index, 'мм'] = 1
        else:
            df.loc[index, 'мм'] = 0

        if len(re.findall(r'[\d]{1,}л', row['Название'])) != 0:
            df.loc[index, 'л'] = 1
        else:
            df.loc[index, 'л'] = 0

    return df

train = feature_generation(train)
train.head()

Wall time: 3min 29s


Unnamed: 0,Название,Группа,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,Gb,ед,амп,л/мин,мм,мкф,л
0,Валок ф108 5ФВ ч.В-241178-14,Инструменты,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1,Державка 30531402 Mapal,Резцы,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2,"Кабель КПСВВнг-LS 1х2х0,75",Кабельная продукция,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
3,"Трубка электроизоляционная ТКР ф16,0мм",Изделия электроустан,0,0,0,0,0,0,0,0,0,0,0,0,1,0.0,0.0
4,"Лента конвейер 2,1-1000-ТК-200-2-5/2",ИзделияРезино-технич,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0


In [19]:
train = txt_prep(train)
train



Unnamed: 0,Название,Группа,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,Gb,ед,амп,л/мин,мм,мкф,л,Название начальный вид
0,валок,Инструменты,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Валок ф108 5ФВ ч.В-241178-14
1,державка mapal,Резцы,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Державка 30531402 Mapal
2,кабель кпсввнг,Кабельная продукция,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Кабель КПСВВнг-LS 1х2х0,75"
3,трубка электроизоляционная ткр,Изделия электроустан,0,0,0,0,0,0,0,0,0,0,0,0,1,0.0,0.0,"Трубка электроизоляционная ТКР ф16,0мм"
4,лента конвейер,ИзделияРезино-технич,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Лента конвейер 2,1-1000-ТК-200-2-5/2"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23968,фреза шпоночная,Фрезы,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Фреза шпоночная ц/х 8,0"
23969,кирпич керам полнотел одинарный,МатерСтроительные,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Кирпич керам полнотел одинарный М200
23970,клеймо спл,Инструменты,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Клеймо тв спл 122""Ф"" ВК15"
23971,элемент питания saft std,Запчасти,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Элемент питания Saft LS 14250/STD 1/2AA


In [20]:
train

Unnamed: 0,Название,Группа,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,Gb,ед,амп,л/мин,мм,мкф,л,Название начальный вид
0,валок,Инструменты,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Валок ф108 5ФВ ч.В-241178-14
1,державка mapal,Резцы,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Державка 30531402 Mapal
2,кабель кпсввнг,Кабельная продукция,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Кабель КПСВВнг-LS 1х2х0,75"
3,трубка электроизоляционная ткр,Изделия электроустан,0,0,0,0,0,0,0,0,0,0,0,0,1,0.0,0.0,"Трубка электроизоляционная ТКР ф16,0мм"
4,лента конвейер,ИзделияРезино-технич,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Лента конвейер 2,1-1000-ТК-200-2-5/2"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23968,фреза шпоночная,Фрезы,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Фреза шпоночная ц/х 8,0"
23969,кирпич керам полнотел одинарный,МатерСтроительные,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Кирпич керам полнотел одинарный М200
23970,клеймо спл,Инструменты,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,"Клеймо тв спл 122""Ф"" ВК15"
23971,элемент питания saft std,Запчасти,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,Элемент питания Saft LS 14250/STD 1/2AA


### Посмотрим на данные по регуляркам и единицам измерения

In [21]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'кг/м3', row['Название'])) != 0:
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [22]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[ч][\.][0-9А-Я]{1,}[-][0-9А-Я]{1,}', row['Название'])) != 0:
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [23]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'мм2', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [24]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'(кл[0-9])', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [25]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'м2/см3', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [26]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]вт', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [27]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}в', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [28]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]мкф', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [29]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'кгс/см2', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [30]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}кг ', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [31]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}Gb ', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [32]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}ед', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [33]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}амп', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [34]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}л', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [35]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}л/мин', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


In [36]:
# %%time
tmp = []
tmp2 = []
for index, row in train.iterrows():
    if len(re.findall(r'[\d]{1,}mm|[\d]{1,}мм', row['Название'])) != 0:
        # print(row)
        tmp.append(row['Название'])
        tmp2.append(row['Группа'])

tmp3 = pd.DataFrame({'Название': tmp, 'Группа': tmp2})
print(tmp3['Группа'].value_counts())
tmp3

Series([], Name: Группа, dtype: int64)


Unnamed: 0,Название,Группа


## Построение модели

In [37]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer

# радуга (2-gramm символьный): ра  ад  ду  уг  га
# мама мыла раму рано утром (2-gramm словные):  (мамы мыла)    (мыла раму)   (раму утром)   (рано утром)
ngram_range = (1,3)
# словные
# униграммы, биграммы

min_df = 10
max_df = 1.
max_features = 1000

tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(train['Название']).toarray() # fit()-обучение transfrom()-применение
labels_train = train['Группа']
print(features_train.shape)

(23973, 1000)
Wall time: 1.61 s


In [38]:
tf_idf_df = pd.DataFrame(features_train, columns = tfidf.get_feature_names())
tf_idf_df.head()



Unnamed: 0,ancarbon,aol,aos,aos aos,art,asc,bcsg,cgnk,classic,din,...,элемент,элемент питания,элемент фильтр,эмаль,энкодер,эскиз,эскиз тпц,эспц,ямз,ящик
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
train_full = pd.concat([train, tf_idf_df], axis=1)
train_full = train_full.drop(columns=['Название начальный вид'])
train_full

Unnamed: 0,Название,Группа,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,...,элемент,элемент питания,элемент фильтр,эмаль,энкодер,эскиз,эскиз тпц,эспц,ямз,ящик
0,валок,Инструменты,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,державка mapal,Резцы,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,кабель кпсввнг,Кабельная продукция,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,трубка электроизоляционная ткр,Изделия электроустан,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,лента конвейер,ИзделияРезино-технич,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23968,фреза шпоночная,Фрезы,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23969,кирпич керам полнотел одинарный,МатерСтроительные,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23970,клеймо спл,Инструменты,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23971,элемент питания saft std,Запчасти,0,0,0,0,0,0,0,0,...,0.553659,0.643542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
train_full

Unnamed: 0,Название,Группа,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,...,элемент,элемент питания,элемент фильтр,эмаль,энкодер,эскиз,эскиз тпц,эспц,ямз,ящик
0,валок,Инструменты,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,державка mapal,Резцы,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,кабель кпсввнг,Кабельная продукция,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,трубка электроизоляционная ткр,Изделия электроустан,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,лента конвейер,ИзделияРезино-технич,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23968,фреза шпоночная,Фрезы,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23969,кирпич керам полнотел одинарный,МатерСтроительные,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23970,клеймо спл,Инструменты,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23971,элемент питания saft std,Запчасти,0,0,0,0,0,0,0,0,...,0.553659,0.643542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
# Закодируем целевую переменную
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
train_full['Группа'] = labelencoder.fit_transform(train_full['Группа'])

mapping = dict(zip(labelencoder.classes_, range(len(labelencoder.classes_))))

train_full.head()

Unnamed: 0,Название,Группа,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,...,элемент,элемент питания,элемент фильтр,эмаль,энкодер,эскиз,эскиз тпц,эспц,ямз,ящик
0,валок,33,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,державка mapal,73,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,кабель кпсввнг,34,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,трубка электроизоляционная ткр,24,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,лента конвейер,25,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
 from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_full.drop(columns=['Название','Группа']), 
                                                    train_full['Группа'], 
                                                    test_size=0.3, 
                                                    random_state=8)

print(f'All train shape: {train_full.shape}')
print(f'X train shape: {X_train.shape}')
print(f'X test shape: {X_test.shape}')

All train shape: (23973, 1017)
X train shape: (16781, 1015)
X test shape: (7192, 1015)


In [43]:
%%time
import numpy as np
from sklearn import svm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

svc = svm.SVC()

svc.fit(X_train, y_train)
accuracy_score(y_test, svc.predict(X_test))

Wall time: 4min 41s


0.6779755283648499

In [44]:
accuracy_score(y_test, svc.predict(X_test))

0.6779755283648499

In [45]:
test

Unnamed: 0,Название
0,Подшипник 3630 (22330)
1,Винт 24х110 ГОСТ11738-84(DIN 912)
2,Пускатель ПМ ГОСТО 12-025-150 220В
3,Образец станд Ш13 концентрат плавико
4,Насос A4VG180EP2DT2/32R-PZD02F691LH-S
...,...
2341,Втулка ч.0301435-30.148
2342,Фильтроэлемент 2600R005BN4HC
2343,Пила цепная электр руч UC 4010А Makita
2344,Картридж Canon PFI-107C голубой 130 мл


In [46]:
tmp = feature_generation(test)
tmp

Unnamed: 0,Название,кг/м3,мм2,куллон,м2/см3,вт,в,кгс/см2,кг,Gb,ед,амп,л/мин,мм,мкф,л
0,Подшипник 3630 (22330),0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1,Винт 24х110 ГОСТ11738-84(DIN 912),0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2,Пускатель ПМ ГОСТО 12-025-150 220В,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
3,Образец станд Ш13 концентрат плавико,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
4,Насос A4VG180EP2DT2/32R-PZD02F691LH-S,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2341,Втулка ч.0301435-30.148,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2342,Фильтроэлемент 2600R005BN4HC,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2343,Пила цепная электр руч UC 4010А Makita,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2344,Картридж Canon PFI-107C голубой 130 мл,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0


In [47]:
features_test = tfidf.transform(tmp['Название']).toarray()
print(features_test.shape)

tf_idf_df_test = pd.DataFrame(features_test, columns = tfidf.get_feature_names())
tf_idf_df_test.head()

(2346, 1000)




Unnamed: 0,ancarbon,aol,aos,aos aos,art,asc,bcsg,cgnk,classic,din,...,элемент,элемент питания,элемент фильтр,эмаль,энкодер,эскиз,эскиз тпц,эспц,ямз,ящик
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
test_full = pd.concat([test, tf_idf_df_test], axis=1)
test_full = test_full.drop(columns=['Название'])
# test_full.head()

In [58]:
answer = svc.predict(test_full)

answer_df = pd.DataFrame(answer, columns=['Группа'])
answer_df

Unnamed: 0,Группа
0,19
1,19
2,19
3,89
4,19
...,...
2341,19
2342,19
2343,28
2344,19


In [59]:
inverse_dict = dict([val,key] for key,val in mapping.items())
inverse_dict

{0: 'АккумБатареи и Элем',
 1: 'Арматура к трубам',
 2: 'АрматураТрубопровод',
 3: 'Блоки систем автомат',
 4: 'ВыключатАвтоматич',
 5: 'ВычОргТехн и З/Ч',
 6: 'ДатчСистемАвтоматики',
 7: 'ДиодТранзисторТирист',
 8: 'З/Ч АвтомобПромышл',
 9: 'З/Ч Гидрооборудован',
 10: 'З/Ч ГрузПодъемОборуд',
 11: 'З/Ч Пневмооборудов',
 12: 'З/Ч ТракСтроитТехн',
 13: 'З/Ч к НасосВентилят',
 14: 'З/Ч к компрессорам',
 15: 'З/Ч кМеталлообОборуд',
 16: 'З/Ч по чертежам',
 17: 'ЗЧ АвтоматПускКонтак',
 18: 'ЗаготовкаИнстр и з/ч',
 19: 'Запчасти',
 20: 'ИздДляТрубПредохран',
 21: 'ИздИзПолимерПоЧертеж',
 22: 'ИздИзПолимеровСтанд',
 23: 'Издел.Асбесто-технич',
 24: 'Изделия электроустан',
 25: 'ИзделияРезино-технич',
 26: 'Измерительные прибор',
 27: 'ИнстОснастТехнПоЧерт',
 28: 'Инстр. электрический',
 29: 'Инструм. абразивный',
 30: 'Инструм. мерительный',
 31: 'Инструмент режущий',
 32: 'Инструмент слесарный',
 33: 'Инструменты',
 34: 'Кабельная продукция',
 35: 'Кабельно-проводников',
 36: 'Канцтовары',
 

In [60]:
answer_df.reset_index(inplace=True)

answer_df['Группа'] = answer_df['Группа'].map(inverse_dict).fillna(answer_df['Группа'])
answer_df

Unnamed: 0,index,Группа
0,0,Запчасти
1,1,Запчасти
2,2,Запчасти
3,3,ХимПродОбщехимНазн
4,4,Запчасти
...,...,...
2341,2341,Запчасти
2342,2342,Запчасти
2343,2343,Инстр. электрический
2344,2344,Запчасти


In [61]:
answer_df.to_csv("test_submit.csv", index=False)