<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка данных</a></span></li></ul></div>

Гальцев Борис Сергеевич, @Borisgaltsev

Стриминговый сервис "МиФаСоль".   
Сервис расширяет работу с новыми артистами и музыкантами, в связи с чем возникла задача правильно классифицировать новые музыкальные треки, чтобы улучшить работу рекомендательной системы.   

Цель проекта:   
- Разработать модель, позволяющую классифицировать музыкальные произведения по жанрам.  

Исходные данные:  
- train.csv - информация (~20000) музыкальных треках, которые будут использоваться в качестве обучающих данных.  
- test.csv - информация (~5000) музыкальных треках, которые будут использоваться в качестве тестовых данных. Ваша задача - предсказать значение 'music_genre' для каждого трека из этого датасета.  
- sample_submit.csv - файл предсказаний в правильном формате.  
- instance_id - идентификатор трека в тестовом наборе.  
- music_genre - целевой признак. Для каждого трека предскажите категориальное значение соответствующее музыкальному жанру трека.  

Поля датасетов:  
- instance_id - уникальный идентификатор трека;  
- track_name - название трека;  
- acousticness - акустичность;  
- danceability - танцевальность;  
- duration_ms - продолжительность в милисекундах;  
- energy - энергичность;  
- instrumentalness - инструментальность;  
- key - тональность;  
- liveness - привлекательность;  
- loudness - громкость;  
- mode - наклонение;  
- speechiness - выразительность;  
- tempo - темп;  
- obtained_date - дата загрузки в сервис;  
- valence - привлекательность произведения для пользователей сервиса;  
- music_genre - музыкальный жанр.  

План работы:
1. Загрузка и ознакомление с данными.  
2. предварительная обработка.  
3. полноценный разведочный анализ.  
4. разработка новых синтетических признаков.  
5. проверка на мультиколлинеарность.  
6. отбор финального набора обучающих признаков.  
7. выбор и обучение моделей.  
8. итоговая оценка качества предсказания лучшей модели.  
9. анализ важности ее признаков.  
10. подготовка отчета по исследованию.  

Итог данной работы:  
1. Разработанная и обученная модель.  
2. Освоенная платформа Kaggle.  

## Подготовка данных

In [1]:
!pip install pandas-profiling
!pip install -U imbalanced-learn



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import pandas_profiling

from tqdm import tqdm
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, fbeta_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, plot_confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from imblearn.over_sampling import SMOTE


import warnings
warnings.filterwarnings("ignore")

  import pandas_profiling


In [3]:
try:
    df_train = pd.read_csv('kaggle_music_genre_train.csv')
except:
    df_train = pd.read_csv('/datasets/kaggle_music_genre_train.csv')

In [4]:
try:
    df_test = pd.read_csv('kaggle_music_genre_test.csv')
except:
    df_test = pd.read_csv('/datasets/kaggle_music_genre_test.csv')

<div class="alert alert-block alert-warning">
<b>Датасет:</b> Train
</div>

In [5]:
#выведем первые 5 строк датафрейма kaggle_music_genre_train
df_train.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,25143.0,Highwayman,0.48,0.67,182653.0,0.351,0.0176,D,0.115,-16.842,Major,0.0463,101.384,4-Apr,0.45,Country
1,26091.0,Toes Across The Floor,0.243,0.452,187133.0,0.67,5.1e-05,A,0.108,-8.392,Minor,0.0352,113.071,4-Apr,0.539,Rock
2,87888.0,First Person on Earth,0.228,0.454,173448.0,0.804,0.0,E,0.181,-5.225,Minor,0.371,80.98,4-Apr,0.344,Alternative
3,77021.0,No Te Veo - Digital Single,0.0558,0.847,255987.0,0.873,3e-06,G#,0.325,-4.805,Minor,0.0804,116.007,4-Apr,0.966,Hip-Hop
4,20852.0,Chasing Shadows,0.227,0.742,195333.0,0.575,2e-06,C,0.176,-5.55,Major,0.0487,76.494,4-Apr,0.583,Alternative


In [6]:
#выведем общую информацию о датафрейме kaggle_music_genre_train
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20394 entries, 0 to 20393
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       20394 non-null  float64
 1   track_name        20394 non-null  object 
 2   acousticness      20394 non-null  float64
 3   danceability      20394 non-null  float64
 4   duration_ms       20394 non-null  float64
 5   energy            20394 non-null  float64
 6   instrumentalness  20394 non-null  float64
 7   key               19659 non-null  object 
 8   liveness          20394 non-null  float64
 9   loudness          20394 non-null  float64
 10  mode              19888 non-null  object 
 11  speechiness       20394 non-null  float64
 12  tempo             19952 non-null  float64
 13  obtained_date     20394 non-null  object 
 14  valence           20394 non-null  float64
 15  music_genre       20394 non-null  object 
dtypes: float64(11), object(5)
memory usage: 

In [7]:
# проверим наличие явных дубликаттов в тренировочном датасете
df_train.duplicated().sum()

0

In [8]:
# посчитаем количество пропусков в полях тренировочного датасета
df_train.isna().sum()

instance_id           0
track_name            0
acousticness          0
danceability          0
duration_ms           0
energy                0
instrumentalness      0
key                 735
liveness              0
loudness              0
mode                506
speechiness           0
tempo               442
obtained_date         0
valence               0
music_genre           0
dtype: int64

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> нулевые/пропущенные значения в полях key, mode и tempo (735, 506 и 442), исследуем далее.
</div>

<div class="alert alert-block alert-warning">
<b>Датасет:</b> Test:
</div>

In [9]:
#выведем первые 5 строк датафрейма kaggle_music_genre_test
df_test.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence
0,48564,Low Class Conspiracy,0.301,0.757,146213.0,0.679,0.0,A#,0.303,-7.136,Minor,0.356,90.361,4-Apr,0.895
1,72394,The Hunter,0.538,0.256,240360.0,0.523,0.00832,G#,0.0849,-5.175,Major,0.0294,78.385,4-Apr,0.318
2,88081,Hate Me Now,0.00583,0.678,284000.0,0.77,0.0,A,0.109,-4.399,Minor,0.222,90.0,4-Apr,0.412
3,78331,Somebody Ain't You,0.0203,0.592,177354.0,0.749,0.0,B,0.122,-4.604,Major,0.0483,160.046,4-Apr,0.614
4,72636,Sour Mango,0.000335,0.421,-1.0,0.447,0.0148,D,0.0374,-8.833,Major,0.202,73.83,4-Apr,0.121


In [10]:
#выведем общую информацию о датафрейме kaggle_music_genre_test
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       5099 non-null   int64  
 1   track_name        5099 non-null   object 
 2   acousticness      5099 non-null   float64
 3   danceability      5099 non-null   float64
 4   duration_ms       5099 non-null   float64
 5   energy            5099 non-null   float64
 6   instrumentalness  5099 non-null   float64
 7   key               4941 non-null   object 
 8   liveness          5099 non-null   float64
 9   loudness          5099 non-null   float64
 10  mode              4950 non-null   object 
 11  speechiness       5099 non-null   float64
 12  tempo             4978 non-null   float64
 13  obtained_date     5099 non-null   object 
 14  valence           5099 non-null   float64
dtypes: float64(10), int64(1), object(4)
memory usage: 597.7+ KB


In [11]:
# проверим наличие явных дубликаттов в тренировочном датасете
df_test.duplicated().sum()

0

In [12]:
# посчитаем количество пропусков в полях тестовом датасете
df_test.isna().sum()

instance_id           0
track_name            0
acousticness          0
danceability          0
duration_ms           0
energy                0
instrumentalness      0
key                 158
liveness              0
loudness              0
mode                149
speechiness           0
tempo               121
obtained_date         0
valence               0
dtype: int64

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Сохраним перечень идентификаторов тестового набора данных в переменной instance_id_test (используем в выводе).
</div>

In [13]:
instance_id_test = df_test['instance_id']

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> нулевые/пропущенные значения в полях key, mode и tempo (158, 149 и 121), исследуем далее.
</div>

Как видим, в обоих датасетах: нулевые/пропущенные значения в полях key, mode и tempo. Типы поля instance_id в датасетах разные, необходимо привести к int64. Также поле duration_ms необходимо привести к типу int64, так как временная единица, меньшая ms в данной задачи неинтересна.

In [14]:
df_train['instance_id'] = df_train['instance_id'].astype(int)

In [15]:
df_train['duration_ms'] = df_train['duration_ms'].astype(int)

In [16]:
df_test['duration_ms'] = df_test['duration_ms'].astype(int)

In [17]:
df_train.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,25143,Highwayman,0.48,0.67,182653,0.351,0.0176,D,0.115,-16.842,Major,0.0463,101.384,4-Apr,0.45,Country
1,26091,Toes Across The Floor,0.243,0.452,187133,0.67,5.1e-05,A,0.108,-8.392,Minor,0.0352,113.071,4-Apr,0.539,Rock
2,87888,First Person on Earth,0.228,0.454,173448,0.804,0.0,E,0.181,-5.225,Minor,0.371,80.98,4-Apr,0.344,Alternative
3,77021,No Te Veo - Digital Single,0.0558,0.847,255987,0.873,3e-06,G#,0.325,-4.805,Minor,0.0804,116.007,4-Apr,0.966,Hip-Hop
4,20852,Chasing Shadows,0.227,0.742,195333,0.575,2e-06,C,0.176,-5.55,Major,0.0487,76.494,4-Apr,0.583,Alternative


In [18]:
df_test.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence
0,48564,Low Class Conspiracy,0.301,0.757,146213,0.679,0.0,A#,0.303,-7.136,Minor,0.356,90.361,4-Apr,0.895
1,72394,The Hunter,0.538,0.256,240360,0.523,0.00832,G#,0.0849,-5.175,Major,0.0294,78.385,4-Apr,0.318
2,88081,Hate Me Now,0.00583,0.678,284000,0.77,0.0,A,0.109,-4.399,Minor,0.222,90.0,4-Apr,0.412
3,78331,Somebody Ain't You,0.0203,0.592,177354,0.749,0.0,B,0.122,-4.604,Major,0.0483,160.046,4-Apr,0.614
4,72636,Sour Mango,0.000335,0.421,-1,0.447,0.0148,D,0.0374,-8.833,Major,0.202,73.83,4-Apr,0.121


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Применим pandas_profiling для вывода общей информации о датасете train.
</div>

In [19]:
pandas_profiling.ProfileReport(df_train)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Результаты работы pandas_profiling:
</div>

По датасету:    
- Number of variables:	16  
- Number of observations:	20394  
- Missing cells	1683:  
- Missing cells (%):	0.5%  
- Duplicate rows:	0  
- Numeric:	11  
- Categorical:	5  
Явных дубликатов нет, пропущенных значений менее 1%, категориальных признаков 5.

По признакам:  
- track_name has a high cardinality: 18643 distinct values - много различных значений, но названия могут повторяться в разных жанрах, нам это немнтересно;  
- track_name is uniformly distributed - равномерно распределенный, нам это не особо интересно.  
- acousticness is highly overall correlated with energy and 1 other fields - высокая корреляция поля acousticness с energy!  
- energy is highly overall correlated with acousticness and 1 other fields - и наоборот!  
- loudness is highly overall correlated with acousticness and 1 other fields - высокая корреляция поля loudness с acousticness!  
по корреляции - 3 поля: 
- acousticness - акустичность;  
- energy - энергичность;  
- loudness - громкость.   
По сути природа этих 3 параметров не позволяет им особо существовать друг без друга, не может быть высокая энергичность с низкой 
громкостью, есть, конечно исключения, но это как раз больше исключения, отсюда и высокая корреляция этих признаков.  

- obtained_date is highly imbalanced (72.5%) - дата загрузки в сервис, по идее эта информация также мало очем говорит в при рассмотрении в общем.  
- key has 735 (3.6%) missing values - это мы определили ранее, будем исправлять, на первом этапе удалим, так как процент пропусков мал.  
- mode has 506 (2.5%) missing values - это мы определили ранее, будем исправлять, на первом этапе удалим, так как процент пропусков мал.  
- tempo has 442 (2.2%) missing values - это мы определили ранее, будем исправлять, на первом этапе удалим, так как процент пропусков мал.  
- instance_id has unique values - отлично, все значения уникальны, по идее при совпадении значений можно определить трек и его жанр, но если рассматривать наш проект как сервис по определению жанра для новых музыкальных треков, то id в системе, даже уникальные для уже известных треков мало будут полезны.  
- instrumentalness has 5978 (29.3%) zeros - не все треки обладают этим свойством, это может быть полезно нам, необходимо посмотреть для каких жанров это свойство имеет те или иные значения.  

По признакам отдедльно:  
- instance_id - нулевых, пропусков, отрицательных значений нет, только неупорядоченные, но это и не важно.  
- track_name - Unique	17252 и Min length	1 - видим, что есть повторяющиеся названия, это нормально, а вот название в 1 символ надо проверить.  
- acousticness - Missing 0, Minimum	0, Maximum 0.996, Zeros 1, Negative	0. - особо смотреть нечего.  
- danceability - Missing 0, Minimum	0.06, Maximum 0.978, Zeros 0, Negative 0. - особо смотреть нечего.  
- duration_ms - Missing	0, Minimum -1 (интересно!), Maximum 4497994, Zeros 0, Negative 2009 - довольно много отрицательных значений.  
- energy - Distinct: 1521 (не так много на общем фоне), Missing	0, Minimum	0.00101, Maximum 0.999, Zeros 0, Negative 0. - есть  повторяющиеся, проверить на соответствие жанрам.  
- instrumentalness - Distinct 4360 (!), Missing	0, Minimum 0, Maximum 0.996, Zeros 5978, Negative 0 - описывал ранее.  
- key - Distinct 12, Missing 735 - описывал ранее.  
- liveness - Distinct 1521, Missing 0, Minimum 0.0136, Maximum 1, Zeros 0, Negative 0. - довольно много одинаковых значений, надо проверить на соответствие жанрам.  
- loudness - Distinct 10844, Missing 0, Minimum -44.406, Maximum 3.744, Zeros 1, Negative 20376. - есть одинаковые значения, надо рассмотреть.  
- mode - Distinct 2, Missing 506, - описал выше, есть пропуски (2.5%).  
- speechiness - Distinct 1243 (не так много на общем фоне), Missing	0, Minimum 0.0223, Maximum 0.942, Zeros 0, Negative 0. - есть повторяющиеся, надо проверить на соответствие жанрам.  
- tempo - Distinct 15762, Missing 442, Minimum 34.765, Maximum 220.041, Zeros 0, Negative 0. - есть пропущенные, надо обработать. Есть много повторяющихся, надо проверить на соответствие жанрам.  
- obtained_date - Distinct 4 (это интересно!), Missing 0 - всего 4 категории!!! надло проверить!  
- valence - Distinct 1454 (не так много на общем фоне), Missing	0, Minimum 0, Maximum 0.992, Zeros 1, Negative 0. Есть повторяющиеся значения, проверить на соответстиве жанрам.  
- music_genre - Distinct 10, Missing 0. - посмотреть у каких жанров пропуски в значениях в полях key, mode и tempo.  

из числовых признаков все лежат в отрезке [0; 1], за исключением 4 признаков: instance_id, duration_ms, loudness и tempo.

Наибольшую корреляцию признак music_genre имеет с:  
- acousticness;  
- danceability;  
- energy;  
- instrumentalness;  
- loudness.  

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> По результатам работы pandas_profiling попытаемся найти взаимосвязи жанров с другими признаками:
</div>

In [20]:
df_train.obtained_date.unique()

array(['4-Apr', '3-Apr', '5-Apr', '1-Apr'], dtype=object)

In [21]:
df_train.query('obtained_date == "4-Apr"').music_genre.unique()

array(['Country', 'Rock', 'Alternative', 'Hip-Hop', 'Blues', 'Jazz',
       'Electronic', 'Anime', 'Rap', 'Classical'], dtype=object)

In [22]:
df_train.query('obtained_date == "3-Apr"').music_genre.unique()

array(['Blues', 'Alternative', 'Hip-Hop', 'Classical', 'Rap', 'Anime',
       'Jazz', 'Country', 'Electronic', 'Rock'], dtype=object)

In [23]:
df_train.query('obtained_date == "5-Apr"').music_genre.unique()

array(['Rap', 'Blues', 'Hip-Hop', 'Electronic', 'Classical', 'Rock',
       'Anime', 'Jazz', 'Alternative', 'Country'], dtype=object)

In [24]:
df_train.query('obtained_date == "1-Apr"').music_genre.unique()

array(['Electronic', 'Classical', 'Jazz', 'Anime', 'Alternative', 'Blues',
       'Rap', 'Hip-Hop', 'Rock', 'Country'], dtype=object)

Взаимосвязи нет никакой, странно, что всего 4 значения в поле загрузки данных, но это поле для нашей задачи не имеет значения.

In [25]:
df_train[df_train['key'].isna()].music_genre.unique()

array(['Classical', 'Hip-Hop', 'Rock', 'Electronic', 'Country',
       'Alternative', 'Rap', 'Jazz', 'Blues', 'Anime'], dtype=object)

In [26]:
df_train[df_train['mode'].isna()].music_genre.unique()

array(['Classical', 'Country', 'Electronic', 'Blues', 'Rap', 'Rock',
       'Alternative', 'Anime', 'Hip-Hop', 'Jazz'], dtype=object)

In [27]:
df_train[df_train['tempo'].isna()].music_genre.unique()

array(['Rock', 'Anime', 'Blues', 'Electronic', 'Alternative', 'Country',
       'Classical', 'Rap', 'Jazz', 'Hip-Hop'], dtype=object)

Проверка на зависимость music_genre от пропущенных значений полей key, mode и tempo результата не дало.

In [28]:
df_train[df_train['instrumentalness'] == 0].music_genre.unique()

array(['Alternative', 'Hip-Hop', 'Anime', 'Electronic', 'Rap', 'Country',
       'Rock', 'Blues', 'Classical', 'Jazz'], dtype=object)

Нулевое значение инструментальности характерно для всех известных жанров, не особо помогло.

In [29]:
df_train[df_train['track_name'].str.len() == 1].count()

instance_id         18
track_name          18
acousticness        18
danceability        18
duration_ms         18
energy              18
instrumentalness    18
key                 18
liveness            18
loudness            18
mode                18
speechiness         18
tempo               18
obtained_date       18
valence             18
music_genre         18
dtype: int64

Не так много, можно удалить, если оставим столбец track_name для дальнейшего исследования

In [30]:
df_train[df_train['duration_ms'] <= 0].music_genre.count()

2009

In [31]:
df_train[df_train['duration_ms'] <= 0].music_genre.unique()

array(['Anime', 'Classical', 'Rap', 'Alternative', 'Country', 'Rock',
       'Blues', 'Electronic', 'Jazz', 'Hip-Hop'], dtype=object)

данная ошибка (ввода или выгрузки) встречается во всех жанрах, придется исправлять на медианные значения по жанрам

In [32]:
df_test[df_test['duration_ms'] <= 0].instance_id.count()

509

видим, что в тестовых данных датафрейма df_test также присутствуют отрицательные значения признака 'duration_ms', но так как в данном датасете мы не можем вычислить медианное значение по жанрам, то заполним медианным значением по тестовому датафрейму без учета значений <= 0

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Заменим значения duration_ms, меньшие и равные 0 и пропущенные значения tempo для всех жанров в основном датасете df_train_copy:
</div>

In [33]:
genre_list = df_train.music_genre.unique()
genre_list

array(['Country', 'Rock', 'Alternative', 'Hip-Hop', 'Blues', 'Jazz',
       'Electronic', 'Anime', 'Rap', 'Classical'], dtype=object)

In [34]:
df_train_copy = df_train

In [35]:
for genre in genre_list:
    duration_ms_median = df_train_copy.loc[(df_train_copy['music_genre'] == genre) & (df_train_copy['duration_ms'] > 0)].duration_ms.median()
    print(duration_ms_median)
    tempo_median = df_train_copy.loc[(df_train_copy['music_genre'] == genre) & (df_train_copy['tempo'].notna())].tempo.median()
    print(tempo_median)
    df_train_copy.loc[(df_train_copy['music_genre'] == genre) & (df_train_copy['duration_ms'] <= 0), 'duration_ms'] = duration_ms_median
    df_train_copy.loc[(df_train_copy['music_genre'] == genre) & (df_train_copy['tempo'].isna()), 'tempo'] = tempo_median

211705.5
121.97
226360.0
120.84
225725.5
120.048
212854.5
120.1
228827.0
119.118
247400.0
105.3485
243629.0
126.0
239523.0
128.002
214307.0
120.02
263373.0
95.5235


In [36]:
df_train_copy.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence,music_genre
0,25143,Highwayman,0.48,0.67,182653.0,0.351,0.0176,D,0.115,-16.842,Major,0.0463,101.384,4-Apr,0.45,Country
1,26091,Toes Across The Floor,0.243,0.452,187133.0,0.67,5.1e-05,A,0.108,-8.392,Minor,0.0352,113.071,4-Apr,0.539,Rock
2,87888,First Person on Earth,0.228,0.454,173448.0,0.804,0.0,E,0.181,-5.225,Minor,0.371,80.98,4-Apr,0.344,Alternative
3,77021,No Te Veo - Digital Single,0.0558,0.847,255987.0,0.873,3e-06,G#,0.325,-4.805,Minor,0.0804,116.007,4-Apr,0.966,Hip-Hop
4,20852,Chasing Shadows,0.227,0.742,195333.0,0.575,2e-06,C,0.176,-5.55,Major,0.0487,76.494,4-Apr,0.583,Alternative


In [37]:
df_train_copy[df_train_copy['duration_ms'] <= 0].duration_ms.count()

0

In [38]:
df_train_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20394 entries, 0 to 20393
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       20394 non-null  int32  
 1   track_name        20394 non-null  object 
 2   acousticness      20394 non-null  float64
 3   danceability      20394 non-null  float64
 4   duration_ms       20394 non-null  float64
 5   energy            20394 non-null  float64
 6   instrumentalness  20394 non-null  float64
 7   key               19659 non-null  object 
 8   liveness          20394 non-null  float64
 9   loudness          20394 non-null  float64
 10  mode              19888 non-null  object 
 11  speechiness       20394 non-null  float64
 12  tempo             20394 non-null  float64
 13  obtained_date     20394 non-null  object 
 14  valence           20394 non-null  float64
 15  music_genre       20394 non-null  object 
dtypes: float64(10), int32(1), object(5)
memo

In [39]:
df_train_copy.tempo.isna().sum()

0

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Заменим значения duration_ms, меньшие и равные 0 и пропущенные значения tempo на медианные значения в тестовом датасете df_test_copy. По идее это не очень хорошо по отношению к данному тестовому датасету и если подгонять под тестовый датасет, то и в трейне необходимо менять эти значения не на медианные по жанрам, а на медианные по всему трейну, НО если мы строим модель для работы, а не для подгона к текущему датасету, то правильнее менять на медианные по жанрам.
</div>

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> В финал должен попасть тестовый датафрейм с изначальным количеством строк и количеством столбцов, соответствующих датасету трейн в момент обучения модели!
</div>

In [40]:
df_test_copy = df_test

In [41]:
duration_ms_median_test = df_test_copy.loc[df_test_copy['duration_ms'] > 0].duration_ms.median()
print(duration_ms_median_test)
tempo_median_test = df_test_copy.loc[df_test_copy['tempo'].notna()].tempo.median()
print(tempo_median_test)

226108.5
120.0535


In [42]:
df_test_copy.loc[df_test_copy['duration_ms'] <= 0, 'duration_ms'] = duration_ms_median_test
df_test_copy.loc[df_test_copy['tempo'].isna(), 'tempo'] = tempo_median_test

In [43]:
df_test_copy.head()

Unnamed: 0,instance_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,obtained_date,valence
0,48564,Low Class Conspiracy,0.301,0.757,146213.0,0.679,0.0,A#,0.303,-7.136,Minor,0.356,90.361,4-Apr,0.895
1,72394,The Hunter,0.538,0.256,240360.0,0.523,0.00832,G#,0.0849,-5.175,Major,0.0294,78.385,4-Apr,0.318
2,88081,Hate Me Now,0.00583,0.678,284000.0,0.77,0.0,A,0.109,-4.399,Minor,0.222,90.0,4-Apr,0.412
3,78331,Somebody Ain't You,0.0203,0.592,177354.0,0.749,0.0,B,0.122,-4.604,Major,0.0483,160.046,4-Apr,0.614
4,72636,Sour Mango,0.000335,0.421,226108.5,0.447,0.0148,D,0.0374,-8.833,Major,0.202,73.83,4-Apr,0.121


In [44]:
df_test_copy[df_test_copy['duration_ms'] <= 0].duration_ms.count()

0

In [45]:
df_test_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       5099 non-null   int64  
 1   track_name        5099 non-null   object 
 2   acousticness      5099 non-null   float64
 3   danceability      5099 non-null   float64
 4   duration_ms       5099 non-null   float64
 5   energy            5099 non-null   float64
 6   instrumentalness  5099 non-null   float64
 7   key               4941 non-null   object 
 8   liveness          5099 non-null   float64
 9   loudness          5099 non-null   float64
 10  mode              4950 non-null   object 
 11  speechiness       5099 non-null   float64
 12  tempo             5099 non-null   float64
 13  obtained_date     5099 non-null   object 
 14  valence           5099 non-null   float64
dtypes: float64(10), int64(1), object(4)
memory usage: 597.7+ KB


In [46]:
df_test_copy.tempo.isna().sum()

0

Посмотрим на значения полей key и mode в зависимости от жанра

In [47]:
for genre in genre_list:
    key_unique = df_train_copy.loc[df_train_copy['music_genre'] == genre, 'key'].unique()
    mode_unique = df_train_copy.loc[df_train_copy['music_genre'] == genre, 'mode'].unique()
    print(f'Для жанра {genre}',f'полученные значения фичи key: {key_unique}.')
    print(f'Для жанра {genre}',f'полученные значения фичи mode: {mode_unique}.')

Для жанра Country полученные значения фичи key: ['D' 'A' 'G#' 'G' 'C#' 'F#' 'F' 'C' nan 'B' 'D#' 'E' 'A#'].
Для жанра Country полученные значения фичи mode: ['Major' nan 'Minor'].
Для жанра Rock полученные значения фичи key: ['A' 'D' 'E' 'D#' 'B' nan 'G#' 'F' 'C' 'G' 'F#' 'C#' 'A#'].
Для жанра Rock полученные значения фичи mode: ['Minor' 'Major' nan].
Для жанра Alternative полученные значения фичи key: ['E' 'C' 'G#' 'G' 'F#' 'D' 'A' 'C#' nan 'D#' 'A#' 'F' 'B'].
Для жанра Alternative полученные значения фичи mode: ['Minor' 'Major' nan].
Для жанра Hip-Hop полученные значения фичи key: ['G#' 'A#' 'D' 'F' 'C' nan 'B' 'A' 'E' 'C#' 'D#' 'F#' 'G'].
Для жанра Hip-Hop полученные значения фичи mode: ['Minor' 'Major' nan].
Для жанра Blues полученные значения фичи key: ['D' 'F' 'G' 'A' 'G#' 'D#' 'B' 'A#' 'C' 'E' 'C#' nan 'F#'].
Для жанра Blues полученные значения фичи mode: ['Major' 'Minor' nan].
Для жанра Jazz полученные значения фичи key: ['D#' 'A#' 'C' 'E' nan 'G' 'F' 'A' 'B' 'G#' 'D' 'C#' 'F#'

В каждом жанре присутствуют все возможные категориальные значения полей key и mode, так как общее количество пропущенных значений в этих полях менее 1%, то для дальнейшего исследования удалим их из датафрейма df_train_copy.

In [48]:
df_train_copy = df_train_copy.loc[(df_train_copy['key'].notna()) & (df_train_copy['mode'].notna())]
df_train_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19175 entries, 0 to 20393
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   instance_id       19175 non-null  int32  
 1   track_name        19175 non-null  object 
 2   acousticness      19175 non-null  float64
 3   danceability      19175 non-null  float64
 4   duration_ms       19175 non-null  float64
 5   energy            19175 non-null  float64
 6   instrumentalness  19175 non-null  float64
 7   key               19175 non-null  object 
 8   liveness          19175 non-null  float64
 9   loudness          19175 non-null  float64
 10  mode              19175 non-null  object 
 11  speechiness       19175 non-null  float64
 12  tempo             19175 non-null  float64
 13  obtained_date     19175 non-null  object 
 14  valence           19175 non-null  float64
 15  music_genre       19175 non-null  object 
dtypes: float64(10), int32(1), object(5)
memo

In [49]:
df_train_copy.isna().sum()

instance_id         0
track_name          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
obtained_date       0
valence             0
music_genre         0
dtype: int64

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Датасеты df_train_copy и df_test_copy готовы для построения моделей. Именно их надо будет использовать для CatBoost, так как дальше пойдут преобразования в виде масштабирования числовых признаков и кодирования категориальных признаков. Опыта использования CatBoost нет, но как понял из описания, это уникальная разработка Yandex для наборов данных, включающих как категориальные данные, так и числовые. Его буду использовать в конце, если время хватит, как говоили наставники от простго к сложному.
</div>

Обратим внимание на поля 'instance_id', 'track_name' и 'obtained_date' - поля с категориальными значениями, однако для предсказания жанра трека эти поля, на мой взгляд, не несут никакой пользы, но при обучении моделей могут негативно влиять на конечный результат. Удалим эти поля из df_train_copy. Новый датасет будет df_result.

In [51]:
df_result = df_train_copy.drop(['instance_id', 'track_name', 'obtained_date'], axis=1)

In [52]:
df_result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19175 entries, 0 to 20393
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      19175 non-null  float64
 1   danceability      19175 non-null  float64
 2   duration_ms       19175 non-null  float64
 3   energy            19175 non-null  float64
 4   instrumentalness  19175 non-null  float64
 5   key               19175 non-null  object 
 6   liveness          19175 non-null  float64
 7   loudness          19175 non-null  float64
 8   mode              19175 non-null  object 
 9   speechiness       19175 non-null  float64
 10  tempo             19175 non-null  float64
 11  valence           19175 non-null  float64
 12  music_genre       19175 non-null  object 
dtypes: float64(10), object(3)
memory usage: 2.0+ MB


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> То же самое сделаем для тестового датасета, необходимо, чтобы количество столбцов совпадало.
</div>

In [53]:
df_result_test = df_test_copy.drop(['instance_id', 'track_name', 'obtained_date'], axis=1)

In [54]:
df_result_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      5099 non-null   float64
 1   danceability      5099 non-null   float64
 2   duration_ms       5099 non-null   float64
 3   energy            5099 non-null   float64
 4   instrumentalness  5099 non-null   float64
 5   key               4941 non-null   object 
 6   liveness          5099 non-null   float64
 7   loudness          5099 non-null   float64
 8   mode              4950 non-null   object 
 9   speechiness       5099 non-null   float64
 10  tempo             5099 non-null   float64
 11  valence           5099 non-null   float64
dtypes: float64(10), object(2)
memory usage: 478.2+ KB


проверим на явные дубликаты получившийся датасет df_result

In [55]:
df_result.duplicated().sum()

6

In [56]:
df_result_test.duplicated().sum()

30

удалим дубликаты, дабы не привести к переобучению моделей, так как тестовый набоор на обучение не влияет, его оставим как есть

In [57]:
df_result.drop_duplicates(inplace=True)

In [58]:
df_result.duplicated().sum()

0

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> До применения кодирования категориальных признаков заменим категориальный признак mode (всего 2 значения 'Major' и 'Minor') на бинарный признак is_Major (1 и если Minor, то 0).
</div>

In [59]:
df_result_nmode = df_result
df_result_nmode.loc[(df_result_nmode['mode'] == "Major"), 'is_Major'] = 1
df_result_nmode.loc[(df_result_nmode['mode'] == "Minor"), 'is_Major'] = 0

df_result_test_nmode = df_result_test
df_result_test_nmode.loc[(df_result_test_nmode['mode'] == "Major"), 'is_Major'] = 1
df_result_test_nmode.loc[(df_result_test_nmode['mode'] == "Minor"), 'is_Major'] = 0

In [60]:
df_result_nmode.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19169 entries, 0 to 20393
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      19169 non-null  float64
 1   danceability      19169 non-null  float64
 2   duration_ms       19169 non-null  float64
 3   energy            19169 non-null  float64
 4   instrumentalness  19169 non-null  float64
 5   key               19169 non-null  object 
 6   liveness          19169 non-null  float64
 7   loudness          19169 non-null  float64
 8   mode              19169 non-null  object 
 9   speechiness       19169 non-null  float64
 10  tempo             19169 non-null  float64
 11  valence           19169 non-null  float64
 12  music_genre       19169 non-null  object 
 13  is_Major          19169 non-null  float64
dtypes: float64(11), object(3)
memory usage: 2.2+ MB


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Заполним Nan в is_Major средним значением в тестовом наборе данных.
</div>

In [61]:
key_test_mean = df_result_test_nmode.loc[df_result_test_nmode['is_Major'].notna(), 'is_Major'].mean()

In [62]:
key_test_mean

0.6418181818181818

In [63]:
df_result_test_nmode.loc[df_result_test_nmode['is_Major'].isna(), 'is_Major'] = key_test_mean

In [64]:
df_result_test_nmode.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      5099 non-null   float64
 1   danceability      5099 non-null   float64
 2   duration_ms       5099 non-null   float64
 3   energy            5099 non-null   float64
 4   instrumentalness  5099 non-null   float64
 5   key               4941 non-null   object 
 6   liveness          5099 non-null   float64
 7   loudness          5099 non-null   float64
 8   mode              4950 non-null   object 
 9   speechiness       5099 non-null   float64
 10  tempo             5099 non-null   float64
 11  valence           5099 non-null   float64
 12  is_Major          5099 non-null   float64
dtypes: float64(11), object(2)
memory usage: 518.0+ KB


In [65]:
df_result_nmode.is_Major.unique()

array([1., 0.])

In [66]:
df_result_test_nmode.is_Major.unique()

array([0.        , 1.        , 0.64181818])

In [67]:
df_result_nmode = df_result_nmode.drop(columns='mode', axis=1)

In [68]:
df_result_test_nmode = df_result_test_nmode.drop(columns='mode', axis=1)

In [69]:
df_result_nmode.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19169 entries, 0 to 20393
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      19169 non-null  float64
 1   danceability      19169 non-null  float64
 2   duration_ms       19169 non-null  float64
 3   energy            19169 non-null  float64
 4   instrumentalness  19169 non-null  float64
 5   key               19169 non-null  object 
 6   liveness          19169 non-null  float64
 7   loudness          19169 non-null  float64
 8   speechiness       19169 non-null  float64
 9   tempo             19169 non-null  float64
 10  valence           19169 non-null  float64
 11  music_genre       19169 non-null  object 
 12  is_Major          19169 non-null  float64
dtypes: float64(11), object(2)
memory usage: 2.0+ MB


In [70]:
df_result_test_nmode.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5099 entries, 0 to 5098
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      5099 non-null   float64
 1   danceability      5099 non-null   float64
 2   duration_ms       5099 non-null   float64
 3   energy            5099 non-null   float64
 4   instrumentalness  5099 non-null   float64
 5   key               4941 non-null   object 
 6   liveness          5099 non-null   float64
 7   loudness          5099 non-null   float64
 8   speechiness       5099 non-null   float64
 9   tempo             5099 non-null   float64
 10  valence           5099 non-null   float64
 11  is_Major          5099 non-null   float64
dtypes: float64(11), object(1)
memory usage: 478.2+ KB


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Переходим к моделям.
</div>

In [71]:
df_result_nmode.shape

(19169, 13)

In [72]:
df_result_test_nmode.shape

(5099, 12)

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> df_result_nmode и df_result_test_nmode - обработанные датасеты, на которых также стоит проверить CatBoost и сравнить с результатами, полученными по итогам применения CatBoost на df_train_copy и df_test_copy (описаны ранее).
</div>

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Разобьем трейн-датасет, предназначенный и подготовленный для обучения моделей на тренировочный датасет и на валидационный.
</div>

In [75]:
# Функция для разбивки фрейма df_result_nmode на выборки с учетом целевого признака 'music_genre'
def test_split(df):
    features = df.drop(['music_genre'] , axis=1)
    target = df['music_genre']

    features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)
    
    return features_train, features_test, target_train, target_test

In [76]:
features_train, features_valid, target_train, target_valid = test_split(df_result_nmode)

In [77]:
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14376 entries, 14860 to 12524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      14376 non-null  float64
 1   danceability      14376 non-null  float64
 2   duration_ms       14376 non-null  float64
 3   energy            14376 non-null  float64
 4   instrumentalness  14376 non-null  float64
 5   key               14376 non-null  object 
 6   liveness          14376 non-null  float64
 7   loudness          14376 non-null  float64
 8   speechiness       14376 non-null  float64
 9   tempo             14376 non-null  float64
 10  valence           14376 non-null  float64
 11  is_Major          14376 non-null  float64
dtypes: float64(11), object(1)
memory usage: 1.4+ MB


In [78]:
features_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4793 entries, 8009 to 15534
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      4793 non-null   float64
 1   danceability      4793 non-null   float64
 2   duration_ms       4793 non-null   float64
 3   energy            4793 non-null   float64
 4   instrumentalness  4793 non-null   float64
 5   key               4793 non-null   object 
 6   liveness          4793 non-null   float64
 7   loudness          4793 non-null   float64
 8   speechiness       4793 non-null   float64
 9   tempo             4793 non-null   float64
 10  valence           4793 non-null   float64
 11  is_Major          4793 non-null   float64
dtypes: float64(11), object(1)
memory usage: 486.8+ KB


In [79]:
target_train.info()

<class 'pandas.core.series.Series'>
Int64Index: 14376 entries, 14860 to 12524
Series name: music_genre
Non-Null Count  Dtype 
--------------  ----- 
14376 non-null  object
dtypes: object(1)
memory usage: 224.6+ KB


In [80]:
target_valid.info()

<class 'pandas.core.series.Series'>
Int64Index: 4793 entries, 8009 to 15534
Series name: music_genre
Non-Null Count  Dtype 
--------------  ----- 
4793 non-null   object
dtypes: object(1)
memory usage: 74.9+ KB


In [81]:
features_train.reset_index(drop= True , inplace= True)
features_valid.reset_index(drop= True , inplace= True)
target_train.reset_index(drop= True , inplace= True)
target_valid.reset_index(drop= True , inplace= True)

In [82]:
features_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14376 entries, 0 to 14375
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      14376 non-null  float64
 1   danceability      14376 non-null  float64
 2   duration_ms       14376 non-null  float64
 3   energy            14376 non-null  float64
 4   instrumentalness  14376 non-null  float64
 5   key               14376 non-null  object 
 6   liveness          14376 non-null  float64
 7   loudness          14376 non-null  float64
 8   speechiness       14376 non-null  float64
 9   tempo             14376 non-null  float64
 10  valence           14376 non-null  float64
 11  is_Major          14376 non-null  float64
dtypes: float64(11), object(1)
memory usage: 1.3+ MB


In [83]:
target_train.info()

<class 'pandas.core.series.Series'>
RangeIndex: 14376 entries, 0 to 14375
Series name: music_genre
Non-Null Count  Dtype 
--------------  ----- 
14376 non-null  object
dtypes: object(1)
memory usage: 112.4+ KB


In [84]:
features_valid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4793 entries, 0 to 4792
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      4793 non-null   float64
 1   danceability      4793 non-null   float64
 2   duration_ms       4793 non-null   float64
 3   energy            4793 non-null   float64
 4   instrumentalness  4793 non-null   float64
 5   key               4793 non-null   object 
 6   liveness          4793 non-null   float64
 7   loudness          4793 non-null   float64
 8   speechiness       4793 non-null   float64
 9   tempo             4793 non-null   float64
 10  valence           4793 non-null   float64
 11  is_Major          4793 non-null   float64
dtypes: float64(11), object(1)
memory usage: 449.5+ KB


In [85]:
target_valid.info()

<class 'pandas.core.series.Series'>
RangeIndex: 4793 entries, 0 to 4792
Series name: music_genre
Non-Null Count  Dtype 
--------------  ----- 
4793 non-null   object
dtypes: object(1)
memory usage: 37.6+ KB


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Масштабируем числовые признаки.
</div> 

In [86]:
#Для масштабирования методом scaler зафиксируем численные признаки
numeric_ex = [*features_train.select_dtypes(exclude=['object']).columns]
numeric_ex

['acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence',
 'is_Major']

In [87]:
scaler = StandardScaler()
scaler.fit(features_train[numeric_ex])

StandardScaler()

In [88]:
#Масштабируем численные признаки обучающей выборки
features_train[numeric_ex] = scaler.transform(features_train[numeric_ex])
features_train.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,valence,is_Major
0,1.154382,-0.29612,-0.107419,-0.495088,-0.522563,C,0.213676,0.094983,-0.654878,-0.89577,0.158635,0.742391
1,-0.583958,0.472977,-0.164328,0.738501,-0.526453,F,-0.511166,0.813316,-0.313374,-0.954332,0.605802,-1.347
2,-0.320573,0.554548,-0.374913,1.235916,-0.526453,F,0.728853,0.630211,2.130326,-0.839437,1.003741,-1.347
3,-0.404237,0.974056,-0.410246,1.088681,-0.526453,A,-0.241597,0.754232,-0.051622,0.243857,1.176044,0.742391
4,0.469581,-0.610751,-0.404195,0.364445,-0.526453,F,-0.662724,0.906422,-0.512755,1.077398,-0.124435,0.742391


In [89]:
features_train.shape

(14376, 12)

In [90]:
#Масштабируем численные признаки валидационной выборки
features_valid[numeric_ex] = scaler.transform(features_valid[numeric_ex])
features_valid.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,valence,is_Major
0,-0.673509,0.723517,0.331617,0.877777,2.089441,E,-0.181692,0.334793,-0.220329,-0.386348,-0.13264,-1.347
1,2.198626,-1.333237,3.269027,-2.317218,2.54001,E,-0.127778,-3.37229,-0.416643,-1.810931,-1.693215,0.742391
2,-0.342264,-0.360212,-0.379611,0.181396,-0.526217,D,0.279571,0.300221,-0.656923,-1.320942,-0.337763,0.742391
3,-0.835506,2.209274,0.146723,0.253024,-0.5264,G#,0.704891,-0.880353,-0.410508,-0.077967,1.660134,0.742391
4,0.420003,-0.698149,-0.49155,0.997156,-0.526453,A#,0.96248,0.724782,-0.426868,1.128673,0.983228,0.742391


In [91]:
features_valid.shape

(4793, 12)

In [92]:
#Масштабируем численные признаки тестового датасета
df_result_test_nmode[numeric_ex] = scaler.transform(df_result_test_nmode[numeric_ex])
df_result_test_nmode.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,valence,is_Major
0,0.079152,1.131372,-0.901127,0.213231,-0.526453,A#,0.621025,0.258698,2.702909,-1.008401,1.758593,-1.347
1,0.813531,-1.787704,-0.028719,-0.407543,-0.499483,G#,-0.685488,0.617407,-0.636474,-1.406888,-0.608525,0.742391
2,-0.835475,0.671078,0.37567,0.575349,-0.526453,A,-0.541118,0.759354,1.332801,-1.020413,-0.222894,-1.347
3,-0.790638,0.17,-0.612561,0.491783,-0.526453,B,-0.463243,0.721855,-0.443227,1.310281,0.605802,0.742391
4,-0.852502,-0.826332,-0.160779,-0.709971,-0.478478,D,-0.970033,-0.051719,1.128307,-1.55845,-1.416709,0.742391


In [93]:
df_result_test_nmode.shape

(5099, 12)

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Кодируем категориальные признаки.
</div> 

Применим алгоритм OHE из sklearn.

In [94]:
#Для кодирования методом OHE зафиксируем категориальные признаки
numeric_in = [*features_train.select_dtypes(include=['object']).columns]
numeric_in

['key']

In [95]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore', sparse = False)
ohe_encoder.fit(features_train[numeric_in])
features_train_ohe_array = ohe_encoder.transform(features_train[numeric_in])
features_train_ohe_title = ohe_encoder.get_feature_names(numeric_in)
features_train_ohe = pd.DataFrame(features_train_ohe_array, columns=features_train_ohe_title)

In [96]:
features_valid_ohe_array = ohe_encoder.transform(features_valid[numeric_in])
features_valid_ohe = pd.DataFrame(features_valid_ohe_array, columns=features_train_ohe_title)

In [97]:
features_test_ohe_array = ohe_encoder.transform(df_result_test_nmode[numeric_in])
features_test_ohe = pd.DataFrame(features_test_ohe_array, columns=features_train_ohe_title)

In [98]:
print('Обучающая выборка после кодирования OHE:', features_train_ohe.shape)
print('Валидационная выборка после кодирования OHE:', features_valid_ohe.shape)
print('Тестовая выборка после кодирования OHE:', features_test_ohe.shape)

Обучающая выборка после кодирования OHE: (14376, 12)
Валидационная выборка после кодирования OHE: (4793, 12)
Тестовая выборка после кодирования OHE: (5099, 12)


In [99]:
features_train_ohe.isna().sum().sum()

0

In [100]:
features_valid_ohe.isna().sum().sum()

0

In [101]:
features_test_ohe.isna().sum().sum()

0

In [102]:
features_train_ohe.head()

Unnamed: 0,key_A,key_A#,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [103]:
features_valid_ohe.head()

Unnamed: 0,key_A,key_A#,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [104]:
features_test_ohe.head()

Unnamed: 0,key_A,key_A#,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


обучили ohe_encoder на трейне, применили к валидационному и тестовому наборам

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Cоединим закодированные данные с основными датасетами и удалим некодированные категориальные столбцы.
</div>

In [105]:
features_train_final = pd.merge(features_train, features_train_ohe, left_index=True, right_index=True)

In [106]:
features_train_final.isna().sum().sum()

0

In [107]:
features_train_final.shape

(14376, 24)

In [108]:
features_train_final.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,1.154382,-0.29612,-0.107419,-0.495088,-0.522563,C,0.213676,0.094983,-0.654878,-0.89577,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.583958,0.472977,-0.164328,0.738501,-0.526453,F,-0.511166,0.813316,-0.313374,-0.954332,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-0.320573,0.554548,-0.374913,1.235916,-0.526453,F,0.728853,0.630211,2.130326,-0.839437,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.404237,0.974056,-0.410246,1.088681,-0.526453,A,-0.241597,0.754232,-0.051622,0.243857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.469581,-0.610751,-0.404195,0.364445,-0.526453,F,-0.662724,0.906422,-0.512755,1.077398,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [109]:
features_train_final.drop(numeric_in, axis= 1 , inplace= True)
features_train_final.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,1.154382,-0.29612,-0.107419,-0.495088,-0.522563,0.213676,0.094983,-0.654878,-0.89577,0.158635,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.583958,0.472977,-0.164328,0.738501,-0.526453,-0.511166,0.813316,-0.313374,-0.954332,0.605802,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-0.320573,0.554548,-0.374913,1.235916,-0.526453,0.728853,0.630211,2.130326,-0.839437,1.003741,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.404237,0.974056,-0.410246,1.088681,-0.526453,-0.241597,0.754232,-0.051622,0.243857,1.176044,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.469581,-0.610751,-0.404195,0.364445,-0.526453,-0.662724,0.906422,-0.512755,1.077398,-0.124435,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


Валидационная часть

In [110]:
features_valid_final = pd.merge(features_valid, features_valid_ohe, left_index=True, right_index=True)
features_valid_final.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,-0.673509,0.723517,0.331617,0.877777,2.089441,E,-0.181692,0.334793,-0.220329,-0.386348,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,2.198626,-1.333237,3.269027,-2.317218,2.54001,E,-0.127778,-3.37229,-0.416643,-1.810931,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.342264,-0.360212,-0.379611,0.181396,-0.526217,D,0.279571,0.300221,-0.656923,-1.320942,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.835506,2.209274,0.146723,0.253024,-0.5264,G#,0.704891,-0.880353,-0.410508,-0.077967,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.420003,-0.698149,-0.49155,0.997156,-0.526453,A#,0.96248,0.724782,-0.426868,1.128673,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [111]:
features_valid_final.drop(numeric_in, axis= 1 , inplace= True)
features_valid_final.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,-0.673509,0.723517,0.331617,0.877777,2.089441,-0.181692,0.334793,-0.220329,-0.386348,-0.13264,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,2.198626,-1.333237,3.269027,-2.317218,2.54001,-0.127778,-3.37229,-0.416643,-1.810931,-1.693215,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.342264,-0.360212,-0.379611,0.181396,-0.526217,0.279571,0.300221,-0.656923,-1.320942,-0.337763,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.835506,2.209274,0.146723,0.253024,-0.5264,0.704891,-0.880353,-0.410508,-0.077967,1.660134,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.420003,-0.698149,-0.49155,0.997156,-0.526453,0.96248,0.724782,-0.426868,1.128673,0.983228,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Тестовый набор данных

In [112]:
features_test_final = pd.merge(df_result_test_nmode, features_test_ohe, left_index=True, right_index=True)
features_test_final.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,0.079152,1.131372,-0.901127,0.213231,-0.526453,A#,0.621025,0.258698,2.702909,-1.008401,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.813531,-1.787704,-0.028719,-0.407543,-0.499483,G#,-0.685488,0.617407,-0.636474,-1.406888,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,-0.835475,0.671078,0.37567,0.575349,-0.526453,A,-0.541118,0.759354,1.332801,-1.020413,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.790638,0.17,-0.612561,0.491783,-0.526453,B,-0.463243,0.721855,-0.443227,1.310281,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.852502,-0.826332,-0.160779,-0.709971,-0.478478,D,-0.970033,-0.051719,1.128307,-1.55845,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Проверка key.
</div>

In [113]:
features_test_final[features_test_final['key'].isna()].head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
34,-0.850748,1.20129,-0.260992,0.662894,-0.526453,,1.112241,0.816608,0.197861,0.510879,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,1.786506,0.274877,1.464413,-1.891829,2.257999,,-0.55909,-2.196655,-0.57717,-0.527462,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
70,-0.850665,0.257397,-0.402713,1.140412,-0.499386,,-0.253578,0.856668,0.402354,0.97771,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
103,0.698881,0.647772,-0.288624,0.201293,-0.522076,,1.040356,0.159738,1.445272,-0.55468,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
165,0.082251,1.08476,-0.160779,0.046099,-0.512935,,-0.77175,0.017425,-0.241801,0.642876,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
features_test_final.drop(numeric_in, axis= 1 , inplace= True)
features_test_final.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,...,key_B,key_C,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#
0,0.079152,1.131372,-0.901127,0.213231,-0.526453,0.621025,0.258698,2.702909,-1.008401,1.758593,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.813531,-1.787704,-0.028719,-0.407543,-0.499483,-0.685488,0.617407,-0.636474,-1.406888,-0.608525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,-0.835475,0.671078,0.37567,0.575349,-0.526453,-0.541118,0.759354,1.332801,-1.020413,-0.222894,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.790638,0.17,-0.612561,0.491783,-0.526453,-0.463243,0.721855,-0.443227,1.310281,0.605802,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.852502,-0.826332,-0.160779,-0.709971,-0.478478,-0.970033,-0.051719,1.128307,-1.55845,-1.416709,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [115]:
print('Параметры обучающей выборки после масштабирования и кодирования:', features_train_final.shape)
print('Параметры валидационной выборки после масштабирования и кодирования:', features_valid_final.shape)
print('Параметры тестовой выборки после масштабирования и кодирования:', features_test_final.shape)

Параметры обучающей выборки после масштабирования и кодирования: (14376, 23)
Параметры валидационной выборки после масштабирования и кодирования: (4793, 23)
Параметры тестовой выборки после масштабирования и кодирования: (5099, 23)


<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Применим LabelEncoder для кодирования целевого признака (target_train и target_valid).
</div>

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Думаю, что не нужно кодировать целевой признак. Проверим, пройдет ли.
</div>

In [116]:
#target_train = LabelEncoder().fit_transform(target_train)

In [117]:
#target_train

In [118]:
#target_valid = LabelEncoder().fit_transform(target_valid)

In [119]:
#target_valid

так как обучающая выборка у нас несбалансированная, применим SMOTE

In [120]:
#smoter = SMOTE(random_state=42)
#X_train_upsample, y_train_upsample = smoter.fit_resample(features_train_final, target_train)
#y_train_upsample.mean()

In [121]:
#X_train_upsample.shape

In [122]:
#y_train_upsample.shape

<div class="alert alert-block alert-warning">
<b>Проверка на значениях resample: пока закомментим</b> 
</div>

In [123]:
#features_train_final = X_train_upsample
#target_train = y_train_upsample

In [124]:
#features_train_final.shape

In [125]:
#target_train.shape

Данные готовы, признаки разделены, преобразованы. Сделали масштабирование и кодирование.

In [126]:
#Напишем функцию для изучия полноты, точности, F1-меры, Auc_roc и F-beta
# average_count: macro, micro, samples, weighted, binary, None
# beta_count = 0.5
#def rec_prec_f1_auc_roc(target_valid, prediction, features_valid, model):
#    print("Полнота (recall):" , recall_score(target_valid, prediction))
#    print("Точность (precision):", precision_score(target_valid, prediction))
#    print("F1-мера:", f1_score(target_valid, prediction))
#    print("Auc_roc:", roc_auc_score(target_valid, model.predict_proba(features_valid)[:, 1]))
#    print("F-beta:", fbeta_score(target_valid, prediction, average='weighted', beta=0.5))

In [127]:
def f_beta_score(target_valid, prediction, average_measure, beta_measure):
    print("F-beta:", fbeta_score(target_valid, prediction, average=average_measure, beta=beta_measure))

In [128]:
#Напишем функцию для построения графика ROC-кривой
def plot_roc_curve(fper, tper):
    plt.plot(fper, tper, color='red', label='ROC')
    plt.plot([0, 1], [0, 1], color='green', linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC-кривая')
    plt.legend()
    plt.show()

Выберем метрику. По условию у нас F_beta.

<div class="alert alert-block alert-warning">
<b>Гиперпараметры:</b> 
</div>

In [129]:
measure_scoring = 'f1_micro' # roc_auc, f1, recall, f1_macro, f1_micro, f1_weighted, f1_binary, f1_samples
measure_average = 'micro' # macro, micro, samples, weighted, binary, None
measure_beta = 0.5

<div class="alert alert-block alert-warning">
<b>Модель "Логистическая регрессия" на данных с дисбалансом:</b> 
</div>

In [130]:
#Обучим модель "Логистическая регрессия"
model_LgR = LogisticRegression(random_state=12345, solver='liblinear')
model_LgR.fit(features_train_final, target_train)
scores = cross_val_score(model_LgR, features_train_final, target_train, cv=5, scoring=measure_scoring) # scoring=measure_scoring
final_score = pd.Series(scores).mean()
print(f'Средняя оценка {measure_scoring}:', final_score)

Средняя оценка f1_micro: 0.4113796698312874


In [131]:
# Проверим модель "Логистическая регрессия" на train
LgR_prediction_train = model_LgR.predict(features_train_final)

<div class="alert alert-block alert-warning">
<b> на трейне не предсказываем, оставил для интереса - проверки выборки и модели</b> 
</div>

In [132]:
# Выведем метрики на train выборке для модели "Логистическая регрессия"
#rec_prec_f1_auc_roc(target_train, LgR_prediction_train, features_train_final, model_LgR)

In [133]:
f_beta_score(target_train, LgR_prediction_train, measure_average, measure_beta)

F-beta: 0.4167362270450751


<div class="alert alert-block alert-warning">
<b>На валидационной выборке:</b> 
</div>

In [134]:
# Проверим модель "Логистическая регрессия" на валидационной выборке
LgR_prediction_valid = model_LgR.predict(features_valid_final)

f_beta_score(target_valid, LgR_prediction_valid, measure_average, measure_beta)

F-beta: 0.3895263926559566


In [135]:
# Матрица ошибок
#plot_confusion_matrix(estimator=model_LgR, X=features_train_final, y_true=target_train, normalize='true', cmap='Blues')
#None

In [136]:
# ROC-кривая для модели "Логистическая регрессия"
#fper, tper, thresholds = roc_curve(target_train, model_LgR.predict_proba(features_train_final)[:, 1])
#plot_roc_curve(fper, tper)

<div class="alert alert-block alert-warning">
<b>Модель "Дерево решений" на данных с дисбалансом:</b> 
</div>

In [137]:
# Дерево решений - подберем лучшие гиперпараметры
best_DTC_model = None
best_DTC_result = 0
final_score = 0
best_depth = 0

for depth in tqdm(range(2, 15, 3)):
    model_DTC = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    scores = cross_val_score(model_DTC, features_train_final, target_train, cv=5, scoring=measure_scoring)
    final_score = pd.Series(scores).mean()

    if final_score > best_DTC_result:
        best_DTC_model = model_DTC
        best_depth = depth
        best_DTC_result = final_score

print('Оптимальная глубина дерева:', best_depth) 
print(f'Средняя оценка {measure_scoring}:', best_DTC_result)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.51it/s]

Оптимальная глубина дерева: 8
Средняя оценка f1_micro: 0.40317169982463563





In [138]:
#Обучим модель с оптимальными значениями гиперпараметров
model_DTC = DecisionTreeClassifier(random_state=12345, max_depth=best_depth)
model_DTC.fit(features_train_final, target_train)

DecisionTreeClassifier(max_depth=8, random_state=12345)

In [139]:
# Проверим модель "Дерево решений" на train
DTC_prediction_train = model_DTC.predict(features_train_final)

In [140]:
f_beta_score(target_train, DTC_prediction_train, measure_average, measure_beta)

F-beta: 0.46835002782415136


<div class="alert alert-block alert-warning">
<b>На валидационной выборке:</b> 
</div>

In [141]:
# Проверим модель "Дерево решений" на тестовой выборке
DTC_prediction_valid = model_DTC.predict(features_valid_final)

In [142]:
f_beta_score(target_valid, DTC_prediction_valid, measure_average, measure_beta)

F-beta: 0.3884832046734822


<div class="alert alert-block alert-warning">
<b>Модель "Случайный лес" на данных с дисбалансом:</b> 
</div>

In [143]:
# Случайный лес - подберем лучшие гиперпараметры
best_RFC_model = None
best_RFC_result = 0
best_est = 0
best_depth = 0

for est in tqdm(range(50, 201, 10)):
    for depth in range(2, 15, 3):
        model_RFC = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        scores = cross_val_score(model_RFC, features_train_final, target_train, cv=5, scoring=measure_scoring)
        final_score = pd.Series(scores).mean()
        
        if final_score > best_RFC_result:
            best_RFC_model = model_RFC
            best_est = est
            best_depth = depth
            best_RFC_result = final_score
            
print('Оптимальное количество оценок:', best_est)
print('Оптимальная глубина дерева:', best_depth)
print(f'Средняя оценка {measure_scoring}:', best_RFC_result)

100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [10:06<00:00, 37.92s/it]

Оптимальное количество оценок: 170
Оптимальная глубина дерева: 14
Средняя оценка f1_micro: 0.47892314204511094





In [144]:
#Обучим модель с оптимальными значениями гиперпараметров
model_RFC = RandomForestClassifier(random_state=12345, n_estimators=best_est, max_depth=best_depth)
model_RFC.fit(features_train_final, target_train)

RandomForestClassifier(max_depth=14, n_estimators=170, random_state=12345)

In [145]:
# Проверим модель "Случайный лес" на train
RFC_prediction_train = model_RFC.predict(features_train_final)

In [146]:
f_beta_score(target_train, RFC_prediction_train, measure_average, measure_beta)

F-beta: 0.8740261547022816


<div class="alert alert-block alert-warning">
<b>На валидационной выборке:</b> 
</div>

In [147]:
# Проверим модель "Случайный лес" на valid
RFC_prediction_valid = model_RFC.predict(features_valid_final)

f_beta_score(target_valid, RFC_prediction_valid, measure_average, measure_beta)

F-beta: 0.4667223033590653


до SMOTE на valid:  
LgR: 0.39955156950672643  
DTC: 0.4079222720478326  
RFC: 0.4667223033590653  

на train:  
LgR: 0.4167362270450751   
DTC: 0.46835002782415136  
RFC: 0.8740261547022816  

после SMOTE на valid:  
LgR: 0.36741080742749843  
DTC: 0.37742541205925306   
RFC: 0.44794491967452543  

на train:  
LgR: 0.4146825396825397    
DTC: 0.7380456349206349  
RFC: 0.9015376984126985  

результаты всех 3 моделей до SMOTE лучше, чем после, немного странно, но примем это. Лучший результат показала модель RFC.

получаем, что модель RFC - лучшая, ее и выбираем для предсказания на тестовой выборке

In [148]:
RFC_prediction_test = model_RFC.predict(features_test_final)

In [149]:
y_pred_test = RFC_prediction_test.ravel().tolist()
# Создание столбца с предсказаниями
y_pred_series = pd.Series(y_pred_test, name='music_genre')

# Создание DataFrame из столбцов
submit = pd.concat([instance_id_test, y_pred_series], axis=1)

# Сохранение DataFrame с предсказаниями в файл
submit.to_csv('rfc_pred.csv', index=False)

In [150]:
y_pred_series

0               Rap
1             Blues
2               Rap
3           Country
4       Alternative
           ...     
5094        Country
5095            Rap
5096            Rap
5097     Electronic
5098          Blues
Name: music_genre, Length: 5099, dtype: object

<div class="alert alert-block alert-warning">
<b>Комментарий студента:</b> Что еще можно было бы сделать:
</div>

- доизучить и применить CatBoostClassifier  
- подбор гиперпараментов осуществить через GridSearchCV  
- выбрать лучшую модель (с учетом CatBoostClassifier)  
- упаковать все в функции  
- реализовать все через конвейеры (Pipeline)  
- провести обучение модели на полном трейе (без разделения на обучающую и валидационную выборки)  