Дисклеймер: поскольку это почти оригинальный ноутбук с заключительного этапа Олимпиады (как минимум, за исключением загрузки данных), то он не претендует на красоту и изящество, а также на цельность выполнения

# Задача 2. Сообщения VK

Это задача классификации анонимизированных сообщений в социальной сети ВКонтакте по заданным классам (темам). Предлагаемый набор данных включает в себя 50 тем. Тексты и изображения в сообщениях представлены в закодированном виде &mdash; в виде векторов (эмбеддингов) размерами 300 и 1024 соответственно.

### Формат ввода

Детальное описание данных:

Тренировочная выборка train.csv представляет собой csv таблицу с четырьмя столбцами:

- "id" : уникальный id сообщения;

- "txt" : вектор текста сообщения;

- "img" : вектор изображения сообщения

- "topic_id" : тема сообщения (целевая переменная) &mdash; число от 0 до 49

Тестовая выборка test.csv представляет собой csv таблицу с тремя столбцами столбцами: "id", "txt" и "img". Ваша задача &mdash; для каждого сообщения из тестовой выборки предсказать его тему &mdash; число от 0 до 49.

Обратите внимание, что для некоторых сообщений отсутствует изображение или текст. В этом случае вектор такого текста/сообщения нулевой.

[Скачать данные](https://disk.yandex.ru/d/uKVk2hXJGRSNOg) (275 Мб)

### Формат вывода

В файле sample_submission.csv показан формат файла ответа, который нужно загружать в систему. В связи с особенностями работы скрипта подсчета метрики качества, необходимо сохранять порядок следования строк в выходном файле submission.csv

### Примечания

Метрика оценки качества решения следующая:

$$score = 19 \times \frac{submissionscore - baselinescore}{maxscore - baselinescore} + 1,$$

где:

- baselinescore = 0.4,

- submissionscore = оценка вашего рещения по метрике accuracy;

- maxscore = максимальные очки по всем посылкам всех участников. В течение контеста эта величина равна 1.0, а по окончании контеста вычислится ее реальное значение, и баллы всех посылок будут пересчитаны.

In [1]:
import os
import requests
import numpy as np
import pandas as pd
import gc # сборщик мусора
import math
import matplotlib.pyplot as plt
import matplotlib.image as img
import seaborn as sns
%matplotlib inline

In [2]:
url_api = 'https://cloud-api.yandex.net/v1/disk/public/resources/download'
public_key = 'https://disk.yandex.ru/d/uKVk2hXJGRSNOg'
response = requests.get(url_api, params={'public_key': public_key})
url_to_download = response.json()['href']

In [3]:
download_response = requests.get(url_to_download)
with open('./2-vk-messages.zip', 'wb') as f:
    f.write(download_response.content)

In [4]:
!unzip ./2-vk-messages.zip -d ./

Archive:  ./2-vk-messages.zip
  inflating: ./2-vk-messages/sample_submission.csv  
  inflating: ./2-vk-messages/vk_test.csv  
  inflating: ./2-vk-messages/vk_train.csv  


In [5]:
os.chdir('./2-vk-messages')
os.getcwd()

'/content/2-vk-messages'

# Reading Data

### Train

In [6]:
train = pd.read_csv("vk_train.csv")
train.head()

Unnamed: 0,id,txt,img,topic_id
0,0,0.10971508 0.0061990465 -0.023294065 0.0773873...,0.22798721 0.0156203555 0.117552206 0.15211636...,26
1,1,0.029578514 0.041357093 -0.022272278 -0.063698...,0.29370075 0.49406192 0.6068035 0.41217968 0.0...,6
2,2,0.1352666 -0.0067131063 0.008000205 0.06661426...,0.48092997 0.94154936 0.24896349 1.2379668 0.6...,28
3,3,-0.0105937 0.020617455 -0.023282487 0.05311264...,0.014712747 0.6803736 0.7064364 0.108088344 0....,48
4,4,0.051762328 0.019998964 0.008127754 0.03517841...,0.25555038 0.3410854 0.3353518 0.5239908 0.580...,48


In [7]:
train = train.drop(['id'], axis=1)
train.head()

Unnamed: 0,txt,img,topic_id
0,0.10971508 0.0061990465 -0.023294065 0.0773873...,0.22798721 0.0156203555 0.117552206 0.15211636...,26
1,0.029578514 0.041357093 -0.022272278 -0.063698...,0.29370075 0.49406192 0.6068035 0.41217968 0.0...,6
2,0.1352666 -0.0067131063 0.008000205 0.06661426...,0.48092997 0.94154936 0.24896349 1.2379668 0.6...,28
3,-0.0105937 0.020617455 -0.023282487 0.05311264...,0.014712747 0.6803736 0.7064364 0.108088344 0....,48
4,0.051762328 0.019998964 0.008127754 0.03517841...,0.25555038 0.3410854 0.3353518 0.5239908 0.580...,48


In [8]:
train['txt'].to_csv(r'vk_train_txt.csv', index=None)
train['img'].to_csv(r'vk_train_img.csv', index=None)

### Test

In [9]:
test = pd.read_csv("vk_test.csv")
test.head()

Unnamed: 0,id,txt,img
0,16000,0.04256578 -0.0021561377 -0.010370443 -0.00041...,0.023715377 0.032524068 0.045921654 0.01448024...
1,16001,0.009802822 -0.014528753 0.05718356 0.02774015...,0.10767715 0.13333236 0.037194956 0.35785222 1...
2,16002,0.09662005 0.027566181 0.014350991 0.075628966...,0.29120436 2.0880947 0.3744759 0.13094811 0.25...
3,16003,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0....,0.20930451 0.9066564 0.0059753037 0.4879898 0....
4,16004,0.031518105 -0.012642406 -0.0037361283 -0.0084...,0.09812577 0.6540897 0.20748469 0.6635425 0.23...


In [10]:
test['txt'].to_csv(r'vk_test_txt.csv', index=None)
test['img'].to_csv(r'vk_test_img.csv', index=None)

# Transformation Data

### Train

In [11]:
import csv
with open('vk_train_txt.csv', newline='') as csvfile:
  spamreader = csv.reader(csvfile, delimiter=',')
  i = 0
  
  with open('vk_train_txt_new.csv', 'w', newline='') as csvfile1:
    spamwriter = csv.writer(csvfile1, delimiter=',')
    spamwriter.writerow(['txt' + str(i) for i in range(300)])
    for row in spamreader:
      if i!=0:
        spamwriter.writerow([i for i in list(map(float, row[0].split()))])
      else:
        i=1

In [12]:
import csv
with open('vk_train_img.csv', newline='') as csvfile:
  spamreader = csv.reader(csvfile, delimiter=',')
  i = 0
  
  with open('vk_train_img_new.csv', 'w', newline='') as csvfile1:
    spamwriter = csv.writer(csvfile1, delimiter=',')
    spamwriter.writerow(['img' + str(i) for i in range(1024)])
    for row in spamreader:
      if i!=0:
        spamwriter.writerow([i for i in list(map(float, row[0].split()))])
      else:
        i=1

### Test

In [13]:
import csv
with open('vk_test_txt.csv', newline='') as csvfile:
  spamreader = csv.reader(csvfile, delimiter=',')
  i = 0
  
  with open('vk_test_txt_new.csv', 'w', newline='') as csvfile1:
    spamwriter = csv.writer(csvfile1, delimiter=',')
    spamwriter.writerow(['txt' + str(i) for i in range(300)])
    for row in spamreader:
      if i!=0:
        spamwriter.writerow([i for i in list(map(float, row[0].split()))])
      else:
        i=1

In [14]:
import csv
with open('vk_test_img.csv', newline='') as csvfile:
  spamreader = csv.reader(csvfile, delimiter=',')
  i = 0
  
  with open('vk_test_img_new.csv', 'w', newline='') as csvfile1:
    spamwriter = csv.writer(csvfile1, delimiter=',')
    spamwriter.writerow(['img' + str(i) for i in range(1024)])
    #spamwriter.writerow(['txt', 'txt1'])
    for row in spamreader:
      if i!=0:
        #print(row)
        #print(', '.join(list(map(float, row))))
        spamwriter.writerow([i for i in list(map(float, row[0].split()))])
      else:
        i=1

# Reading new data

In [15]:
train_txt = pd.read_csv("vk_train_txt_new.csv")
train_img = pd.read_csv("vk_train_img_new.csv")

In [16]:
X = pd.concat([train_txt, train_img], axis=1)
X.head()

Unnamed: 0,txt0,txt1,txt2,txt3,txt4,txt5,txt6,txt7,txt8,txt9,...,img1014,img1015,img1016,img1017,img1018,img1019,img1020,img1021,img1022,img1023
0,0.109715,0.006199,-0.023294,0.077387,0.071365,-0.103661,0.013797,0.020722,0.02089,0.001507,...,0.023197,0.078137,0.135272,0.055538,3.706026,0.761436,1.169066,0.128768,0.064016,0.033315
1,0.029579,0.041357,-0.022272,-0.063699,0.150385,0.033826,-0.019963,0.041946,-0.051351,-0.019872,...,0.153276,1.695928,0.039144,0.208006,2.041775,0.052166,0.050494,0.097997,0.771985,0.160381
2,0.135267,-0.006713,0.008,0.066614,0.083543,-0.000314,0.037447,0.06739,-0.067446,-0.0703,...,0.254442,0.331281,0.017118,1.122694,1.12985,0.482373,0.000908,1.294992,0.342763,1.397375
3,-0.010594,0.020617,-0.023282,0.053113,0.087416,-0.046398,-0.032336,0.053689,-0.067166,-0.101538,...,0.721198,0.17314,0.859713,0.080659,1.876263,0.579082,0.473624,0.280743,0.184486,0.302793
4,0.051762,0.019999,0.008128,0.035178,0.106786,-0.009019,-0.037549,0.01912,-0.090112,-0.04822,...,0.004694,0.271862,1.271133,1.215401,0.576827,0.315861,0.584772,0.38171,0.915396,0.949022


In [17]:
X.shape

(16000, 1324)

In [18]:
y = train['topic_id']

In [19]:
test_txt = pd.read_csv("vk_test_txt_new.csv")
test_img = pd.read_csv("vk_test_img_new.csv")

In [20]:
test_n = pd.concat([test_txt, test_img], axis=1)
test_n.head()

Unnamed: 0,txt0,txt1,txt2,txt3,txt4,txt5,txt6,txt7,txt8,txt9,...,img1014,img1015,img1016,img1017,img1018,img1019,img1020,img1021,img1022,img1023
0,0.042566,-0.002156,-0.01037,-0.000418,0.10597,-0.054699,0.060066,-0.037288,-0.160018,-0.09289,...,0.164414,0.152944,0.025623,0.338002,1.160025,0.210946,0.278895,0.884254,0.063972,1.641685
1,0.009803,-0.014529,0.057184,0.02774,0.093558,0.049662,0.015115,0.043358,-0.109056,-0.029175,...,0.042775,0.394536,0.058756,0.405004,3.42798,0.719489,1.350089,0.155992,0.50453,0.050563
2,0.09662,0.027566,0.014351,0.075629,0.102081,0.064648,-0.007737,0.026077,-0.071799,0.027189,...,0.018297,0.563718,0.045452,0.546598,0.545203,0.637313,1.243791,0.090779,0.352679,0.528879
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.110823,0.062481,0.041697,0.095695,0.237628,0.412328,0.06532,0.262349,0.201556,0.944138
4,0.031518,-0.012642,-0.003736,-0.008473,0.031408,-0.032983,-0.03352,-0.019718,0.025152,-0.013452,...,0.692442,0.908432,0.70857,0.352414,1.043778,1.456602,1.575651,0.209127,1.130022,0.659817


In [21]:
test_n.shape

(4000, 1324)

In [22]:
# for i in range(1024):
#   name = 'img' + str(i)
#   train[name] = train['img']

# Removing trash

In [22]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
#                 if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
#                     df[col] = df[col].astype(np.float16)
#                 elif

                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [27]:
# train = reduce_mem_usage(train)

Memory usage of dataframe is 0.37 MB
Memory usage after optimization is: 1.30 MB
Decreased by -254.9%


In [28]:
# test = reduce_mem_usage(test)

Memory usage of dataframe is 0.09 MB
Memory usage after optimization is: 0.33 MB
Decreased by -260.2%


### Data dimension

In [23]:
train.shape

(16000, 3)

In [24]:
test.shape

(4000, 3)

### Data gaps

In [25]:
train.isna().sum().sum()

0

In [26]:
test.isna().sum().sum()

0

### Balancing classes

In [27]:
train['topic_id'].value_counts()

48    904
32    796
28    770
16    708
3     692
0     644
1     558
26    514
10    486
23    479
12    457
24    440
11    420
31    406
20    402
13    402
6     376
27    370
43    365
25    360
9     354
14    340
18    315
29    291
45    282
17    276
19    268
30    262
5     254
34    233
41    224
21    223
22    221
8     212
7     210
15    186
33    169
49    140
36    140
35    126
4     117
2      99
37     97
44     87
46     83
39     76
38     67
40     45
47     44
42     10
Name: topic_id, dtype: int64

### Removing the most unrepresentative class

In [None]:
# train.loc[train['Cover_Type'] == 5, 'Cover_Type']

In [None]:
# train.drop(labels = [3403875], axis = 0, inplace = True)

### Data types

In [None]:
# for i in X.dtypes[1320:]:
      # print(i)

### Correlations

In [None]:
plt.figure(figsize = (15,10))

sns.set(font_scale=1.4)

corr_matrix = train.corr()
#print(X.corr())
corr_matrix = np.round(corr_matrix, 2)
corr_matrix[np.abs(corr_matrix) < 0.1] = 0  # Проверьте, что будет если убрать маленькие корреляции

sns.heatmap(corr_matrix, annot=True, linewidths=.5, cmap='coolwarm')

plt.title('Correlation matrix')
plt.show()

In [None]:
corr_with_target = train.corr().iloc[:-1, -1].sort_values(ascending=False)

plt.figure(figsize=(10, 8))

sns.barplot(x=corr_with_target.values, y=corr_with_target.index)

plt.title('Correlation with target variable')
plt.show()

# Training and model selection

In [None]:
# X = train.drop(['topic_id'], axis=1)
# y = train['topic_id']

### Normalization and standardization

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler = StandardScaler()
# Fit on the training data
scaler.fit(X)
# Transform both the training and testing data
X_standar = scaler.transform(X)
test_standar = scaler.transform(test_n)
print(X_standar)

[[ 1.31126604 -0.43466494 -0.93480838 ... -0.75264692 -0.87184224
  -0.87342014]
 [-0.17652881  0.33473891 -0.91090049 ... -0.81654107  0.82901187
  -0.61331421]
 [ 1.78564899 -0.71723639 -0.20258133 ...  1.66901358 -0.20216877
   1.91882826]
 ...
 [ 0.9287709   1.75162685  0.66199262 ...  0.17020399  0.71337245
   0.28831513]
 [-0.41900913  1.74462392 -1.12058767 ... -0.91198467 -0.76252019
  -0.87303639]
 [-0.64809792  1.77988289  1.61192429 ... -1.02003246 -0.58819033
  -0.75712352]]


In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))
# Fit on the training data
scaler.fit(X_standar)
# Transform both the training and testing data
X_norm = scaler.transform(X_standar)
test_norm = scaler.transform(test_standar)
print(X_norm)

[[0.65216375 0.4246169  0.36024603 ... 0.03188899 0.01439903 0.00673107]
 [0.45357018 0.52252847 0.3633343  ... 0.02426883 0.1736415  0.03240427]
 [0.71548525 0.38865787 0.45483046 ... 0.32070155 0.07709721 0.28233406]
 ...
 [0.6011076  0.70283654 0.56651062 ... 0.14195022 0.16281476 0.1213977 ]
 [0.42120347 0.70194537 0.33624825 ... 0.01288602 0.0246343  0.00676894]
 [0.39062428 0.7064323  0.68921674 ... 0.         0.04095594 0.01820988]]


### Balancing data

In [37]:
from imblearn.over_sampling import SMOTE, ADASYN, KMeansSMOTE

os = SMOTE(random_state=0)
os1 = ADASYN(random_state=0, n_neighbors=1)
os2 = KMeansSMOTE(random_state=0, k_neighbors=1, sampling_strategy='minority', kmeans_estimator=1)
# feature vector
#X_train_full = train.drop(['topic_id'], axis=1) 
# target variable vector
#y_train_full = train['topic_id']
X_train_full = X.copy()
column = X_train_full.columns
y_train_full = y.copy()
print("До балансировки")
print(X_train_full.shape)
print(y_train_full.value_counts())

# Let's apply the balancing algorithm
os_data_X, os_data_y = os.fit_resample(X_train_full, y_train_full)
os_data_X = pd.DataFrame(data=os_data_X, columns=column)
os_data_y = pd.DataFrame(data=os_data_y, columns=['topic_id'])

print('_'*100)
print("После балансировки")
print(os_data_X.shape)
print(os_data_y.value_counts())

До балансировки
(16000, 1324)
48    904
32    796
28    770
16    708
3     692
0     644
1     558
26    514
10    486
23    479
12    457
24    440
11    420
31    406
20    402
13    402
6     376
27    370
43    365
25    360
9     354
14    340
18    315
29    291
45    282
17    276
19    268
30    262
5     254
34    233
41    224
21    223
22    221
8     212
7     210
15    186
33    169
49    140
36    140
35    126
4     117
2      99
37     97
44     87
46     83
39     76
38     67
40     45
47     44
42     10
Name: topic_id, dtype: int64
____________________________________________________________________________________________________
После балансировки
(45200, 1324)
topic_id
0           904
37          904
27          904
28          904
29          904
30          904
31          904
32          904
33          904
34          904
35          904
36          904
38          904
1           904
39          904
40          904
41          904
42          904
43        

### Choosing a model

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #X_norm

### Support Vector Machines

In [None]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred_test = svc.predict(X_test)

print(accuracy_score(y_test, y_pred_test))

In [None]:
svc_f = SVC()
svc_f.fit(X, y)
y_pred = svc_f.predict(test)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=1000)
clf.fit(X_train, y_train)
y_pred_test = clf.predict(X_test)

print(accuracy_score(y_test, y_pred_test))

In [None]:
clf_f = RandomForestClassifier(n_estimators=1000)
clf_f.fit(X, y)
y_pred = clf_f.predict(test)

### Extreme Gradient Boosting

In [None]:
import xgboost as xgb

xgb1 = xgb.XGBClassifier(max_depth = 15, min_child_weight = 0.07, learning_rate = 0.01, tree_method='gpu_hist')
xgb1.fit(X_train, y_train)
y_pred_test = xgb1.predict(X_test)

print(accuracy_score(y_test, y_pred_test))

In [None]:
xgb_f = xgb.XGBClassifier(max_depth = 15, min_child_weight = 0.07, learning_rate = 0.01, tree_method='gpu_hist')
xgb_f.fit(X, y)
y_pred = xgb_f.predict(test)

### Catboost

In [23]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [24]:
from catboost import CatBoostClassifier

catboost = CatBoostClassifier(n_estimators=1000, learning_rate=0.03,
                              depth=8, max_ctr_complexity=1,
                              max_bin=64, task_type='GPU')  

In [25]:
catboost.fit(X_train, y_train, verbose=True)

0:	learn: 3.8500589	total: 588ms	remaining: 9m 47s
1:	learn: 3.7966724	total: 1.05s	remaining: 8m 46s
2:	learn: 3.7439732	total: 1.52s	remaining: 8m 27s
3:	learn: 3.7008430	total: 2s	remaining: 8m 18s
4:	learn: 3.6581044	total: 2.51s	remaining: 8m 18s
5:	learn: 3.6203191	total: 3s	remaining: 8m 16s
6:	learn: 3.5841207	total: 3.45s	remaining: 8m 8s
7:	learn: 3.5478062	total: 3.93s	remaining: 8m 7s
8:	learn: 3.5135013	total: 4.49s	remaining: 8m 14s
9:	learn: 3.4820281	total: 5.07s	remaining: 8m 21s
10:	learn: 3.4502769	total: 5.53s	remaining: 8m 17s
11:	learn: 3.4185889	total: 6.04s	remaining: 8m 17s
12:	learn: 3.3874808	total: 6.56s	remaining: 8m 17s
13:	learn: 3.3582966	total: 7.08s	remaining: 8m 18s
14:	learn: 3.3319441	total: 7.65s	remaining: 8m 22s
15:	learn: 3.3046955	total: 8.26s	remaining: 8m 28s
16:	learn: 3.2844388	total: 8.88s	remaining: 8m 33s
17:	learn: 3.2599979	total: 9.52s	remaining: 8m 39s
18:	learn: 3.2369629	total: 10.2s	remaining: 8m 46s
19:	learn: 3.2140426	total: 10

<catboost.core.CatBoostClassifier at 0x7fb49e16ee20>

In [26]:
y_pred_test = catboost.predict(X_test)

print(accuracy_score(y_test, y_pred_test))

0.55


In [27]:
y_pred = catboost.predict(test_n)

### Logistic regression

In [None]:
from lightgbm import LGBMClassifier

logreg = LGBMClassifier(#max_depth=2,
                      max_depth=11,#11 
                      n_estimators=326,#326
                      random_state=53,
                      #objective = 'gamma',#gamma
                      # min_data_in_leaf = 27)#27)
)
logreg.fit(X_train, y_train)
y_pred_test = logreg.predict(X_test)

print(accuracy_score(y_test, y_pred_test))

In [None]:
logreg_f = LGBMClassifier(#max_depth=2,
                      max_depth=11,#11 
                      n_estimators=326,#326
                      random_state=53,
                      #objective = 'gamma',#gamma
                      # min_data_in_leaf = 27)#27)
)
logreg_f.fit(X, y)
y_pred = logreg_f.predict(test)

### Gradient boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
gbc.fit(X_train, y_train)
y_pred_test = gbc.predict(X_test)

print(accuracy_score(y_test, y_pred_test))

In [None]:
gbc_f = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
gbc_f.fit(X, y)
y_pred = gbc_f.predict(test)

### Neural network

In [None]:
from keras import models
from keras import layers
from keras.wrappers.scikit_learn import KerasClassifier

In [None]:
y_train

In [None]:
y_train_n = y_train - 1
y_test_n = y_test - 1
y_train_n

In [None]:
y_train_n.unique()

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
y_train_n = np.array(y_train_n).reshape((-1, 1))

encoder = OneHotEncoder(categories='auto')
y_train_n = encoder.fit_transform(y_train_n).toarray()

y_train_n

In [None]:
y_test_n = np.array(y_test_n).reshape((-1, 1))

encoder = OneHotEncoder(categories='auto')
y_test_n = encoder.fit_transform(y_test_n).toarray()

y_test_n

In [None]:
network = models.Sequential()
network.add(layers.Dense(units=20, activation="relu", input_shape=(55,)))
network.add(layers.Dense(units=256, activation="relu"))
#network.add(layers.Dense(units=10, activation="relu"))
network.add(layers.Dense(units=1024, activation="relu"))
#network.add(layers.Dense(units=1, activation="tanh"))
network.add(layers.Dense(units=128, activation="relu"))
network.add(layers.Dense(units=6, activation="relu"))

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
early_stopping = EarlyStopping(
    monitor='accuracy', 
    min_delta=0, 
    patience=100, 
    verbose=0,
    mode='min', 
    baseline=None, 
    restore_best_weights=True
)

reduce_lr = ReduceLROnPlateau(
    monitor='accuracy', 
    factor=0.2,
    patience=200,
    mode='min'
)

In [None]:
network.compile(
    loss="categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

In [None]:
history = network.fit(
    X_train,
    y_train_n,
    epochs=30,
    verbose=1,
    batch_size=100,
    validation_data=(X_test, y_test_n),
        callbacks=[
            early_stopping,
            reduce_lr,
        ])

In [None]:
history = network.fit(
    X,
    y,
    epochs=18,
    verbose=1,
    batch_size=100,
        callbacks=[
            early_stopping,
            reduce_lr,
        ])

In [None]:
y_pred = network.predict(test)
y_pred

# One more neural network

In [None]:
import torch
import random
import numpy as np

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.backends.cudnn.deterministic = True

In [None]:
X_norm = X_train_full
y = y_train_full

In [None]:
X_train = X_norm.to_numpy()
X_train = torch.from_numpy(X_train).float()

In [None]:
train['topic_id'].unique()

array([26,  6, 28, 48, 12, 19, 29, 10, 43, 32,  1, 30, 27, 31, 15, 46, 20,
        3,  0, 21, 13, 23, 18, 39, 49,  5, 45, 16, 33,  4, 25, 17, 11, 35,
       14, 34, 37, 38, 24, 44,  9, 36,  7, 41,  8, 22,  2, 47, 40, 42])

In [None]:
y_train = y.to_numpy()
y_train = torch.from_numpy(y_train)

In [None]:
class ClassifierNet(torch.nn.Module):

    def __init__(self, n_hidden_neurons, p):
        super(ClassifierNet, self).__init__()

        self.fc1 = torch.nn.Linear(1324, n_hidden_neurons)
        self.ac1 = torch.nn.ReLU()
        self.dr1 = torch.nn.Dropout(p)
        self.batch_norm1 = torch.nn.BatchNorm1d(n_hidden_neurons)


        self.fc2 = torch.nn.Linear(n_hidden_neurons, n_hidden_neurons)
        self.ac2 = torch.nn.ReLU()
        self.dr2 = torch.nn.Dropout(p)
        self.batch_norm2 = torch.nn.BatchNorm1d(n_hidden_neurons)

        self.fc3 = torch.nn.Linear(n_hidden_neurons, n_hidden_neurons)
        self.ac3 = torch.nn.ReLU()
        self.dr3 = torch.nn.Dropout(p)
        self.batch_norm3 = torch.nn.BatchNorm1d(n_hidden_neurons)
        

        self.fc4 = torch.nn.Linear(n_hidden_neurons, n_hidden_neurons // 2)
        self.ac4 = torch.nn.ReLU()
        self.dr4 = torch.nn.Dropout(p)
        self.batch_norm4 = torch.nn.BatchNorm1d(n_hidden_neurons // 2)
        

        self.fc5 = torch.nn.Linear(n_hidden_neurons // 2, n_hidden_neurons // 4)
        
        self.ac5 = torch.nn.ReLU()
        self.dr5 = torch.nn.Dropout(p)
        self.batch_norm5 = torch.nn.BatchNorm1d(n_hidden_neurons // 4)

        self.fc6 = torch.nn.Linear(n_hidden_neurons // 4, 50)
    

    def forward(self, x):
        x = self.fc1(x)
        x = self.ac1(x)
        x = self.dr1(x) 
        x = self.batch_norm1(x)
        x = self.fc2(x)
        x = self.ac2(x)
        x = self.dr2(x) 
        x = self.batch_norm2(x)
        x = self.fc3(x)
        x = self.ac3(x)
        x = self.dr3(x) 
        x = self.batch_norm3(x)
        x = self.fc4(x)
        x = self.ac4(x)
        x = self.dr4(x) 
        x = self.batch_norm4(x)
        x = self.fc5(x)
        x = self.ac5(x)
        x = self.dr5(x)
        x = self.batch_norm5(x)
        x = self.fc6(x)
        
        return x

net = ClassifierNet(1024, 0.4)

In [None]:
torch.cuda.is_available()

True

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
net = net.to(device) 

In [None]:
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1.0e-3)

In [None]:
batch_size = 250

for epoch in range(100):
    order = np.random.permutation(len(X_train))
    for start_index in range(0, len(X_train), batch_size):
        optimizer.zero_grad()
        
        batch_indexes = order[start_index:start_index+batch_size]
        
        X_batch = X_train[batch_indexes].to(device)
        y_batch = y_train[batch_indexes].to(device)
        
        preds = net.forward(X_batch)
        
        loss_value = loss(preds, y_batch)
        loss_value.backward()
        
        optimizer.step()
    if epoch % 5 == 0:
        print(epoch, loss_value, (preds.argmax(dim=1) == y_batch).float().mean())

0 tensor(2.7918, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.2760, device='cuda:0')
5 tensor(2.0090, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.4320, device='cuda:0')
10 tensor(1.9914, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.4800, device='cuda:0')
15 tensor(1.6348, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.5480, device='cuda:0')
20 tensor(1.5505, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.5920, device='cuda:0')
25 tensor(1.4542, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.5680, device='cuda:0')
30 tensor(1.3985, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.6120, device='cuda:0')
35 tensor(1.2346, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.6400, device='cuda:0')
40 tensor(1.1280, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.6440, device='cuda:0')
45 tensor(1.3020, device='cuda:0', grad_fn=<NllLossBackward0>) tensor(0.6400, device='cuda:0')
50 tensor(1.3298, device='cuda:0', grad_fn=<NllLossB

In [None]:
y_pred = net.forward(test_n)

In [None]:
y_pred = y_pred.argmax(dim=1)

In [None]:
y_pred = y_pred.cpu()

In [None]:
y_pred += 1
y_test[y_test > 4] += 1

### Forming an answer

In [28]:
indexs = [i for i in range(16000, 20000)]

In [29]:
y_pred = y_pred.tolist()

In [30]:
for i in range(len(y_pred)):
    y_pred[i] = y_pred[i][0]  

In [31]:
d = {'id': indexs, 'topic_id': y_pred}
test_pred = pd.DataFrame(data=d)
test_pred

Unnamed: 0,id,topic_id
0,16000,21
1,16001,26
2,16002,16
3,16003,45
4,16004,32
...,...,...
3995,19995,34
3996,19996,10
3997,19997,26
3998,19998,16


In [32]:
test_pred.to_csv(r'submission.csv', index=None)