## Фреймворк PyTorch для разработки искусственных нейронных сетей.

### Урок 3. Dataset, Dataloader, BatchNorm, Dropout, Оптимизация.

**Будем практиковаться на датасете: https://www.kaggle.com/c/avito-demand-prediction**

**Ваша задача:**

**1. Создать Dataset для загрузки данных (используем только числовые данные)**

**2. Обернуть его в Dataloader**

**3. Написать архитектуру сети, которая предсказывает число показов на основании числовых данных (вы всегда можете нагенерить дополнительных факторов). Сеть должна включать BatchNorm слои и Dropout (или НЕ включать, но нужно обосновать)**

**4. Учить будем на функцию потерь с кагла (log RMSE) - нужно её реализовать**

**5. Сравните сходимость Adam, RMSProp и SGD, сделайте вывод по качеству работы модели**

**train-test разделение нужно сделать с помощью sklearn random_state=13, test_size = 0.25**

In [1]:
import numpy as np

import pandas as pd

import torch
import torchvision
from torch import nn
import torch.nn.functional as F
from torch import optim
import torchvision.transforms as transforms

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from tqdm import tqdm_notebook

import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

device

device(type='cuda', index=0)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
train = pd.read_csv('/content/drive/MyDrive/PyTorch/train.csv')
train.head()

Unnamed: 0,item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability
0,b912c3c6a6ad,e00f8ff2eaf9,Свердловская область,Екатеринбург,Личные вещи,Товары для детей и игрушки,Постельные принадлежности,,,Кокоби(кокон для сна),"Кокон для сна малыша,пользовались меньше месяц...",400.0,2,2017-03-28,Private,d10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c...,1008.0,0.12789
1,2dac0150717d,39aeb48f0017,Самарская область,Самара,Для дома и дачи,Мебель и интерьер,Другое,,,Стойка для Одежды,"Стойка для одежды, под вешалки. С бутика.",3000.0,19,2017-03-26,Private,79c9392cc51a9c81c6eb91eceb8e552171db39d7142700...,692.0,0.0
2,ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ...",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a...,3032.0,0.43177
3,02996f1dd2ea,bf5cccea572d,Татарстан,Набережные Челны,Личные вещи,Товары для детей и игрушки,Автомобильные кресла,,,Автокресло,Продам кресло от0-25кг,2200.0,286,2017-03-25,Company,e6ef97e0725637ea84e3d203e82dadb43ed3cc0a1c8413...,796.0,0.80323
4,7c90be56d2ab,ef50846afc0b,Волгоградская область,Волгоград,Транспорт,Автомобили,С пробегом,ВАЗ (LADA),2110.0,"ВАЗ 2110, 2003",Все вопросы по телефону.,40000.0,3,2017-03-16,Private,54a687a3a0fc1d68aed99bdaaf551c5c70b761b16fd0a2...,2264.0,0.20797


In [5]:
train.activation_date = pd.to_datetime(train.activation_date)

train['day_of_month'] = train.activation_date.apply(lambda x: x.day)
train['day_of_week'] = train.activation_date.apply(lambda x: x.weekday())

In [6]:
agg_cols = ['region', 'category_name',
            'city', 'user_type']
            
for col in tqdm_notebook(agg_cols):
    gp = train.groupby(col)['deal_probability']
    mean = gp.mean()
    train[col + '_deal_probability_avg'] = train[col].map(mean)

  0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
train = train.drop(['city', 'category_name', 'user_id', 'description', 'image', 'parent_category_name', 'region',
                    'item_id', 'param_1', 'param_2', 'param_3', 'title', 'user_type', 'activation_date'], axis=1)

for col in train.columns:
    if train[col].isna().sum() > 0:
        # train.drop(columns=col, inplace=True)
        train[col].fillna(train[col].median(), inplace=True)

In [8]:
display(train.head(), train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503424 entries, 0 to 1503423
Data columns (total 10 columns):
 #   Column                              Non-Null Count    Dtype  
---  ------                              --------------    -----  
 0   price                               1503424 non-null  float64
 1   item_seq_number                     1503424 non-null  int64  
 2   image_top_1                         1503424 non-null  float64
 3   deal_probability                    1503424 non-null  float64
 4   day_of_month                        1503424 non-null  int64  
 5   day_of_week                         1503424 non-null  int64  
 6   region_deal_probability_avg         1503424 non-null  float64
 7   category_name_deal_probability_avg  1503424 non-null  float64
 8   city_deal_probability_avg           1503424 non-null  float64
 9   user_type_deal_probability_avg      1503424 non-null  float64
dtypes: float64(7), int64(3)
memory usage: 114.7 MB


Unnamed: 0,price,item_seq_number,image_top_1,deal_probability,day_of_month,day_of_week,region_deal_probability_avg,category_name_deal_probability_avg,city_deal_probability_avg,user_type_deal_probability_avg
0,400.0,2,1008.0,0.12789,28,1,0.122004,0.198445,0.123397,0.149557
1,3000.0,19,692.0,0.0,26,6,0.136721,0.191848,0.1394,0.149557
2,4000.0,9,3032.0,0.43177,20,0,0.135944,0.171572,0.124881,0.149557
3,2200.0,286,796.0,0.80323,25,5,0.142602,0.198445,0.135031,0.124513
4,40000.0,3,2264.0,0.20797,16,3,0.145908,0.278427,0.137275,0.149557


None

In [9]:
train, test = train_test_split(train, test_size=0.25, random_state=13)
train.head()

Unnamed: 0,price,item_seq_number,image_top_1,deal_probability,day_of_month,day_of_week,region_deal_probability_avg,category_name_deal_probability_avg,city_deal_probability_avg,user_type_deal_probability_avg
31723,1300.0,6,1056.0,0.0,23,3,0.147066,0.17848,0.189069,0.149557
929954,1300.0,16,399.0,0.0,28,1,0.122004,0.046447,0.123397,0.149557
1143227,140000.0,1,1055.0,0.7376,24,4,0.122004,0.278427,0.123397,0.149557
484798,10000.0,1,2918.0,0.33973,15,2,0.141007,0.186176,0.133636,0.149557
898764,53000.0,1,2264.0,0.47553,21,1,0.135944,0.278427,0.124881,0.149557


In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1127568 entries, 31723 to 1015882
Data columns (total 10 columns):
 #   Column                              Non-Null Count    Dtype  
---  ------                              --------------    -----  
 0   price                               1127568 non-null  float64
 1   item_seq_number                     1127568 non-null  int64  
 2   image_top_1                         1127568 non-null  float64
 3   deal_probability                    1127568 non-null  float64
 4   day_of_month                        1127568 non-null  int64  
 5   day_of_week                         1127568 non-null  int64  
 6   region_deal_probability_avg         1127568 non-null  float64
 7   category_name_deal_probability_avg  1127568 non-null  float64
 8   city_deal_probability_avg           1127568 non-null  float64
 9   user_type_deal_probability_avg      1127568 non-null  float64
dtypes: float64(7), int64(3)
memory usage: 94.6 MB


In [11]:
class DFDataset(torch.utils.data.Dataset):
    def __init__(self, df, normalize=False, fit_scaler=False):
        self.df = df.copy()
        self.normalize = normalize
        self.scaler = MinMaxScaler()
        self.fit_scaler = fit_scaler
        self.sc_fl = 0

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        label = self.df.iloc[idx, -1:]
        if self.fit_scaler:
            self.scaler.fit(df.iloc[:, :-1])
            self.sc_fl = 1

        if self.normalize and sc_fl:
            df = scaler.transform(df)

        tensor = torch.FloatTensor(self.df.iloc[idx, 1:-1].values)
        label = torch.FloatTensor(label.values)

        return tensor, label

In [12]:
train_dataset = DFDataset(train)

test_dataset = DFDataset(test)

In [13]:
train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=2048,
                                           shuffle=True,
                                           num_workers=4)

test_loader = torch.utils.data.DataLoader(test_dataset,
                                          batch_size=2048,
                                          shuffle=True,
                                          num_workers=4)

In [14]:
class FFNet(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(FFNet, self).__init__()
        self.bn1 = nn.BatchNorm1d(input_dim)
        self.fc1 = nn.Linear(input_dim, 5*hidden_dim)
        self.dp1 = nn.Dropout(0.40)

        self.bn2 = nn.BatchNorm1d(5*hidden_dim)
        self.fc2 = nn.Linear(5*hidden_dim, 2*hidden_dim)
        self.dp2 = nn.Dropout(0.15)

        self.bn4 = nn.BatchNorm1d(2*hidden_dim)
        self.fc4 = nn.Linear(2*hidden_dim, 1)

    def forward(self, x):
        x = self.bn1(x)
        x = self.fc1(x)
        x = F.tanh(x)
        x = self.dp1(x)

        x = self.bn2(x)
        x = self.fc2(x)
        x = F.tanh(x)
        x = self.dp2(x)

        x = self.bn4(x)
        x = self.fc4(x)
        x = F.sigmoid(x)

        return x

In [15]:
def rmsle_loss(y_pred, y_true):
    loss = torch.sqrt(torch.mean((torch.log(y_pred+1)-torch.log(y_true+1))**2))
    
    return loss

In [16]:
EPOCHES = 1

LR = 0.001

In [17]:
def train_loop(tr_dataloader, ev_dataloader, model, optimizer, history):

    size = len(tr_dataloader.dataset)
    for batch, (X, y) in enumerate(tr_dataloader):
        # Compute prediction and loss
        pred = model(X.to(device))
        loss = rmsle_loss(pred, y.to(device))

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 50 == 0:
            current = batch * len(X)
            history['train'].append(loss)
            test_loss = eval_loop(ev_dataloader, model)
            history['eval'].append(test_loss)
            print(f"loss: {loss:>7f}, Avg test loss: {test_loss:>8f}  [{current:>5d}/{size:>5d}]")

    return history

def eval_loop(ev_dataloader, model):
    
    size = len(ev_dataloader.dataset)
    num_batches = len(ev_dataloader)
    test_loss = 0

    with torch.no_grad():
        for X, y in ev_dataloader:
            pred = model(X.to(device))
            test_loss += rmsle_loss(pred, y.to(device))

    test_loss /= num_batches
    
    return test_loss

## ADAM

In [18]:
model = FFNet(8, 5).to(device)

In [19]:
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

In [20]:
history_3 = {'train': [], 'eval': []}

for t in tqdm_notebook(range(EPOCHES)):
    print(f"Epoch {t+1}\n-------------------------------")
    history_3 = train_loop(train_loader,
                           test_loader,
                           model,
                           optimizer,
                           history_3
                          )
print("Finish")

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1
-------------------------------
loss: 0.315760, Avg test loss: 0.315081  [    0/1127568]
loss: 0.289158, Avg test loss: 0.288608  [102400/1127568]
loss: 0.253682, Avg test loss: 0.253143  [204800/1127568]
loss: 0.199260, Avg test loss: 0.197519  [307200/1127568]
loss: 0.128776, Avg test loss: 0.127945  [409600/1127568]
loss: 0.066238, Avg test loss: 0.066138  [512000/1127568]
loss: 0.040465, Avg test loss: 0.040890  [614400/1127568]
loss: 0.035930, Avg test loss: 0.035533  [716800/1127568]
loss: 0.030600, Avg test loss: 0.031396  [819200/1127568]
loss: 0.026285, Avg test loss: 0.027789  [921600/1127568]
loss: 0.023880, Avg test loss: 0.024666  [1024000/1127568]
loss: 0.020774, Avg test loss: 0.021978  [642400/1127568]
Finish


## RMSProp

In [21]:
model = FFNet(8, 5).to(device)

In [22]:
optimizer = torch.optim.RMSprop(model.parameters(), lr=LR)

In [23]:
history_2 = {'train': [], 'eval': []}

for t in tqdm_notebook(range(EPOCHES)):
    print(f"Epoch {t+1}\n-------------------------------")
    history_2 = train_loop(train_loader,
                           test_loader,
                           model,
                           optimizer,
                           history_2
                          )
print("Finish")

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1
-------------------------------
loss: 0.329811, Avg test loss: 0.323226  [    0/1127568]
loss: 0.227593, Avg test loss: 0.225665  [102400/1127568]
loss: 0.141174, Avg test loss: 0.140713  [204800/1127568]
loss: 0.076485, Avg test loss: 0.075706  [307200/1127568]
loss: 0.043789, Avg test loss: 0.045486  [409600/1127568]
loss: 0.038746, Avg test loss: 0.038727  [512000/1127568]
loss: 0.034669, Avg test loss: 0.035046  [614400/1127568]
loss: 0.031119, Avg test loss: 0.031385  [716800/1127568]
loss: 0.027427, Avg test loss: 0.027888  [819200/1127568]
loss: 0.023650, Avg test loss: 0.024686  [921600/1127568]
loss: 0.020923, Avg test loss: 0.021781  [1024000/1127568]
loss: 0.019486, Avg test loss: 0.019625  [642400/1127568]
Finish


## SGD

In [24]:
model = FFNet(8, 5).to(device)

In [25]:
optimizer = torch.optim.SGD(model.parameters(), lr=LR)

In [27]:
history_1 = {'train': [], 'eval': []}

for t in tqdm_notebook(range(EPOCHES)):
    print(f"Epoch {t+1}\n-------------------------------")
    history_1 = train_loop(train_loader,
                           test_loader,
                           model,
                           optimizer,
                           history_1
                          )
print("Finish")

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1
-------------------------------
loss: 0.276220, Avg test loss: 0.276005  [    0/1127568]
loss: 0.275157, Avg test loss: 0.274431  [102400/1127568]
loss: 0.272902, Avg test loss: 0.272877  [204800/1127568]
loss: 0.271040, Avg test loss: 0.271303  [307200/1127568]
loss: 0.270211, Avg test loss: 0.269759  [409600/1127568]
loss: 0.267379, Avg test loss: 0.268146  [512000/1127568]
loss: 0.265628, Avg test loss: 0.266563  [614400/1127568]
loss: 0.265749, Avg test loss: 0.265002  [716800/1127568]
loss: 0.263455, Avg test loss: 0.263379  [819200/1127568]
loss: 0.261439, Avg test loss: 0.261777  [921600/1127568]
loss: 0.259536, Avg test loss: 0.260157  [1024000/1127568]
loss: 0.260174, Avg test loss: 0.258540  [642400/1127568]
Finish


**Лучший показатель сходимости показал оптимизационный алгоритм RMSProp.**