### Preface

Hello . This is basically cutting and pasting from the amazing kernels of this competition. Please notify me if I don't attribute something correctly.

* https://www.kaggle.com/gmhost/gru-capsule
* How to: Preprocessing when using embeddings
https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings
* Improve your Score with some Text Preprocessing https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing
* Simple attention layer taken from https://github.com/mttk/rnn-classifier/blob/master/model.py
* https://www.kaggle.com/ziliwang/baseline-pytorch-bilstm
* https://www.kaggle.com/hengzheng/pytorch-starter

**UPDATE**: I seems that the shuffling the data doesn't add the features in the correct order. To address this issue I added a custom dataset class that can return indexes so that they can be accessed while training and properly put each feature with the corresponding sample. The training time though is increased, so you might need to make the model lighter in order to submit results.

## IMPORTS 

In [1]:
import time
import random
import pandas as pd
import numpy as np
import gc
import re
import torch
from torchtext import data
import spacy
from tqdm import tqdm_notebook, tnrange
from tqdm.auto import tqdm

tqdm.pandas(desc='Progress')
from collections import Counter
from textblob import TextBlob
from nltk import word_tokenize

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.autograd import Variable
from torchtext.data import Example
from sklearn.metrics import f1_score
import torchtext
import os 

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# cross validation and metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from torch.optim.optimizer import Optimizer
from unidecode import unidecode

Using TensorFlow backend.


### Basic Parameters

In [2]:
embed_size = 300 # how big is each word vector
max_features = 120000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 70 # max number of words in a question to use
batch_size = 512 # how many samples to process at once
#n_epochs = 4 # how many times to iterate over all samples
#n_splits = 3 # Number of K-fold Splits

SEED = 1029

### Ensure determinism in the results

A common headache in this competition is the lack of determinism in the results due to cudnn. The following Kernel has a solution in Pytorch.

See https://www.kaggle.com/hengzheng/pytorch-starter. 

In [3]:
def seed_everything(seed=1029):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything()

### Code for Loading Embeddings

Functions taken from the kernel:https://www.kaggle.com/gmhost/gru-capsule


In [4]:
## FUNCTIONS TAKEN FROM https://www.kaggle.com/gmhost/gru-capsule

def load_glove(word_index):
    EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:300]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    
    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = -0.005838499,0.48782197
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 
    
def load_fasttext(word_index):    
    EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector

    return embedding_matrix

def load_para(word_index):
    EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = -0.0053247833,0.49346462
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

## LOAD PROCESSED TRAINING DATA FROM DISK

In [5]:
df_train = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")
df = pd.concat([df_train ,df_test],sort=True)

In [6]:
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab
vocab = build_vocab(df['question_text'])

In [7]:
sin = len(df_train[df_train["target"]==0])
insin = len(df_train[df_train["target"]==1])
persin = (sin/(sin+insin))*100
perinsin = (insin/(sin+insin))*100            
print("# Sincere questions: {:,}({:.2f}%) and # Insincere questions: {:,}({:.2f}%)".format(sin,persin,insin,perinsin))
# print("Sinsere:{}% Insincere: {}%".format(round(persin,2),round(perinsin,2)))
print("# Test samples: {:,}({:.3f}% of train samples)".format(len(df_test),len(df_test)/len(df_train)))

# Sincere questions: 1,225,312(93.81%) and # Insincere questions: 80,810(6.19%)
# Test samples: 56,370(0.043% of train samples)



## Normalization

Borrowed from:
* How to: Preprocessing when using embeddings
https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings
* Improve your Score with some Text Preprocessing https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing

In [8]:
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

def known_contractions(embed):
    known = []
    for contract in contraction_mapping:
        if contract in embed:
            known.append(contract)
    return known
def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text
def correct_spelling(x, dic):
    for word in dic.keys():
        x = x.replace(word, dic[word])
    return x
def unknown_punct(embed, punct):
    unknown = ''
    for p in punct:
        if p not in embed:
            unknown += p
            unknown += ' '
    return unknown

def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

def clean_special_chars(text, punct, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])
    
    for p in punct:
        text = text.replace(p, f' {p} ')
    
    specials = {'\u200b': ' ', '…': ' ... ', '\ufeff': '', 'करना': '', 'है': ''}  # Other special characters that I have to deal with in last
    for s in specials:
        text = text.replace(s, specials[s])
    
    return text
def add_lower(embedding, vocab):
    count = 0
    for word in vocab:
        if word in embedding and word.lower() not in embedding:  
            embedding[word.lower()] = embedding[word]
            count += 1
    print(f"Added {count} words to embedding")    

In [9]:
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

def clean_text(x):
    x = str(x)
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x

def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

mispell_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re

mispellings, mispellings_re = _get_mispell(mispell_dict)
def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]
    return mispellings_re.sub(replace, text)

Extra feature part taken from https://github.com/wongchunghang/toxic-comment-challenge-lstm/blob/master/toxic_comment_9872_model.ipynb

In [10]:
from sklearn.preprocessing import StandardScaler


def add_features(df):
    
    df['question_text'] = df['question_text'].progress_apply(lambda x:str(x))
    df['total_length'] = df['question_text'].progress_apply(len)
    df['capitals'] = df['question_text'].progress_apply(lambda comment: sum(1 for c in comment if c.isupper()))
    df['caps_vs_length'] = df.progress_apply(lambda row: float(row['capitals'])/float(row['total_length']),
                                axis=1)
    df['num_words'] = df.question_text.str.count('\S+')
    df['num_unique_words'] = df['question_text'].progress_apply(lambda comment: len(set(w for w in comment.split())))
    df['words_vs_unique'] = df['num_unique_words'] / df['num_words']  

    return df

def load_and_prec():
    train_df = pd.read_csv("../input/train.csv")
    test_df = pd.read_csv("../input/test.csv")
    print("Train shape : ",train_df.shape)
    print("Test shape : ",test_df.shape)
    
    # lower
    train_df["question_text"] = train_df["question_text"].apply(lambda x: x.lower())
    test_df["question_text"] = test_df["question_text"].apply(lambda x: x.lower())

    # Clean the text
    train_df["question_text"] = train_df["question_text"].progress_apply(lambda x: clean_text(x))
    test_df["question_text"] = test_df["question_text"].apply(lambda x: clean_text(x))
    
    # Clean numbers
    train_df["question_text"] = train_df["question_text"].progress_apply(lambda x: clean_numbers(x))
    test_df["question_text"] = test_df["question_text"].apply(lambda x: clean_numbers(x))
    
    # Clean speelings
    train_df["question_text"] = train_df["question_text"].progress_apply(lambda x: replace_typical_misspell(x))
    test_df["question_text"] = test_df["question_text"].apply(lambda x: replace_typical_misspell(x))
    
    ## fill up the missing values
    train_X = train_df["question_text"].fillna("_##_").values
    test_X = test_df["question_text"].fillna("_##_").values


    
    ###################### Add Features ###############################
    #  https://github.com/wongchunghang/toxic-comment-challenge-lstm/blob/master/toxic_comment_9872_model.ipynb
    train = add_features(train_df)
    test = add_features(test_df)

    features = train[['caps_vs_length', 'words_vs_unique']].fillna(0)
    test_features = test[['caps_vs_length', 'words_vs_unique']].fillna(0)

    ss = StandardScaler()
    ss.fit(np.vstack((features, test_features)))
    features = ss.transform(features)
    test_features = ss.transform(test_features)
    ###########################################################################

    ## Tokenize the sentences
    tokenizer = Tokenizer(num_words=max_features)
    tokenizer.fit_on_texts(list(train_X))
    train_X = tokenizer.texts_to_sequences(train_X)
    test_X = tokenizer.texts_to_sequences(test_X)

    ## Pad the sentences 
    train_X = pad_sequences(train_X, maxlen=maxlen)
    test_X = pad_sequences(test_X, maxlen=maxlen)

    ## Get the target values
    train_y = train_df['target'].values
    
#     # Splitting to training and a final test set    
#     train_X, x_test_f, train_y, y_test_f = train_test_split(list(zip(train_X,features)), train_y, test_size=0.2, random_state=SEED)    
#     train_X, features = zip(*train_X)
#     x_test_f, features_t = zip(*x_test_f)    
    
    #shuffling the data
    np.random.seed(SEED)
    trn_idx = np.random.permutation(len(train_X))

    train_X = train_X[trn_idx]
    train_y = train_y[trn_idx]
    features = features[trn_idx]
    
    return train_X, test_X, train_y, features, test_features, tokenizer.word_index
#     return train_X, test_X, train_y, x_test_f,y_test_f,features, test_features, features_t, tokenizer.word_index
#     return train_X, test_X, train_y, tokenizer.word_index

In [11]:

# fill up the missing values
# x_train, x_test, y_train, word_index = load_and_prec()
x_train, x_test, y_train, features, test_features, word_index = load_and_prec() 
# x_train, x_test, y_train, x_test_f,y_test_f,features, test_features,features_t, word_index = load_and_prec() 


Train shape :  (1306122, 3)
Test shape :  (56370, 2)


HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=1306122, style=ProgressStyle(description_width…




HBox(children=(IntProgress(value=0, description='Progress', max=56370, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='Progress', max=56370, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='Progress', max=56370, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='Progress', max=56370, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='Progress', max=56370, style=ProgressStyle(description_width='…




### SAVE DATASET TO DISK

In [12]:
np.save("x_train",x_train)
np.save("x_test",x_test)
np.save("y_train",y_train)

np.save("features",features)
np.save("test_features",test_features)
np.save("word_index.npy",word_index)

### LOAD DATASET FROM DISK

In [13]:
x_train = np.load("x_train.npy")
x_test = np.load("x_test.npy")
y_train = np.load("y_train.npy")
features = np.load("features.npy")
test_features = np.load("test_features.npy")
word_index = np.load("word_index.npy").item()

In [14]:
features.shape

(1306122, 2)

### Load Embeddings

Two embedding matrices have been used. Glove, and paragram. The mean of the two is used as the final embedding matrix

In [15]:
# missing entries in the embedding are set using np.random.normal so we have to seed here too
seed_everything()

glove_embeddings = load_glove(word_index)
paragram_embeddings = load_para(word_index)

embedding_matrix = np.mean([glove_embeddings, paragram_embeddings], axis=0)

# vocab = build_vocab(df['question_text'])
# add_lower(embedding_matrix, vocab)
del glove_embeddings, paragram_embeddings
gc.collect()

np.shape(embedding_matrix)

  


(120000, 300)

In [16]:
np.shape(embedding_matrix)

(120000, 300)

### Cyclic CLR
Code taken from https://www.kaggle.com/dannykliu/lstm-with-attention-clr-in-pytorch

In [17]:
# code inspired from: https://github.com/anandsaha/pytorch.cyclic.learning.rate/blob/master/cls.py
class CyclicLR(object):
    def __init__(self, optimizer, base_lr=1e-3, max_lr=6e-3,
                 step_size=2000, mode='triangular', gamma=1.,
                 scale_fn=None, scale_mode='cycle', last_batch_iteration=-1):

        if not isinstance(optimizer, Optimizer):
            raise TypeError('{} is not an Optimizer'.format(
                type(optimizer).__name__))
        self.optimizer = optimizer

        if isinstance(base_lr, list) or isinstance(base_lr, tuple):
            if len(base_lr) != len(optimizer.param_groups):
                raise ValueError("expected {} base_lr, got {}".format(
                    len(optimizer.param_groups), len(base_lr)))
            self.base_lrs = list(base_lr)
        else:
            self.base_lrs = [base_lr] * len(optimizer.param_groups)

        if isinstance(max_lr, list) or isinstance(max_lr, tuple):
            if len(max_lr) != len(optimizer.param_groups):
                raise ValueError("expected {} max_lr, got {}".format(
                    len(optimizer.param_groups), len(max_lr)))
            self.max_lrs = list(max_lr)
        else:
            self.max_lrs = [max_lr] * len(optimizer.param_groups)

        self.step_size = step_size

        if mode not in ['triangular', 'triangular2', 'exp_range'] \
                and scale_fn is None:
            raise ValueError('mode is invalid and scale_fn is None')

        self.mode = mode
        self.gamma = gamma

        if scale_fn is None:
            if self.mode == 'triangular':
                self.scale_fn = self._triangular_scale_fn
                self.scale_mode = 'cycle'
            elif self.mode == 'triangular2':
                self.scale_fn = self._triangular2_scale_fn
                self.scale_mode = 'cycle'
            elif self.mode == 'exp_range':
                self.scale_fn = self._exp_range_scale_fn
                self.scale_mode = 'iterations'
        else:
            self.scale_fn = scale_fn
            self.scale_mode = scale_mode

        self.batch_step(last_batch_iteration + 1)
        self.last_batch_iteration = last_batch_iteration

    def batch_step(self, batch_iteration=None):
        if batch_iteration is None:
            batch_iteration = self.last_batch_iteration + 1
        self.last_batch_iteration = batch_iteration
        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
            param_group['lr'] = lr

    def _triangular_scale_fn(self, x):
        return 1.

    def _triangular2_scale_fn(self, x):
        return 1 / (2. ** (x - 1))

    def _exp_range_scale_fn(self, x):
        return self.gamma**(x)

    def get_lr(self):
        step_size = float(self.step_size)
        cycle = np.floor(1 + self.last_batch_iteration / (2 * step_size))
        x = np.abs(self.last_batch_iteration / step_size - 2 * cycle + 1)

        lrs = []
        param_lrs = zip(self.optimizer.param_groups, self.base_lrs, self.max_lrs)
        for param_group, base_lr, max_lr in param_lrs:
            base_height = (max_lr - base_lr) * np.maximum(0, (1 - x))
            if self.scale_mode == 'cycle':
                lr = base_lr + base_height * self.scale_fn(cycle)
            else:
                lr = base_lr + base_height * self.scale_fn(self.last_batch_iteration)
            lrs.append(lr)
        return lrs


### model1

Binary LSTM with an attention layer and an additional fully connected layer. Also added extra features taken from a winning kernel of the toxic comments competition. Also using CLR and a capsule Layer. Blended together in concatentation.

Initial idea borrowed from: https://www.kaggle.com/ziliwang/baseline-pytorch-bilstm

In [18]:
import torch as t
import torch.nn as nn
import torch.nn.functional as F

embedding_dim = 300
embedding_path = '../save/embedding_matrix.npy'  # or False, not use pre-trained-matrix
use_pretrained_embedding = True

hidden_size = 60
gru_len = hidden_size

Routings = 4 #5
Num_capsule = 5
Dim_capsule = 5#16
dropout_p = 0.25
rate_drop_dense = 0.28
LR = 0.001
T_epsilon = 1e-7
num_classes = 30


class Embed_Layer(nn.Module):
    def __init__(self, embedding_matrix=None, vocab_size=None, embedding_dim=300):
        super(Embed_Layer, self).__init__()
        self.encoder = nn.Embedding(vocab_size + 1, embedding_dim)
        if use_pretrained_embedding:
            # self.encoder.weight.data.copy_(t.from_numpy(np.load(embedding_path))) # 方法一，加载np.save的npy文件
            self.encoder.weight.data.copy_(t.from_numpy(embedding_matrix))  # 方法二

    def forward(self, x, dropout_p=0.25):
        return nn.Dropout(p=dropout_p)(self.encoder(x))


class GRU_Layer(nn.Module):
    def __init__(self):
        super(GRU_Layer, self).__init__()
        self.gru = nn.GRU(input_size=300,
                          hidden_size=gru_len,
                          bidirectional=True)
        '''
        自己修改GRU里面的激活函数及加dropout和recurrent_dropout
        如果要使用，把rnn_revised import进来，但好像是使用cpu跑的，比较慢
       '''
        # # if you uncomment /*from rnn_revised import * */, uncomment following code aswell
        # self.gru = RNNHardSigmoid('GRU', input_size=300,
        #                           hidden_size=gru_len,
        #                           bidirectional=True)

    # 这步很关键，需要像keras一样用glorot_uniform和orthogonal_uniform初始化参数
    def init_weights(self):
        ih = (param.data for name, param in self.named_parameters() if 'weight_ih' in name)
        hh = (param.data for name, param in self.named_parameters() if 'weight_hh' in name)
        b = (param.data for name, param in self.named_parameters() if 'bias' in name)
        for k in ih:
            nn.init.xavier_uniform_(k)
        for k in hh:
            nn.init.orthogonal_(k)
        for k in b:
            nn.init.constant_(k, 0)

    def forward(self, x):
        return self.gru(x)


# core caps_layer with squash func
class Caps_Layer(nn.Module):
    def __init__(self, input_dim_capsule=gru_len * 2, num_capsule=Num_capsule, dim_capsule=Dim_capsule, \
                 routings=Routings, kernel_size=(9, 1), share_weights=True,
                 activation='default', **kwargs):
        super(Caps_Layer, self).__init__(**kwargs)

        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.kernel_size = kernel_size  # 暂时没用到
        self.share_weights = share_weights
        if activation == 'default':
            self.activation = self.squash
        else:
            self.activation = nn.ReLU(inplace=True)

        if self.share_weights:
            self.W = nn.Parameter(
                nn.init.xavier_normal_(t.empty(1, input_dim_capsule, self.num_capsule * self.dim_capsule)))
        else:
            self.W = nn.Parameter(
                t.randn(BATCH_SIZE, input_dim_capsule, self.num_capsule * self.dim_capsule))  # 64即batch_size

    def forward(self, x):

        if self.share_weights:
            u_hat_vecs = t.matmul(x, self.W)
        else:
            print('add later')

        batch_size = x.size(0)
        input_num_capsule = x.size(1)
        u_hat_vecs = u_hat_vecs.view((batch_size, input_num_capsule,
                                      self.num_capsule, self.dim_capsule))
        u_hat_vecs = u_hat_vecs.permute(0, 2, 1, 3)  # 转成(batch_size,num_capsule,input_num_capsule,dim_capsule)
        b = t.zeros_like(u_hat_vecs[:, :, :, 0])  # (batch_size,num_capsule,input_num_capsule)

        for i in range(self.routings):
            b = b.permute(0, 2, 1)
            c = F.softmax(b, dim=2)
            c = c.permute(0, 2, 1)
            b = b.permute(0, 2, 1)
            outputs = self.activation(t.einsum('bij,bijk->bik', (c, u_hat_vecs)))  # batch matrix multiplication
            # outputs shape (batch_size, num_capsule, dim_capsule)
            if i < self.routings - 1:
                b = t.einsum('bik,bijk->bij', (outputs, u_hat_vecs))  # batch matrix multiplication
        return outputs  # (batch_size, num_capsule, dim_capsule)

    # text version of squash, slight different from original one
    def squash(self, x, axis=-1):
        s_squared_norm = (x ** 2).sum(axis, keepdim=True)
        scale = t.sqrt(s_squared_norm + T_epsilon)
        return x / scale
    
class Capsule_Main(nn.Module):
    def __init__(self, embedding_matrix=None, vocab_size=None):
        super(Capsule_Main, self).__init__()
        self.embed_layer = Embed_Layer(embedding_matrix, vocab_size)
        self.gru_layer = GRU_Layer()
        # 【重要】初始化GRU权重操作，这一步非常关键，acc上升到0.98，如果用默认的uniform初始化则acc一直在0.5左右
        self.gru_layer.init_weights()
        self.caps_layer = Caps_Layer()
        self.dense_layer = Dense_Layer()

    def forward(self, content):
        content1 = self.embed_layer(content)
        content2, _ = self.gru_layer(
            content1)  # 这个输出是个tuple，一个output(seq_len, batch_size, num_directions * hidden_size)，一个hn
        content3 = self.caps_layer(content2)
        output = self.dense_layer(content3)
        return output
    


In [19]:
class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super(Attention, self).__init__(**kwargs)
        
        self.supports_masking = True

        self.bias = bias
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.features_dim = 0
        
        weight = torch.zeros(feature_dim, 1)
        nn.init.xavier_uniform_(weight)
        self.weight = nn.Parameter(weight)
        
        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))
        
    def forward(self, x, mask=None):
        feature_dim = self.feature_dim
        step_dim = self.step_dim

        eij = torch.mm(
            x.contiguous().view(-1, feature_dim), 
            self.weight
        ).view(-1, step_dim)
        
        if self.bias:
            eij = eij + self.b
            
        eij = torch.tanh(eij)
        a = torch.exp(eij)
        
        if mask is not None:
            a = a * mask

        a = a / torch.sum(a, 1, keepdim=True) + 1e-10

        weighted_input = x * torch.unsqueeze(a, -1)
        return torch.sum(weighted_input, 1)
    
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        
        fc_layer = 16
        fc_layer1 = 16

        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        
        self.embedding_dropout = nn.Dropout2d(0.1)
        self.lstm = nn.LSTM(embed_size, hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(hidden_size * 2, hidden_size, bidirectional=True, batch_first=True)

        self.lstm2 = nn.LSTM(hidden_size * 2, hidden_size, bidirectional=True, batch_first=True)

        self.lstm_attention = Attention(hidden_size * 2, maxlen)
        self.gru_attention = Attention(hidden_size * 2, maxlen)
        self.bn = nn.BatchNorm1d(16, momentum=0.5)
        self.linear = nn.Linear(hidden_size*8+3, fc_layer1) #643:80 - 483:60 - 323:40
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(fc_layer**2,fc_layer)
        self.out = nn.Linear(fc_layer, 1)
        self.lincaps = nn.Linear(Num_capsule * Dim_capsule, 1)
        self.caps_layer = Caps_Layer()
    
    def forward(self, x):
        
#         Capsule(num_capsule=10, dim_capsule=10, routings=4, share_weights=True)(x)

        h_embedding = self.embedding(x[0])
        h_embedding = torch.squeeze(
            self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))
        
        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)

        ##Capsule Layer        
        content3 = self.caps_layer(h_gru)
        content3 = self.dropout(content3)
        batch_size = content3.size(0)
        content3 = content3.view(batch_size, -1)
        content3 = self.relu(self.lincaps(content3))

        ##Attention Layer
        h_lstm_atten = self.lstm_attention(h_lstm)
        h_gru_atten = self.gru_attention(h_gru)
        
        # global average pooling
        avg_pool = torch.mean(h_gru, 1)
        # global max pooling
        max_pool, _ = torch.max(h_gru, 1)
        
        f = torch.tensor(x[1], dtype=torch.float).cuda()

                #[512,160]
        conc = torch.cat((h_lstm_atten, h_gru_atten,content3, avg_pool, max_pool,f), 1)
        conc = self.relu(self.linear(conc))
        conc = self.bn(conc)
        conc = self.dropout(conc)

        out = self.out(conc)
        
        return out

### model2

In [20]:
class Alex_NeuralNet_Meta(nn.Module):
    def __init__(self,hidden_size,lin_size,embedding_matrix=embedding_matrix):
        super(Alex_NeuralNet_Meta, self).__init__()
        self.hidden_size = hidden_size
        drp = 0.1
        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False

        self.embedding_dropout = nn.Dropout2d(0.1)
        self.lstm = nn.LSTM(embed_size, hidden_size, bidirectional=True, batch_first=True)

        for name, param in self.lstm.named_parameters():
            if 'bias' in name:
                 nn.init.constant_(param, 0.0)
            elif 'weight_ih' in name:
                 nn.init.kaiming_normal_(param)
            elif 'weight_hh' in name:
                 nn.init.orthogonal_(param)

        self.gru = nn.GRU(hidden_size*2, hidden_size, bidirectional=True, batch_first=True)

        for name, param in self.gru.named_parameters():
            if 'bias' in name:
                 nn.init.constant_(param, 0.0)
            elif 'weight_ih' in name:
                 nn.init.kaiming_normal_(param)
            elif 'weight_hh' in name:
                 nn.init.orthogonal_(param)

        self.linear = nn.Linear(hidden_size*6 + features.shape[1], lin_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(drp)
        self.out = nn.Linear(lin_size, 1)

    def forward(self, x):
        h_embedding = self.embedding(x[0])
        h_embedding = torch.squeeze(self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))
        #print("emb", h_embedding.size())
        h_lstm, _ = self.lstm(h_embedding)
        #print("lst",h_lstm.size())
        h_gru, hh_gru = self.gru(h_lstm)
        hh_gru = hh_gru.view(-1, 2*self.hidden_size )
        #print("gru", h_gru.size())
        #print("h_gru", hh_gru.size())
        avg_pool = torch.mean(h_gru, 1)
        max_pool, _ = torch.max(h_gru, 1)
        #print("avg_pool", avg_pool.size())
        #print("max_pool", max_pool.size())
        f = torch.tensor(x[1], dtype=torch.float).cuda()
        #print("f", f.size())
        conc = torch.cat(( hh_gru, avg_pool, max_pool,f), 1)
        #print("conc", conc.size())
        conc = self.relu(self.linear(conc))
        conc = self.dropout(conc)
        out = self.out(conc)
        return out
    

**model3**

In [21]:
class NeuralNet1(nn.Module):
    def __init__(self):
        super(NeuralNet1, self).__init__()
        
        hidden_size = 40
        
        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        
        self.embedding_dropout = nn.Dropout2d(0.1)
        self.gru_1 = nn.GRU(embed_size, hidden_size, bidirectional=True, batch_first=True)
        self.gru_2 = nn.GRU(hidden_size*2, hidden_size, bidirectional=True, batch_first=True)
        
        self.attention = Attention(hidden_size*2, maxlen)
        
        self.linear = nn.Linear(hidden_size*2, 16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
        self.out = nn.Linear(16, 1)
        
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = torch.squeeze(self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))
        
        h_gru_1, _ = self.gru_1(h_embedding)
        h_gru_2, _ = self.gru_2(h_gru_1)
        
        h_atten = self.attention(h_gru_2)
        
        fc = self.relu(self.linear(h_atten))
        fc = self.dropout(fc)
        out = self.out(fc)
        
        return out

**model4**

In [22]:
class NeuralNet2(nn.Module):
    def __init__(self):
        super(NeuralNet2, self).__init__()
        
        hidden_size = 40
        
        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        
        self.embedding_dropout = nn.Dropout2d(0.1)
        self.lstm = nn.LSTM(embed_size, hidden_size, bidirectional=True, batch_first=True)
        
        self.attention = Attention(hidden_size*2, maxlen)
        
        self.linear = nn.Linear(hidden_size*6, 16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
        self.out = nn.Linear(16, 1)
        
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = torch.squeeze(self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))
        
        h_lstm, _ = self.lstm(h_embedding)
        h_atten = self.attention(h_lstm)
        
        avg_pool = torch.mean(h_lstm, 1)
        max_pool, _ = torch.max(h_lstm, 1)
        conc = torch.cat((h_atten, avg_pool, max_pool), 1)
        
        fc = self.relu(self.linear(conc))
        fc = self.dropout(fc)
        out = self.out(fc)
        
        return out

**model5**

In [23]:
class NeuralNet3(nn.Module):
    def __init__(self):
        super(NeuralNet3, self).__init__()
        
        hidden_size = 40
        
        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        
        self.embedding_dropout = nn.Dropout2d(0.1)
        self.lstm = nn.LSTM(embed_size, hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(hidden_size*2, hidden_size, bidirectional=True, batch_first=True)
        
        self.lstm_attention = Attention(hidden_size*2, maxlen)
        self.gru_attention = Attention(hidden_size*2, maxlen)
        
        self.linear = nn.Linear(hidden_size*8, 16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
        self.out = nn.Linear(16, 1)
        
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = torch.squeeze(self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))
        
        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)
        
        h_lstm_atten = self.lstm_attention(h_lstm)
        h_gru_atten = self.gru_attention(h_gru)
        
        avg_pool = torch.mean(h_gru, 1)
        max_pool, _ = torch.max(h_gru, 1)
        
        conc = torch.cat((h_lstm_atten, h_gru_atten, avg_pool, max_pool), 1)
        conc = self.relu(self.linear(conc))
        conc = self.dropout(conc)
        out = self.out(conc)
        
        return out

In [24]:
model1 = NeuralNet()
model2 = Alex_NeuralNet_Meta(70,16, embedding_matrix=embedding_matrix)
model3 = NeuralNet1()
model4 = NeuralNet2()
model5 = NeuralNet3()

### Training

The method for training is borrowed from https://www.kaggle.com/hengzheng/pytorch-starter

In [25]:
def bestThresshold(y_train,train_preds):
    tmp = [0,0,0] # idx, cur, max
    delta = 0
    for tmp[0] in tqdm(np.arange(0.1, 0.501, 0.01)):
        tmp[1] = f1_score(y_train, np.array(train_preds)>tmp[0])
        if tmp[1] > tmp[2]:
            delta = tmp[0]
            tmp[2] = tmp[1]
    print('best threshold is {:.4f} with F1 score: {:.4f}'.format(delta, tmp[2]))
    return delta


In [26]:

class MyDataset(Dataset):
    def __init__(self,dataset):
        self.dataset = dataset

    def __getitem__(self, index):
        data, target = self.dataset[index]

        return data, target, index
    def __len__(self):
        return len(self.dataset)

In [27]:
def pytorch_model_run_cv(n_splits,n_epochs,x_train,y_train,features ,x_test, model_obj, clip = True):
    seed_everything()
    avg_losses_f = []
    avg_val_losses_f = []
    # matrix for the out-of-fold predictions
    train_preds = np.zeros((len(x_train)))
    # matrix for the predictions on the test set
    test_preds = np.zeros((len(x_test)))
    x_test_cuda = torch.tensor(x_test, dtype=torch.long).cuda()
    
    test = torch.utils.data.TensorDataset(x_test_cuda)
    test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=False)
    
    
    splits = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED).split(x_train, y_train))
    for i, (train_idx, valid_idx) in enumerate(splits):
        seed_everything(i*1000+i)
        x_train = np.array(x_train)
        y_train = np.array(y_train)
        
        features = np.array(features)
        x_train_fold = torch.tensor(x_train[train_idx.astype(int)], dtype=torch.long).cuda()
        y_train_fold = torch.tensor(y_train[train_idx.astype(int), np.newaxis], dtype=torch.float32).cuda()
        
        kfold_X_features = features[train_idx.astype(int)]
        kfold_X_valid_features = features[valid_idx.astype(int)]
        x_val_fold = torch.tensor(x_train[valid_idx.astype(int)], dtype=torch.long).cuda()
        y_val_fold = torch.tensor(y_train[valid_idx.astype(int), np.newaxis], dtype=torch.float32).cuda()
        
        model = copy.deepcopy(model_obj)

        model.cuda()

        loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

        step_size = 300
        base_lr, max_lr = 0.001, 0.003   
        optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), 
                                 lr=max_lr)
        
        ################################################################################################
        scheduler = CyclicLR(optimizer, base_lr=base_lr, max_lr=max_lr,
                   step_size=step_size, mode='exp_range',
                   gamma=0.99994)
        ###############################################################################################

        train = MyDataset(torch.utils.data.TensorDataset(x_train_fold, y_train_fold))
        valid = MyDataset(torch.utils.data.TensorDataset(x_val_fold, y_val_fold))
        
        train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
        valid_loader = torch.utils.data.DataLoader(valid, batch_size=batch_size, shuffle=False)

        print(f'Fold {i + 1}')
        for epoch in range(n_epochs):
            start_time = time.time()
            model.train()

            avg_loss = 0.  
            for i, (x_batch, y_batch, index) in enumerate(train_loader):      
                f = kfold_X_features[index]
                y_pred = model([x_batch,f])
                

                if scheduler:
                    scheduler.batch_step()

                # Compute and print loss.
                loss = loss_fn(y_pred, y_batch)
                optimizer.zero_grad()
                loss.backward()
                if clip:
                    nn.utils.clip_grad_norm_(model.parameters(),1)
                optimizer.step()
                avg_loss += loss.item() / len(train_loader)
                
            model.eval()
            
            valid_preds_fold = np.zeros((x_val_fold.size(0)))
            test_preds_fold = np.zeros((len(x_test)))
            
            avg_val_loss = 0.
            for i, (x_batch, y_batch,index) in enumerate(valid_loader):
                f = kfold_X_valid_features[index]            
                y_pred = model([x_batch,f]).detach()
                
                avg_val_loss += loss_fn(y_pred, y_batch).item() / len(valid_loader)
                valid_preds_fold[index] = sigmoid(y_pred.cpu().numpy())[:, 0]
            
            elapsed_time = time.time() - start_time 
            print('Epoch {}/{} \t loss={:.4f} \t val_loss={:.4f} \t time={:.2f}s'.format(
                epoch + 1, n_epochs, avg_loss, avg_val_loss, elapsed_time))
        avg_losses_f.append(avg_loss)
        avg_val_losses_f.append(avg_val_loss) 
        # predict all samples in the test set batch per batch
        for i, (x_batch,) in enumerate(test_loader):
            f = test_features[i * batch_size:(i+1) * batch_size]
            y_pred = model([x_batch,f]).detach()
            test_preds_fold[i * batch_size:(i+1) * batch_size] = sigmoid(y_pred.cpu().numpy())[:, 0]
            
        train_preds[valid_idx] = valid_preds_fold
        test_preds += test_preds_fold / len(splits)

    print('All \t loss={:.4f} \t val_loss={:.4f} \t '.format(np.average(avg_losses_f),np.average(avg_val_losses_f)))
    return train_preds, test_preds



In [28]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [29]:
def stacking_level1(x_train, y_train, test):
    """ stacking
    input:  train_x, train_y, test
    output: test的预测值
    clfs:   5个一级classifier
    dataset_blend_train: 一级分类器的prediction, 二级分类器的train_x
    dataset_blend_test: 二级分类器的test
    """
    # 5 个一级分类器
    #models = [model1, model2, model3, model4, model5]
    models = [model1, model2]
    # 二级分类器的 train_x, test
    dataset_blend_train = np.zeros((x_train.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_test.shape[0],len(models)))
    # x_test_cuda_f = torch.tensor(x_test_f, dtype=torch.long).cuda()
    # test_f = torch.utils.data.TensorDataset(x_test_cuda_f)
    # test_loader_f = torch.utils.data.DataLoader(test_f, batch_size=batch_size, shuffle=False)
    x_test_cuda = torch.tensor(x_test, dtype=torch.long).cuda()
    test = torch.utils.data.TensorDataset(x_test_cuda)
    test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=False)
    
    
    # 5个分类器进行4 folds 的预测
    
    # splits = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED).split(x_train, y_train))
    #for i, model in enumerate(models):
        # dataset_blend_test = np.zeros((x_test.shape[0],n_splits)) # 每个分类器单次fold的预测结果
        # always call this before training for deterministic results
    print(f'Model 1')
    start_time = time.time()
    seed_everything()
    train_preds, test_preds = pytorch_model_run_cv(3,4,x_train,y_train,features,x_test,models[0], clip = True)
        #print(train_preds[:10])
        #print(test_preds[:10])
    dataset_blend_train[:,0] = train_preds 
    dataset_blend_test[:,0]= test_preds
        
    delta = bestThresshold(y_train,train_preds)
    elapsed_time = time.time() - start_time 
    print('Model 1 \t time={:.2f}s'.format( elapsed_time))
        
        
    print(f'Model 2')
    start_time = time.time()
    seed_everything()
    train_preds, test_preds = pytorch_model_run_cv(4,4,x_train,y_train,features,x_test,models[1], clip = True)
        #print(train_preds[:10])
        #print(test_preds[:10])
    dataset_blend_train[:,1] = train_preds 
    dataset_blend_test[:,1]= test_preds
        
    delta = bestThresshold(y_train,train_preds)
    elapsed_time = time.time() - start_time 
    print('Model 2\t time={:.2f}s'.format(elapsed_time))
        
    
    
    
    dataset_blend_train
    dataset_blend_test 
    print(dataset_blend_train.shape)
    print(dataset_blend_train[:10])
    print(dataset_blend_test.shape)
    print(dataset_blend_test[:10])
    
    return dataset_blend_train, dataset_blend_test


    

In [30]:
import copy
dataset_blend_train, dataset_blend_test = stacking_level1(x_train, y_train, x_test)

Model 1
Fold 1
Epoch 1/4 	 loss=82.6344 	 val_loss=78.8607 	 time=227.93s
Epoch 2/4 	 loss=57.8573 	 val_loss=63.4486 	 time=227.77s
Epoch 3/4 	 loss=54.3791 	 val_loss=58.1555 	 time=228.22s
Epoch 4/4 	 loss=51.4005 	 val_loss=58.8835 	 time=227.67s
Fold 2
Epoch 1/4 	 loss=83.3871 	 val_loss=52.9682 	 time=227.71s
Epoch 2/4 	 loss=58.5807 	 val_loss=51.9227 	 time=227.80s
Epoch 3/4 	 loss=55.1009 	 val_loss=51.7660 	 time=228.28s
Epoch 4/4 	 loss=52.0815 	 val_loss=52.2676 	 time=227.76s
Fold 3
Epoch 1/4 	 loss=82.8507 	 val_loss=58.7376 	 time=228.15s


  


Epoch 2/4 	 loss=57.8250 	 val_loss=52.9793 	 time=227.83s
Epoch 3/4 	 loss=54.5298 	 val_loss=50.7844 	 time=228.18s
Epoch 4/4 	 loss=51.0762 	 val_loss=51.6702 	 time=227.60s
All 	 loss=51.5194 	 val_loss=54.2738 	 


HBox(children=(IntProgress(value=0, max=41), HTML(value='')))


best threshold is 0.2800 with F1 score: 0.6785
Model 1 	 time=2761.32s
Model 2
Fold 1
Epoch 1/4 	 loss=66.8371 	 val_loss=53.3386 	 time=192.97s
Epoch 2/4 	 loss=58.8461 	 val_loss=51.3629 	 time=193.84s
Epoch 3/4 	 loss=55.6100 	 val_loss=50.6911 	 time=194.19s
Epoch 4/4 	 loss=52.5372 	 val_loss=52.1161 	 time=194.06s
Fold 2
Epoch 1/4 	 loss=66.1756 	 val_loss=55.6075 	 time=195.14s
Epoch 2/4 	 loss=58.7467 	 val_loss=52.9427 	 time=193.98s
Epoch 3/4 	 loss=55.4291 	 val_loss=52.8714 	 time=193.77s
Epoch 4/4 	 loss=52.3028 	 val_loss=52.1701 	 time=193.72s
Fold 3
Epoch 1/4 	 loss=66.4468 	 val_loss=53.3490 	 time=194.19s
Epoch 2/4 	 loss=58.7417 	 val_loss=52.1803 	 time=193.77s
Epoch 3/4 	 loss=55.2498 	 val_loss=52.2232 	 time=193.95s
Epoch 4/4 	 loss=51.8789 	 val_loss=54.1351 	 time=193.67s
Fold 4
Epoch 1/4 	 loss=66.8734 	 val_loss=53.3266 	 time=193.76s
Epoch 2/4 	 loss=58.5140 	 val_loss=52.7945 	 time=193.76s
Epoch 3/4 	 loss=55.1485 	 val_loss=50.5539 	 time=195.17s
Epoch 4

HBox(children=(IntProgress(value=0, max=41), HTML(value='')))


best threshold is 0.2700 with F1 score: 0.6810
Model 2	 time=3130.97s
(1306122, 2)
[[1.04836293e-03 2.00492963e-02]
 [3.81487198e-02 8.15159380e-02]
 [6.13783021e-04 1.46030937e-03]
 [2.02142261e-02 1.31544024e-01]
 [5.87348267e-03 1.05027203e-02]
 [1.64671801e-04 2.01822477e-04]
 [8.16344400e-04 8.70656077e-05]
 [2.35359228e-04 1.71699852e-04]
 [7.64139812e-04 9.73958056e-04]
 [1.78086339e-03 1.96978147e-03]]
(56370, 2)
[[4.36094003e-04 1.65193414e-03]
 [7.70563808e-05 1.21547061e-04]
 [1.29842322e-03 3.04735679e-04]
 [3.99529695e-03 3.91631904e-04]
 [8.44137404e-04 2.56653271e-04]
 [3.83549044e-03 9.00024688e-03]
 [7.20759931e-04 1.04038330e-03]
 [2.46791897e-04 1.38276759e-04]
 [7.86058930e-04 2.33426443e-04]
 [2.00263197e-03 2.41104067e-03]]


In [31]:
print(dataset_blend_train.shape)
print(dataset_blend_test.shape)

(1306122, 2)
(56370, 2)


In [32]:
def stacking_level2(x_train,x_test,y_train):
    #二级分类器进行预测
    from sklearn.metrics import f1_score
    from sklearn.model_selection import StratifiedKFold
    import lightgbm as lgb
    
    start_time = time.time()
    
    splits = list(StratifiedKFold(n_splits=n_splits1, shuffle=True, random_state=SEED).split(x_train, y_train))
    
    for i,[train_idx, valid_idx] in enumerate (splits):
        print(f'2nd level Fold {i + 1}')
        x_train_tr = x_train[train_idx]
        y_train_tr = y_train[train_idx]
        
        x_train_val = x_train[valid_idx]
        y_train_val = y_train[valid_idx]
    
        d_train = lgb.Dataset(x_train_tr,y_train_tr)
        d_val = lgb.Dataset(x_train_val,y_train_val)
        params = {}
        params['learning_rate'] = 0.02
        params['boosting_type'] = 'gbdt'
        params['objective'] = 'binary'
        params['metric'] = 'binary_logloss'
        params['sub_feature'] = 0.5
        params['num_leaves'] = 80
        params['min_data'] = 500
        params['max_depth'] = 10
        params['lambda_l1']= 0.1
        
        clf = lgb.train(params, d_train, 500, valid_sets = d_val, early_stopping_rounds=20)
        
        y_pred_val = clf.predict(x_train_val)
        
    delta = bestThresshold(y_train_val,y_pred_val)
        
        # f1_score= f1_score(y_train_val,y_pred_val) 
        # print('f1_score',f1_score)
        
    y_pred_test = clf.predict(x_test)
    prediction = (y_pred_test>delta).astype(int)
         
    elapsed_time = time.time() - start_time 
    print('2nd prediction \t time={:.2f}s'.format(elapsed_time))
    
    return prediction
    

In [33]:
n_splits1 = 5
prediction = stacking_level2(dataset_blend_train, dataset_blend_test,y_train)

2nd level Fold 1
[1]	valid_0's binary_logloss: 0.22158
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's binary_logloss: 0.213347
[3]	valid_0's binary_logloss: 0.206232
[4]	valid_0's binary_logloss: 0.200185
[5]	valid_0's binary_logloss: 0.194736
[6]	valid_0's binary_logloss: 0.189932
[7]	valid_0's binary_logloss: 0.185496
[8]	valid_0's binary_logloss: 0.18151
[9]	valid_0's binary_logloss: 0.177782
[10]	valid_0's binary_logloss: 0.174381
[11]	valid_0's binary_logloss: 0.171162
[12]	valid_0's binary_logloss: 0.168204
[13]	valid_0's binary_logloss: 0.165382
[14]	valid_0's binary_logloss: 0.162771
[15]	valid_0's binary_logloss: 0.160267
[16]	valid_0's binary_logloss: 0.157938
[17]	valid_0's binary_logloss: 0.155693
[18]	valid_0's binary_logloss: 0.153598
[19]	valid_0's binary_logloss: 0.151573
[20]	valid_0's binary_logloss: 0.149676
[21]	valid_0's binary_logloss: 0.147832
[22]	valid_0's binary_logloss: 0.146104
[23]	valid_0's binary_logloss: 0.14442
[24]	valid_0'

HBox(children=(IntProgress(value=0, max=41), HTML(value='')))


best threshold is 0.3900 with F1 score: 0.6954
2nd prediction 	 time=206.52s


### Find final Thresshold

Borrowed from: https://www.kaggle.com/ziliwang/baseline-pytorch-bilstm

In [34]:
submission = df_test[['qid']].copy()
submission['prediction'] = prediction
submission.to_csv('submission.csv', index=False)

In [35]:
!head submission.csv

qid,prediction
00014894849d00ba98a9,0
000156468431f09b3cae,0
000227734433360e1aae,0
0005e06fbe3045bd2a92,0
00068a0f7f41f50fc399,0
000a2d30e3ffd70c070d,0
000b67672ec9622ff761,0
000b7fb1146d712c1105,0
000d665a8ddc426a1907,0
