## **Kaggle Project**

---


#### Applications of Deep Learning(WUSTL, Fall 2019): Natural Language Understanding: Are Two Sentences of the same Topic
#### **Team**: Kattle
#### **Team member**: Gao Yaming, Lan Gangqi, Li Hao, Lu Wei

---



We devide this project into 4 parts: pre-processing, feature engineering, feature selection and modeling. Following description shows how we work on this project.



### 1.   Pre-processing <br>
&nbsp;&nbsp;&nbsp;&nbsp;We use three kinds of pre-processing methods for the following work:
<br>&nbsp;&nbsp;&nbsp;&nbsp;**The first method** is to turn all the words into lowercase and remove the 
stop words and lemma the words(for counting the same words between sent1 and sent2). 
<br>&nbsp;&nbsp;&nbsp;&nbsp;**The second method** is only to turn all the words into lowercase (for word2vec and GloVe). 
<br>&nbsp;&nbsp;&nbsp;&nbsp;**The third method** is to remove the stop words, and then use PoS to remove some ADP, DET, CCONJ, etc, and then do the lemmatization (for BoW).


### 2.   Feature Engineering <br>
&nbsp;&nbsp;&nbsp;&nbsp;Our features are derived from two parts: some basic statistical counting based on sent1 and sent2, and socre the similarity between two sentences by using embedding skills and algorithms.<br>
**Statistical counting:**
*   tf-idf/ same words (lemma version)
*   dummy: We select 100 words occurred with the highest frequency in the training set and turn them into 100 dummies. Besides, we also manually choose 18 categories and choose some common words in these categories. If sent1 and sent2 both contains the words from one category, then they can be put into the same category. Last, we also create dummies of if two sentences both contains some special characters ($, %, 0-9, etc).
*   kaggle_help_2: We adopt all the varibles you provided in help_2 <br>

**Similarity:** <br>
Similarity can be counted by firstly embedding the words into vectors by using techniques like word2vec and GloVe, and then using algorithms such as cosine similarity, word mover distance to count the similarity between two sentences. We use permutations in embedding methods and algorithms to generate about 400 features in this step.
*   embedding: word2vec models: different parameters(min_count, size, window); GloVe models:(different vector length)
*   algorithms: bow_cosine, bow_cityblock, bow_jaccard, bow_canberra, bow_euclidean, bow_minkowski, bow_braycurtis, word mover distance.

### 3. Feature Selection <br>
&nbsp;&nbsp;&nbsp;&nbsp;We firstly drop the features with high Spearman correlation, then we use Xgboost to select features with the highest importance score and DNN input perturbation ranking to select the most predictive features. 

### 4. Modelling <br>
&nbsp;&nbsp;&nbsp;&nbsp;We tried three models: Xgboost, LightGbm, DNN. When using Xgboost and LightGbm, we use grid search to find the best parameters. When use DNN, we use bayesian optimization to find the best sturcture. In addition, we try model ensembling. We finally find that Xgboost has the lowest log loss.



In [0]:
import warnings 
warnings.filterwarnings(action='ignore') 
from itertools import combinations
import numpy as np 
import pandas as pd 
import csv

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[A-Za-z]+')
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
nlp = spacy.load("en")
from collections import Counter
from tqdm import tqdm_notebook as tqdm
import pickle
import gensim
from gensim.models import Word2Vec 

from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis
from scipy.stats import zscore

import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from keras import optimizers

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Using TensorFlow backend.


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
df_train = pd.read_csv('/content/drive//My Drive/data/train.csv')
df_test = pd.read_csv('/content/drive//My Drive/data/test.csv')

## **Pre-Processing**


In [0]:
TRAIN_PKL = '/content/drive//My Drive/data/train.pkl'
TRAIN_CSV = '/content/drive//My Drive/data/train.csv'
TEST_PKL = '/content/drive//My Drive/data/test.pkl'
TEST_CSV = '/content/drive//My Drive/data/test.csv'

def token_pickle(file_name, pic_file_name):
    result = []
    with open(file_name, encoding="utf8") as csvfile:
        readCSV = csv.reader(csvfile)
        i = 0
        next(readCSV) # Skip header

        for row in tqdm(readCSV):
            
            # row: ['1', 'June – Moctezuma II, Aztec ruler of Tenochtitlan,...', 'The Swedish regent Sten Sture ...', '1']
            id = row[0]
            s1 = list(nlp(row[1]))
            s2 = list(nlp(row[2]))
            
            i += 1
            #if i>50: break
            
            s1 = [word.lemma_ for word in s1 if word.text.lower() not in STOP_WORDS and (len(word.text)>1 or word.text=='$') and word.pos_ not in ('ADP','DET','CCONJ','PRON','PART','INTJ','AUX')]
            s2 = [word.lemma_ for word in s2 if word.text.lower() not in STOP_WORDS and (len(word.text)>1 or word.text=='$') and word.pos_ not in ('ADP','DET','CCONJ','PRON','PART','INTJ','AUX')]
            
            
            if len(row)==4:
                result.append([id, s1, s2, int(row[3])])
            else:
                result.append([id, s1, s2])

    with open(pic_file_name, 'wb') as handle:
        pickle.dump(result, handle, protocol=pickle.HIGHEST_PROTOCOL)

#token_pickle(TRAIN_CSV, TRAIN_PKL)
#token_pickle(TEST_CSV, TEST_PKL)

In [0]:
with (open(TRAIN_PKL, "rb")) as openfile:
  train_data = pickle.load(openfile)

td = pd.DataFrame(train_data)
td.columns = ['id', 'sent1_tradition', 'sent2_tradition', 'same_source']

In [0]:
td = pd.concat([df_train[['sent1','sent2']], td], axis=1)

In [0]:
#  turn all the words into lowercase 
tokenizer = RegexpTokenizer(r'[A-Za-z]+')
def pre_trained_embedding_preprocess(s):
    word = tokenizer.tokenize(s.lower())
    return word

sent1_pre_trained = []
sent2_pre_trained = []
for i in range(td.shape[0]):
    sent1_pre_trained.append(pre_trained_embedding_preprocess(td.iat[i,0]))
    sent2_pre_trained.append(pre_trained_embedding_preprocess(td.iat[i,1]))

td.insert(6,'sent1_pre_trained',sent1_pre_trained)
td.insert(7,'sent2_pre_trained',sent2_pre_trained)

## **Feature Engineering**

#### **kaggle_help_2**
These features are from Kaggle_help_2



In [0]:
def get_weight(count, eps=10000, min_count=2):
    return 0 if count < min_count else 1 / (count + eps)

train_qs = pd.Series(df_train['sent1'].tolist() + df_train['sent2'].tolist()).astype(str)
words = (" ".join(train_qs)).lower().split()
counts = Counter(words)
weights = {word: get_weight(count) for word, count in counts.items()}

stops = set(stopwords.words("english"))
def word_shares(row):
    q1 = set(str(row['sent1']).lower().split())
    q1words = q1.difference(stops)
    if len(q1words) == 0:
        return '0:0:0:0:0'

    q2 = set(str(row['sent2']).lower().split())
    q2words = q2.difference(stops)
    if len(q2words) == 0:
        return '0:0:0:0:0'

    q1stops = q1.intersection(stops)
    q2stops = q2.intersection(stops)

    shared_words = q1words.intersection(q2words)
    shared_weights = [weights.get(w, 0) for w in shared_words] ## if no w, returns 0
    total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words]
    
    R1 = np.sum(shared_weights) / np.sum(total_weights) #tfidf share
    R2 = len(shared_words) / (len(q1words) + len(q2words)) #count share
    R31 = len(q1stops) / len(q1words) #stops in q1
    R32 = len(q2stops) / len(q2words) #stops in q2
    return '{}:{}:{}:{}:{}'.format(R1, R2, len(shared_words), R31, R32)

df = pd.concat([df_train, df_test])
df['word_shares'] = df.apply(word_shares, axis=1, raw=True)

x = pd.DataFrame()

x['word_match']       = df['word_shares'].apply(lambda x: float(x.split(':')[0]))
x['tfidf_word_match'] = df['word_shares'].apply(lambda x: float(x.split(':')[1]))
x['shared_count']     = df['word_shares'].apply(lambda x: float(x.split(':')[2]))

x['stops1_ratio']     = df['word_shares'].apply(lambda x: float(x.split(':')[3]))
x['stops2_ratio']     = df['word_shares'].apply(lambda x: float(x.split(':')[4]))
x['diff_stops_r']     = x['stops1_ratio'] - x['stops2_ratio']

x['len_q1'] = df['sent1'].apply(lambda x: len(str(x)))
x['len_q2'] = df['sent2'].apply(lambda x: len(str(x)))
x['diff_len'] = x['len_q1'] - x['len_q2']

x['len_char_q1'] = df['sent1'].apply(lambda x: len(str(x).replace(' ', '')))
x['len_char_q2'] = df['sent2'].apply(lambda x: len(str(x).replace(' ', '')))
x['diff_len_char'] = x['len_char_q1'] - x['len_char_q2']

x['len_word_q1'] = df['sent1'].apply(lambda x: len(str(x).split()))
x['len_word_q2'] = df['sent2'].apply(lambda x: len(str(x).split()))
x['diff_len_word'] = x['len_word_q1'] - x['len_word_q2']

x['avg_world_len1'] = x['len_char_q1'] / x['len_word_q1']
x['avg_world_len2'] = x['len_char_q2'] / x['len_word_q2']
x['diff_avg_word'] = x['avg_world_len1'] - x['avg_world_len2']

x['exactly_same'] = (df['sent1'] == df['sent2']).astype(int)
x['duplicated'] = df.duplicated(['sent1','sent2']).astype(int)

#### same words (lemma version)
Use lemma to preprocess all the words and then count the number of same words from two sentences.

In [0]:
from nltk.stem.porter import PorterStemmer
tokenizer = RegexpTokenizer(r'[A-Za-z]+')

def stem(row):
    ps = PorterStemmer()
    s1 = tokenizer.tokenize(td['sent1'][row].lower())
    s2 = tokenizer.tokenize(td['sent2'][row].lower())

    s1 = [ps.stem(w) for w in s1 if (w not in STOP_WORDS) and len(w)>1]
    s2 = [ps.stem(w) for w in s2 if (w not in STOP_WORDS) and len(w)>1]
    return len(set(s1).intersection(set(s2)))

stemming = pd.DataFrame()
same_word_stem = []
for i in range(td.shape[0]):
    same_word_stem.append(stem(i))
stemming['same_words_stem'] = pd.Series(same_word_stem)

In [0]:
pd.concat([stemming['same_words_stem'], td['same_source']], axis=1).groupby('same_source').mean()

Unnamed: 0_level_0,same_words_stem
same_source,Unnamed: 1_level_1
0,0.093188
1,0.57445


#### dummy 


##### (1) 100 common words
We select the most common 100 words and then turn them into dummy.
Eg: "age" is the most common words, if sent1 contains "age", then 1, else 0.

In [0]:
train_qs = pd.Series(df_train['sent1'].tolist() + df_train['sent2'].tolist()).astype(str)
words = (" ".join(train_qs)).lower().split()
counts = Counter(words)

In [0]:
for stop_words in STOP_WORDS:
    del counts[stop_words]
most_common = counts.most_common(100)

In [0]:
dummy = pd.DataFrame(data=td.index).drop(columns=0)

In [0]:
for i in range(len(most_common)):
    dummy[most_common[i][0]] = 0

In [0]:
for row in range(td.shape[0]):
    s1 = tokenizer.tokenize(td['sent1'][row])
    s2 = tokenizer.tokenize(td['sent2'][row])
    flag1, flag2 = 0, 0
    for i in range(len(most_common)):
        for word in s1:
            if word.lower() == most_common[i][0]:
                flag1 = 1
        for word in s2:
            if word.lower() == most_common[i][0]:
                flag2 = 1
        if (flag1 == 1) and (flag2 == 1):
            dummy.iat[row,i] = 1
        else:
            dummy.iat[row,i] = 0

In [0]:
grp = pd.concat([td['same_source'], dummy], axis=1).groupby('same_source').mean()
grp

Unnamed: 0_level_0,age,population,income,18,average,median,living,city,united,size,family,census,density,household,county,families,100,township,new,males.,town,area,square,65,mile,"american,",states,years,including,people,according,children,"present,",married,makeup,householder,",",located,units,"them,",...,known,2010,capita,males,"18,",older.,45,"24,",females.,"over,",village,versus,spread,"64,",poverty,"44,","line,",african,population.,race,north,land,"2000,","bureau,",hispanic,latino,"females,",states.,south,national,called,early,named,over.,time,"races,",census.,high,"asian,",year
same_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,0.003627,0.013113,0.025467,0.025467,0.037573,0.037588,0.038456,0.055088,0.0691,0.069999,0.073084,0.08181,0.082074,0.082182,0.098814,0.099682,0.099682,0.105929,0.120375,0.120375,0.131659,0.141502,0.142354,0.142354,0.142959,0.142959,0.145129,0.154119,0.160645,0.16655,0.171014,0.173014,0.173014,0.174626,0.179385,0.1794,0.1794,0.18977,0.191025,0.191025,...,0.235573,0.235573,0.235666,0.235883,0.235883,0.235883,0.235883,0.235883,0.235883,0.235883,0.238813,0.238906,0.239603,0.239603,0.239743,0.239743,0.239743,0.24089,0.24089,0.242192,0.24864,0.25174,0.25174,0.25174,0.251802,0.251802,0.251802,0.251802,0.256219,0.263055,0.272526,0.279439,0.284957,0.284957,0.297915,0.297915,0.297915,0.302798,0.302798,0.311943
1,0.008478,0.029177,0.058647,0.058647,0.091954,0.091985,0.092279,0.122941,0.147368,0.147832,0.149116,0.167572,0.167664,0.167757,0.196872,0.197584,0.197584,0.212388,0.22133,0.22133,0.239801,0.251156,0.251775,0.251775,0.252425,0.252425,0.253616,0.257948,0.261135,0.264151,0.266503,0.2674,0.2674,0.26805,0.277146,0.277146,0.277146,0.291363,0.291951,0.291951,...,0.3454,0.3454,0.345431,0.345508,0.345508,0.345508,0.345508,0.345508,0.345508,0.345508,0.349453,0.349484,0.349855,0.349855,0.35001,0.35001,0.35001,0.351,0.351,0.351882,0.357884,0.361195,0.361195,0.361195,0.361226,0.361257,0.361257,0.361257,0.365527,0.37173,0.377052,0.381801,0.387169,0.387169,0.394749,0.394749,0.394749,0.397317,0.397317,0.403738


In [0]:
for i in range(grp.shape[1]):
    print( abs(grp.iat[0,i]-grp.iat[1,i])/max(grp.iat[0,i],grp.iat[0,i]), grp.columns[i], grp.iat[1,i]/grp.iat[0,i] )

1.3373154848534252 age 2.337315484853425
1.2249689803399701 population 2.22496898033997
1.3028669990365722 income 2.3028669990365724
1.3028669990365722 18 2.3028669990365724
1.4473654261977922 average 2.447365426197792
1.4471793379995492 median 2.4471793379995495
1.3995859973118763 living 2.3995859973118763
1.231713418930895 city 2.231713418930895
1.1326674878914529 united 2.1326674878914527
1.1119072595664254 size 2.111907259566425
1.0403417599869595 family 2.0403417599869593
1.0482921504745488 census 2.048292150474549
1.0428468885441067 density 2.0428468885441067
1.0412792492413092 household 2.041279249241309
0.9923441778437834 county 1.9923441778437834
0.9821341112268187 families 1.9821341112268187
0.9821341112268187 100 1.9821341112268187
1.0050101051802298 township 2.00501010518023
0.8386701526447685 new 1.8386701526447684
0.8386701526447685 males. 1.8386701526447684
0.8213780348573291 town 1.8213780348573292
0.7749320073678959 area 1.774932007367896
0.7686494257590044 square 1.76

In [0]:
counts.most_common(100)

[('age', 22533),
 ('population', 16108),
 ('income', 14905),
 ('18', 14700),
 ('average', 14041),
 ('median', 11793),
 ('living', 10268),
 ('city', 10229),
 ('united', 9453),
 ('size', 9427),
 ('family', 9337),
 ('census', 8420),
 ('density', 8355),
 ('household', 8312),
 ('county', 8261),
 ('families', 8016),
 ('100', 7767),
 ('township', 7698),
 ('new', 7538),
 ('males.', 7296),
 ('town', 7194),
 ('area', 6648),
 ('square', 6400),
 ('65', 6386),
 ('mile', 6048),
 ('american,', 5994),
 ('states', 5924),
 ('years', 5867),
 ('including', 5848),
 ('people', 5581),
 ('according', 5265),
 ('children', 5179),
 ('present,', 5179),
 ('married', 5143),
 ('makeup', 5126),
 ('householder', 5110),
 (',', 5072),
 ('located', 5027),
 ('units', 4878),
 ('them,', 4813),
 ('housing', 4794),
 ('female', 4737),
 ('households,', 4685),
 ('together,', 4666),
 ('couples', 4622),
 ('husband', 4600),
 ('people,', 4450),
 ('non-families.', 4424),
 ('racial', 4388),
 ('households', 4384),
 ('total', 4355),
 ('

In [0]:
#td[td['sent1'].str.contains('population')][['sent1','sent2','same_source']].to_excel('/content/drive//My Drive/data/median_income.xlsx', index=False)

##### (2) % $ 0-9
if contains "%" or "$" or 0-9, then turn it to dummy.

In [0]:
def token_pickle(file_name, pic_file_name):
    result = []
    with open(file_name, encoding="utf8") as csvfile:
        readCSV = csv.reader(csvfile)
        i = 0
        next(readCSV) # Skip header

        for row in tqdm(readCSV):
            
            id = row[0]
            s1 = list(nlp(row[1]))
            s2 = list(nlp(row[2]))
            
            i += 1
            
            #if word.text=='$':s1='$'
            #if word.text=='$':s2='$'
            #s1 = ['$' if word.text=='$']
            #s2 = ['$' if word.text=='$']
            #s1 = len(['%' for word in s1 if word.text=='%' ])
            #s2 = len(['%' for word in s2 if word.text=='%' ])
            s1 = len(['%' for word in s1 if is_number(replace_all_blank(word.text))==1 ])
            s2 = len(['%' for word in s2 if is_number(replace_all_blank(word.text))==1 ])

            
            if len(row)==4:
                result.append([id, s1, s2, int(row[3])])
            else:
                result.append([id, s1, s2])

    with open(pic_file_name, 'wb') as handle:
        pickle.dump(result, handle, protocol=pickle.HIGHEST_PROTOCOL)


##### categorical words
We define 20 categories, and add words to it. If sentences contains catorical words, then 1, else 0. (a manual process)

In [0]:
discipline = ['census','median','income','household','population','per','capita','racial','males','females','housing','community']
people = ['racial','makeup','gender','male','female','every','age','children','marry','family','couple','house','writer']
industry = ['energy','manufacture','market','power','electric','weapon','infrastructure','sustainable','transport','vehicle','product','production','factory']
business = ['finance','commercial','income','stock','CDP']
culture = ['philosophy','religion','nations','language','humanities','literature','ritual','linguistic','rhetoric']
economy = ['economies','economics','consumer','labor','money','trade','wealth','capital','GDP']
education = ['academic','curricula','literacy','literature','research','student','school','university','tutor','professor','teach','train','knowledge','learn','information','wisdom','research','scholar','understand','intellectual','logic','cognitive']
entertainment = ['recreation','amuse','movie','comedy','theater','media','film','game','humor','magic','art','television','toy','food','drink','cuisine','dairy','theme','visual','club','bar','sport','player','gamble','card']
art = ['opera','genre','paint','library','comedy','tragedy','drama','dance','artistic','artist','sculptures']
society = ['location','accident','conflict','event','emergency','news','individual','behavior','human','communication','personal','affair','violence','gun','crime','kill','shoot','injure','information','sociology','people','community','facility','rail','station','housing','violent','place']
geography = ['environmental','ecosystem','land','region','zone','decay','plate','cluster','curve','rock','lithic','map','mountain','soil','landscape','island','equator','earth','contour','local','phenomena','geological','border','ocean','bay','area','rocky','peninsula','hemisphere','locate','island','regional','cloud','rain','rainfall','lake','park','Asia','Africa','English','resort']
health = ['patient','treat','medical','transcript','healthcare','alcohol','diet','disability','disease','nutrition','fitness','heal','injury','death','life']
history = ['period','era','epoch','historian','chronology']
law = ['legislation','judicial','just','bill','act','constitution','legal','right','state','abrogation','legislative','judge']
mathematics = ['formula','equation','theorem','mathematical','statistics','distribution','mean','medium','median','max','min','probability','possibility','sample','data','analysis','rate','more','average']
statistics = ['distribution','mean','medium','median','max','min','probability','possibility','sample','data','analysis','rate','per capita','more','average']
military = ['weapon','missile','force','army','corp','kill','armed','assault','attack','force','war','ink']
nature = ['animal','ecosystem','earth','natural','life','phenomena','river','forest','village','mountain','hiking','cableway','hamlet','goose','wild']
government  = ['citizen','politic','citizenship','agency','civil','service','election','vote','debt','tax','public','policy','presidency','administration','office','state','enforce','compulsory','constitution','hierarchical','parliament','congress','council','policy','agenda','regime','senator','democracy','field','committee','scandal','reform','political','society','republic','people','leader','population','county','portion','commonwealth','census','international','racial','race','electoral','congress','communist','capital','sign','bureau','legislative','authority','consolidate','execution','federal','enforce','diplomatic','judicial','senate','treaty','reform','county','president','holy','revolutionaries','metropolitan','hotel','cabins','Norway','religion','religious','baptism','monasticism','monk','church']
technology = ['science','develop','technique','information','fiction','computer','scientific','satellite','site','technologies','ray','radiate','radiation','unit','refract','frequency','charge','ion','atom','thermal','voltage','particle','chip','optical','electronic','properties','analysis','amplifier']

In [0]:
# numbers?
dummy = pd.DataFrame()

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        pass
    try:
        import unicodedata
        unicodedata.numeric(s)
        return True
    except (TypeError, ValueError):
        pass
    return False

def dummys(row):
    sent1, sent2, a, b = 0, 0, 0, 0
    sent11, sent22, c, d = 0, 0, 0, 0

    dis, ind, peo, bus, cul, eco, edu, ent, artt, soc, geo = 0,0,0,0,0,0,0,0,0,0,0
    gov, hea, his, laww, mat, sta, mil, nat, tec = 0,0,0,0,0,0,0,0,0
    dis1, ind1, peo1, bus1, cul1, eco1, edu1, ent1, art1, soc1, geo1 = 0,0,0,0,0,0,0,0,0,0,0
    gov1, hea1, his1, law1, mat1, sta1, mil1, nat1, tec1 = 0,0,0,0,0,0,0,0,0
    dis2, ind2, peo2, bus2, cul2, eco2, edu2, ent2, art2, soc2, geo2 = 0,0,0,0,0,0,0,0,0,0,0
    gov2, hea2, his2, law2, mat2, sta2, mil2, nat2, tec2 = 0,0,0,0,0,0,0,0,0

    for word in row['sent1_tradition']:
        if is_number(word): sent1=1
        if word=='$': sent11=1 

        if word in (discipline): dis1=1
        if word in (industry): ind1=1
        if word in (people): peo1=1
        if word in (business): bus1=1
        if word in (culture): cul1=1
        if word in (economy): eco1=1
        if word in (education): edu1=1
        if word in (entertainment): ent1=1
        if word in (art): art1=1
        if word in (society): soc1=1
        if word in (geography): geo1=1
        if word in (government): gov1=1
        if word in (history): his1=1
        if word in (health): hea1=1
        if word in (law): law1=1
        if word in (mathematics): mat1=1
        if word in (statistics): sta1=1
        if word in (military): mil1=1
        if word in (nature): nat1=1
        if word in (technology): tec1=1


    for word in row['sent2_tradition']:
        if is_number(word): sent2=1 
        if word=='$': sent22=1 

        if word in (discipline): dis2=1
        if word in (industry): ind2=1
        if word in (people): peo2=1
        if word in (business): bus2=1
        if word in (culture): cul2=1
        if word in (economy): eco2=1
        if word in (education): edu2=1
        if word in (entertainment): ent2=1
        if word in (art): art2=1
        if word in (society): soc2=1
        if word in (geography): geo2=1
        if word in (government): gov2=1
        if word in (health): hea2=1
        if word in (history): his2=1
        if word in (law): law2=1
        if word in (mathematics): mat2=1
        if word in (statistics): sta2=1
        if word in (military): mil2=1
        if word in (nature): nat2=1
        if word in (technology): tec2=1

    # both contain number?
    a=0 if (sent1 == 1 and sent2 == 1) or (sent1 == 0 and sent2 == 0) else 1
    b=0 if (sent1 == 1 and sent2 == 1) else 1

    # contain $?
    c=0 if (sent11 == 1 and sent22 == 1) or (sent11 == 0 and sent22 == 0) else 1

    # category words?
    dis=1 if (dis1 == 1 and dis2 == 1) else 0
    ind=1 if (ind1 == 1 and ind2 == 1) else 0
    peo=1 if (peo1 == 1 and peo2 == 1) else 0
    bus=1 if (bus1 == 1 and bus2 == 1) else 0
    cul=1 if (cul1 == 1 and cul2 == 1) else 0
    eco=1 if (eco1 == 1 and eco2 == 1) else 0
    edu=1 if (edu1 == 1 and edu2 == 1) else 0
    ent=1 if (ent1 == 1 and ent2 == 1) else 0
    artt=1 if (art1 == 1 and art2 == 1) else 0
    soc=1 if (soc1 == 1 and soc2 == 1) else 0
    geo=1 if (geo1 == 1 and geo2 == 1) else 0
    gov=1 if (gov1 == 1 and gov2 == 1) else 0
    hea=1 if (hea1 == 1 and hea2 == 1) else 0
    his=1 if (his1 == 1 and his2 == 1) else 0
    laww=1 if (law1 == 1 and law2 == 1) else 0
    mat=1 if (mat1 == 1 and mat2 == 1) else 0
    sta=1 if (sta1 == 1 and sta2 == 1) else 0
    mil=1 if (mil1 == 1 and mil2 == 1) else 0
    nat=1 if (nat1 == 1 and nat2 == 1) else 0
    tec=1 if (tec1 == 1 and tec2 == 1) else 0

    return '{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}:{}'.format(a, b, c, dis, ind, peo, bus, cul, eco, edu, ent, artt, soc, geo, gov, hea, his, laww, mat, sta, mil, nat, tec)

dummy['dummys'] = td.apply(dummys, axis=1)
dummy['both_contain_number'] = dummy['dummys'].apply(lambda x: float(x.split(':')[0]))
dummy['contain_number'] = dummy['dummys'].apply(lambda x: float(x.split(':')[1]))
dummy['contain_$'] = dummy['dummys'].apply(lambda x: float(x.split(':')[2]))

dummy['discipline'] = dummy['dummys'].apply(lambda x: float(x.split(':')[3]))
dummy['industry'] = dummy['dummys'].apply(lambda x: float(x.split(':')[4]))
dummy['people'] = dummy['dummys'].apply(lambda x: float(x.split(':')[5]))
dummy['business'] = dummy['dummys'].apply(lambda x: float(x.split(':')[6]))
dummy['culture'] = dummy['dummys'].apply(lambda x: float(x.split(':')[7]))
dummy['economy'] = dummy['dummys'].apply(lambda x: float(x.split(':')[8]))
dummy['education'] = dummy['dummys'].apply(lambda x: float(x.split(':')[9]))
dummy['entertainment'] = dummy['dummys'].apply(lambda x: float(x.split(':')[10]))
dummy['art'] = dummy['dummys'].apply(lambda x: float(x.split(':')[11]))
dummy['society'] = dummy['dummys'].apply(lambda x: float(x.split(':')[12]))
dummy['geography'] = dummy['dummys'].apply(lambda x: float(x.split(':')[13]))
dummy['government'] = dummy['dummys'].apply(lambda x: float(x.split(':')[14]))
dummy['health'] = dummy['dummys'].apply(lambda x: float(x.split(':')[15]))
dummy['law'] = dummy['dummys'].apply(lambda x: float(x.split(':')[16]))
dummy['history'] = dummy['dummys'].apply(lambda x: float(x.split(':')[17]))
dummy['mathematics'] = dummy['dummys'].apply(lambda x: float(x.split(':')[18]))
dummy['statistics'] = dummy['dummys'].apply(lambda x: float(x.split(':')[19]))
dummy['military'] = dummy['dummys'].apply(lambda x: float(x.split(':')[20]))
dummy['nature'] = dummy['dummys'].apply(lambda x: float(x.split(':')[21]))
dummy['technology'] = dummy['dummys'].apply(lambda x: float(x.split(':')[22]))

In [0]:
dummy.drop(columns='dummys', inplace=True)

In [0]:
pd.concat([dummy, td['same_source']], axis=1).groupby('same_source').mean()

Unnamed: 0_level_0,both_contain_number,contain_number,contain_$,discipline,industry,people,business,culture,economy,education,entertainment,art,society,geography,government,health,law,history,mathematics,statistics,military,nature,technology
same_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
0,0.484647,0.83176,0.087887,0.0,0.000589,0.012292,0.011796,9.3e-05,1.6e-05,0.000775,0.001395,4.7e-05,0.005379,0.002139,0.034519,0.000233,0.000202,0.001132,1.6e-05,0.006929,0.000806,0.000124,0.000295
1,0.392909,0.781609,0.084018,0.0,0.002305,0.03029,0.031219,0.001361,0.000356,0.003264,0.006064,0.00065,0.005971,0.004038,0.050773,0.001129,0.00065,0.002816,0.000263,0.012546,0.003991,0.000774,0.001841


In [0]:
grp = pd.concat([td['same_source'], dummy], axis=1).groupby('same_source').mean()
for i in range(grp.shape[1]):
    print( abs(grp.iat[0,i]-grp.iat[1,i])/max(grp.iat[0,i],grp.iat[0,i]), grp.columns[i], grp.iat[1,i]/grp.iat[0,i] )

0.1892892338318083 both_contain_number 0.8107107661681917
0.06029486514639477 contain_number 0.9397051348536052
0.044018739560268895 contain_$ 0.9559812604397311
nan discipline nan
2.913409608859947 industry 3.9134096088599466
1.46429181967734 people 2.46429181967734
1.646605071442698 business 2.6466050714426985
13.63807800003094 culture 14.63807800003094
21.955167772775795 economy 22.955167772775795
3.2117742609179936 education 4.211774260917993
3.3470655878879763 entertainment 4.347065587887976
12.972710818211352 art 13.972710818211352
0.11022362614853472 society 1.1102236261485348
0.8876177658142664 geography 1.8876177658142665
0.47085883973858683 government 1.4708588397385869
3.8571804272829935 health 4.8571804272829935
2.2244717272795427 law 3.224471727279543
1.4882909676266791 history 2.488290967626679
15.9668631363995 mathematics 16.9668631363995
0.8107811558915637 statistics 1.8107811558915639
3.951867295465012 military 4.951867295465012
5.23781732955864 nature 6.23781732955864

#### word_match_share

In [0]:
stops = STOP_WORDS
def word_match_share(row):
    q1words = {}
    q2words = {}
    for word in str(row['sent1']).lower().split():
        if word not in stops:
            q1words[word] = 1
    for word in str(row['sent2']).lower().split():
        if word not in stops:
            q2words[word] = 1
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    shared_words_in_q1 = [w for w in q1words.keys() if w in q2words]
    # print(shared_words_in_q1)
    shared_words_in_q2 = [w for w in q2words.keys() if w in q1words]
    R = (len(shared_words_in_q1) + len(shared_words_in_q2))/(len(q1words) + len(q2words))
    return R
    
td['train_word_match'] = td.apply(word_match_share, axis=1, raw=True)

####  TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.linalg import norm
def tfidf_similarity(s1, s2):
    def add_space(s):
        return ' '.join(list(s))

    # 将字中间加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 转化为TF矩阵
    cv = TfidfVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 计算TF系数
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))

tfidf_sim = []
for i in range(df_train.shape[0]):
    tfidf_sim.append(tfidf_similarity(df_train.iat[i,1], df_train.iat[i,2]))
    
td['tfidf_sim'] = tfidf_sim

#### same words

In [0]:
td['same_words'] = 0

for i in range(td.shape[0]):
    count = 0
    for word1 in td.loc[i,'sent1_tradition']:
        for word2 in td.loc[i,'sent2_tradition']:
            if word1==word2:
                count += 1
    
    td.loc[i,'same_words'] = count

#### BoW + basic distances
Count bag of words of two sentences in the same row, then use basic distances functions (cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis distances) to count the distance between sent1 and sent2.

In [0]:
bow_basic_distance = pd.DataFrame()

In [0]:
# cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis
def bow_sent(sent1, sent2):
    # turn list to string by using " ".join()
    vect = CountVectorizer(stop_words="english").fit([" ".join(sent1+sent2)]) #Convert a collection of text documents to a matrix of token counts
    v_1, v_2 = vect.transform([" ".join(sent1), " ".join(sent2)]) #Learn the vocabulary dictionary and return term-document matrixv
    v_1 = v_1.toarray().ravel()
    v_2 = v_2.toarray().ravel()
    bow_cosine = cosine(v_1, v_2)
    bow_cityblock = cityblock(v_1, v_2)
    bow_jaccard = jaccard(v_1, v_2)
    bow_canberra = canberra(v_1, v_2)
    bow_euclidean = euclidean(v_1, v_2)
    bow_minkowski = minkowski(v_1, v_2, 3)
    bow_braycurtis = braycurtis(v_1, v_2)
    #return bow_cosine, bow_cityblock, bow_jaccard, bow_canberra, bow_euclidean, bow_minkowski, bow_braycurtis
    return '{}:{}:{}:{}:{}:{}:{}'.format(bow_cosine, bow_cityblock, bow_jaccard, bow_canberra, bow_euclidean, bow_minkowski, bow_braycurtis)


In [0]:
bow_basic_distance['temp'] = td[['sent1_tradition', 'sent2_tradition']].apply(lambda x: bow_sent(x[0], x[1]), axis=1)
bow_basic_distance['bow_cosine'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[0]))
bow_basic_distance['bow_cityblock'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[1]))
bow_basic_distance['bow_jaccard'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[2]))
bow_basic_distance['bow_canberra'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[3]))
bow_basic_distance['bow_euclidean'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[4]))
bow_basic_distance['bow_minkowski'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[5]))
bow_basic_distance['bow_braycurtis'] = bow_basic_distance['temp'].apply(lambda x: float(x.split(':')[6]))

bow_basic_distance.drop(columns='temp', inplace=True)

#### Word2Vec + basic distances
We use word2vec model to count every words' vector. Then we count each sentence's vector by using sent2vec function. We can get every sentence's vector in this step. Then we use basic distances to count the distance between sent1 and sent2.


*word2vec function has several hyperparameters, so we adjust min_count, size, window parameters to create several word2vec models in this step.</br>
*We hope to create as much features as possible and use xgboost or DNN models to select the most important features for us.

In [0]:
from nltk.tokenize import RegexpTokenizer

## creating corpus by using training set and testing set
train_qs = pd.Series(df_train['sent1'].tolist() + df_train['sent2'].tolist()).astype(str)
test_qs = pd.Series(df_test['sent1'].tolist() + df_test['sent2'].tolist()).astype(str)

train_string = ''
for i in range(len(train_qs)):
    train_string += train_qs[i]
    i = i+1

test_string = ''
for i in range(len(test_qs)):
    train_string += train_qs[i]
    i = i+1

data = train_string + test_string
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(data)

## creating word2vec model by using different paramerters (min_count, size, window)
for min_count in range(1,3):
    for size in range(40,210,30):
        for window in range(3,8,2):
            print('model_w2v_' + str(min_count) + '_' + str(size) + '_' + str(window))
            locals()['model_w2v_' + str(min_count) + '_' + str(size) + '_' + str(window)] = gensim.models.Word2Vec([words], min_count=min_count, size=size, window=window)


In [0]:
model_w2v_2_250_5 = gensim.models.Word2Vec([words], min_count=2, size=250, window=5)
model_w2v_2_300_5 = gensim.models.Word2Vec([words], min_count=2, size=300, window=5)

In [0]:
## counting every sent's score
def sent2vec(s, model):
    words = tokenizer.tokenize(s.lower())
    words = [w for w in words if not w in STOP_WORDS]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(model[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    return v / np.sqrt((v ** 2).sum())

In [0]:
from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis

data = pd.DataFrame()
def sent1_score(sent1, sent2, model, model_name):
    
    sent1_vectors = np.zeros((td.shape[0], model.vector_size))
    count = 0
    for i in td[sent1].values:
        sent1_vectors[count,:] = sent2vec(i, model)
        count += 1

    sent2_vectors = np.zeros((td.shape[0], model.vector_size))
    count = 0
    for i in td[sent2].values:
        sent2_vectors[count,:] = sent2vec(i, model)
        count += 1
    
    data['cosine_distance_' + model_name] = [cosine(x, y) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    data['cityblock_distance_' + model_name] = [cityblock(x, y) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    data['jaccard_distance_' + model_name] = [jaccard(x, y) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    data['canberra_distance_' + model_name] = [canberra(x, y) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    data['euclidean_distance_' + model_name] = [euclidean(x, y) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    data['minkowski_distance_' + model_name] = [minkowski(x, y, 3) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    data['braycurtis_distance_' + model_name] = [braycurtis(x, y) for (x, y) in zip(np.nan_to_num(sent1_vectors),np.nan_to_num(sent2_vectors))]
    print('-')
    return data

In [0]:
sent1_score('sent1', 'sent2', model_w2v_1_40_3, 'model_w2v_1_40_3') 
sent1_score('sent1', 'sent2', model_w2v_1_40_5, 'model_w2v_1_40_5')
sent1_score('sent1', 'sent2', model_w2v_1_40_7, 'model_w2v_1_40_7') 
sent1_score('sent1', 'sent2', model_w2v_1_70_3, 'model_w2v_1_70_3') 
sent1_score('sent1', 'sent2', model_w2v_1_70_5, 'model_w2v_1_70_5') 
sent1_score('sent1', 'sent2', model_w2v_1_70_7, 'model_w2v_1_70_7')
sent1_score('sent1', 'sent2', model_w2v_1_100_3, 'model_w2v_1_100_3') 
sent1_score('sent1', 'sent2', model_w2v_1_100_5, 'model_w2v_1_100_5')
sent1_score('sent1', 'sent2', model_w2v_1_100_7, 'model_w2v_1_100_7')
sent1_score('sent1', 'sent2', model_w2v_1_130_3, 'model_w2v_1_130_3')
sent1_score('sent1', 'sent2', model_w2v_1_130_5, 'model_w2v_1_130_5')
sent1_score('sent1', 'sent2', model_w2v_1_130_7, 'model_w2v_1_130_7')
sent1_score('sent1', 'sent2', model_w2v_1_160_3, 'model_w2v_1_160_3')
sent1_score('sent1', 'sent2', model_w2v_1_160_5, 'model_w2v_1_160_5')
sent1_score('sent1', 'sent2', model_w2v_1_160_7, 'model_w2v_1_160_7')
sent1_score('sent1', 'sent2', model_w2v_1_190_3, 'model_w2v_1_190_3')
sent1_score('sent1', 'sent2', model_w2v_1_190_5, 'model_w2v_1_190_5')
sent1_score('sent1', 'sent2', model_w2v_1_190_7, 'model_w2v_1_190_7')
sent1_score('sent1', 'sent2', model_w2v_2_40_3, 'model_w2v_2_40_3') 
sent1_score('sent1', 'sent2', model_w2v_2_40_5, 'model_w2v_2_40_5')
sent1_score('sent1', 'sent2', model_w2v_2_40_7, 'model_w2v_2_40_7') 
sent1_score('sent1', 'sent2', model_w2v_2_70_3, 'model_w2v_2_70_3') 
sent1_score('sent1', 'sent2', model_w2v_2_70_5, 'model_w2v_2_70_5') 
sent1_score('sent1', 'sent2', model_w2v_2_70_7, 'model_w2v_2_70_7')
sent1_score('sent1', 'sent2', model_w2v_2_100_3, 'model_w2v_2_100_3') 
sent1_score('sent1', 'sent2', model_w2v_2_100_5, 'model_w2v_2_100_5')
sent1_score('sent1', 'sent2', model_w2v_2_100_7, 'model_w2v_2_100_7')
sent1_score('sent1', 'sent2', model_w2v_2_130_3, 'model_w2v_2_130_3')
sent1_score('sent1', 'sent2', model_w2v_2_130_5, 'model_w2v_2_130_5')
sent1_score('sent1', 'sent2', model_w2v_2_130_7, 'model_w2v_2_130_7')
sent1_score('sent1', 'sent2', model_w2v_2_160_3, 'model_w2v_2_160_3')
sent1_score('sent1', 'sent2', model_w2v_2_160_5, 'model_w2v_2_160_5')
sent1_score('sent1', 'sent2', model_w2v_2_160_7, 'model_w2v_2_160_7')
sent1_score('sent1', 'sent2', model_w2v_2_190_3, 'model_w2v_2_190_3')
sent1_score('sent1', 'sent2', model_w2v_2_190_5, 'model_w2v_2_190_5')
sent1_score('sent1', 'sent2', model_w2v_2_190_7, 'model_w2v_2_190_7')
sent1_score('sent1', 'sent2', model_w2v_2_250_5, 'model_w2v_2_250_5')
sent1_score('sent1', 'sent2', model_w2v_2_300_5, 'model_w2v_2_300_5')

#### Word2Vec + WMD
Use word move distance to count distance between two sentences.

In [0]:
data_w2v_wmd = pd.DataFrame()

def w2v_wmd_score(model, model_name):
    data_w2v_wmd['w2v_wmd' + model_name] = td[['sent1_tradition', 'sent2_tradition']].apply(lambda x: model.wmdistance(x[0], x[1]), axis=1)
    print('-')
    return data_w2v_wmd

In [0]:
w2v_wmd_score(model_w2v_1_40_3, 'model_w2v_1_40_3') 
w2v_wmd_score(model_w2v_1_40_5, 'model_w2v_1_40_5')
w2v_wmd_score(model_w2v_1_40_7, 'model_w2v_1_40_7') 
w2v_wmd_score(model_w2v_1_70_3, 'model_w2v_1_70_3') 
w2v_wmd_score(model_w2v_1_70_5, 'model_w2v_1_70_5') 
w2v_wmd_score(model_w2v_1_70_7, 'model_w2v_1_70_7')
w2v_wmd_score(model_w2v_1_100_3, 'model_w2v_1_100_3') 
w2v_wmd_score(model_w2v_1_100_5, 'model_w2v_1_100_5')
w2v_wmd_score(model_w2v_1_100_7, 'model_w2v_1_100_7')
w2v_wmd_score(model_w2v_1_130_3, 'model_w2v_1_130_3')
w2v_wmd_score(model_w2v_1_130_5, 'model_w2v_1_130_5')
w2v_wmd_score(model_w2v_1_130_7, 'model_w2v_1_130_7')
w2v_wmd_score(model_w2v_1_160_3, 'model_w2v_1_160_3')
w2v_wmd_score(model_w2v_1_160_5, 'model_w2v_1_160_5')
w2v_wmd_score(model_w2v_1_160_7, 'model_w2v_1_160_7')
w2v_wmd_score(model_w2v_1_190_3, 'model_w2v_1_190_3')
w2v_wmd_score(model_w2v_1_190_5, 'model_w2v_1_190_5')
w2v_wmd_score(model_w2v_1_190_7, 'model_w2v_1_190_7')
w2v_wmd_score(model_w2v_2_40_3, 'model_w2v_2_40_3') 
w2v_wmd_score(model_w2v_2_40_5, 'model_w2v_2_40_5')
w2v_wmd_score(model_w2v_2_40_7, 'model_w2v_2_40_7') 
w2v_wmd_score(model_w2v_2_70_3, 'model_w2v_2_70_3') 
w2v_wmd_score(model_w2v_2_70_5, 'model_w2v_2_70_5') 
w2v_wmd_score(model_w2v_2_70_7, 'model_w2v_2_70_7')
w2v_wmd_score(model_w2v_2_100_3, 'model_w2v_2_100_3') 
w2v_wmd_score(model_w2v_2_100_5, 'model_w2v_2_100_5')
w2v_wmd_score(model_w2v_2_100_7, 'model_w2v_2_100_7')
w2v_wmd_score(model_w2v_2_130_3, 'model_w2v_2_130_3')
w2v_wmd_score(model_w2v_2_130_5, 'model_w2v_2_130_5')
w2v_wmd_score(model_w2v_2_130_7, 'model_w2v_2_130_7')
w2v_wmd_score(model_w2v_2_160_3, 'model_w2v_2_160_3')
w2v_wmd_score(model_w2v_2_160_5, 'model_w2v_2_160_5')
w2v_wmd_score(model_w2v_2_160_7, 'model_w2v_2_160_7')
w2v_wmd_score(model_w2v_2_190_3, 'model_w2v_2_190_3')
w2v_wmd_score(model_w2v_2_190_5, 'model_w2v_2_190_5')
w2v_wmd_score(model_w2v_2_190_7, 'model_w2v_2_190_7')
w2v_wmd_score(model_w2v_2_250_5, 'model_w2v_2_250_5')
w2v_wmd_score(model_w2v_2_300_5, 'model_w2v_2_300_5')

#### GloVe
GloVe is similar to word2vec, it can also turn words to vectors. We use GloVe to count word vectors in this step.

In [0]:
#200d
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file='/content/drive//My Drive/data/glove.6B.200d.txt'
word2vec_output_file='/content/drive//My Drive/data/glove.6B.200d.txt.word2vec'
glove2word2vec(glove_input_file,word2vec_output_file)
from gensim.models import KeyedVectors
filename=word2vec_output_file
model2=KeyedVectors.load_word2vec_format(filename,binary=False)

In [0]:
def sent2vec(s, model):
    words = tokenizer.tokenize(s.lower())
    words = [w for w in words if not w in STOP_WORDS]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(model[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    return v / np.sqrt((v ** 2).sum())

In [0]:
sent1_vectors = np.zeros((td.shape[0], model2.vector_size))
count = 0
for i in td.sent1.values:
    sent1_vectors[count,:] = sent2vec(i, model2)
    count += 1

sent2_vectors = np.zeros((td.shape[0], model2.vector_size))
count = 0
for i in td.sent2.values:
    sent2_vectors[count,:] = sent2vec(i, model2)
    count += 1

#### Glove + Cosine Distance

In [0]:
data = pd.DataFrame()
data['cosine_distance'] = [cosine(x, y) for (x, y) in zip(sent1_vectors, sent2_vectors)]
data['cityblock_distance'] = [cityblock(x, y) for (x, y) in zip(sent1_vectors, sent2_vectors)]
data['jaccard_distance'] = [jaccard(x, y) for (x, y) in zip(sent1_vectors, sent2_vectors)]
data['canberra_distance'] = [canberra(x, y) for (x, y) in zip(sent1_vectors, sent2_vectors)]
data['euclidean_distance'] = [euclidean(x, y) for (x, y) in zip(sent1_vectors, sent2_vectors)]
data['minkowski_distance'] = [minkowski(x, y, 3) for (x, y) in zip(sent1_vectors, sent2_vectors)]
data['braycurtis_distance'] = [braycurtis(x, y) for (x, y) in zip(sent1_vectors, sent2_vectors)]

In [0]:
td.insert(8,'glove_cosine',data['cosine_distance'])
td.insert(9,'glove_cityblock',data['cityblock_distance'])
td.insert(10,'glove_jaccard',data['jaccard_distance'])
td.insert(11,'glove_canberra',data['canberra_distance'])
td.insert(12,'glove_braycurtis',data['braycurtis_distance'])
td.insert(13,'glove_euclidean',data['euclidean_distance'])
td.insert(14,'glove_minkowski',data['minkowski_distance'])

#### GloVe + wmdistance

In [0]:
td['glove_wmd_1'] = td[['sent1_tradition', 'sent2_tradition']].apply(lambda x: model2.wmdistance(x[0], x[1]), axis=1)

## Modelling

#### Xgboost

In [0]:
all_data_train = pd.read_csv('/content/drive//My Drive/data/train_features.csv')
all_data_clean = all_data_train.replace([np.inf, -np.inf], np.nan).dropna(how="any")

x_train = all_data_clean.drop(columns=['same_source','id'])
y_train = all_data_clean['same_source']

In [0]:
all_data_test = pd.read_csv('/content/drive//My Drive/data/test_features.csv')

In [0]:
def train_xgb(X, y, params):
	x, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=RS)

	xg_train = xgb.DMatrix(x, label=y_train)
	xg_val = xgb.DMatrix(X_val, label=y_val)

	watchlist  = [(xg_train,'train'), (xg_val,'eval')]
	return xgb.train(params, xg_train, ROUNDS, watchlist)

def train_xgb_silent(X, y, params):
	x, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=RS)

	xg_train = xgb.DMatrix(x, label=y_train)
	xg_val = xgb.DMatrix(X_val, label=y_val)

	#watchlist  = [(xg_train,'train'), (xg_val,'eval')]
	return xgb.train(params, xg_train, ROUNDS)
 
def predict_xgb(clr, X_test):
	return clr.predict(xgb.DMatrix(X_test))

In [0]:
RS = 47
ROUNDS = 500
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.1
params['max_depth'] = 5
params['silent'] = 1
params['seed'] = RS

In [0]:
# feature selection - spearman
## drop the high correlated features by using spearman. If two features' correlation score is above threshold(0.98), then drop one feature.
X_train_corr = all_data[['w2v_wmdmodel_w2v_1_40_3', 'w2v_wmdmodel_w2v_1_40_5',
       'w2v_wmdmodel_w2v_1_40_7', 'w2v_wmdmodel_w2v_1_70_3',
       'w2v_wmdmodel_w2v_1_70_5', 'w2v_wmdmodel_w2v_1_70_7',
       'w2v_wmdmodel_w2v_1_100_3', 'w2v_wmdmodel_w2v_1_100_5',
       'w2v_wmdmodel_w2v_1_100_7', 'w2v_wmdmodel_w2v_1_130_3',
       'w2v_wmdmodel_w2v_1_130_5', 'w2v_wmdmodel_w2v_1_130_7',
       'w2v_wmdmodel_w2v_1_160_3', 'w2v_wmdmodel_w2v_1_160_5',
       'w2v_wmdmodel_w2v_1_160_7', 'w2v_wmdmodel_w2v_1_190_3',
       'w2v_wmdmodel_w2v_1_190_5', 'w2v_wmdmodel_w2v_1_190_7',
       'w2v_wmdmodel_w2v_2_40_3', 'w2v_wmdmodel_w2v_2_40_5',
       'w2v_wmdmodel_w2v_2_40_7', 'w2v_wmdmodel_w2v_2_70_3',
       'w2v_wmdmodel_w2v_2_70_5', 'w2v_wmdmodel_w2v_2_70_7',
       'w2v_wmdmodel_w2v_2_100_3', 'w2v_wmdmodel_w2v_2_100_5',
       'w2v_wmdmodel_w2v_2_100_7', 'w2v_wmdmodel_w2v_2_130_3',
       'w2v_wmdmodel_w2v_2_130_5', 'w2v_wmdmodel_w2v_2_130_7',
       'w2v_wmdmodel_w2v_2_160_3', 'w2v_wmdmodel_w2v_2_160_5',
       'w2v_wmdmodel_w2v_2_160_7', 'w2v_wmdmodel_w2v_2_190_3',
       'w2v_wmdmodel_w2v_2_190_5', 'w2v_wmdmodel_w2v_2_190_7',
       'w2v_wmdmodel_w2v_2_250_5', 'w2v_wmdmodel_w2v_2_300_5']].corr(method='spearman')

# 将对角线变为0
mask = np.ones(X_train_corr.columns.size) - np.eye(X_train_corr.columns.size)
X_train_corr = mask * X_train_corr

drops = []
for col in X_train_corr.columns.values:
    # if we've already determined to drop the current variable, continue
    if np.in1d([col],drops):
        continue

    # 找出高相关的变量
    corr = X_train_corr[abs(X_train_corr[col]) > 0.98].index
    drops = np.union1d(drops, corr)

print("nDropping", drops.shape[0], "highly correlated features...n", drops)

nDropping 27 highly correlated features...n ['w2v_wmdmodel_w2v_1_100_5' 'w2v_wmdmodel_w2v_1_100_7'
 'w2v_wmdmodel_w2v_1_130_3' 'w2v_wmdmodel_w2v_1_130_7'
 'w2v_wmdmodel_w2v_1_160_5' 'w2v_wmdmodel_w2v_1_190_3'
 'w2v_wmdmodel_w2v_1_190_5' 'w2v_wmdmodel_w2v_1_190_7'
 'w2v_wmdmodel_w2v_2_100_3' 'w2v_wmdmodel_w2v_2_100_5'
 'w2v_wmdmodel_w2v_2_100_7' 'w2v_wmdmodel_w2v_2_130_3'
 'w2v_wmdmodel_w2v_2_130_5' 'w2v_wmdmodel_w2v_2_130_7'
 'w2v_wmdmodel_w2v_2_160_3' 'w2v_wmdmodel_w2v_2_160_5'
 'w2v_wmdmodel_w2v_2_160_7' 'w2v_wmdmodel_w2v_2_190_3'
 'w2v_wmdmodel_w2v_2_190_7' 'w2v_wmdmodel_w2v_2_250_5'
 'w2v_wmdmodel_w2v_2_300_5' 'w2v_wmdmodel_w2v_2_40_3'
 'w2v_wmdmodel_w2v_2_40_5' 'w2v_wmdmodel_w2v_2_40_7'
 'w2v_wmdmodel_w2v_2_70_3' 'w2v_wmdmodel_w2v_2_70_5'
 'w2v_wmdmodel_w2v_2_70_7']


In [0]:
drops=['same_source','braycurtis_distance_model_w2v_1_100_7','braycurtis_distance_model_w2v_1_130_7','braycurtis_distance_model_w2v_1_160_7','braycurtis_distance_model_w2v_1_190_5','braycurtis_distance_model_w2v_1_40_7','braycurtis_distance_model_w2v_1_70_7','braycurtis_distance_model_w2v_2_100_7','braycurtis_distance_model_w2v_2_130_5','braycurtis_distance_model_w2v_2_130_7','braycurtis_distance_model_w2v_2_160_5','braycurtis_distance_model_w2v_2_190_5','braycurtis_distance_model_w2v_2_190_7','braycurtis_distance_model_w2v_2_250_5','braycurtis_distance_model_w2v_2_40_7','braycurtis_distance_model_w2v_2_70_7','cityblock_distance_model_w2v_1_190_7','cityblock_distance_model_w2v_2_160_7','cityblock_distance_model_w2v_2_190_7','cityblock_distance_model_w2v_2_300_5','cosine_distance_model_w2v_1_190_7','cosine_distance_model_w2v_2_100_3','cosine_distance_model_w2v_2_100_7','cosine_distance_model_w2v_2_130_7','cosine_distance_model_w2v_2_160_7','cosine_distance_model_w2v_2_300_5','cosine_distance_model_w2v_2_40_7','cosine_distance_model_w2v_2_70_7','diff_len_char','euclidean_distance_model_w2v_1_100_3','euclidean_distance_model_w2v_1_100_5','euclidean_distance_model_w2v_1_100_7','euclidean_distance_model_w2v_1_130_3','euclidean_distance_model_w2v_1_130_5','euclidean_distance_model_w2v_1_130_7','euclidean_distance_model_w2v_1_160_3','euclidean_distance_model_w2v_1_160_5','euclidean_distance_model_w2v_1_160_7','euclidean_distance_model_w2v_1_190_3','euclidean_distance_model_w2v_1_190_5','euclidean_distance_model_w2v_1_190_7','euclidean_distance_model_w2v_1_40_3','euclidean_distance_model_w2v_1_40_5','euclidean_distance_model_w2v_1_40_7','euclidean_distance_model_w2v_1_70_3','euclidean_distance_model_w2v_1_70_5','euclidean_distance_model_w2v_1_70_7','euclidean_distance_model_w2v_2_100_5','euclidean_distance_model_w2v_2_100_7','euclidean_distance_model_w2v_2_130_3','euclidean_distance_model_w2v_2_130_5','euclidean_distance_model_w2v_2_130_7','euclidean_distance_model_w2v_2_160_3','euclidean_distance_model_w2v_2_160_5','euclidean_distance_model_w2v_2_160_7','euclidean_distance_model_w2v_2_190_3','euclidean_distance_model_w2v_2_190_5','euclidean_distance_model_w2v_2_190_7','euclidean_distance_model_w2v_2_250_5','euclidean_distance_model_w2v_2_300_5','euclidean_distance_model_w2v_2_40_3','euclidean_distance_model_w2v_2_40_5','euclidean_distance_model_w2v_2_40_7','euclidean_distance_model_w2v_2_70_3','euclidean_distance_model_w2v_2_70_5','jaccard_distance_model_w2v_1_100_3','jaccard_distance_model_w2v_1_100_5','jaccard_distance_model_w2v_1_100_7','jaccard_distance_model_w2v_1_130_3','jaccard_distance_model_w2v_1_130_5','jaccard_distance_model_w2v_1_130_7','jaccard_distance_model_w2v_1_160_3','jaccard_distance_model_w2v_1_160_5','jaccard_distance_model_w2v_1_160_7','jaccard_distance_model_w2v_1_190_3','jaccard_distance_model_w2v_1_190_5','jaccard_distance_model_w2v_1_190_7','jaccard_distance_model_w2v_1_40_5','jaccard_distance_model_w2v_1_40_7','jaccard_distance_model_w2v_1_70_3','jaccard_distance_model_w2v_1_70_5','jaccard_distance_model_w2v_1_70_7','jaccard_distance_model_w2v_2_100_3','jaccard_distance_model_w2v_2_100_5','jaccard_distance_model_w2v_2_100_7','jaccard_distance_model_w2v_2_130_3','jaccard_distance_model_w2v_2_130_5','jaccard_distance_model_w2v_2_130_7','jaccard_distance_model_w2v_2_160_3','jaccard_distance_model_w2v_2_160_5','jaccard_distance_model_w2v_2_160_7','jaccard_distance_model_w2v_2_190_3','jaccard_distance_model_w2v_2_190_5','jaccard_distance_model_w2v_2_190_7','jaccard_distance_model_w2v_2_250_5','jaccard_distance_model_w2v_2_300_5','jaccard_distance_model_w2v_2_40_3','jaccard_distance_model_w2v_2_40_5','jaccard_distance_model_w2v_2_40_7','jaccard_distance_model_w2v_2_70_3','jaccard_distance_model_w2v_2_70_5','jaccard_distance_model_w2v_2_70_7','len_char_q1','len_char_q2','minkowski_distance_model_w2v_1_100_7','minkowski_distance_model_w2v_1_130_7','minkowski_distance_model_w2v_1_160_7','minkowski_distance_model_w2v_1_40_7','minkowski_distance_model_w2v_1_70_7','minkowski_distance_model_w2v_2_130_5','minkowski_distance_model_w2v_2_160_5','minkowski_distance_model_w2v_2_160_7','minkowski_distance_model_w2v_2_190_5','minkowski_distance_model_w2v_2_190_7','minkowski_distance_model_w2v_2_250_5','minkowski_distance_model_w2v_2_70_7','w2v_wmdmodel_w2v_1_100_5','w2v_wmdmodel_w2v_1_100_7','w2v_wmdmodel_w2v_1_130_3','w2v_wmdmodel_w2v_1_130_7','w2v_wmdmodel_w2v_1_160_5','w2v_wmdmodel_w2v_1_190_3','w2v_wmdmodel_w2v_1_190_5','w2v_wmdmodel_w2v_1_190_7','w2v_wmdmodel_w2v_2_100_3','w2v_wmdmodel_w2v_2_100_5','w2v_wmdmodel_w2v_2_100_7','w2v_wmdmodel_w2v_2_130_3','w2v_wmdmodel_w2v_2_130_5','w2v_wmdmodel_w2v_2_130_7','w2v_wmdmodel_w2v_2_160_3','w2v_wmdmodel_w2v_2_160_5','w2v_wmdmodel_w2v_2_160_7','w2v_wmdmodel_w2v_2_190_3','w2v_wmdmodel_w2v_2_190_7','w2v_wmdmodel_w2v_2_250_5','w2v_wmdmodel_w2v_2_300_5','w2v_wmdmodel_w2v_2_40_3','w2v_wmdmodel_w2v_2_40_5','w2v_wmdmodel_w2v_2_40_7','w2v_wmdmodel_w2v_2_70_3','w2v_wmdmodel_w2v_2_70_5','w2v_wmdmodel_w2v_2_70_7']
x_train = all_data_clean.drop(columns = drops)

In [0]:
xgbt = train_xgb(x_train, y_train, params)

[0]	train-logloss:0.663879	eval-logloss:0.666232
[1]	train-logloss:0.639898	eval-logloss:0.644155
[2]	train-logloss:0.619991	eval-logloss:0.626366
[3]	train-logloss:0.60318	eval-logloss:0.611524
[4]	train-logloss:0.588639	eval-logloss:0.598561
[5]	train-logloss:0.576322	eval-logloss:0.587783
[6]	train-logloss:0.565615	eval-logloss:0.578249
[7]	train-logloss:0.556325	eval-logloss:0.57021
[8]	train-logloss:0.548316	eval-logloss:0.563134
[9]	train-logloss:0.541181	eval-logloss:0.557022
[10]	train-logloss:0.535049	eval-logloss:0.55165
[11]	train-logloss:0.529769	eval-logloss:0.547277
[12]	train-logloss:0.525046	eval-logloss:0.543281
[13]	train-logloss:0.520818	eval-logloss:0.53974
[14]	train-logloss:0.517312	eval-logloss:0.536973
[15]	train-logloss:0.51382	eval-logloss:0.534279
[16]	train-logloss:0.510715	eval-logloss:0.531857
[17]	train-logloss:0.507802	eval-logloss:0.52962
[18]	train-logloss:0.505537	eval-logloss:0.528189
[19]	train-logloss:0.503328	eval-logloss:0.526824
[20]	train-loglo

In [0]:
# import collections
# print([item for item, count in collections.Counter(all_data.columns).items() if count > 1])

['both_contain_number', 'contain_number', 'contain_$', 'people']


In [0]:
preds = predict_xgb(xgbt, all_data_test)

In [0]:
# We try to use the most important 40 features in the model and get a better performance in Xgboost. 
# But we find that the performance of using 40 features is worse than using all the five hundred features.
# We feel confused here because we suppose with less while strong features, the model should perform better.
import operator
importance = clr.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1))

list = []
for i in range(40):
    list.append(importance[-i][0])

clr = train_xgb(all_data_clean[list], y_train, params)

In [0]:
cols = ['state',
 'word_match',
 '300d_glove_braycurtis',
 'w2v_wmdmodel_w2v_1_160_7',
 '50d_glove_braycurtis',
 'tfidf_sim',
 '200d_glove_canberra',
 'w2v_wmdmodel_w2v_2_190_5',
 '200d_glove_wmd_1',
 'stops2_ratio',
 '50d_glove_canberra',
 '100d_glove_minkowski',
 '300d_glove_canberra',
 'stops1_ratio',
 '100d_glove_wmd_1',
 'w2v_wmdmodel_w2v_1_40_7',
 'tfidf_word_match',
 'diff_avg_word',
 'avg_world_len1',
 '50d_glove_cosine',
 'same_words.1',
 'len_q2',
 'len_q1',
 '50d_glove_wmd_1',
 'diff_len',
 'avg_world_len2',
 'w2v_wmdmodel_w2v_1_70_7',
 '300d_glove_cityblock',
 'diff_stops_r',
 'w2v_wmdmodel_w2v_1_40_3']

for var in combinations(cols, 25):
    li = list(var)

    model = train_xgb_silent(x[li], y_train, params)
    pred = predict_xgb(model, X_val[li])
    logloss = log_loss(y_val, pred)
    print(logloss)
    if logloss<0.48:
        print(logloss, li)

    del li

#### LightGBM

In [0]:
x, X_val, y, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=47)

In [0]:
parameters = {
              'max_depth': [6, 7, 8],
              'learning_rate': [0.09, 0.11, 0.12],
              #'feature_fraction': [0.6, 0.7, 0.8, 0.9, 0.95],
              #'bagging_fraction': [0.6, 0.7, 0.8, 0.9, 0.95],
              #'bagging_freq': [2, 4, 5, 6, 8],
              'lambda_l1': [0.1, 0.4, 0.6],
              #'lambda_l2': [0, 10, 15, 35, 40],
              #'cat_smooth': [1, 10, 15, 20, 35]
}

gbm = lgb.LGBMClassifier(boosting_type='gbdt',
                         objective = 'binary',
                         metric = 'logloss',
                         verbose = 0,
                         max_depth = 5,
                         learning_rate = 0.11,
                         feature_fraction = 0.9)
                         #num_leaves = 35,
                         #lambda_l1= 0.6,
                         #lambda_l2= 0)


gsearch = GridSearchCV(gbm, param_grid=parameters, scoring='accuracy', cv=3) #neg_log_loss
gsearch.fit(x, y)

print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))


Best score: 0.720
Best parameters set:
	lambda_l1: 0.1
	learning_rate: 0.12
	max_depth: 7


In [0]:
lgb_model = lgb.LGBMClassifier(max_depth=7, learning_rate=0.12, objective='binary', metric='binary_logloss', lambda_l1=0.1)
x, X_val, y, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=47)
lgb_model.fit(x, y)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', lambda_l1=0.1, learning_rate=0.12,
               max_depth=7, metric='binary_logloss', min_child_samples=20,
               min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,
               n_jobs=-1, num_leaves=31, objective='binary', random_state=None,
               reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

In [0]:
pred = lgb_model.predict_proba(X_val)
log_loss(y_val, pred[:,1])

0.5446318431163707

In [0]:
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

feat_labels = x_train.columns
for f in range(x_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

In [0]:
indices_ = indices[0:50]
X_train_model = x.iloc[:,indices_]
X_test_model = X_val.iloc[:,indices_]
X_train_model.columns

Index(['word_match', 'w2v_wmdmodel_w2v_1_160_7', 'tfidf_sim',
       'tfidf_word_match', 'w2v_wmdmodel_w2v_2_190_5', 'stops2_ratio',
       'w2v_wmdmodel_w2v_1_40_7', 'len_q1', 'len_q2',
       'cosine_distance_model_w2v_2_190_7', 'avg_world_len2', 'stops1_ratio',
       'same_words.1', 'bow_minkowski', 'avg_world_len1', 'bow_cosine',
       'w2v_wmdmodel_w2v_1_70_7', 'diff_avg_word', 'bow_canberra',
       'len_word_q1', 'w2v_wmdmodel_w2v_1_100_3', 'w2v_wmdmodel_w2v_1_70_3',
       'minkowski_distance_model_w2v_2_40_7', 'contain_number',
       'bow_cityblock', 'minkowski_distance_model_w2v_2_130_7',
       'minkowski_distance_model_w2v_2_100_7',
       'minkowski_distance_model_w2v_2_300_5', 'len_word_q2',
       'w2v_wmdmodel_w2v_1_160_3', 'diff_len', 'bow_euclidean', 'diff_stops_r',
       'w2v_wmdmodel_w2v_1_40_3', 'canberra_distance_model_w2v_1_40_7',
       'w2v_wmdmodel_w2v_1_70_5', 'w2v_wmdmodel_w2v_1_130_5', 'bow_jaccard',
       'canberra_distance_model_w2v_1_100_7',
       

In [0]:
cols = ['word_match', 'w2v_wmdmodel_w2v_1_160_7', 'tfidf_sim',
       'tfidf_word_match', 'w2v_wmdmodel_w2v_2_190_5', 'stops2_ratio',
       'w2v_wmdmodel_w2v_1_40_7', 'len_q1', 'len_q2',
       'cosine_distance_model_w2v_2_190_7', 'avg_world_len2', 'stops1_ratio',
       'same_words.1', 'bow_minkowski', 'avg_world_len1', 'bow_cosine',
       'w2v_wmdmodel_w2v_1_70_7', 'diff_avg_word', 'bow_canberra',
       'len_word_q1', 'w2v_wmdmodel_w2v_1_100_3', 'w2v_wmdmodel_w2v_1_70_3',
       'minkowski_distance_model_w2v_2_40_7', 'contain_number',
       'bow_cityblock', 'minkowski_distance_model_w2v_2_130_7',
       'minkowski_distance_model_w2v_2_100_7',
       'minkowski_distance_model_w2v_2_300_5', 'len_word_q2',
       'w2v_wmdmodel_w2v_1_160_3', 'diff_len', 'bow_euclidean', 'diff_stops_r',
       'w2v_wmdmodel_w2v_1_40_3', 'canberra_distance_model_w2v_1_40_7',
       'w2v_wmdmodel_w2v_1_70_5', 'w2v_wmdmodel_w2v_1_130_5', 'bow_jaccard',
       'canberra_distance_model_w2v_1_100_7',
       'minkowski_distance_model_w2v_2_40_3',
       'braycurtis_distance_model_w2v_2_300_5',
       'canberra_distance_model_w2v_1_100_3',
       'canberra_distance_model_w2v_2_40_3', 'w2v_wmdmodel_w2v_1_40_5',
       'canberra_distance_model_w2v_2_70_3',
       'braycurtis_distance_model_w2v_2_130_3',
       'canberra_distance_model_w2v_1_160_3',
       'canberra_distance_model_w2v_2_70_7', 'diff_len_word',
       'canberra_distance_model_w2v_1_130_3']

for var in combinations(cols, 45):
    li = list(var)
    X_train_model_3 = X_train_model[li]
    X_test_model_3 = X_test_model[li]

    model = lgb.LGBMClassifier(max_depth=7, learning_rate=0.12, objective='binary', metric='binary_logloss', lambda_l1=0.1)
    model.fit(X_train_model_3, y)
    pred = model.predict_proba(X_val[li])
    logloss = log_loss(y_val, pred[:,1])
    print(logloss)
    if logloss<0.544:
        print(logloss, li)

    del li

#### DNN

In [0]:
x = all_data_clean.iloc[:,1:].values#[impt_features]
y = all_data_clean.iloc[:,0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [0]:
# Build the neural network
model = Sequential()

model.add(Dense(150, input_dim=x.shape[1], activation='relu'))
model.add(BatchNormalization())
#model.add(activation='relu')
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))

model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(1))
Adam = optimizers.Adam(lr=0.00003, beta_1=0.9, beta_2=0.999, amsgrad=False)
#SGD = optimizers.SGD(learning_rate=0.01, momentum=0.5, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['acc'])
# optimizer: Nadam/ Adam

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=100, verbose=1, mode='auto', restore_best_weights=True)
history = model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=2,epochs=200)

pred = model.predict(x_test)

Train on 103324 samples, validate on 25832 samples
Epoch 1/200
103324/103324 - 22s - loss: 0.7392 - acc: 0.5016 - val_loss: 0.6947 - val_acc: 0.4999
Epoch 2/200
103324/103324 - 22s - loss: 0.6976 - acc: 0.4987 - val_loss: 0.6932 - val_acc: 0.4999
Epoch 3/200
103324/103324 - 22s - loss: 0.6945 - acc: 0.4998 - val_loss: 0.6932 - val_acc: 0.4999
Epoch 4/200
103324/103324 - 21s - loss: 0.6935 - acc: 0.5005 - val_loss: 0.6932 - val_acc: 0.5001
Epoch 5/200
103324/103324 - 21s - loss: 0.6933 - acc: 0.5022 - val_loss: 0.6933 - val_acc: 0.5001
Epoch 6/200
103324/103324 - 21s - loss: 0.6933 - acc: 0.4986 - val_loss: 0.6933 - val_acc: 0.4999
Epoch 7/200
103324/103324 - 21s - loss: 0.6932 - acc: 0.4999 - val_loss: 0.6932 - val_acc: 0.4999
Epoch 8/200
103324/103324 - 21s - loss: 0.6932 - acc: 0.5002 - val_loss: 0.6932 - val_acc: 0.5001
Epoch 9/200
103324/103324 - 21s - loss: 0.6933 - acc: 0.4983 - val_loss: 0.6932 - val_acc: 0.4999
Epoch 10/200
103324/103324 - 21s - loss: 0.6932 - acc: 0.5003 - val

In [0]:
import time
import tensorflow.keras.initializers
import statistics
import tensorflow.keras
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import StratifiedShuffleSplit
from tensorflow.keras.layers import LeakyReLU,PReLU
from tensorflow.keras.optimizers import Adam

In [0]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

In [0]:
!pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading https://files.pythonhosted.org/packages/72/0c/173ac467d0a53e33e41b521e4ceba74a8ac7c7873d7b857a8fbdca88302d/bayesian-optimization-1.0.1.tar.gz
Building wheels for collected packages: bayesian-optimization
  Building wheel for bayesian-optimization (setup.py) ... [?25l[?25hdone
  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.0.1-cp36-none-any.whl size=10032 sha256=c2c7bc69db8d97891d9d2bcb2bd53fcbbd2d83279e1a5033586fff86fabbdf55
  Stored in directory: /root/.cache/pip/wheels/1d/0d/3b/6b9d4477a34b3905f246ff4e7acf6aafd4cc9b77d473629b77
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.0.1


##### Input Perturbation Ranking

In [0]:
def perturbation_rank(model, x, y, names, regression):
    errors = []

    for i in range(x.shape[1]):
        hold = np.array(x[:, i])
        np.random.shuffle(x[:, i])
        
        if regression:
            pred = model.predict(x)
            error = metrics.mean_squared_error(y, pred)
        else:
            pred = model.predict(x)
            error = metrics.log_loss(y, pred)
            
        errors.append(error)
        x[:, i] = hold
        
    max_error = np.max(errors)
    importance = [e/max_error for e in errors]

    data = {'name':names,'error':errors,'importance':importance}
    result = pd.DataFrame(data, columns = ['name','error','importance'])
    result.sort_values(by=['importance'], ascending=[0], inplace=True)
    result.reset_index(inplace=True, drop=True)
    return result

In [0]:
from IPython.display import display, HTML

names = list(all_data_clean.drop(columns = drops).columns)
rank = perturbation_rank(model, x_test, y_test, names, False)
display(rank)

Unnamed: 0,name,error,importance
0,300d_glove_canberra,0.564802,1.000000
1,diff_len,0.553036,0.979169
2,len_q1,0.551476,0.976407
3,200d_glove_canberra,0.545663,0.966115
4,len_q2,0.545373,0.965601
...,...,...,...
254,cityblock_distance_model_w2v_2_100_3,0.521215,0.922828
255,canberra_distance_model_w2v_1_160_5,0.521142,0.922700
256,cityblock_distance_model_w2v_2_160_3,0.521120,0.922660
257,canberra_distance_model_w2v_1_100_3,0.520916,0.922300


In [0]:
list(rank['name'])[:20]

['300d_glove_canberra',
 'diff_len',
 'len_q1',
 '200d_glove_canberra',
 'len_q2',
 'same_words.1',
 'same_words',
 'canberra_distance_model_w2v_2_300_5',
 '50d_glove_canberra',
 'shared_count',
 'canberra_distance_model_w2v_2_190_7',
 'canberra_distance_model_w2v_2_160_7',
 'canberra_distance_model_w2v_1_190_7',
 'feature_percent_sent2',
 'canberra_distance_model_w2v_2_190_5',
 'len_word_q2',
 'canberra_distance_model_w2v_2_250_5',
 'canberra_distance_model_w2v_2_100_7',
 'bow_cityblock',
 'canberra_distance_model_w2v_1_130_7']

##### Bayesian Optimization

In [0]:
def evaluate_network(dropout,lr,neuronPct,neuronShrink):
    SPLITS = 1

    # Bootstrap
    boot = StratifiedShuffleSplit(n_splits=SPLITS, test_size=0.2)

    # Track progress
    mean_benchmark = []
    epochs_needed = []
    num = 0
    neuronCount = int(neuronPct * 5000)

    # Loop through samples
    for train, test in boot.split(x,y):
        start_time = time.time()
        num+=1

        # Split train and test
        x_train = x[train]
        y_train = y[train]
        x_test = x[test]
        y_test = y[test]

        # Construct neural network
        # kernel_initializer = tensorflow.keras.initializers.he_uniform(seed=None)
        model = Sequential()
        
        layer = 0
        while neuronCount>25 and layer<10:
            #print(neuronCount)
            if layer==0:
                model.add(Dense(neuronCount, 
                    input_dim=x.shape[1], 
                    activation='relu')) 
            else:
                model.add(Dense(neuronCount, activation='relu')) 
            model.add(Dropout(dropout))
        
            neuronCount = neuronCount * neuronShrink
        
        model.add(Dense(1,activation='sigmoid')) # Output
        model.compile(loss='binary_crossentropy', optimizer=Adam(lr=lr))
        monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
            patience=10, verbose=0, mode='auto', restore_best_weights=True)

        # Train on the bootstrap sample
        model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=0,epochs=50)
        epochs = monitor.stopped_epoch
        epochs_needed.append(epochs)

        # Predict on the out of boot (validation)
        pred = model.predict(x_test)

        # Measure this bootstrap's log loss
        #y_compare = np.argmax(y_test,axis=1) # For log loss calculation
        score = metrics.log_loss(y_test, pred)
        mean_benchmark.append(score)
        m1 = statistics.mean(mean_benchmark)
        #m1 = statistics.mean(mean_benchmark[~np.isnan(mean_benchmark)])
        m2 = statistics.mean(epochs_needed)
        mdev = statistics.pstdev(mean_benchmark)

        # Record this iteration
        time_took = time.time() - start_time
        #print(f"#{num}: score={score:.6f}, mean score={m1:.6f}, stdev={mdev:.6f}, epochs={epochs}, mean epochs={int(m2)}, time={hms_string(time_took)}")

    tensorflow.keras.backend.clear_session()
    return (m1)


In [0]:
for dropout in np.arange(0.0,0.49,0.1):
    for lr in np.arange(0.0,0.1,0.02):
        for neuronPct in np.arange(0.01,0.07,0.02):
            for neuronShrink in np.arange(0.01,1,0.2):
                evaluate_network(dropout,lr,neuronPct,neuronShrink)
                print(dropout, lr, neuronPct, neuronShrink, m1)

0.0 0.0 0.01 0.01 0.6476401327683433
0.0 0.0 0.01 0.21000000000000002 0.6476401327683433
0.0 0.0 0.01 0.41000000000000003 0.6476401327683433
0.0 0.0 0.01 0.6100000000000001 0.6476401327683433
0.0 0.0 0.01 0.81 0.6476401327683433
0.0 0.0 0.03 0.01 0.6476401327683433
0.0 0.0 0.03 0.21000000000000002 0.6476401327683433
0.0 0.0 0.03 0.41000000000000003 0.6476401327683433
0.0 0.0 0.03 0.6100000000000001 0.6476401327683433
0.0 0.0 0.03 0.81 0.6476401327683433
0.0 0.0 0.049999999999999996 0.01 0.6476401327683433


KeyboardInterrupt: ignored

In [0]:
from bayes_opt import BayesianOptimization
import time

# Bounded region of parameter space
pbounds = {'dropout': (0.0, 0.499),
           'lr': (0.0, 0.1),
           'neuronPct': (0.01, 0.07),
           'neuronShrink': (0.01, 1)
          }

optimizer = BayesianOptimization(
    f=evaluate_network,
    pbounds=pbounds,
    verbose=2,  # verbose = 1 prints only when a maximum is observed, verbose = 0 is silent
    random_state=1,
)

start_time = time.time()
optimizer.maximize(init_points=10, n_iter=100,)
time_took = time.time() - start_time


|   iter    |  target   |  dropout  |    lr     | neuronPct | neuron... |
-------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.6932  [0m | [0m 0.2081  [0m | [0m 0.07203 [0m | [0m 0.01001 [0m | [0m 0.3093  [0m |
| [0m 2       [0m | [0m 0.6931  [0m | [0m 0.07323 [0m | [0m 0.009234[0m | [0m 0.02118 [0m | [0m 0.3521  [0m |
| [95m 3       [0m | [95m 0.6938  [0m | [95m 0.198   [0m | [95m 0.05388 [0m | [95m 0.03515 [0m | [95m 0.6884  [0m |
| [0m 4       [0m | [0m 0.6932  [0m | [0m 0.102   [0m | [0m 0.08781 [0m | [0m 0.01164 [0m | [0m 0.6738  [0m |


KeyboardInterrupt: ignored

#### Ensemble models

In [0]:
import os
import math
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

x, X_val, y, y_val = train_test_split(x_train.values, y_train.values, test_size=0.2, random_state=47)

SHUFFLE = False
FOLDS = 10

def build_ann():
    model = Sequential()
    model.add(Dense(100, input_dim=x.shape[1], activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(25, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    Adam = optimizers.Adam(lr=0.001)
    model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


def stretch(y):
    return (y - y.min()) / (y.max() - y.min())


def blend_ensemble(x, y, x_submit):
    kf = StratifiedKFold(FOLDS)
    folds = list(kf.split(x,y))

    models = [  build_ann(),
            xgb.XGBClassifier(learning_rate=0.1, max_depth=5, objective='binary:logistic', random_state=47, epoch=200),\
            lgb.LGBMClassifier(max_depth=6, learning_rate=0.12, objective='binary', metric='binary_logloss', lambda_l1=0.1),\
            LogisticRegression(C=100, random_state=47, penalty='l1',)]#,\
            #RandomForestClassifier(n_estimators=10, max_features=3, random_state=47)]

    dataset_blend_train = np.zeros((x.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))

    for j, model in enumerate(models):
        print("Model: {} : {}".format(j, model) )
        fold_sums = np.zeros((x_submit.shape[0], len(folds)))
        total_loss = 0
        for i, (train, test) in enumerate(folds):
            x_train = x[train]
            y_train = y[train]
            x_test = x[test]
            y_test = y[test]
            model.fit(x_train, y_train)
            pred = np.array(model.predict_proba(x_test))
            # pred = model.predict_proba(x_test)
            if j==0:
                dataset_blend_train[test, j] = pred.flatten()
            else:
                dataset_blend_train[test, j] = pred[:, 1]
            
            pred2 = np.array(model.predict_proba(x_submit))
            #fold_sums[:, i] = model.predict_proba(x_submit)[:, 1]

            if j==0:
                fold_sums[:, i] = pred2.flatten()
            else:
                fold_sums[:, i] = pred2[:, 1]

            loss = log_loss(y_test, pred)
            total_loss+=loss
            print("Fold #{}: loss={}".format(i,loss))
        print("{}: Mean loss={}".format(model.__class__.__name__,total_loss/len(folds)))
        dataset_blend_test[:, j] = fold_sums.mean(1)

    print()
    print("Blending models.")
    blend = LogisticRegression(solver='lbfgs')
    blend.fit(dataset_blend_train, y)
    return blend.predict_proba(dataset_blend_test)

if __name__ == '__main__':

    np.random.seed(42)  # seed to shuffle the train set

    print("Loading data...")

    if SHUFFLE:
        idx = np.random.permutation(y.size)
        x = x[idx]
        y = y[idx]

    submit_data = blend_ensemble(x, y, X_val)
    submit_data = stretch(submit_data)

    ####################
    # Build submit file
    ####################
    ids = [id+1 for id in range(submit_data.shape[0])]
    submit_df = pd.DataFrame({'MoleculeId': ids, 'PredictedProbability': submit_data[:, 1]},
                             columns=['MoleculeId','PredictedProbability'])
    
    print(log_loss(y_val, submit_data[:, 1]))



Loading data...
Model: 0 : <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f9050e56748>
Train on 90261 samples
Fold #0: loss=0.5511684605189048
Train on 90261 samples
Fold #1: loss=0.550884449179821
Train on 90262 samples
Fold #2: loss=0.5400628521962254
Train on 90262 samples
Fold #3: loss=0.5447344950541149
Train on 90262 samples
Fold #4: loss=0.5382207332909016
Train on 90262 samples
Fold #5: loss=0.5324909045083807
Train on 90262 samples
Fold #6: loss=0.5433720567166088
Train on 90262 samples
Fold #7: loss=0.5495700576338789
Train on 90262 samples
Fold #8: loss=0.5389333589173114
Train on 90263 samples
Fold #9: loss=0.527313591164354
Sequential: Mean loss=0.5416750959180502
Model: 1 : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, epoch=200, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,


#### Evaluating the features

In [0]:
report = all_data_clean.copy()
grp = report.groupby(by=['same_source']).mean()
grp

Unnamed: 0_level_0,cosine_distance_model_w2v_1_40_3,cityblock_distance_model_w2v_1_40_3,jaccard_distance_model_w2v_1_40_3,canberra_distance_model_w2v_1_40_3,euclidean_distance_model_w2v_1_40_3,minkowski_distance_model_w2v_1_40_3,braycurtis_distance_model_w2v_1_40_3,cosine_distance_model_w2v_1_40_5,cityblock_distance_model_w2v_1_40_5,jaccard_distance_model_w2v_1_40_5,canberra_distance_model_w2v_1_40_5,euclidean_distance_model_w2v_1_40_5,minkowski_distance_model_w2v_1_40_5,braycurtis_distance_model_w2v_1_40_5,cosine_distance_model_w2v_1_40_7,cityblock_distance_model_w2v_1_40_7,jaccard_distance_model_w2v_1_40_7,canberra_distance_model_w2v_1_40_7,euclidean_distance_model_w2v_1_40_7,minkowski_distance_model_w2v_1_40_7,braycurtis_distance_model_w2v_1_40_7,cosine_distance_model_w2v_1_70_3,cityblock_distance_model_w2v_1_70_3,jaccard_distance_model_w2v_1_70_3,canberra_distance_model_w2v_1_70_3,euclidean_distance_model_w2v_1_70_3,minkowski_distance_model_w2v_1_70_3,braycurtis_distance_model_w2v_1_70_3,cosine_distance_model_w2v_1_70_5,cityblock_distance_model_w2v_1_70_5,jaccard_distance_model_w2v_1_70_5,canberra_distance_model_w2v_1_70_5,euclidean_distance_model_w2v_1_70_5,minkowski_distance_model_w2v_1_70_5,braycurtis_distance_model_w2v_1_70_5,cosine_distance_model_w2v_1_70_7,cityblock_distance_model_w2v_1_70_7,jaccard_distance_model_w2v_1_70_7,canberra_distance_model_w2v_1_70_7,euclidean_distance_model_w2v_1_70_7,...,w2v_wmdmodel_w2v_1_100_5,w2v_wmdmodel_w2v_1_100_7,w2v_wmdmodel_w2v_1_130_3,w2v_wmdmodel_w2v_1_130_5,w2v_wmdmodel_w2v_1_130_7,w2v_wmdmodel_w2v_1_160_3,w2v_wmdmodel_w2v_1_160_5,w2v_wmdmodel_w2v_1_160_7,w2v_wmdmodel_w2v_1_190_3,w2v_wmdmodel_w2v_1_190_5,w2v_wmdmodel_w2v_1_190_7,w2v_wmdmodel_w2v_2_40_3,w2v_wmdmodel_w2v_2_40_5,w2v_wmdmodel_w2v_2_40_7,w2v_wmdmodel_w2v_2_70_3,w2v_wmdmodel_w2v_2_70_5,w2v_wmdmodel_w2v_2_70_7,w2v_wmdmodel_w2v_2_100_3,w2v_wmdmodel_w2v_2_100_5,w2v_wmdmodel_w2v_2_100_7,w2v_wmdmodel_w2v_2_130_3,w2v_wmdmodel_w2v_2_130_5,w2v_wmdmodel_w2v_2_130_7,w2v_wmdmodel_w2v_2_160_3,w2v_wmdmodel_w2v_2_160_5,w2v_wmdmodel_w2v_2_160_7,w2v_wmdmodel_w2v_2_190_3,w2v_wmdmodel_w2v_2_190_5,w2v_wmdmodel_w2v_2_190_7,w2v_wmdmodel_w2v_2_250_5,w2v_wmdmodel_w2v_2_300_5,bow_cosine,bow_cityblock,bow_jaccard,bow_canberra,bow_euclidean,bow_minkowski,bow_braycurtis,tfidf_sim,same_words
same_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,0.939994,6.942299,0.998341,28.085681,1.360873,0.850351,0.957495,0.885093,6.709271,0.998341,27.39359,1.314488,0.821005,0.907533,0.802964,6.3206,0.998341,26.336231,1.238289,0.773397,0.833983,0.945128,9.200179,0.998341,49.369512,1.366862,0.780232,0.957488,0.888232,8.878867,0.998341,48.121596,1.319056,0.752946,0.906936,0.810079,8.395026,0.998341,46.476837,1.247076,...,0.040796,0.044364,0.034922,0.036022,0.039141,0.031621,0.032597,0.03548,0.02914,0.030083,0.03288,0.060436,0.063122,0.071613,0.047009,0.049153,0.055428,0.039808,0.041505,0.046579,0.035182,0.036706,0.041129,0.031862,0.033238,0.037202,0.02936,0.030747,0.034605,0.026908,0.024615,0.989788,21.245633,0.994659,19.940481,4.799265,3.020495,0.990502,0.769364,0.105355
1,0.869206,6.651821,0.999247,27.33206,1.304625,0.815429,0.893123,0.794914,6.294989,0.999247,26.273414,1.234752,0.771767,0.829662,0.704269,5.7861,0.999247,24.801337,1.134846,0.709321,0.750512,0.875287,8.831502,0.999246,48.133598,1.311737,0.748508,0.895499,0.798325,8.344901,0.999246,46.242916,1.239801,0.707594,0.83075,0.710832,7.702942,0.999246,43.882598,1.14452,...,0.038686,0.0407,0.033535,0.034171,0.035929,0.030358,0.030917,0.032536,0.027965,0.028497,0.030053,0.05788,0.059398,0.064198,0.045038,0.04622,0.04973,0.038149,0.039075,0.041911,0.033714,0.03454,0.037013,0.030519,0.031262,0.033475,0.028106,0.028849,0.030988,0.025249,0.023105,0.940125,20.303832,0.975169,19.046835,4.675032,2.958895,0.948362,0.789402,0.654956


In [0]:
def evaluate_feature(threshold):
    for i in range(grp.shape[1]):
        if abs(grp.iat[0,i]-grp.iat[1,i])/max(grp.iat[0,i],grp.iat[0,i]) > threshold:
             impt_features.append(grp.columns[i])
    return impt_features

In [0]:
impt_features = []
evaluate_feature(0.09)

['cosine_distance_model_w2v_1_40_5',
 'cosine_distance_model_w2v_1_40_7',
 'braycurtis_distance_model_w2v_1_40_7',
 'cosine_distance_model_w2v_1_70_5',
 'cosine_distance_model_w2v_1_70_7',
 'braycurtis_distance_model_w2v_1_70_7',
 'cosine_distance_model_w2v_1_100_5',
 'cosine_distance_model_w2v_1_100_7',
 'braycurtis_distance_model_w2v_1_100_7',
 'cosine_distance_model_w2v_1_130_5',
 'cosine_distance_model_w2v_1_130_7',
 'braycurtis_distance_model_w2v_1_130_7',
 'cosine_distance_model_w2v_1_160_5',
 'cosine_distance_model_w2v_1_160_7',
 'braycurtis_distance_model_w2v_1_160_7',
 'cosine_distance_model_w2v_1_190_5',
 'cosine_distance_model_w2v_1_190_7',
 'braycurtis_distance_model_w2v_1_190_7',
 'cosine_distance_model_w2v_2_40_5',
 'braycurtis_distance_model_w2v_2_40_5',
 'cosine_distance_model_w2v_2_40_7',
 'cityblock_distance_model_w2v_2_40_7',
 'braycurtis_distance_model_w2v_2_40_7',
 'cosine_distance_model_w2v_2_70_5',
 'cosine_distance_model_w2v_2_70_7',
 'braycurtis_distance_model_

#### Ensemble models

In [0]:
import os
import math
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

x, X_val, y, y_val = train_test_split(x_train.values, y_train.values, test_size=0.2, random_state=47)

SHUFFLE = False
FOLDS = 10

def build_ann():
    model = Sequential()
    model.add(Dense(100, input_dim=x.shape[1], activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(25, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    Adam = optimizers.Adam(lr=0.001)
    model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


def stretch(y):
    return (y - y.min()) / (y.max() - y.min())


def blend_ensemble(x, y, x_submit):
    kf = StratifiedKFold(FOLDS)
    folds = list(kf.split(x,y))

    models = [  build_ann(),
            xgb.XGBClassifier(learning_rate=0.1, max_depth=5, objective='binary:logistic', random_state=47, epoch=200),\
            lgb.LGBMClassifier(max_depth=6, learning_rate=0.12, objective='binary', metric='binary_logloss', lambda_l1=0.1),\
            LogisticRegression(C=100, random_state=47, penalty='l1',)]#,\
            #RandomForestClassifier(n_estimators=10, max_features=3, random_state=47)]

    dataset_blend_train = np.zeros((x.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))

    for j, model in enumerate(models):
        print("Model: {} : {}".format(j, model) )
        fold_sums = np.zeros((x_submit.shape[0], len(folds)))
        total_loss = 0
        for i, (train, test) in enumerate(folds):
            x_train = x[train]
            y_train = y[train]
            x_test = x[test]
            y_test = y[test]
            model.fit(x_train, y_train)
            pred = np.array(model.predict_proba(x_test))
            # pred = model.predict_proba(x_test)
            if j==0:
                dataset_blend_train[test, j] = pred.flatten()
            else:
                dataset_blend_train[test, j] = pred[:, 1]
            
            pred2 = np.array(model.predict_proba(x_submit))
            #fold_sums[:, i] = model.predict_proba(x_submit)[:, 1]

            if j==0:
                fold_sums[:, i] = pred2.flatten()
            else:
                fold_sums[:, i] = pred2[:, 1]

            loss = log_loss(y_test, pred)
            total_loss+=loss
            print("Fold #{}: loss={}".format(i,loss))
        print("{}: Mean loss={}".format(model.__class__.__name__,total_loss/len(folds)))
        dataset_blend_test[:, j] = fold_sums.mean(1)

    print()
    print("Blending models.")
    blend = LogisticRegression(solver='lbfgs')
    blend.fit(dataset_blend_train, y)
    return blend.predict_proba(dataset_blend_test)

if __name__ == '__main__':

    np.random.seed(42)  # seed to shuffle the train set

    print("Loading data...")

    if SHUFFLE:
        idx = np.random.permutation(y.size)
        x = x[idx]
        y = y[idx]

    submit_data = blend_ensemble(x, y, X_val)
    submit_data = stretch(submit_data)

    ####################
    # Build submit file
    ####################
    ids = [id+1 for id in range(submit_data.shape[0])]
    submit_df = pd.DataFrame({'MoleculeId': ids, 'PredictedProbability': submit_data[:, 1]},
                             columns=['MoleculeId','PredictedProbability'])
    
    print(log_loss(y_val, submit_data[:, 1]))



Loading data...
Model: 0 : <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f9050e56748>
Train on 90261 samples
Fold #0: loss=0.5511684605189048
Train on 90261 samples
Fold #1: loss=0.550884449179821
Train on 90262 samples
Fold #2: loss=0.5400628521962254
Train on 90262 samples
Fold #3: loss=0.5447344950541149
Train on 90262 samples
Fold #4: loss=0.5382207332909016
Train on 90262 samples
Fold #5: loss=0.5324909045083807
Train on 90262 samples
Fold #6: loss=0.5433720567166088
Train on 90262 samples
Fold #7: loss=0.5495700576338789
Train on 90262 samples
Fold #8: loss=0.5389333589173114
Train on 90263 samples
Fold #9: loss=0.527313591164354
Sequential: Mean loss=0.5416750959180502
Model: 1 : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, epoch=200, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,


In [0]:
submit_df

Unnamed: 0,MoleculeId,PredictedProbability
0,1,0.928677
1,2,0.830183
2,3,0.963911
3,4,0.362951
4,5,0.449426
...,...,...
25827,25828,0.581315
25828,25829,0.135866
25829,25830,0.540500
25830,25831,0.475617
