# Sentence Similarity Measures II: KB + Syn-Sem

## 0. Contents

* I. Corpora (SpaCy Preprocessing for Lemmatization)
    * MSR Paraphrase Corpus (for evaluation)
    * Brown Corpus (for computing info content of words)
    * WordNet (for computing word similarity)
* II. Word Similarity
* III. Sentence Similarity
* IV. Word-Order Similarity
* V. Overall Sentence Similarity (Linear Combination of Sent & Word Similarities)
* VI. Evaluation (Not Using MSR for now, computationally expensive)

* VII. Extension (SRL)

## I. Corpora

* MSR Paraphrase Corpus
* NLTK WordNet

### A. MSR Preprocessing

##### Load

In [3]:
import pandas as pd

In [10]:
train_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_train.txt"
test_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_test.txt"

In [11]:
df_train = pd.read_table(train_path, encoding='utf-8-sig')
df_test = pd.read_table(test_path, encoding='utf-8-sig')
df_train.head()

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi accused his brother, whom he called the...","Referring to him as only the witness, Amrozi a..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [12]:
print df_train.shape
print df_test.shape

(4076, 5)
(1725, 5)


In [33]:
df_train.ix[0] # NB: index Quality is actually weirdly '﻿Quality', using '.

Quality                                                      1
#1 ID                                                   702876
#2 ID                                                   702977
#1 String    Amrozi accused his brother, whom he called the...
#2 String    Referring to him as only the witness, Amrozi a...
Name: 0, dtype: object

In [34]:
df_train.ix[0]['#1 String']

u'Amrozi accused his brother, whom he called the witness, of deliberately distorting his evidence.'

##### To Lemmas

In [8]:
from spacy.en import English

In [40]:
from collections import defaultdict

In [9]:
parser = English()

In [41]:
def parse_msr(df, indexer):
    
    X_dic, Y_dic = defaultdict(lambda x: defaultdict(list)), \
                   defaultdict(lambda x: defaultdict(list))
    
    for i in indexer:
        
        entry_dic = defaultdict(list)
        s1, s2 = df.ix[i]['#1 String'][:-1], \
                 df.ix[i]['#2 String'][:-1] 
                # get rid of period, which causes problem in distinguishing identical tokens.
        
        parsed_s1, parsed_s2 = parser(unicode(s1)), parser(unicode(s2))
        
        entry_dic['s1'] = [token.orth_ for token in parsed_s1]
        entry_dic['s2'] = [token.orth_ for token in parsed_s2]
        entry_dic['s1_lm'] = [token.lemma_ for token in parsed_s1]
        entry_dic['s2_lm'] = [token.lemma_ for token in parsed_s2] 
#         parsed_lm_s1, parsed_lm_s2 = parser(' '.join(entry_dic['s1_lm'])), \
#                                     parser(' '.join(entry_dic['s2_lm'])) # parse on lemmas.
        
#         entry_dic['s1_dep_lm'] = dep_lemmas(parsed_lm_s1) # for dep lemma features.
#         entry_dic['s2_dep_lm'] = dep_lemmas(parsed_lm_s2)
#         entry_dic['s1_dep_tk'] = dep_tokens(parsed_s1) # for dep token features.
#         entry_dic['s2_dep_tk'] = dep_tokens(parsed_s2) 
#         entry_dic['s1_root_lm'] = get_root(parsed_lm_s1)
#         entry_dic['s2_root_lm'] = get_root(parsed_lm_s2)
#         entry_dic['s1_root_tk'] = get_root(parsed_s1)
#         entry_dic['s2_root_tk'] = get_root(parsed_s2)

        entry_dic['s1_id'] = df.ix[i]['#1 ID'] # for error analysis later.
        entry_dic['s2_id'] = df.ix[i]['#2 ID']
        X_dic[i] = entry_dic
        Y_dic[i] = df.ix[i]['Quality']
    
    return X_dic, Y_dic


In [42]:
%%time
X_train, Y_train = parse_msr(df_train, df_train.index)

CPU times: user 18.3 s, sys: 161 ms, total: 18.5 s
Wall time: 18.6 s


In [43]:
%%time
X_test, Y_test = parse_msr(df_test, df_test.index)

CPU times: user 7.37 s, sys: 57.3 ms, total: 7.42 s
Wall time: 7.47 s


In [36]:
# %%time
# msr_sents, msr_words = parse_msr()

In [37]:
# print msr_sents[0]
# print
# print msr_words[:10]

### B. Brown Preprocessing

In [52]:
from nltk.corpus import brown

In [53]:
def parse_brown():
    
    sents = brown.sents()
    parsed_sents = [parser(' '.join(sent)) for sent in sents]
    lemma_words = [token.lemma_ for parsed_sent in parsed_sents for token in parsed_sent]
    
    return lemma_words

In [54]:
%%time
brown_words = parse_brown()

CPU times: user 1min 44s, sys: 830 ms, total: 1min 44s
Wall time: 1min 45s


In [55]:
N = len(brown_words)
N

1188973

### C. WordNet

In [14]:
from nltk.corpus import wordnet as wn

## II. Word Similarity with Knowledge Base

**Math**

* **Li et al. (2006)'s WordNet Word Similarity**
    * Equation: $SIM(w_1,w_2) = e^{-\alpha l}\cdot \frac{e^{\beta h}-e^{-\beta h}}{e^{\beta h}+e^{-\beta h}}$ (cf. ibid.:14,(5)).
    * Breakdown: The similarity between $w_1$ and $w_2$ is the product of the following functions:
        * Path Length Function: $f(l) = e^{-\alpha l}$
        * Subsumer Depth Function: $g(h) = \frac{e^{\beta h}-e^{-\beta h}}{e^{\beta h}+e^{-\beta h}}$
    * Measures:
        * Path Length: (cf. ibid.:13)
            * $0$ if $w_1$ and $w_2$ are in the same synset.
            * $1$ if $w_1$ and $w_2$ are not in the same synset but the synset for $w_1$ and $w_2$ contain one or more common words.
            * *shortest path length* according to WordNet if neither of the above is true.
        * Subsumer Depth: (cf. ibid.:14)
            * "Words at upper layers of hierarchical semantic nets have more general concepts and less semantic similarity between words than words at lower layers. Therefore $g(h)$ should increase monotonically with respect to the subsumer depth".

In [15]:
import numpy as np

In [16]:
lemmas = lambda synset: frozenset(str(lemma.name()) for lemma in synset.lemmas()
                         if '_' not in str(lemma.name())) # there are lemmas like 'domestic_dog'.
div = lambda x,y: x/y if y!=0 else 0

In [17]:
PATH_LEN_CACHE = {}
def path_len(w1, w2):
    
    if (w1,w2) in PATH_LEN_CACHE: 
        return PATH_LEN_CACHE[(w1,w2)]
        
    w1synsets, w2synsets = wn.synsets(w1), wn.synsets(w2)
    w1syns = {lemmas(syn) for syn in w1synsets}
    w2syns = {lemmas(syn) for syn in w2synsets}
    
    for syn in w1syns.union(w2syns):
        if w1 in syn and w2 in syn:
            return 0
    for w1syn in w1syns:
        for w2syn in w2syns:
            if w1syn.intersection(w2syn):
                return 1
    pls = []
    for w1syn in w1synsets:
        for w2syn in w2synsets:
            pl = w1syn.shortest_path_distance(w2syn)
            if pl!=None: pls.append(pl)
    
    PATH_LEN_CACHE[(w1,w2)] = 50 if len(pls)==0 else min(pls)
    
    return PATH_LEN_CACHE[(w1,w2)] # to penalize non-related words
          

In [19]:
%%time
path_len('dog','cat')

CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 10 µs


4

In [21]:
SUBSUMER_CACHE = {}
def subsumer_depth(w1, w2):
    
    if (w1,w2) in SUBSUMER_CACHE: 
        return SUBSUMER_CACHE[(w1,w2)]    
    
    w1synsets, w2synsets = wn.synsets(w1), wn.synsets(w2)
    subsumers = []
    for w1syn in w1synsets:
        for w2syn in w2synsets:
            subsumers += w1syn.common_hypernyms(w2syn)
    subsumers = list(set(subsumers))
    
    depths = [subsumer.min_depth() for subsumer in subsumers] 
    
    SUBSUMER_CACHE[(w1,w2)] = 0 if len(depths)==0 else max(depths) # penalizes no-subsumer case.
    
    return SUBSUMER_CACHE[(w1,w2)]


In [23]:
%%time
subsumer_depth('dog','cat')

CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 9.06 µs


11

In [58]:
WORD_SIM_CACHE = {}
def word_sim(w1, w2, alpha=.2, beta=.45):
    
    if (w1,w2) in WORD_SIM_CACHE:
        return WORD_SIM_CACHE[(w1,w2)]
    
    l, h = path_len(w1,w2), subsumer_depth(w1,w2)
    
    WORD_SIM_CACHE[(w1,w2)] = np.exp(-alpha*l) * \
                              div(np.exp(beta*h)-np.exp(-beta*h), \
                              np.exp(beta*h)+np.exp(-beta*h))
        
    return WORD_SIM_CACHE[(w1,w2)]
    

In [59]:
%%time
print word_sim('dog','cat')
print word_sim('dog','canine')

0.449283876504
0.818697350358
CPU times: user 124 µs, sys: 26 µs, total: 150 µs
Wall time: 138 µs


## III. Sentence Similarity

**Math**

* **Sentence Vector $\check{s}$**:
    * Build a vector template $\check{s}$ the cells of which correspond to the set of distinctive words in two sentences $s_1$, $s_2$, i.e. $\{w|w\in s_1\cup s_2\}$.
    * For $s_1$ and $s_1$, build their vector $\check{s}_1$ and $\check{s}_2$ as follows: for each $w$ in $\check{s}$,
        * If $w$ appears in a sentence, set $\check{s}_{1/2,i} = 1$
        * Otherwise, compute $w$'s similarities to all the words in $\check{s}_{1/2}$, and set $\check{s}_{1/2,i}$ to be the highest similarity value resulted.
    * Each cell of $\check{s}_{1/2,i}$ is weighted by the corresponding word $w_i$'s *Information Content*, which is computed with $I(w) = \frac{logp(w)}{log(N+1)} = 1 - \frac{log(n+1)}{log(N+1)}$, where $n$ is the frequence of $w$ in a corpus (Brown, in this case), $N$ is the size of the corpus. The normalization: $\check{s}_i = \check{s}_i\cdot I(w_i)\cdot I(\tilde{w}_i)$, where $\tilde{w}_i$ is the word entry that is associated with $w_i$ (i.e. either itself, when $w$ is found in a sentence, and $w$'s most similar word otherwise). 


* **Sentence Similarity**:
    * Equation: $SIM(s_1,s_2) = \frac{\check{s}_1\cdot\check{s}_2}{||\check{s}_1||\cdot||\check{s}_2||}$.
    * I.e. Cosine Similarity

In [27]:
log = lambda x: np.log(x) if x>0 else 0

In [29]:
I_CACHE = {}
def I(w):
    if w in I_CACHE:
        return I_CACHE[w]
    else:
        I_CACHE[w] = 1 - div(log(brown_words.count(w)+1),log(N+1))
    return I_CACHE[w]

In [60]:
def vec(s1, s2): # assuming s1,s2 are lists of words.
    
    s_check = list(set(s1).union(set(s2)))
    l_check = len(s_check)
    s1_check, s2_check = np.zeros(l_check), np.zeros(l_check)
    for i,w in enumerate(s_check):
        if w in s1: s1_check[i] = 1
        else: 
            idx,most_sim = max(enumerate(s1), key=lambda (j,w_j):word_sim(w,w_j)) # idx: that of w's most sim.
            s1_check[i] = word_sim(w,most_sim) * I(w) * I(s1[idx]) # weight by info content
        if w in s2: s2_check[i] = 1
        else: 
            idx,most_sim = max(enumerate(s2), key=lambda (j,w_j):word_sim(w,w_j)) 
            s2_check[i] = word_sim(w,most_sim) * I(w) * I(s2[idx])
    
    return s1_check, s2_check


In [61]:
def sent_sim(s1, s2):
    
    s1_vec, s2_vec = vec(s1, s2)
    
    return div(np.dot(s1_vec,s2_vec),
               np.sqrt(np.dot(s1_vec,s1_vec)) * \
               np.sqrt(np.dot(s2_vec,s2_vec)))


In [62]:
q = X_train[0]['s1']
r1 = X_train[0]['s2'] # known to be the paraphrase pairmate to q.
r2 = X_train[1]['s1'] # know to be not the paraphrase pairmate to q.

In [63]:
%%time
print sent_sim(q, r1)
print sent_sim(q, r2)

0.805565926878
0.150437483504
CPU times: user 5.22 ms, sys: 1.34 ms, total: 6.56 ms
Wall time: 5.66 ms


In [175]:
print q

[u'Amrozi', u'accused', u'his', u'brother', u',', u'whom', u'he', u'called', u'the', u'witness', u',', u'of', u'deliberately', u'distorting', u'his', u'evidence']


## IV. Word-Order Similarity

**Math**

* **Order Similarity**:
    * Equation: $SIM(s_1,s_2) = 1 - \frac{||r_1 - r_2||}{||r_1 + r_2||}$ (cf. Li et al. (2006):18,(8)).
    * Breakdown: Word order vectors $r_1$ and $r_2$ are computed as follows:
        * Build vector template $\check{s}$ as in section III.
        * For $s_1$ and $s_2$, build word order vectors. For each $w$ in $\check{s}$,
            * If $w$ is found in $s_{1/2}$, set $r_{1/2,i}$ to be 1.
            * Otherwise, set $r_{1/2,i}$ to be the index of the $w$'s most similar word in $s_{1/2}$.
    * Idea: "... normalized difference of word order" (cf. ibid.)

In [64]:
def order_vec(s1, s2):
    
    s_check = list(set(s1).union(set(s2)))
    l_check = len(s_check)
    r1, r2 = np.zeros(l_check), np.zeros(l_check)    
    for i,w in enumerate(s_check):
        if w in s1:
            r1[i] = s1.index(w)
        else:
            most_sim = max(s1, key=lambda w_j:word_sim(w,w_j)) 
            r1[i] = s1.index(most_sim)
        if w in s2:
            r2[i] = s2.index(w)
        else:
            most_sim = max(s2, key=lambda w_j:word_sim(w,w_j)) 
            r2[i] = s2.index(most_sim)   
            
    return r1, r2


In [65]:
def order_sim(s1, s2):
    
    r1, r2 = order_vec(s1, s2)
    
    diff = r1 - r2
    norm = r1 + r2
    
    return 1 - div(np.sqrt(np.dot(diff,diff)),np.sqrt(np.dot(norm,norm)))


In [66]:
%%time
print order_sim(q,r1)
print order_sim(q,r2)

0.671823693459
0.406778405642
CPU times: user 1.08 ms, sys: 452 µs, total: 1.53 ms
Wall time: 1.19 ms


## V. Overall Sentence Similarity

**Math**

* $SIM(s_1,s_2) = \delta\cdot SIM_{sent}(s_1,s_2) + (1-\delta)\cdot SIM_{order}(s_1,s_2)$.
* $\delta \in (0.5,1]$, considering word order's "... subordinate role in semantic processing".  

In [67]:
def overall_sent_sim(s1, s2, delta=.85): # delta is a value between [.5,1]. (cf. Li et al. (2006):20,24)
    
    return delta*sent_sim(s1,s2) + (1-delta)*order_sim(s1,s2)

In [68]:
%%time
print overall_sent_sim(q,r1)
print overall_sent_sim(q,r2)

0.785504591865
0.188888621825
CPU times: user 2.09 ms, sys: 1.17 ms, total: 3.26 ms
Wall time: 2.55 ms


## VI. Evaluation

### A. Li et al. (2006) + Wan et al. (2006)

##### Evaluation Function

In [94]:
def evaluate(X_test_fts, Y_test_fts, model):
    y_true = Y_test_fts
    y_pred = model.predict(X_test_fts)
    print 'Accuracy: %.6f' % accuracy_score(y_true,y_pred)
    print
    print classification_report(y_true,y_pred)

##### Load Wan et al. (2006): Featurized Data

In [84]:
print X_train[0]['s1']; print
print Y_train[0]

[u'Amrozi', u'accused', u'his', u'brother', u',', u'whom', u'he', u'called', u'the', u'witness', u',', u'of', u'deliberately', u'distorting', u'his', u'evidence']

1


In [85]:
import cPickle

In [86]:
data_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/DATA/"

In [87]:
# LOAD W06 FEATURES
# with open(data_path+'train1.p','rb') as f_train:
#     X_train_w06fts, Y_train_w06fts = cPickle.load(f_train)
# with open(data_path+'test1.p','rb') as f_test:
#     X_test_w06fts, Y_test_w06fts = cPickle.load(f_test)

In [89]:
print X_train_w06fts[0]

[36.129483428692055, 34.004219697592525, 35.034650597519565, 50.932376695342796, 47.936354536793218, 49.388971340938468, 0.5, 0.4924790605054523, 0.49621103366618263, 0.5, 0.4924790605054523, 0.49621103366618263, 0.5714285714285714, 0.5, 0.5333333333333333, 0.5714285714285714, 0.5, 0.5333333333333333, 13, 13, -1, 1]


##### How does Li et al. (2006) do on MSR on its own?

In [92]:
def featurize(X_train, Y_train, X_test, Y_test, sim):
    
    X_train_fts, Y_train_fts = [], []
    X_test_fts, Y_test_fts = [], []
    
    print "... processing train"
    for i,x in X_train.iteritems():
        if i!=0 and i%100==0:
            print "    ... processed %d train sentences" % i
        X_train_fts.append(sim(x['s1_lm'],x['s2_lm']))
        Y_train_fts.append(Y_train[i])
    print "... processing test"
    for i,x in X_test.iteritems():
        if i!=0 and i%100==0:
            print "    ... processed %d test sentences" % i
        X_test_fts.append(sim(x['s1_lm'],x['s2_lm']))
        Y_test_fts.append(Y_test[i])  
        
    return X_train_fts, Y_train_fts, X_test_fts, Y_test_fts


In [506]:
%%time
X_train_fts, Y_train_fts, X_test_fts, Y_test_fts = featurize(X_train, Y_train, X_test, Y_test, overall_sent_sim)

In [101]:
# CONVERT TO LISTS OF LISTS
X_train_fts = [[ft] for ft in X_train_fts]
X_test_fts = [[ft] for ft in X_test_fts]

In [480]:
# SAVE
# with open(data_path+'li2006_fts.p','wb') as f:
#     cPickle.dump((X_train_fts,Y_train_fts,X_test_fts,Y_test_fts), f)
# LOAD
# with open(data_path+'li2006_fts.p','rb') as f:
#     X_train_fts,Y_train_fts,X_test_fts,Y_test_fts = cPickle.load(f)

In [233]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [481]:
lr_li = LogisticRegression()

In [482]:
lr_li.fit(X_train_fts, Y_train_fts)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [483]:
evaluate(X_train_fts, Y_train_fts, lr_li)

Accuracy: 0.731845

             precision    recall  f1-score   support

          0       0.65      0.38      0.48      1323
          1       0.75      0.90      0.82      2753

avg / total       0.72      0.73      0.71      4076



In [237]:
evaluate(X_test_fts, Y_test_fts, lr_li)

Accuracy: 0.732174

             precision    recall  f1-score   support

          0       0.66      0.41      0.51       578
          1       0.75      0.90      0.82      1147

avg / total       0.72      0.73      0.71      1725



##### How does Wan et al. (2006) do on MSR on its own?

In [238]:
lr_wan = LogisticRegression()

In [239]:
lr_wan.fit(X_train_w06fts, Y_train_w06fts)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [240]:
evaluate(X_train_w06fts, Y_train_w06fts, lr_wan)

Accuracy: 0.736997

             precision    recall  f1-score   support

          0       0.62      0.48      0.54      1323
          1       0.78      0.86      0.82      2753

avg / total       0.73      0.74      0.73      4076



In [241]:
evaluate(X_test_w06fts, Y_test_w06fts, lr_wan)

Accuracy: 0.732754

             precision    recall  f1-score   support

          0       0.62      0.51      0.56       578
          1       0.77      0.85      0.81      1147

avg / total       0.72      0.73      0.73      1725



##### Wan + Li

In [135]:
def featurize_plus(wan_fts, li_fts):
    
    X_train_wan, Y_train_wan, X_test_wan, Y_test_wan = wan_fts
    X_train_li, Y_train_li, X_test_li, Y_test_li = li_fts
    
    X_train_fts, Y_train_fts = [], []
    X_test_fts, Y_test_fts = [], []
    
    for i,(x_wan,x_li) in enumerate(zip(X_train_wan,X_train_li)):
        X_train_fts.append(x_wan+x_li) 
        Y_train_fts.append(Y_train_li[i])
    for i,(x_wan,x_li) in enumerate(zip(X_test_wan,X_test_li)):
        X_test_fts.append(x_wan+x_li)
        Y_test_fts.append(Y_test_li[i])
        
    return X_train_fts, Y_train_fts, X_test_fts, Y_test_fts


In [242]:
wan_fts = (X_train_w06fts,Y_train_w06fts,X_test_w06fts,Y_test_w06fts)
li_fts = (X_train_fts,Y_train_fts,X_test_fts,Y_test_fts)

In [243]:
%%time
X_train_fts, Y_train_fts, X_test_fts, Y_test_fts = featurize_plus(wan_fts,li_fts)

CPU times: user 8.84 ms, sys: 1.66 ms, total: 10.5 ms
Wall time: 9.4 ms


In [244]:
lr_wan_li = LogisticRegression()

In [245]:
lr_wan_li.fit(X_train_fts, Y_train_fts)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [246]:
evaluate(X_train_w06fts, Y_train_w06fts, lr_wan)

Accuracy: 0.736997

             precision    recall  f1-score   support

          0       0.62      0.48      0.54      1323
          1       0.78      0.86      0.82      2753

avg / total       0.73      0.74      0.73      4076



In [247]:
evaluate(X_test_fts, Y_test_fts, lr)

Accuracy: 0.731594

             precision    recall  f1-score   support

          0       0.62      0.50      0.55       578
          1       0.77      0.85      0.81      1147

avg / total       0.72      0.73      0.72      1725



In [141]:
from sklearn import svm

In [248]:
svm_linear = svm.SVC(kernel='linear',verbose=3)
svm_rbf = svm.SVC(kernel='rbf',verbose=3)

In [254]:
svm_linsvc = svm.LinearSVC()

In [249]:
%%time
svm_linear.fit(X_train_fts, Y_train_fts)

[LibSVM]CPU times: user 12.5 s, sys: 67.9 ms, total: 12.6 s
Wall time: 12.7 s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=3)

In [258]:
evaluate(X_train_fts, Y_train_fts, svm_linear)

Accuracy: 0.746075

             precision    recall  f1-score   support

          0       0.65      0.46      0.54      1323
          1       0.77      0.88      0.82      2753

avg / total       0.73      0.75      0.73      4076



In [251]:
evaluate(X_test_fts, Y_test_fts, svm_linear)

Accuracy: 0.734493

             precision    recall  f1-score   support

          0       0.64      0.47      0.54       578
          1       0.77      0.87      0.81      1147

avg / total       0.72      0.73      0.72      1725



In [252]:
%%time
svm_rbf.fit(X_train_fts, Y_train_fts)

[LibSVM]CPU times: user 1.15 s, sys: 7.2 ms, total: 1.15 s
Wall time: 1.16 s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=3)

In [259]:
evaluate(X_train_fts, Y_train_fts, svm_rbf)

Accuracy: 0.940137

             precision    recall  f1-score   support

          0       0.95      0.86      0.90      1323
          1       0.94      0.98      0.96      2753

avg / total       0.94      0.94      0.94      4076



In [260]:
evaluate(X_test_fts, Y_test_fts, svm_rbf)

Accuracy: 0.692754

             precision    recall  f1-score   support

          0       0.58      0.30      0.39       578
          1       0.72      0.89      0.79      1147

avg / total       0.67      0.69      0.66      1725



In [255]:
%%time
svm_linsvc.fit(X_train_fts, Y_train_fts)

CPU times: user 398 ms, sys: 2.02 ms, total: 400 ms
Wall time: 401 ms


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [261]:
evaluate(X_train_fts, Y_train_fts, svm_linsvc)

Accuracy: 0.525025

             precision    recall  f1-score   support

          0       0.40      0.96      0.57      1323
          1       0.94      0.32      0.47      2753

avg / total       0.77      0.53      0.50      4076



In [257]:
evaluate(X_test_fts, Y_test_fts, svm_linsvc)

Accuracy: 0.509565

             precision    recall  f1-score   support

          0       0.40      0.94      0.56       578
          1       0.91      0.29      0.44      1147

avg / total       0.74      0.51      0.48      1725



### B. OJO Sents

In [209]:
ojo_sents = '''
What are the quality of schools in this neighborhood?
What areas have the best schools?
What are the crime statistics in this neighborhood?
What are the number of registered sex offenders in this neighborhood?
What is the walkability score in this neighborhood?
Which neighborhoods have homes that are over 2500 sq ft. 
What neighborhoods have new construction?
Show me pictures of the neighborhood
Show me pictures of homes in the neighborhood
How bicycle friendly is this neighborhood?
What is the median income of this neighborhood?
What is the average demographics of this neighborhood? 
What is the poverty score of this neighborhood?
What is the best day of the week to list my home?
What is the best month to list a home like mine for the most money and shortest time?
How much has my home appreciated?
How has appreciation been in my neighborhood vs other neighborhoods?
What has the average appreciation in my neighborhood been over the last x years?
What has the average appreciation in my school district been over the last x years?
What has the average appreciation on my street been over the last x years?
Which neighborhoods are best for kids under 10
Show me the nearest parks
Show me the nearest pools
Show me the nearest dog parks
Show me the nearest urgent care / emergency room?
Show me the nearest fire / police station?
Show me the impact of railroad/trains
How has appreciation been in this neighborhood vs other neighborhoods?
What has the average appreciation in this neighborhood over the last x years?
What has the average appreciation in this school district over the last x years?
Where can I find a house that is a better fit for me for less money?
What is the commute time for this neighborhood?
Which neighborhoods have a commute time of less than 30min from [address]
I want to live in a low traffic spot
Show me diversity of neighborhood
Show me historic natural disaster trends for this area
Show me historic weather trends for this area
Where can I find a house that is a better fit for me for less money?
What confidence level does OJO have that I should list my home now?
What confidence level does OJO have that I should buy a home right now?
How much is my home worth?
What confidence level does OJO have that I should buy a home right now?
Show me district city government information
Which street(s) in this neighborhood have the highest appreciation over x years?
What is the expected appreciation for my home over the next x years?
Which neighborhood in Austin is expected to appreciate the most over the next x years that have homes similar to what I'm interested in?
What areas have mature trees?
What areas have the most greenspace?
What is the expected appreciation for homes in this area over the next x years?
How is this neighborhood impacted by traffic congestion and which time(s) of day?
How fast will my home sell?
Is this a pet friendly neighborhood?
What are the utility costs in this neighborhood?
I want to live in a tidy area
I want a area where the homes are setback from the streeet
Are there complete streets in this neighborhood (connecting sidewalks)?
Green building score?
Air quality of city/neighborhood?
Air quality of home (VOCs, materials)
Curbside waster services?
Curbside recycling services?
Curbside composting services?
Average heating/cooling costs?
Is sustainable energy availalbe?
Show me the impact of flight patterns
What are the zoning breakdowns of this neighborhood? (section 8, residtential, mixed used, commercial, etc)?
What is the estimated time to sell my home right now?
How long does it take to sell a home in my neighborhood right now?
How long does it take to sell a home on my street right now?
'''

In [210]:
def drop_mark(s):
    return s[:-1] if s.endswith('?') or s.endswith('.') else s

In [211]:
ojo_sents = ojo_sents.split('\n') # split into list of sent strings.
ojo_sents = ojo_sents[1:len(ojo_sents)-1] # get rid of ''s in front and end.
ojo_sents = list({drop_mark(sent) for sent in ojo_sents}) # get rid of question mark and duplicates.
ojo_sents = [to_lemmas(sent) for sent in ojo_sents]

In [196]:
q1 = 'are the schools in the neighborhood good?'
q2 = 'i care the most about the commute time between home and work.'
q3 = 'is this a safe neighborhood?'

In [213]:
from heapq import nlargest

In [214]:
parser = English()

In [215]:
def to_lemmas(s):
    parsed_s = parser(unicode(s))
    return [token.lemma_ for token in parsed_s]

In [216]:
def most_sim(q, k=5):
    
    q = to_lemmas(drop_mark(q))
    sents = nlargest(k, ojo_sents, key=lambda s: overall_sent_sim(q,s))
    
    for i,sent in enumerate(sents):
        print "Sim Rank: %d | Sent: %s" % (i+1,' '.join(sent))

In [217]:
most_sim(q1)

Sim Rank: 1 | Sent: what be the quality of school in this neighborhood
Sim Rank: 2 | Sent: what be the utility cost in this neighborhood
Sim Rank: 3 | Sent: what be the crime statistic in this neighborhood
Sim Rank: 4 | Sent: what be the walkability score in this neighborhood
Sim Rank: 5 | Sent: what be the number of register sex offender in this neighborhood


In [218]:
most_sim(q2)

Sim Rank: 1 | Sent: what be the best month to list a home like mine for the most money and short time
Sim Rank: 2 | Sent: what be the estimate time to sell my home right now
Sim Rank: 3 | Sent: i want a area where the home be setback from the streeet
Sim Rank: 4 | Sent: what be the commute time for this neighborhood
Sim Rank: 5 | Sent: show me the impact of flight pattern


In [219]:
most_sim(q3)

Sim Rank: 1 | Sent: be this a pet friendly neighborhood
Sim Rank: 2 | Sent: how bicycle friendly be this neighborhood
Sim Rank: 3 | Sent: what be the utility cost in this neighborhood
Sim Rank: 4 | Sent: what be the poverty score of this neighborhood
Sim Rank: 5 | Sent: what be the number of register sex offender in this neighborhood


In [220]:
most_sim('is this neighborhood dangerous?')

Sim Rank: 1 | Sent: be this a pet friendly neighborhood
Sim Rank: 2 | Sent: how bicycle friendly be this neighborhood
Sim Rank: 3 | Sent: what be the utility cost in this neighborhood
Sim Rank: 4 | Sent: what be the poverty score of this neighborhood
Sim Rank: 5 | Sent: what be the crime statistic in this neighborhood


In [224]:
qq1 = 'the dog ate an apple'.split()
qq2 = 'the apple ate a dog'.split()
qq3 = 'the cat ate an apple'.split()

In [225]:
overall_sent_sim(qq1,qq2)

0.78328571189411678

In [226]:
overall_sent_sim(qq1,qq3)

0.89501360904995508

In [227]:
most_sim('are there murders here?')

Sim Rank: 1 | Sent: what be the poverty score of this neighborhood
Sim Rank: 2 | Sent: what be the walkability score in this neighborhood
Sim Rank: 3 | Sent: be this a pet friendly neighborhood
Sim Rank: 4 | Sent: be there complete street in this neighborhood ( connect sidewalk )
Sim Rank: 5 | Sent: green build score


## VII. Extension: SRL

In [271]:
from practnlptools.tools import Annotator
from spacy.en import English
from __future__ import division

In [266]:
antr = Annotator()
parser = English()

In [342]:
def get_root(ph):
    
    parsed_ph = parser(unicode(ph))
    
    return filter(lambda tk: tk.dep_=='ROOT', [tk for tk in parsed_ph])[0].lemma_


In [343]:
%%time
get_root('his brother , whom he call the witness ,')

CPU times: user 1.34 ms, sys: 403 µs, total: 1.74 ms
Wall time: 1.12 ms


u'brother'

In [344]:
def get_argstruct(s_lm):
    
    srl = antr.getAnnotations(s_lm)['srl']
    argstruct = defaultdict(dict)
    for entry in srl:
        v = entry['V']
        for arg_lb,arg in entry.iteritems():
            if arg_lb=='V': continue
            if len(arg)>1: arg = get_root(arg)
            if arg_lb not in set(['A0','A1','A2']):
                argstruct[v]['OTHER'] = arg
            else:
                argstruct[v][arg_lb] = arg
    
    return argstruct



In [345]:
%%time
get_argstruct('amrozi accuse his brother , whom he call the witness , of deliberately distort his evidence .')

CPU times: user 3.63 ms, sys: 13 ms, total: 16.6 ms
Wall time: 299 ms


defaultdict(dict,
            {'accuse': {'A0': u'amrozi', 'A1': u'brother', 'A2': u'distort'},
             'call': {'A0': u'he', 'A1': u'witness', 'OTHER': u'whom'},
             'distort': {'A0': u'brother',
              'A1': u'evidence',
              'OTHER': u'deliberately'}})

In [346]:
def args_sim(arg_dic1, arg_dic2):
    
    nargs = len(set(arg_dic1.keys()+arg_dic2.keys()))
    score = 0
    for arg_lb in arg_dic1:
        if arg_lb in arg_dic2.keys():
            score += word_sim(arg_dic1[arg_lb],arg_dic2[arg_lb])
    
    return score / nargs
        

In [347]:
arg_dic1 = {'A0': 'amrozi', 'A1': 'brother', 'A2': 'distort'} # accuse
arg_dic2 = {'A0': 'amrozi', 'A1': 'brother', 'A2': 'kill'} # the same

In [348]:
print args_sim(arg_dic1,arg_dic1)
print args_sim(arg_dic1,arg_dic2)

0.62346247
0.332111374038


In [404]:
def argstruct_sim(as1, as2):
    
    nvs = len(set(as1.keys()+as2.keys()))
    if len(as1)==0 or len(as2)==0: return 0
    score = 0
    for v1 in as1.keys():
        if v1 in as2.keys():
            score += args_sim(as1[v1],as2[v1])
        else:
            v2 = nlargest(1, as2.keys(), key=lambda v2: word_sim(v1,v2))[0]
            score += word_sim(v1,v2)*args_sim(as1[v1],as2[v2])
    
    return score / nvs
    

In [394]:
s1, s2 = df_train.ix[0]['#1 String'], df_train.ix[0]['#2 String']
s3 = df_train.ix[1]['#1 String']
parsed_s1_lm = ' '.join([token.lemma_ for token in parser(s1)])
parsed_s2_lm = ' '.join([token.lemma_ for token in parser(s2)])
parsed_s3_lm = ' '.join([token.lemma_ for token in parser(s3)])

In [395]:
argstruct_s1 = get_argstruct(parsed_s1_lm)
argstruct_s2 = get_argstruct(parsed_s2_lm)
argstruct_s3 = get_argstruct(parsed_s3_lm)

In [396]:
print argstruct_s1; print
print argstruct_s2

defaultdict(<type 'dict'>, {'call': {'A1': u'witness', 'A0': u'he', 'OTHER': u'whom'}, 'distort': {'A1': u'evidence', 'A0': u'brother', 'OTHER': u'deliberately'}, 'accuse': {'A1': u'brother', 'A0': u'amrozi', 'A2': u'distort'}})

defaultdict(<type 'dict'>, {'distort': {'A1': u'evidence', 'A0': u'brother', 'OTHER': u'deliberately'}, 'accuse': {'A1': u'brother', 'A0': u'amrozi', 'OTHER': u'refer', 'A2': u'distort'}, 'refer': {'A1': u'to', 'A0': u'amrozi', 'A2': u'as'}})


In [397]:
argstruct_sim(argstruct_s1,argstruct_s2)

0.28142923286288285

In [398]:
argstruct_sim(argstruct_s1,argstruct_s3)

0.0078325380278391928

In [421]:
def argsim_test1(idx):
    as1 = get_argstruct(' '.join(X_train[idx]['s1_lm']).encode('utf8'))
    as2 = get_argstruct(' '.join(X_train[idx]['s2_lm']).encode('utf8'))
    sim = argstruct_sim(as1,as2)
    label = Y_train[idx]
    print 'idx = %d | argsim = %.6f | label = %s' % (idx,sim,label)
    return sim, label
    

In [443]:
def argsim_test2(X,Y):
    sims_0, sims_1 = [], []
    sims, labels = [], []
    for i,x in X.iteritems():
        as1 = get_argstruct(' '.join(x['s1_lm']).encode('utf8'))
        as2 = get_argstruct(' '.join(x['s2_lm']).encode('utf8'))
        sim = argstruct_sim(as1, as2)
        label = Y[i]
        sims.append(sim)
        labels.append(label)
        if label==0:
            sims_0.append(sim)
        else:
            sims_1.append(sim)
        if i!=0 and i%100==0:
            print "[processed: %d] avg. 0 sim: %.6f | avg. 1 sim: %.6f" % (i,np.mean(sims_0),np.mean(sims_1))
    return sims, labels

In [428]:
# sims, labels = [], []
# for i in xrange(100):
#     sim, label = argsim_test1(i)
#     sims.append(sim)
#     labels.append(label) 

In [444]:
%%time
sims, labels = argsim_test2(X_train, Y_train)

[processed: 100] avg. 0 sim: 0.087554 | avg. 1 sim: 0.194288
[processed: 200] avg. 0 sim: 0.090766 | avg. 1 sim: 0.213298
[processed: 300] avg. 0 sim: 0.102342 | avg. 1 sim: 0.222098
[processed: 400] avg. 0 sim: 0.111044 | avg. 1 sim: 0.225719
[processed: 500] avg. 0 sim: 0.110492 | avg. 1 sim: 0.235329
[processed: 600] avg. 0 sim: 0.107436 | avg. 1 sim: 0.237669
[processed: 700] avg. 0 sim: 0.105167 | avg. 1 sim: 0.237473
[processed: 800] avg. 0 sim: 0.104041 | avg. 1 sim: 0.236566
[processed: 900] avg. 0 sim: 0.103174 | avg. 1 sim: 0.233549
[processed: 1000] avg. 0 sim: 0.105833 | avg. 1 sim: 0.235653
[processed: 1100] avg. 0 sim: 0.116416 | avg. 1 sim: 0.236362
[processed: 1200] avg. 0 sim: 0.112117 | avg. 1 sim: 0.233840
[processed: 1300] avg. 0 sim: 0.119859 | avg. 1 sim: 0.230562
[processed: 1400] avg. 0 sim: 0.121389 | avg. 1 sim: 0.234079
[processed: 1500] avg. 0 sim: 0.120173 | avg. 1 sim: 0.233276
[processed: 1600] avg. 0 sim: 0.118393 | avg. 1 sim: 0.234271
[processed: 1700]

In [445]:
%%time
sims_test, labels_test = argsim_test2(X_test, Y_test)

[processed: 100] avg. 0 sim: 0.094453 | avg. 1 sim: 0.184534
[processed: 200] avg. 0 sim: 0.109066 | avg. 1 sim: 0.212809
[processed: 300] avg. 0 sim: 0.106324 | avg. 1 sim: 0.215218
[processed: 400] avg. 0 sim: 0.104338 | avg. 1 sim: 0.224117
[processed: 500] avg. 0 sim: 0.109962 | avg. 1 sim: 0.221442
[processed: 600] avg. 0 sim: 0.117865 | avg. 1 sim: 0.224665
[processed: 700] avg. 0 sim: 0.126886 | avg. 1 sim: 0.232550
[processed: 800] avg. 0 sim: 0.125812 | avg. 1 sim: 0.235065
[processed: 900] avg. 0 sim: 0.126797 | avg. 1 sim: 0.235669
[processed: 1000] avg. 0 sim: 0.122362 | avg. 1 sim: 0.236606
[processed: 1100] avg. 0 sim: 0.123071 | avg. 1 sim: 0.238742
[processed: 1200] avg. 0 sim: 0.126195 | avg. 1 sim: 0.239168
[processed: 1300] avg. 0 sim: 0.126372 | avg. 1 sim: 0.237842
[processed: 1400] avg. 0 sim: 0.127697 | avg. 1 sim: 0.235775
[processed: 1500] avg. 0 sim: 0.123811 | avg. 1 sim: 0.237493
[processed: 1600] avg. 0 sim: 0.124741 | avg. 1 sim: 0.235304
[processed: 1700]

In [446]:
# CONVERT TO LISTS OF LISTS
sims = [[sim] for sim in sims]
sims_test = [[sim] for sim in sims_test]

In [452]:
import cPickle

In [453]:
data_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/DATA/"

In [454]:
# SAVE
# with open(data_path+'srl_sim.p','wb') as f:
#     cPickle.dump((sims,labels,sims_test,labels_test), f)
# LOAD 
# with open(data_path+'srl_sim.p','rb') as f:
#     sims,labels,sims_test,labels_test = cPickle.load(f)

##### Eval 1: Pearson's Correlation

In [447]:
from scipy.stats.stats import pearsonr

In [451]:
print "train pearson: %.6f (p = %.6f)" % pearsonr([sim for [sim] in sims],labels) 
print "test pearson: %.6f (p = %.6f)" % pearsonr([sim for [sim] in sims_test],labels_test) 

train pearson: 0.227632 (p = 0.000000)
test pearson: 0.234784 (p = 0.000000)


##### Eval 2: Classification

In [455]:
lr_srl = LogisticRegression()

In [456]:
lr_srl.fit(sims,labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [458]:
evaluate(sims, labels, lr_srl)

Accuracy: 0.675417

             precision    recall  f1-score   support

          0       0.00      0.00      0.00      1323
          1       0.68      1.00      0.81      2753

avg / total       0.46      0.68      0.54      4076



In [459]:
evaluate(sims_test, labels_test, lr_srl)

Accuracy: 0.664928

             precision    recall  f1-score   support

          0       0.00      0.00      0.00       578
          1       0.66      1.00      0.80      1147

avg / total       0.44      0.66      0.53      1725



In [460]:
svc_srl = svm.SVC(kernel='rbf')

In [461]:
svc_srl.fit(sims, labels)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [462]:
evaluate(sims, labels, svc_srl)

Accuracy: 0.675417

             precision    recall  f1-score   support

          0       0.00      0.00      0.00      1323
          1       0.68      1.00      0.81      2753

avg / total       0.46      0.68      0.54      4076



In [463]:
evaluate(sims_test, labels_test, svc_srl)

Accuracy: 0.664928

             precision    recall  f1-score   support

          0       0.00      0.00      0.00       578
          1       0.66      1.00      0.80      1147

avg / total       0.44      0.66      0.53      1725



##### Eval 3: OJO Sents

In [476]:
def most_sim_srl(q, k=5):
    
    q = ' '.join(to_lemmas(drop_mark(q)))
    sents = nlargest(k, ojo_sents, key=lambda s: argstruct_sim(get_argstruct(q),get_argstruct(' '.join(s))))
    
    for i,sent in enumerate(sents):
        print "Sim Rank: %d | Sent: %s" % (i+1,sent)
        

In [477]:
%%time
most_sim_srl(q1)

Sim Rank: 1 | Sent: [u'show', u'me', u'the', u'impact', u'of', u'flight', u'pattern']
Sim Rank: 2 | Sent: [u'what', u'be', u'the', u'average', u'demographic', u'of', u'this', u'neighborhood', u'?']
Sim Rank: 3 | Sent: [u'what', u'be', u'the', u'estimate', u'time', u'to', u'sell', u'my', u'home', u'right', u'now']
Sim Rank: 4 | Sent: [u'what', u'confidence', u'level', u'do', u'ojo', u'have', u'that', u'i', u'should', u'list', u'my', u'home', u'now']
Sim Rank: 5 | Sent: [u'how', u'fast', u'will', u'my', u'home', u'sell']
CPU times: user 118 ms, sys: 716 ms, total: 834 ms
Wall time: 34.1 s


##### Eval 4: Wan + Li + ArgSim

In [484]:
argsims = (sims, labels, sims_test, labels_test)

In [491]:
def featurize_plusplus(wan_fts, li_fts, argsims):
    
    X_train_wan, Y_train_wan, X_test_wan, Y_test_wan = wan_fts
    X_train_li, Y_train_li, X_test_li, Y_test_li = li_fts
    X_train_argsim, Y_train_argsim, X_test_argsim, Y_test_argsim = argsims
    
    X_train_fts, Y_train_fts = [], []
    X_test_fts, Y_test_fts = [], []
    
    for i,(x_wan,x_li,x_argsim) in enumerate(zip(X_train_wan,X_train_li,X_train_argsim)):
        X_train_fts.append(x_wan+x_li+x_argsim) 
        Y_train_fts.append(Y_train_li[i])
    for i,(x_wan,x_li,x_argsim) in enumerate(zip(X_test_wan,X_test_li,X_test_argsim)):
        X_test_fts.append(x_wan+x_li+x_argsim)
        Y_test_fts.append(Y_test_li[i])
        
    return X_train_fts, Y_train_fts, X_test_fts, Y_test_fts


In [492]:
X_train_fts, Y_train_fts, X_test_fts, Y_test_fts = featurize_plusplus(wan_fts, li_fts, argsims)

In [494]:
lr_all = LogisticRegression()
lr_all.fit(X_train_fts, Y_train_fts)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [496]:
evaluate(X_train_fts, Y_train_fts, lr_all)

Accuracy: 0.751227

             precision    recall  f1-score   support

          0       0.65      0.50      0.56      1323
          1       0.78      0.87      0.83      2753

avg / total       0.74      0.75      0.74      4076



In [497]:
evaluate(X_test_fts, Y_test_fts, lr_all)

Accuracy: 0.729275

             precision    recall  f1-score   support

          0       0.62      0.50      0.55       578
          1       0.77      0.84      0.81      1147

avg / total       0.72      0.73      0.72      1725



In [501]:
svm_linear_all = svm.SVC(kernel='linear')
svm_rbf_all = svm.SVC(kernel='rbf')

In [502]:
%%time
svm_linear_all.fit(X_train_fts, Y_train_fts)

CPU times: user 12.5 s, sys: 19.5 ms, total: 12.5 s
Wall time: 12.5 s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [504]:
evaluate(X_test_fts, Y_test_fts, svm_linear_all)

Accuracy: 0.735072

             precision    recall  f1-score   support

          0       0.64      0.47      0.55       578
          1       0.77      0.87      0.81      1147

avg / total       0.72      0.74      0.72      1725



In [503]:
%%time
svm_rbf_all.fit(X_train_fts, Y_train_fts)

CPU times: user 1.05 s, sys: 13.8 ms, total: 1.07 s
Wall time: 1.06 s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [505]:
evaluate(X_test_fts, Y_test_fts, svm_rbf_all)

Accuracy: 0.695072

             precision    recall  f1-score   support

          0       0.59      0.31      0.40       578
          1       0.72      0.89      0.80      1147

avg / total       0.67      0.70      0.66      1725

