### 1. Combine effects calculated by KNN, VT, CTF, CSF into same file
- 1.1 Check if KNN, VT, CTF, CSF are in same order for each datasetIf not in same order, then the result is wrong. If in same order, record index for each word pair. <br>
- 1.2 Map four effects from four files into one fileProcess by tuple (one wordpair, one sentence, knn_effect, vt_effect, ctf_effect, csf_effect) <br>

### 2. Limit treatment effect file according to different criteria
- 2.1 Limit treatment effect file to a given set of word pairs
> 2.1.1 Limit data according to vocabulary get from postag intersection > 0 (word pairs that have at least one common pos tag)<br>
> 2.1.2 Limit data according to vocabulary get from most_common(1) postag matching (word pairs have same most common POS tag)<br>
- 2.2 Limit treatment effect file to bigram match strategy <br>
> 2.2.1 check if (left_word target_word) or (target_word right_word) in bigram file <br>
> 2.2.2 Further check n_bigram > a given value <br>
> 2.2.3 Check to see if more strict threshold will produce better reasonable sentences <br>
- 2.3 Limit treatment effect file by sensitive case and sentence length<br>
- 2.4 Extract word pairs from a treatment effect file. <br>

### 3. Select topn word pairs and sentences
- 3.1. Get top-n word pairs <br>
- 3.2 For each word pair, select 3 sentences with max, min and median treatment effects according to each method.<br>
- 3.3 For each word pair, sort all sentences, select 1+10 sentences. <br>

### 4. Generate paraphrase substitution sentences for selected word pairs and sentences

### 5. Different evaluation methods
- 5.1 Rank all word pairs according to different methods. <br>
- 5.2 Get average treatment effect for each word pair. <br>
- 5.3 Fetch information for word pairs with coef change and frequency in opposite class. <br>
- 5.4 Get top-10 and bottom-10 word pair. <br>
- 5.5 Find treatment words for a sentence. <br>
- 5.6 Spearman rank correlation for sentences and word pairs. <br>
- 5.7 Percentage of negative instances among topn high treatment sentences. <br>

### 6. Revise for previous results
- 6.1 Assign new treatment effect to existing amt labeled airbnb files (post process)


In [2]:
import pickle, re, random, math
import numpy as np
import pandas as pd
import ast,time
from collections import Counter
import warnings
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import spearmanr
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
% matplotlib inline

In [3]:
tw_path = '/data/2/zwang/2018_S_WordTreatment/V2_twitter/'
yp_path = '/data/2/zwang/2018_S_WordTreatment/V2_yelp/'
airbnb_path = '/data/2/zwang/2018_S_WordTreatment/V2_airbnb/'

### 1. Combine effects calculated by KNN, VT, CTF, CSF into same file
- 1.1 Check if KNN, VT, CTF, CSF are in same order for each dataset
> If not in same order, then the result is wrong. <br>
> If in same order, record index for each word pair. <br>
- 1.2 Map four effects from four files into one file
> Process by tuple (one wordpair, one sentence, knn_effect, vt_effect, ctf_effect, csf_effect)

#### 1.2 Map four effects for airbnb, yelp, twitter separately

In [3]:
this_path = airbnb_path
prefix = 'airbnb'

In [4]:
knn_file = this_path+'3_KNN/'+prefix+'_knn_30_treatment_limitvocab.csv'
vt_file = this_path+'3_VirtualTwins/'+prefix+'_vt_200tree_treatment_limitvocab.csv'
ctf_file = this_path+'3_CounterFactual/'+prefix+'_ctf_200tree_treatment_limitvocab.csv'
csf_file = this_path+'3_CausalForest/'+prefix+'_csf_200tree_treatment_limitvocab.csv'

In [5]:
def get_effect_forpair(row):
    """
    Parameters: 
        one word pair with all sentences.
    Function:
        Extract treatment effect values for all sentences of one word pair.
        Sentence order: src_pos, src_neg, tar_pos, tar_neg. 
        A sub-function called by combine_4effects.
    Return: 
        a list of effect values for all sentences.
    """
    effect = []
    src_pos_effects = ast.literal_eval(row['source_pos_sents_treatment'].values[0])[0]
    for i in range(len(src_pos_effects)):
        effect.append(src_pos_effects[i][1])

    src_neg_effects = ast.literal_eval(row['source_neg_sents_treatment'].values[0])[0]
    for j in range(len(src_neg_effects)):
        effect.append(src_neg_effects[j][1])
        
    tar_pos_effects = ast.literal_eval(row['target_pos_sents_treatment'].values[0])[0]
    for m in range(len(tar_pos_effects)):
        effect.append(tar_pos_effects[m][1])
        
    tar_neg_effects = ast.literal_eval(row['target_neg_sents_treatment'].values[0])[0]
    for n in range(len(tar_neg_effects)):
        effect.append(tar_neg_effects[n][1])
        
    return effect

In [7]:
def combine_4effects(knn_file,vt_file,ctf_file,csf_file,all_effect_file):
    """
    Parameters: 
        4 treatment effect files and one result file to store all treatment effects
    Function:
        Iterate over all word pairs:
            Combine 4 effects for every sentence and make a dict for every sentence
        Concatenate all dicts and pickle to file        
    Return: 
        A pickle file with following fields:
        ['id','source','target','sentence','knn_effect','vt_effect','ctf_effect','csf_effect','true_y']
    """
    knn_pd = pd.read_csv(knn_file)
    vt_pd = pd.read_csv(vt_file)
    ctf_pd = pd.read_csv(ctf_file)
    csf_pd = pd.read_csv(csf_file)
    all_sents_effect = []
    g_idx = -1
    for idx,row in knn_pd.iterrows():
        if(idx % 100 == 0):
            print(idx)
        vt_row = vt_pd[(vt_pd.source==row.source) & (vt_pd.target==row.target)]
        vt_effect = get_effect_forpair(vt_row)
        ctf_row = ctf_pd[(ctf_pd.source==row.source) & (ctf_pd.target==row.target)]
        ctf_effect = get_effect_forpair(ctf_row)
        csf_row = csf_pd[(csf_pd.source==row.source) & (csf_pd.target==row.target)]
        csf_effect = get_effect_forpair(csf_row)
        
        local_idx = -1
        knn_src_pos_effects = ast.literal_eval(row['source_pos_sents_treatment'])[0]
        for i in range(len(knn_src_pos_effects)):
            g_idx += 1
            local_idx += 1
            all_sents_effect.append({'id':g_idx,'source':row.source,'target':row.target,'sentence':knn_src_pos_effects[i][0],
                                     'knn_effect':knn_src_pos_effects[i][1],'vt_effect':vt_effect[local_idx],
                                     'ctf_effect':ctf_effect[local_idx],'csf_effect':csf_effect[local_idx],'true_y':1})
        
        knn_src_neg_effects = ast.literal_eval(row['source_neg_sents_treatment'])[0]
        for j in range(len(knn_src_neg_effects)):
            g_idx += 1
            local_idx += 1
            all_sents_effect.append({'id':g_idx,'source':row.source,'target':row.target,'sentence':knn_src_neg_effects[j][0],
                                     'knn_effect':knn_src_neg_effects[j][1],'vt_effect':vt_effect[local_idx],
                                     'ctf_effect':ctf_effect[local_idx],'csf_effect':csf_effect[local_idx],'true_y':0})
        
        knn_tar_pos_effects = ast.literal_eval(row['target_pos_sents_treatment'])[0]
        for m in range(len(knn_tar_pos_effects)):
            g_idx += 1
            local_idx += 1
            all_sents_effect.append({'id':g_idx,'source':row.target,'target':row.source,'sentence':knn_tar_pos_effects[m][0],
                                     'knn_effect':knn_tar_pos_effects[m][1],'vt_effect':vt_effect[local_idx],
                                     'ctf_effect':ctf_effect[local_idx],'csf_effect':csf_effect[local_idx],'true_y':1})
        
        knn_tar_neg_effects = ast.literal_eval(row['target_neg_sents_treatment'])[0]
        for n in range(len(knn_tar_neg_effects)):
            g_idx += 1
            local_idx += 1
            all_sents_effect.append({'id':g_idx,'source':row.target,'target':row.source,'sentence':knn_tar_neg_effects[n][0],
                                     'knn_effect':knn_tar_neg_effects[n][1],'vt_effect':vt_effect[local_idx],
                                     'ctf_effect':ctf_effect[local_idx],'csf_effect':csf_effect[local_idx],'true_y':0})
    
    pickle.dump(pd.DataFrame(all_sents_effect), open(all_effect_file,'wb'))     

In [8]:
all_effect_file = this_path+'5_Select/'+prefix+'_wdpair_sents_4effects_limitvocab.pickle'
combine_4effects(knn_file,vt_file,ctf_file,csf_file,all_effect_file)

0
100
200
300
400
500
600
700
800


#### 1.1 Check if the combination is correct
- Double chek if the function matched all 4 effects to sentences correctly
- Randomly pick a sentence, check the effects in combined file and 4 separate effect files.

In [9]:
test_pd = pickle.load(open(airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_limitvocab.pickle','rb'))
test_pd.shape

(1562060, 9)

In [43]:
mys = "As a predominantly Catholic port city, New Orleans provided community and job security for Irish immigrants fleeing famine in the early to mid 1800s."
test_pd[(test_pd.source == 'predominantly') & (test_pd.target == 'mostly')].ix[152163]

csf_effect                                0.18227
ctf_effect                                0.20484
id                                         152163
knn_effect                                   -0.1
sentence      As a predominantly Catholic port ci
source                              predominantly
target                                     mostly
true_y                                          1
vt_effect                                 0.02126
Name: 152163, dtype: object

### 2. Limit treatment effect file according to different criteria
- 2.1 Limit treatment effect file to a given set of word pairs <br>
> 2.1.1 Limit data according to vocabulary get from postag intersection > 0 (word pairs that have at least one common pos tag)<br>
> 2.1.2 Limit data according to vocabulary get from most_common(1) postag matching (word pairs have same most common POS tag)<br>

- 2.2 Limit treatment effect file to bigram match strategy <br>
> 2.2.1 check if (left_word target_word) or (target_word right_word) in bigram file <br>
> 2.2.2 Further check n_bigram > a given value <br>
> 2.2.3 Check to see if more strict threshold will produce better reasonable sentences <br>

- 2.3 Limit treatment effect file by sensitive case and sentence length<br>
- 2.4 Extract word pairs from a treatment effect file. <br>

#### 2.1 Limit treatment effect file to a given set of word pairs
- Used when we only care about treatment effects for a specific set of words instead of all word pairs

In [45]:
def limit_by_wdpair(wdpair_file,effect_file,result_file):
    """
    Parameters:
        File for specific word pairs, the whole effect file, effect file for limited word pairs
    Function:
        For the set of specific word pairs:
            Map to find all sentences with treatment effects, bidirectional (source,target), (target,source)
    Return:
        Pickle file for specific word pairs with following fields:
        ['source','target','src_sentence','tar_sentence','knn_effect','vt_effect','ctf_effect','csf_effect',
        'true_y','id','src_ratings','src_ratings_avg','tar_ratings','tar_ratings_avg','amt_effect']
        
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    wdpair_pd = pd.read_csv(wdpair_file)
    remain_id = []
    for idx,row in wdpair_pd.iterrows():
        if(idx % 100 == 0):
            print(idx)
        remain_id.extend(effect_pd[(effect_pd.source == row.source) & (effect_pd.target == row.target)].id.values)
        remain_id.extend(effect_pd[(effect_pd.source == row.target) & (effect_pd.target == row.source)].id.values)
        
    remain_effect_pd = effect_pd[effect_pd['id'].isin(remain_id)]
    pickle.dump(remain_effect_pd,open(result_file,'wb'))
    print("%d word pairs with %d sentences remained" % (wdpair_pd.shape[0],remain_effect_pd.shape[0]))

> 2.1.1 Limit data according to vocabulary get from postag intersection > 0 (word pairs that have at least one common POS tag)

In [47]:
wdpair_file = airbnb_path+'1_Process/airbnb_treat_pairs_posinters.csv'
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_limitvocab.pickle'
result_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_limitvocab.pickle'
limit_by_wdpair(wdpair_file,effect_file,result_file)

0
100
200
300
400
500
600
700
750 word pairs with 1377527 sentences remained


> 2.1.2 Limit data according to vocabulary get from most_common(1) postag matching (word pairs have same most common POS tag)

In [48]:
wdpair_file = airbnb_path+'1_Process/airbnb_treat_pairs_posinters_poscom1.csv'
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_limitvocab.pickle'
result_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_limitvocab.pickle'
limit_by_wdpair(wdpair_file,effect_file,result_file)

0
100
200
300
400
500
561 word pairs with 1000502 sentences remained


#### 2.2 Limit treatment effect file to bigram match strategy

- 2.2.1 check if (left_word target_word) or (target_word right_word) in bigram file <br>

In [558]:
wdpair_sentence_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1.pickle'
wdpair_effect_pd = pd.DataFrame(pickle.load(open(wdpair_sentence_file,'rb')))
result_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck.pickle'
big_effect_pd = pd.DataFrame(pickle.load(open(result_file,'rb')))

In [559]:
st_pd = wdpair_effect_pd.sort(columns=['knn_effect'],ascending=False)

In [44]:
#st_pd.iloc[:10].sentence.values

In [172]:
mystr = "Long story short , my boyfriend loved and was impressed with the chocolate ."
#wdpair_effect_pd[(wdpair_effect_pd.source=='boyfriend')]
wdpair_effect_pd[(wdpair_effect_pd.source=='boyfriend') & (wdpair_effect_pd.target=='buddy') & ((wdpair_effect_pd.sentence==mystr))]

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect
1650086,0.70315,0.68375,1650086,0.96667,"Long story short , my boyfriend loved and was ...",boyfriend,buddy,1,0.33178


In [176]:
big_effect_pd[(big_effect_pd.source=='boyfriend') & (big_effect_pd.target=='buddy')]

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect


In [49]:
def check_bigram_match(effect_file,wdpair_file,vocab_bigram_file,result_file):
    """
    Parameters:
        Whole effect file, a specific set of word pairs, bigram vocabulary file, word pair and sentences passed bigram match
    Function:
        Iterate over each word pair, and each sentence:
            check if the bigram exists after target word substitution:
            left_word control_word right_word --> left_word treatment_word right_word
            Only remain sentences that either (left_word treatment_word) or (treatment_word right_word) exist in treatment_word's history bigrams.
        return remaining word pairs, number of remaining sentences, and write sentences for each word pairs.
    Return:
        Pickle file that stores word pair and sentences passed bigram match
    """
    
    wdpair_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    wdpair_pd = pd.read_csv(wdpair_file)
    vocab_bigram_dfdict = pickle.load(open(vocab_bigram_file,'rb'))
    remain_id = []
    
    for idx,row in wdpair_effect_pd.iterrows():
        if(idx % 100000 ==0):
            print(idx)
        # check if this word pair still exists in the limited vocabulary, this is optional
        src_tar_pd = wdpair_pd[(wdpair_pd.source==row.source) & (wdpair_pd.target==row.target)]
        tar_src_pd = wdpair_pd[(wdpair_pd.source==row.target) & (wdpair_pd.target==row.source)]
        if((src_tar_pd.shape[0]>0) or (tar_src_pd.shape[0]>0)):
            word_list = re.findall('[a-z]+',row.sentence.lower())
            if(row.source in word_list):
                srci = word_list.index(row.source)
                tar_bigram_l = ''
                tar_bigram_r = ''
                if(srci > 0):
                    tar_bigram_l = word_list[srci-1]+' '+row.target
                if(srci < len(word_list)-1):
                    tar_bigram_r = row.target+' '+word_list[srci+1]

                flag = False
                if(tar_bigram_l in vocab_bigram_dfdict[row.target]):
                    flag = True
                elif(tar_bigram_r in vocab_bigram_dfdict[row.target]):
                    flag = True

                if(flag):
                    remain_id.append(row.id)

    remain_effect_pd = wdpair_effect_pd[wdpair_effect_pd['id'].isin(remain_id)]
    pickle.dump(remain_effect_pd,open(result_file,'wb'))
    print("%d word pairs remained out of %d in total" % (len(remain_effect_pd.groupby(["source", "target"]).size()),wdpair_pd.shape[0]))
    print("%d sentences remained out of %d in total" % (len(remain_id),wdpair_effect_pd.shape[0]))

In [50]:
this_path = airbnb_path
prefix = "airbnb"

In [51]:
vocab_bigram_file = '/data/2/zwang/2018_S_WordTreatment/V2_airbnb/vocab_bigram_dfdict.pickle'
wdpair_file = this_path + '1_Process/'+prefix+'_treat_pairs_posinters_poscom1.csv'

In [52]:
wdpair_sentence_file = this_path+'5_Select/'+prefix+'_wdpair_sents_4effects_posinters_poscom1_limitvocab.pickle'
result_file = this_path+'5_Select/'+prefix+'_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limitvocab.pickle'
check_bigram_match(wdpair_sentence_file,wdpair_file,vocab_bigram_file,result_file)

300000
500000
600000
700000
900000
1500000
1108 word pairs remained out of 561 in total
885786 sentences remained out of 1000502 in total


- 2.2.2 Further check n_bigram > a given value 
> check to see if more strict threshold will produce better reasonable sentences

In [126]:
def check_nbigram(effect_file,vocab_bigram_file,result_file,n_min):
    """
    Parameters:
        Whole effect file, bigram vocabulary file, word pair and sentences passed bigram match, threshold for n_bigram
    Function:
        Iterate over each word pair, and each sentence:
            check if the bigram exists after target word substitution:
            left_word control_word right_word --> left_word treatment_word right_word
            Only remain sentences that either n_(left_word treatment_word) or n_(treatment_word right_word) > n_min in treatment_word bigrams.
        return remaining word pairs, number of remaining sentences, and write sentences for each word pairs.
    Return:
        Pickle file that stores word pair and sentences passed bigram match with given threshold
    """
    
    wdpair_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    vocab_bigram_dfdict = pickle.load(open(vocab_bigram_file,'rb'))
    remain_id = []
    
    for idx,row in wdpair_effect_pd.iterrows():
        if(idx % 100000 ==0):
            print(idx)

        word_list = re.findall('[a-z]+',row.sentence.lower())
        if(row.source in word_list):
            srci = word_list.index(row.source)
            tar_bigram_l = ''
            tar_bigram_r = ''
            if(srci > 0):
                tar_bigram_l = word_list[srci-1]+' '+row.target
            if(srci < len(word_list)-1):
                tar_bigram_r = row.target+' '+word_list[srci+1]

            flag = False
            if(vocab_bigram_dfdict[row.target][tar_bigram_l] > n_min):
                flag = True
            elif(vocab_bigram_dfdict[row.target][tar_bigram_r] > n_min):
                flag = True

            if(flag):
                remain_id.append(row.id)

    remain_effect_pd = wdpair_effect_pd[wdpair_effect_pd['id'].isin(remain_id)]
    pickle.dump(remain_effect_pd,open(result_file,'wb'))
    print("%d word pairs remained in total" % (len(remain_effect_pd.groupby(["source", "target"]).size())))
    print("%d sentences remained out of %d in total" % (len(remain_id),wdpair_effect_pd.shape[0]))

In [127]:
vocab_bigram_file = '/data/2/zwang/2018_S_WordTreatment/V2_airbnb/vocab_bigram_dfdict.pickle'
wdpair_sentence_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
result_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_bicheck3.pickle'
check_nbigram(wdpair_sentence_file,vocab_bigram_file,result_file,n_min=2)

100000
200000
700000
900000
1100000
1400000
1500000
1600000
1800000
1900000
2000000
1020 word pairs remained in total
799363 sentences remained out of 907939 in total


- 2.2.3 Check to see if more strict threshold will produce better reasonable sentences 
> Check if both n_left_bigram > n_thresh and n_right_bigram > n_thresh

In [138]:
def check_nbigram_lr(effect_file,vocab_bigram_file,result_file,n_min):
    """
    Parameters:
        Whole effect file, bigram vocabulary file, word pair and sentences passed bigram match, threshold for n_bigram
    Function:
        Iterate over each word pair, and each sentence:
            check if the n_bigram > n_min after target word substitution:
            left_word control_word right_word --> left_word treatment_word right_word
            Only remain sentences that both n_(left_word treatment_word) and n_(treatment_word right_word) > n_min.
        return remaining word pairs, number of remaining sentences, and write sentences for each word pairs.
    Return:
        Pickle file that stores word pair and sentences passed bigram match with given threshold.
    """
    
    wdpair_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    vocab_bigram_dfdict = pickle.load(open(vocab_bigram_file,'rb'))
    remain_id = []
    
    for idx,row in wdpair_effect_pd.iterrows():
        if(idx % 100000 ==0):
            print(idx)

        word_list = re.findall('[a-z]+',row.sentence.lower())
        if(row.source in word_list):
            srci = word_list.index(row.source)
            tar_bigram_l = ''
            tar_bigram_r = ''
            if(srci > 0):
                tar_bigram_l = word_list[srci-1]+' '+row.target
            if(srci < len(word_list)-1):
                tar_bigram_r = row.target+' '+word_list[srci+1]

            flag = False
            if((vocab_bigram_dfdict[row.target][tar_bigram_l] > n_min) and (vocab_bigram_dfdict[row.target][tar_bigram_r] > n_min)):
                flag = True
            
            if(flag):
                remain_id.append(row.id)

    remain_effect_pd = wdpair_effect_pd[wdpair_effect_pd['id'].isin(remain_id)]
    pickle.dump(remain_effect_pd,open(result_file,'wb'))
    print("%d word pairs remained in total" % (len(remain_effect_pd.groupby(["source", "target"]).size())))
    print("%d sentences remained out of %d in total" % (len(remain_id),wdpair_effect_pd.shape[0]))

In [140]:
vocab_bigram_file = '/data/2/zwang/2018_S_WordTreatment/V2_airbnb/vocab_bigram_dfdict.pickle'
wdpair_sentence_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
result_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_bicheck3_lr.pickle'
#check_bigram2_lr(wdpair_sentence_file,vocab_bigram_file,result_file,n_min=2)

In [141]:
res_pd = pd.DataFrame(pickle.load(open(result_file,'rb')))
res_pd.shape

(293956, 9)

#### 2.3 Limit treatment effect file by sensitive case and sentence length
> If no match for source word lowercase, then remove. (Not consider if treat word is a proper noun). <br>
> Limit sentence length to [3,50], to remove very long sentences (might be caused by wrong pre-processing).

In [100]:
'unique' in 'We love going to all the Unique restaurants and bars in the area.'

False

In [103]:
len('We love going to all the Unique restaurants and bars in the area.'.split())

13

In [72]:
def limit_by_condition(effect_file,result_file,min_len,max_len):
    """
    Parameters:
        Whole effect file, file limited by conditions, limitations on sentence length.
    Function:
        Put limitation on sentence length and sensitive case.
    Return:
        A pickle file for instances passed the limitations.
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    remain_id = []
    for idx,row in effect_pd.iterrows():
        sent_len = len(row.sentence.split())
        if((sent_len>min_len) and (sent_len<max_len)): # limitation on sentence length
            if((' '+row.source in row.sentence) or (row.source+' ' in row.sentence)): # limitation by case sensitive
                remain_id.append(row.id)
    
    remain_effect_pd = effect_pd[effect_pd['id'].isin(remain_id)]
    pickle.dump(remain_effect_pd,open(result_file,'wb'))
    print("%d sentences out of %d in total remained." % (len(remain_id),effect_pd.shape[0]))

In [73]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limitvocab.pickle'
result_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
limit_by_condition(effect_file,result_file,min_len=3,max_len=50)
#723297 sentences out of 885786 in total remained.

723297 sentences out of 885786 in total remained.


In [97]:
effect_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck.pickle'
result_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
limit_by_condition(effect_file,result_file,min_len=3,max_len=50)

1490969 sentences out of 1659198 in total remained.


In [98]:
effect_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck.pickle'
result_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
limit_by_condition(effect_file,result_file,min_len=3,max_len=50)

907939 sentences out of 1069312 in total remained.


#### 2.4 Extract word pairs from a treatment effect file

In [None]:
def write_vocab(effect_file,vocab_file):
    """
    Parameters:
        A given effect file, file for extracted word pairs
    Function:
        Extract word pairs from a given effect file
    Return:
        A csv file with fields ['source', 'target']
    """
    
    res_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    new_vocab = []
    for idx,row in res_pd.iterrows():
        src_tar_dict = {}
        src_tar_dict['source'] = row['source']
        src_tar_dict['target'] = row['target']
        if(src_tar_dict not in new_vocab):
            new_vocab.append(src_tar_dict)
    
    pd.DataFrame(new_vocab).to_csv(vocab_file,columns=['source','target'],index=False)

In [None]:
write_vocab(effect_file = airbnb_path + '5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck.pickle',
            vocab_file = airbnb_path + '1_Process/airbnb_treat_pairs_posinters_poscom1_bigramcheck.csv')
write_vocab(effect_file = yp_path + '5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck.pickle',
            vocab_file = yp_path + '1_Process/yp_treat_pairs_posinters_poscom1_bigramcheck.csv')
write_vocab(effect_file = tw_path + '5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck.pickle',
            vocab_file = tw_path + '1_Process/tw_treat_pairs_posinters_poscom1_bigramcheck.csv')

### 3. Select topn word pairs and sentences
- 3.1 Get top-n word pairs<br>
- 3.2 For each word pair, select 3 sentences with max, min and median treatment effects according to each method.<br>
- 3.3 For each word pair, sort all sentences, select 1+10 sentences. <br>
> top1 sentence <br>
> Divide others into 10 effect score levels, for each level, select one sentence <br> 

In [374]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
st_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(columns=['vt_effect'],ascending=False)

In [443]:
st_effect_pd[(st_effect_pd.source == 'amazing') & (st_effect_pd.target == 'fabulous')].sort(columns=['ctf_effect'],ascending=False).iloc[9] 

csf_effect                                              0.06497
ctf_effect                                              0.27129
id                                                       295558
knn_effect                                              0.06667
sentence      The neighborhood is amazing as righ next to th...
source                                                  amazing
target                                                 fabulous
true_y                                                        0
vt_effect                                                 0.024
Name: 295558, dtype: object

- 3.1 Get top-n word pairs

In [74]:
def select_topn_wdpairs(effect_file,n_pairs,min_sent_length,method):
    """
    Parameters:
        Treatment effect file, number of word pairs, condition on sentence length, according to which treatment method
    Function:
        Sort treatment effect file according to a specific method in descending order, and 
    Return:
        A list of word pairs: source,target
    """
    # Sort by one effect
    st_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(columns=[method],ascending=False)
    top_pairs = []
    for idx,row in st_effect_pd.iterrows():
        if(len(top_pairs) < n_pairs):
            if(len(re.findall('\w+',row.sentence)) > min_sent_length):
                if(row.source+','+row.target not in top_pairs):
                    top_pairs.append(row.source+','+row.target)
        else:
            break
    return top_pairs

In [78]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
select_topn_wdpairs(effect_file,n_pairs=20,min_sent_length=0,method='vt_effect')

['nice,gorgeous',
 'diverse,many',
 'famous,wonderful',
 'apartment,condo',
 'see,enjoy',
 'arts,artists',
 'chinese,famous',
 'stores,shops',
 'end,rest',
 'district,home',
 'stores,boutiques',
 'great,gorgeous',
 'diverse,several',
 'famous,old',
 'stores,restaurants',
 'chinese,traditional',
 'good,gorgeous',
 'plaza,place',
 'plaza,square',
 'boulevard,avenue']

- 3.2 For each word pair, select 3 sentences with max, min and median treatment effects according to each method.

In [88]:
def select_median_for_wdpairs(effect_file,wd_pairs,wd_pairs_file,topn_file,method):
    """
    Parameters:
        Treatment effect file, a list of topn word pairs, file to record word pair info, 
        file to record selected sentences for word pairs, the method which selection is based on.
    Function:
        For each given word pair, sort all sentences by treatment effect from a method, and select 3 with:
        max, min and median treatment effect.
    Return:
        A csv file with selected sentences for each word pair.
    """
    st_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(columns=[method],ascending=False)
    wdpair_sents = []
    wdpair_info = []
    for pair in wd_pairs:
        wdpair_pd = st_effect_pd[(st_effect_pd.source == pair.split(',')[0]) & (st_effect_pd.target == pair.split(',')[1])].sort(columns=[method],ascending=False)
        effect_list = wdpair_pd[method].values
        wdpair_info.append({'source':pair.split(',')[0],'target':pair.split(',')[1],
                            'min':min(effect_list),'max':max(effect_list),'n_sents':wdpair_pd.shape[0]})
        
        for i in [0,-1,int(wdpair_pd.shape[0]/2)]:
            row = wdpair_pd.iloc[i]
            wdpair_sents.append({'source':row.source,'target':row.target,
                             'sentence':row.sentence,'knn_effect':row.knn_effect,'vt_effect':row.vt_effect,
                             'ctf_effect':row.ctf_effect,'csf_effect':row.csf_effect,
                             'true_y':row.true_y,'id':row.id})
    
    pd.DataFrame(wdpair_info).to_csv(wd_pairs_file,columns=['source','target','n_sents','min','max'],index=False)
    pd.DataFrame(wdpair_sents).to_csv(topn_file,columns=['source','target','sentence','knn_effect','vt_effect','ctf_effect','csf_effect','true_y','id'],index=False)
        

In [95]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
top_pairs = select_topn_wdpairs(effect_file,n_pairs=20,min_sent_length=0,method='knn_effect')

In [96]:
file_path = airbnb_path+'5_Select/5_tolabel/'
select_median_for_wdpairs(effect_file,top_pairs,
                          wd_pairs_file = file_path + 'airbnb_knn_20wdpairs_limitvocab.csv',
                          topn_file = file_path + 'airbnb_knn_20wdpairs_60sents_limitvocab.csv',
                          method = 'knn_effect')

In [213]:
wdpair_pd.tail(1)

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect
1477321,0.07856,-0.0129,1477321,-0.36667,The Mansion is situated mere steps from the hu...,famous,wonderful,1,0.00494


In [216]:
wdpair_pd.iloc[-1]

csf_effect                                              0.07856
ctf_effect                                              -0.0129
id                                                      1477321
knn_effect                                             -0.36667
sentence      The Mansion is situated mere steps from the hu...
source                                                   famous
target                                                wonderful
true_y                                                        1
vt_effect                                               0.00494
Name: 1477321, dtype: object

- 3.3 For each selected word pair, sort all sentences, select 10+1 sentences. <br>
> Top1 sentence <br>
> Divide others into 10 effect score levels, for each level, select one sentence <br>

In [None]:
select_nsents_for_wdpairs(effect_file,top_pairs,wd_pairs_file = ,
                          topn_file = airbnb_path+file_path+'airbnb_csf_top_20pair_11sents_poscom1_bigram_limit_limitvocab.csv',
                          method = 'csf_effect')

In [83]:
def select_nsents_for_wdpairs(effect_file,wd_pairs,n_sents,min_sent_length,topn_file,method):
    """
    Parameters:
        Treatment effect file, a list of topn word pairs, number of sentences to select for each word pair,
        limitation on sentence length, file to record selected sentences for word pairs, the method which selection is based on.
    Function:
        For each given word pair, sort all sentences by treatment effect from a method, and select 11 with:
        1 max treatment effect + 10 sentences from 10 deciles of treatment effect.
    Return:
        A csv file with word pairs and 11 sentences for each pair.
    """
    st_effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(columns=[method],ascending=False)
    wdpair_sents = []
    tmp_rdidx = {}
    for pair in wd_pairs:
        wdpair_pd = st_effect_pd[(st_effect_pd.source == pair.split(',')[0]) & (st_effect_pd.target == pair.split(',')[1])].sort(columns=[method],ascending=False)
        first_row = wdpair_pd.iloc[0]
        wdpair_sents.append({'source':first_row.source,'target':first_row.target,
                             'sentence':first_row.sentence,'knn_effect':first_row.knn_effect,'vt_effect':first_row.vt_effect,
                             'ctf_effect':first_row.ctf_effect,'csf_effect':first_row.csf_effect,
                             'true_y':first_row.true_y,'id':first_row.id})
        
        effect_list = wdpair_pd[method].values
        interv = (max(effect_list)-min(effect_list))/(n_sents-0.1)
        intervid_list = {}
        
        for idx,row in wdpair_pd.iloc[1:].iterrows():
            interv_id = math.floor((row[method]-min(effect_list))/interv)
            if(interv_id not in intervid_list):
                intervid_list[interv_id]=[]
            intervid_list[interv_id].append(idx)
        
        tmp_rdidx['rand_idx'] = []
        for i in range(min(n_sents,wdpair_pd.shape[0])):
            if(i in intervid_list):
                rd_i = random.sample(intervid_list[i],1)
                #return wdpair_pd,intervid_list,i,rd_i
                rd_row = wdpair_pd.ix[rd_i]
                if(len(re.findall('\w+',rd_row.sentence.values[0])) > min_sent_length):
                    tmp_rdidx['rand_idx'].append(rd_i)
                    wdpair_sents.append({'source':rd_row.source.values[0],'target':rd_row.target.values[0],
                                         'sentence':rd_row.sentence.values[0],'knn_effect':rd_row.knn_effect.values[0],
                                         'vt_effect':rd_row.vt_effect.values[0],'ctf_effect':rd_row.ctf_effect.values[0],
                                         'csf_effect':rd_row.csf_effect.values[0],
                                         'true_y':rd_row.true_y.values[0],'id':rd_row.id.values[0]})
            else:
                print(pair,i)
        
        #wdpair_rdidx.append(tmp_rdidx)
        
    #pd.DataFrame(wdpair_rdidx).to_csv()
    pd.DataFrame(wdpair_sents).to_csv(topn_file,columns=['source','target','sentence','knn_effect','vt_effect','ctf_effect','csf_effect','true_y','id'],index=False)

In [87]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
top_pairs = select_topn_wdpairs(effect_file,n_pairs=30,min_sent_length=3,method='knn_effect')
select_nsents_for_wdpairs(effect_file,top_pairs,n_sents=10,min_sent_length=0,
                          topn_file = airbnb_path+'5_Select/4_posinters_poscom1_bigram_limit/airbnb_knn_top_30pair_11sents_poscom1_bigram_limit_limitvocab.csv',
                          method = 'knn_effect')

plaza,square 6
entertainment,recreation 9
plaza,place 7
chinese,traditional 9
hard,excellent 9
store,boutique 9
beautiful,wonderful 8
beautiful,wonderful 9
beautiful,picturesque 8
exciting,unique 9


In [102]:
print("airbnb,csf:")
top_pairs

airbnb,csf:


['yummy,good',
 'annual,general',
 'trips,tours',
 'wide,multiple',
 'wide,grand',
 'various,several',
 'wide,whole',
 'predominantly,mostly',
 'general,grand',
 'predominantly,especially',
 'yummy,delicious',
 'suggestions,views',
 'help,sit',
 'wide,several',
 'wide,numerous',
 'comfortable,happy',
 'apartments,homes',
 'wide,open',
 'closest,highest',
 'wide,much']

In [142]:
test_pd = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(columns=['ctf_effect'],ascending=False)
test_pd.shape

(387213, 9)

In [143]:
test_pd.head()

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect
1167909,0.0853,0.67888,1167909,0.23333,"Little Tokyo also has amazing food, we recomme...",amazing,outstanding,0,0.02356
1167563,0.06561,0.67222,1167563,0.23333,"Little Tokyo also has amazing food, we recomme...",amazing,outstanding,0,0.0157
1167519,0.11755,0.66128,1167519,0.23333,"Little Tokyo also has amazing food, we recomme...",amazing,outstanding,0,0.01968
1167687,0.17155,0.65167,1167687,0.23333,"Little Tokyo also has amazing food, we recomme...",amazing,outstanding,0,0.02165
1167593,0.07898,0.60272,1167593,0.2,"Little Tokyo also has amazing food, we recomme...",amazing,outstanding,0,0.01551


In [139]:
pos_tag("You will be able to walk to award winning restaurants, as well as great bars, casual dining and cozy coffee shops.".split())

[('You', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('walk', 'VB'),
 ('to', 'TO'),
 ('award', 'VB'),
 ('winning', 'VBG'),
 ('restaurants,', 'NN'),
 ('as', 'RB'),
 ('well', 'RB'),
 ('as', 'IN'),
 ('great', 'JJ'),
 ('bars,', 'JJ'),
 ('casual', 'JJ'),
 ('dining', 'NN'),
 ('and', 'CC'),
 ('cozy', 'JJ'),
 ('coffee', 'NN'),
 ('shops.', 'NN')]

In [140]:
pos_tag("You will be able to walk to award won restaurants, as well as great bars, casual dining and cozy coffee shops.".split())

[('You', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('walk', 'VB'),
 ('to', 'TO'),
 ('award', 'VB'),
 ('won', 'NN'),
 ('restaurants,', 'NN'),
 ('as', 'RB'),
 ('well', 'RB'),
 ('as', 'IN'),
 ('great', 'JJ'),
 ('bars,', 'JJ'),
 ('casual', 'JJ'),
 ('dining', 'NN'),
 ('and', 'CC'),
 ('cozy', 'JJ'),
 ('coffee', 'NN'),
 ('shops.', 'NN')]

### 4. Generate paraphrase substitution sentences for selected word pairs and sentences

In [174]:
mystr = "My boyfriend and I came here for the boyfriend first time on a recent trip to Vegas and could not have been more pleased with the quality of food and service ."

In [177]:
re.sub('boyfriend','buddy',mystr,re.IGNORECASE)

'My buddy and I came here for the buddy first time on a recent trip to Vegas and could not have been more pleased with the quality of food and service .'

In [161]:
def substitute_sentence(wd_sents_file,subs_file):
    """
    Parameters:
        File for word pairs with selected sentences
    Function:
        Generate new sentences by substituting source words with target words.
    Return:
        A csv file with original sentences and substituted sentences.
    """
    wd_sents_pd = pd.read_csv(wd_sents_file)
    all_info = []
    for idx,row in wd_sents_pd.iterrows():
        tar_sent = re.sub(row.source,row.target,row.sentence)#flags = re.IGNORECASE
        all_info.append({'source':row.source,'target':row.target,'src_sentence':row.sentence,'tar_sentence':tar_sent,
                         'knn_effect':row.knn_effect,'vt_effect':row.vt_effect,'ctf_effect':row.ctf_effect,
                         'csf_effect':row.csf_effect,
                         'true_y':row.true_y,'id':row.id})

    pd.DataFrame(all_info).to_csv(subs_file,columns=['source','target','src_sentence','tar_sentence',
                                                    'knn_effect','vt_effect','ctf_effect','csf_effect','true_y','id'],index=False)

In [454]:
this_path = tw_path
prefix = 'tw'

In [455]:
wd_sents_file = this_path+'5_Select/5_tolabel/'+prefix+'_knn_10wdpairs_30sents.csv'
subs_file = this_path+'5_Select/5_tolabel/'+prefix+'_knn_10wdpairs_60sents.csv'
substitute_sentence(wd_sents_file,subs_file)

In [456]:
wd_sents_file = this_path+'5_Select/5_tolabel/'+prefix+'_vt_10wdpairs_30sents.csv'
subs_file = this_path+'5_Select/5_tolabel/'+prefix+'_vt_10wdpairs_60sents.csv'
substitute_sentence(wd_sents_file,subs_file)

In [457]:
wd_sents_file = this_path+'5_Select/5_tolabel/'+prefix+'_ctf_10wdpairs_30sents.csv'
subs_file = this_path+'5_Select/5_tolabel/'+prefix+'_ctf_10wdpairs_60sents.csv'
substitute_sentence(wd_sents_file,subs_file)

In [458]:
wd_sents_file = this_path+'5_Select/5_tolabel/'+prefix+'_csf_10wdpairs_30sents.csv'
subs_file = this_path+'5_Select/5_tolabel/'+prefix+'_csf_10wdpairs_60sents.csv'
substitute_sentence(wd_sents_file,subs_file)

In [202]:
this_path = yp_path
prefix = 'yp'
subs_file = this_path+'5_Select/4_posinters_poscom1_bigram_limit/'+prefix+'_knn_top_30pair_11sents_limit_lb_bidirection.csv'

In [203]:
csv_pd = pd.read_csv(subs_file)
pair_ct = Counter()
for idx,row in csv_pd.iterrows():
    pair_ct.update(row.source+','+row.target)
len(pair_ct)

24

### 5. Different evaluation methods
- 5.1 Rank all word pairs according to different methods. <br>
- 5.2 Get average treatment effect for each word pair. <br>
- 5.3 Fetch information for word pairs with coef change and frequency in opposite class. <br>
- 5.4 Get top-10 and bottom-10 word pair. <br>
- 5.5 Find treatment words for a sentence. <br>
- 5.6 Spearman rank correlation for sentences and word pairs. <br>
- 5.7 Percentage of negative instances among topn high treatment sentences. <br>

- 5.1 Rank all word pairs according to different methods. 

In [97]:
def sort_allpairs(effect_file,method,flag):
    """
    Parameters:
        Treatment effect file, sort according to treatment effect calculated by which method, flag for descending or asending
    Function:
        Sort word pairs according to treatment effect calculated by a specific method.
    Return:
        DataFrame for de-duplicate word pairs, only contain 2 fields: ['source','target']
    """
    
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(columns=[method],ascending=flag)
    method_pd = effect_pd[['source','target']].drop_duplicates(keep='first').reset_index()
    return method_pd

In [98]:
def rank_wdpairs(effect_file,rank_file,increase):
    """
    Parameters:
        Treatment effect file, file that contains ranks for word pairs according to 4 methods, increase order or decrease
    Function:
        Get order for each word pair according to 4 methods
    Return:
        A csv file with word pairs and ranks according to 4 methods.
    """
    # Sort by one effect, increase = True, 1_0, increase = False, 0_1 
    knn_pd = sort_allpairs(effect_file,method='knn_effect',flag=increase)
    vt_pd = sort_allpairs(effect_file,method='vt_effect',flag=increase)
    ctf_pd = sort_allpairs(effect_file,method='ctf_effect',flag=increase)
    csf_pd = sort_allpairs(effect_file,method='csf_effect',flag=increase)
    
    wdpair_info = []
    
    for idx,row in knn_pd.iterrows():
        tmp_pair = {}
        tmp_pair['source'] = row.source
        tmp_pair['target'] = row.target
        tmp_pair['knn_rank'] = idx
        tmp_pair['vt_rank'] = vt_pd[(vt_pd.source == row.source) & (vt_pd.target == row.target)].index[0]
        tmp_pair['ctf_rank'] = ctf_pd[(ctf_pd.source == row.source) & (ctf_pd.target == row.target)].index[0]
        tmp_pair['csf_rank'] = csf_pd[(csf_pd.source == row.source) & (csf_pd.target == row.target)].index[0]
        
        wdpair_info.append(tmp_pair)
    pd.DataFrame(wdpair_info).to_csv(rank_file,columns=['source','target','knn_rank','vt_rank','ctf_rank','csf_rank'],index=True)
    

In [103]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
rank_file =  airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_limitvocab.csv'
rank_wdpairs(effect_file,rank_file,increase = False)#increase = True, 1_0, increase = False, 0_1 

#### 5.2 Get average treatment effect for each word pair 


In [100]:
def get_avg_effect(effect_file,wdpair_file,avg_effect_file):
    """
    Parameters:
        Treatment effect file, ranked word pairs file (optional), file to store average treatment effect for each word pair
    Function:
        Calculate average treatment effect for each word pair according to 4 methods
    Return:
        A csv file with records average treatment effect and rank for each word pair.
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    wdpair_pd = pd.read_csv(wdpair_file)
    avg_info = []
    for idx, row in wdpair_pd.iterrows():
        pair_df = effect_pd[(effect_pd.source == row.source) & (effect_pd.target == row.target)]
        row_info = {}
        row_info['source'] = row.source
        row_info['target'] = row.target
        row_info['avg_knn'] = np.mean(pair_df.knn_effect.values)
        row_info['avg_vt'] = np.mean(pair_df.vt_effect.values)
        row_info['avg_ctf'] = np.mean(pair_df.ctf_effect.values)
        row_info['avg_csf'] = np.mean(pair_df.csf_effect.values)
    
        avg_info.append(row_info)
    
    mg_df = wdpair_pd.merge(pd.DataFrame(avg_info),left_on=['source','target'],right_on=['source','target'],how='inner')
    mg_df.to_csv(avg_effect_file,columns=['source','target','knn_rank','vt_rank','ctf_rank','csf_rank',
                                          'avg_knn','avg_vt','avg_ctf','avg_csf'],index=True)

In [57]:
effect_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file =  yp_path+'5_Select/6_PairRank/1_0/yp_pairrank.csv'
avg_effect_file = yp_path+'5_Select/6_PairRank/1_0/yp_pairrank_avgeffect.csv'
get_avg_effect(effect_file,wdpair_file,avg_effect_file)

In [58]:
effect_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file =  tw_path+'5_Select/6_PairRank/1_0/tw_pairrank.csv'
avg_effect_file = tw_path+'5_Select/6_PairRank/1_0/tw_pairrank_avgeffect.csv'
get_avg_effect(effect_file,wdpair_file,avg_effect_file)

In [104]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
wdpair_file =  airbnb_path+'5_Select/6_PairRank/1_0/airbnb_pairrank_limitvocab.csv'
avg_effect_file = airbnb_path+'5_Select/6_PairRank/1_0/airbnb_pairrank_avgeffect_limitvocab.csv'
get_avg_effect(effect_file,wdpair_file,avg_effect_file)

#### 5.3 Fetch information for word pairs with coef change and frequency in opposite class. 

In [174]:
def merge_coef(avg_effect_file,coef_file,res_file):
    """
    Parameters:
        Treatment effect file with average treatment effect, file with coefficient information for each word pair, 
        result file to store all information for each word pair.
    Function:
        Get coefficient change of each word pair and frequency in opposite class.
    Return:
        A csv file with rank info, avg effect, coefficient change, and frequency in opposite class.
    """
    effect_pd = pd.read_csv(avg_effect_file)
    coef_pd = pd.read_csv(coef_file)
    
    all_info = []
    for idx,row in effect_pd.iterrows():
        tmp_info = {}
        tmp_info['source'] = row.source
        tmp_info['target'] = row.target
        
        src_tar_coef = coef_pd[(coef_pd.source == row.source) & (coef_pd.target == row.target)]
        tar_src_coef = coef_pd[(coef_pd.target == row.source) & (coef_pd.source == row.target)]
        if(src_tar_coef.shape[0]>0):
            tmp_info['coef_change'] = float(src_tar_coef.tar_coef.values[0]) - float(src_tar_coef.src_coef.values[0])
            tmp_info['tar_n_neg'] = src_tar_coef.tar_n_neg.values[0]
            #tmp_info['tar_n_pos'] = src_tar_coef.tar_n_pos.values[0]
        elif(tar_src_coef.shape[0]>0):
            tmp_info['coef_change'] = float(tar_src_coef.src_coef.values[0]) - float(tar_src_coef.tar_coef.values[0])
            tmp_info['tar_n_neg'] = tar_src_coef.src_n_neg.values[0]
            #tmp_info['tar_n_pos'] = tar_src_coef.src_n_pos.values[0]
            
        else:
            print("Error:",row.source,row.target)
        
        all_info.append(tmp_info)
        
    mg_df = effect_pd.merge(pd.DataFrame(all_info),left_on=['source','target'],right_on=['source','target'],how='inner')
    mg_df.to_csv(res_file,columns=['source','target','knn_rank','vt_rank','ctf_rank','csf_rank',
                                          'avg_knn','avg_vt','avg_ctf','avg_csf','coef_change','tar_n_neg'],index=True)

In [175]:
avg_effect_file = airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_avgeffect_limitvocab.csv'
coef_file = airbnb_path+'1_Process/airbnb_treat_pairs.csv'
res_file = airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_avgeffect_coef_limitvocab.csv'
merge_coef(avg_effect_file,coef_file,res_file)

#### 5.4 Get top-10 and bottom-10 word pair 

In [110]:
def get_top10(effect_file,res01_file,res10_file):
    """
    Parameters:
        Treatment effect file, file for top 10 word pairs and bottom 10 word pairs.
    Function:
        Get top 10 and bottom 10 word pairs.
        Top word pairs are positive treatment effect, bottom word pairs are negative treatment effect.
    Return:
        A csv file with top 10 word pairs and a csv file with bottom 10 word pairs.
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    method = ['knn_effect','vt_effect','ctf_effect','csf_effect']
    df01_list = []
    df10_list = []
    for i in range(len(method)):
        st_df = effect_pd.sort(columns = [method[i]],ascending=False)
        df01_list.append(st_df.iloc[:10])
        df10_list.append(st_df.iloc[-10:])
    
    pd.concat(df01_list).to_csv(res01_file,columns=['source','target','true_y','sentence',
                                                    'knn_effect','vt_effect','ctf_effect','csf_effect'], index=False)
    
    pd.concat(df10_list).to_csv(res10_file,columns=['source','target','true_y','sentence',
                                                    'knn_effect','vt_effect','ctf_effect','csf_effect'], index=False)

In [111]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
res01_file = airbnb_path+'5_Select/7_PairInSents/airbnb_top10_limitvocab.csv'
res10_file = airbnb_path+'5_Select/7_PairInSents/airbnb_end10_limitvocab.csv'
get_top10(effect_file,res01_file,res10_file)

#### 5.5 Find treatment words for a sentence. 

In [377]:
def fetch_words_forsents(effect_file,wdpair_file,res_file):
    """
    Parameters:
        Treatment effect file, treatment word pairs, result file with all treatment word pairs for each sentence.
    Function:
        Search for all treatment word pairs for every sentence.
    Return:
        A pickle file to record all treatment word pairs for every sentence.
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    wdpair_pd = pd.read_csv(wdpair_file)
    src_wds = list(set(wdpair_pd.source.values))
    ct_vec = CountVectorizer(binary=True,vocabulary=src_wds)
    sents = effect_pd[effect_pd.true_y == 0].drop_duplicates(['sentence'],keep='first').sentence.values
    X_sents = ct_vec.fit_transform(sents)
    vocab = ct_vec.get_feature_names()
    sent_pd_list = []
    sent_id_list = []
    n_one = 0
    for i in range(10):#X_sents.shape[0]
        row_wdidx = X_sents[i].nonzero()[1]
        if(len(row_wdidx)>1):
            all_para_info = []
            for srci in row_wdidx:
                src_wd = vocab[srci]
                tar_wd_list = wdpair_pd[(wdpair_pd.source == src_wd)].target.values
                for tarj in range(len(tar_wd_list)):
                    para_info = {}
                    tar_wd = tar_wd_list[tarj]
                    select_pd=effect_pd[(effect_pd.source==src_wd)&(effect_pd.target==tar_wd)&(effect_pd.sentence==sents[i])]
                    if(select_pd.shape[0]>0):
                        para_info['source']=src_wd
                        para_info['target']=tar_wd
                        para_info['knn_effect']=select_pd.knn_effect.values[0]
                        para_info['vt_effect']=select_pd.vt_effect.values[0]
                        para_info['ctf_effect']=select_pd.ctf_effect.values[0]
                        para_info['csf_effect']=select_pd.csf_effect.values[0]
                        all_para_info.append(para_info)
            sent_pd_list.append(pd.DataFrame(all_para_info))
            sent_id_list.append(i)
    #return sent_pd_list
    all_info = []
    for j in range(len(sent_pd_list)):#X_sents.shape[0]
        sent_info = {}
        sent_info['sentence']=sents[sent_id_list[j]]
        sent_info['treat_words']=sent_pd_list[j]
        all_info.append(sent_info)
                
    #print("%d, %.5f has more than one positive treatment word." % ((X_sents.shape[0]-n_one),((X_sents.shape[0]-n_one)/X_sents.shape[0])))
    pd.DataFrame(all_info).reset_index().to_csv(res_file,columns=['sentence','treat_words'],index=False)
        

In [378]:
effect_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file = yp_path+'5_Select/6_PairRank/0_1/yp_pairrank_avgeffect_coef.csv'
res_file = yp_path+'5_Select/7_PairInSents/yp_pairinsents_pd.csv'
fetch_words_forsents(effect_file,wdpair_file,res_file)
#139708, 0.51890 has more than one treatment word.

In [379]:
test_pd = pd.read_csv(yp_path+'5_Select/7_PairInSents/yp_pairinsents_pd.csv')

In [383]:
test_pd.iloc[0].treat_words

'   csf_effect  ctf_effect  knn_effect source    target  vt_effect\n0    -0.05456     0.00950     0.06667   shop  boutique   -0.01361\n1     0.02197     0.02569    -0.03333   shop      mall   -0.00058\n2    -0.05700    -0.04470     0.03333   shop     store   -0.01193'

#### Find treatment words for a sentence
> Used for selecting example sentences to present in paper

In [128]:
def get_treat_words(effect_file,wdpair_file,my_sents):
    """
    Parameters:
        Treatment effect file, treatment word pairs, a given sentence.
    Function:
        Search for all treatment word pairs for a given sentence.
    Return:
        A DataFrame with all treatment word pairs and corresponding treatment effects.
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    wdpair_pd = pd.read_csv(wdpair_file)
    src_wds = list(set(wdpair_pd.source.values))
    ct_vec = CountVectorizer(binary=True,vocabulary=src_wds)
    #sents = effect_pd[effect_pd.true_y == 0].drop_duplicates(['sentence'],keep='first').sentence.values
    X_sents = ct_vec.fit_transform([my_sents])
    vocab = ct_vec.get_feature_names()

    treat_info = []
    row_wdidx = X_sents[0].nonzero()[1]
    if(len(row_wdidx)>0):
        for srci in row_wdidx:
            src_wd = vocab[srci]
            tar_wd_list = wdpair_pd[(wdpair_pd.source == src_wd)].target.values
            for tarj in range(len(tar_wd_list)):
                para_info = {}
                tar_wd = tar_wd_list[tarj]
                select_pd=effect_pd[(effect_pd.source==src_wd)&(effect_pd.target==tar_wd)&(effect_pd.sentence==my_sents)]
                if(select_pd.shape[0]>0):
                    para_info['source']=src_wd
                    para_info['target']=tar_wd
                    para_info['knn_effect']=select_pd.knn_effect.values[0]
                    para_info['vt_effect']=select_pd.vt_effect.values[0]
                    para_info['ctf_effect']=select_pd.ctf_effect.values[0]
                    para_info['csf_effect']=select_pd.csf_effect.values[0]
                    treat_info.append(para_info)
    return pd.DataFrame(treat_info,columns=['source','target','knn_effect','vt_effect','ctf_effect','csf_effect'])

In [129]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
wdpair_file = airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_avgeffect_coef_limitvocab.csv'
#my_sents = 'It\'s predominantly hipsterville, making it very affordable and good for people watching.'
treat_pd = get_treat_words(effect_file,wdpair_file,airbnb_sents[0])

In [134]:
treat_pd.sort(columns=['csf_effect'],ascending=False)

Unnamed: 0,source,target,knn_effect,vt_effect,ctf_effect,csf_effect
8,exciting,spectacular,0.13333,0.01976,0.17731,0.20004
17,exciting,stunning,0.23333,0.01883,0.16231,0.16514
10,exciting,fantastic,0.23333,0.00894,0.08822,0.1596
9,exciting,wonderful,0.16667,0.02686,0.1216,0.15543
11,exciting,fabulous,0.33333,0.01996,0.15407,0.15069
7,exciting,incredible,0.2,0.0172,0.08051,0.14791
20,exciting,awesome,0.2,0.00926,0.1034,0.14167
21,exciting,excellent,0.23333,0.02928,0.10999,0.13894
16,exciting,great,0.03333,0.01958,0.0301,0.12944
18,exciting,cool,0.16667,0.01985,0.10167,0.11462


In [127]:
airbnb_sents = [
    'I don\'t suggest long walks after dark, but I would definitely not let this neighborhood discourage your stay, it\'s vibrant, fun and exciting. ',
    'If you prefer to enjoy night life then you\' ll like the location even more as every famous club and bar in Hollywood is located in walking distance from the apartment.',
    'The neighborhood features notable historical landmarks that include the 1836 Clarke House, one of Chicago’s oldest residences; a diverse dining scene; blues clubs and other nightlife options; and the convenience of the Museum Campus and Loop and McCormick Place a short stroll away.	The neighborhood features unique historical landmarks that include the 1836 Clarke House, one of Chicago’s oldest residences; a diverse dining scene; blues clubs and other nightlife options; and the convenience of the Museum Campus and Loop and McCormick Place a short stroll away.',
    'We\'re a mile or so to great bars (Bacchanal, Vaughan\'s, BJ\'s), coffee shops (Solo, Satsuma), and cheap eats (Sneaky Pickle, Pizza Delicious, The Joint); as well as the art and music venues of St. Claude Avenue.',
    'If you want great (and cheap) tacos, just walk about three minutes towards Oxnard St. to San Marcos.',
    'Quiet but also in the middle of Hollywood, one side accessible to the hub of Hollywood movie premieres, Jimmy Kimmel, Hotel Roosevelt and renowned locations, one side accessible to tranquility of Runyon Canyon.'
]

In [136]:
test_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
test_pd.shape

(723297, 9)

In [158]:
test_pd[(test_pd.source =='store') & (test_pd.target == 'boutique')].sort(['csf_effect'],ascending=True).iloc[1].sentence

'There is also a Whole Foods grocery store 5 min (0.3 mile) walk northeast, or a Jewell-Osco grocery and pharmacy 3 min - 2 red line stops south on Berwyn.'

In [None]:
#knn: exciting -> interesting, vibrant -> dynamic, definitely -> truely
#vt: exciting -> stunning, vibrant -> dynamic, definitely -> really
#ctf: exciting -> spectacular, vibrant -> dynamic, definitely -> absolutely
#csf: exciting -> spectacular, vibrant -> dynamic, definitely -> absolutely

In [None]:
#knn: cheap -> inexpensive, shops -> stores, venues -> locations
#vt: cheap -> inexpensive, shops -> stores, venues -> locations
#ctf: cheap -> inexpensive, shops -> boutiques, venues -> locations
#csf: cheap -> inexpensive, shops -> boutiques, venues -> sites

In [528]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file = airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_avgeffect_coef.csv'
#treat_pd = get_treat_words(effect_file,wdpair_file,airbnb_sents[-1])

In [539]:
#treat_pd.sort(columns=['knn_effect'],ascending=True)

In [495]:
tw_sents = [
    'Every girl I know is with it and makes nice dinners for their boyfriends while I just order pizza and drink beer with mine #sorrybabe',
    'Hung out with this cutie and my sweetheart at lunch today . #butterfly #flowers #nature http://url',
    'I think I like Luke Hemmings more than I\'ve liked any of my real boyfriends ... Lol',
    'I got burn holes in my hoodies all my homies think its dank .',
    'Salty af I got no classes with my dude @user this shit gonna be tough lol'
]

In [None]:
#knn: boyfriends -> buddies, nice -> good
#vt: boyfriends -> buddies, beer -> brew
#ctf: boyfriends -> buddies, beer -> brew, nice -> good
#csf: boyfriends -> buddies, beer -> brew

In [542]:
effect_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file = tw_path+'5_Select/6_PairRank/0_1/tw_pairrank_avgeffect_coef.csv'
treat_pd = get_treat_words(effect_file,wdpair_file,tw_sents[-1])

In [500]:
treat_pd.sort(columns=['knn_effect'],ascending=True)

Unnamed: 0,source,target,knn_effect,vt_effect,ctf_effect,csf_effect
1,tough,strict,-0.37143,-0.09055,-0.34185,-0.65278
0,tough,hard,-0.13333,-0.02964,-0.08774,-0.2072
2,dude,man,0.03333,-0.01117,-0.04092,-0.04794


In [520]:
yp_sents = [
    'There are different bar areas , the waterfall is gorgeous !',
    'I came for my birthday on a 3 night comp ... my room was gorgeous , the bathroom was fit for a queen , and the service was excellent .',
    'It \'s gorgeous inside so I usually end up taking a bunch of pictures .',
    #'I ordered the vegetarian breakfast and was not disappointed -- great portion and fabulous taste .',
    'Dr. Newland was so thorough in his workup and care .',
    #'The salesperson shows me the item but walks away and answers me while he is still walking away .',
    'one morning I came here and wanted this crazy alcoholic coffee concoction my buddy told me about , its basically 10 % coffee and 90 % booze .',
    'My buddy had the Original `` G '' spicy Po boy , and it was also tasty .',
    'Tasty tasty tasty .',
    'Very fresh , and tasty herbs and spring rolls as well !'
]

In [None]:
#knn: different -> alternative, gorgeous -> terrific
#vt: different -> alternative, gorgeous -> terrific
#ctf: different -> typical, gorgeous -> terrific
#csf: different -> alternative, gorgeous -> outstanding

In [159]:
effect_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
effect_df = pd.DataFrame(pickle.load(open(effect_file,'rb')))
wdpair_file = yp_path+'5_Select/6_PairRank/0_1/yp_pairrank_avgeffect_coef.csv'

In [169]:
effect_df = pd.DataFrame(pickle.load(open(effect_file,'rb')))
st_df = effect_df[(effect_df.source=='cute') & (effect_df.target=='attractive')].sort(columns=['vt_effect'],ascending=True)
#st_df

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect
1543638,0.31043,0.27833,1543638,0.23333,"The server , who was very cute and attractive ...",cute,attractive,1,0.05278
1543529,0.40095,0.26853,1543529,0.30000,"Go and get your drink on , they got a cute vie...",cute,attractive,1,0.06801
1543427,0.42040,0.26442,1543427,0.26667,"Our cute Long Island native , Mary suggested t...",cute,attractive,1,0.07196
1544210,0.35726,0.27552,1544210,0.20000,Not very busy the night we were there ; lots o...,cute,attractive,0,0.07201
1544221,0.32231,0.31974,1544221,0.36667,The veggies were great as well as the tofu and...,cute,attractive,0,0.07230
1543837,0.30839,0.37658,1543837,0.43333,"Very tasty food , fine -LRB- not amazing -RRB-...",cute,attractive,0,0.07883
1543548,0.39580,0.26169,1543548,0.30000,"The wait staff are friendly and cute , our fav...",cute,attractive,1,0.08000
1544436,0.25297,0.28698,1544436,0.36667,Smoking inside and there is a cute bar outside...,cute,attractive,0,0.08115
1543694,0.34551,0.29225,1543694,0.33333,I will be coming back to try out a few other m...,cute,attractive,0,0.08176
1543417,0.40486,0.33363,1543417,0.46667,This location is great as far as customer serv...,cute,attractive,1,0.08429


In [172]:
st_df.iloc[2].sentence

'Our cute Long Island native , Mary suggested the best things on the menu - even telling us what was off and on from the specials board that would work or not .'

In [519]:
effect_df[(effect_df.source=='tasty')].sort(columns=['csf_effect']).iloc[1].sentence

'Very fresh , and tasty herbs and spring rolls as well !'

In [509]:
effect_df.sort(columns=['knn_effect'],ascending=True).head()

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect
905001,-0.26992,-0.27337,905001,-1.0,"Great , great , great !",great,fabulous,0,-0.20525
481044,-0.28591,-0.28453,481044,-1.0,Tasty tasty tasty .,tasty,yummy,1,-0.20831
1499699,-0.28051,-0.28531,1499699,-1.0,"Great , great , great !",great,gorgeous,0,-0.21915
2234703,-0.19621,-0.17188,2234703,-0.96667,"overall , delightful .",delightful,adorable,0,-0.07908
1267190,-0.29497,-0.33908,1267190,-0.96667,A superb recommendation .,superb,gorgeous,0,-0.20656


In [526]:
treat_pd = get_treat_words(effect_file,wdpair_file,yp_sents[-1])
treat_pd.sort(columns=['csf_effect'],ascending=True)
#treat_pd

Unnamed: 0,source,target,knn_effect,vt_effect,ctf_effect,csf_effect
5,tasty,yummy,-0.33333,-0.15257,-0.39367,-0.30843
0,well,beautifully,-0.2,-0.11394,-0.23093,-0.18963
4,tasty,delicious,-0.23333,-0.06241,-0.19966,-0.12515
3,tasty,super,-0.13333,-0.05382,-0.16118,-0.11619
1,tasty,delightful,-0.23333,-0.06429,-0.20855,-0.09204
2,tasty,excellent,0.06667,0.0371,0.07486,0.09396


- previous results for function fetch_words_forsents

In [146]:
effect_file = yp_path+'5_Select/yp_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file = yp_path+'5_Select/6_PairRank/0_1/yp_pairrank_avgeffect_coef.csv'
res_file = yp_path+'5_Select/7_PairInSents/yp_pairinsents.csv'
fetch_words_forsents(effect_file,wdpair_file,res_file)
#139708, 0.51890 has more than one treatment word.

76587, 0.51485 has more than one positive treatment word.


In [147]:
effect_file = tw_path+'5_Select/tw_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file = tw_path+'5_Select/6_PairRank/0_1/tw_pairrank_avgeffect_coef.csv'
res_file = tw_path+'5_Select/7_PairInSents/tw_pairinsents.csv'
fetch_words_forsents(effect_file,wdpair_file,res_file)
#123895, 0.52578 has more than one treatment word.

71849, 0.52592 has more than one positive treatment word.


In [148]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit.pickle'
wdpair_file = airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_avgeffect_coef.csv'
res_file = airbnb_path+'5_Select/7_PairInSents/airbnb_pairinsents.csv'
fetch_words_forsents(effect_file,wdpair_file,res_file)
#82920, 0.80884 has more than one treatment word.

14585, 0.79944 has more than one positive treatment word.


In [353]:
test_pd = pd.read_csv(yp_path+'5_Select/7_PairInSents/yp_pairinsents.csv')
test_pd.shape

(76587, 3)

In [354]:
test_pd.head()

Unnamed: 0,n_src,src_words,sentence
0,10,"['little', 'entire', 'favorite', 'portions', '...",What I ate : Alaskan crab legs -LRB- LOTS -RRB...
1,9,"['well', 'fabulous', 'everyone', 'place', 'atm...","Yummy food , fabulous coffee , great atmospher..."
2,9,"['whole', 'getting', 'rude', 'birthday', 'awfu...",Once we made it through the pick-up issue and ...
3,8,"['tasty', 'okay', 'such', 'great', 'sure', 'fe...",I had room service there once a few times and ...
4,8,"['trendy', 'fine', 'cute', 'tasty', 'little', ...","Very tasty food , fine -LRB- not amazing -RRB-..."


#### 5.6 Spearman rank correlation for sentences and word pairs. 

In [122]:
def cal_spearmanr(effect_file):
    """
    Parameters:
        Treatment effect file
    Function:
        Calculate spearman rank correlation between sentence treatment effect computed by every two methods among four methods.
    Return:
        
    """
    effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    #effect_pd = pd.read_csv(effect_file)
    method = ['knn','vt','ctf','csf']
    for i in range(len(method)):
        for j in range(i+1,len(method)):
            print(method[i]+','+method[j],spearmanr(effect_pd[method[i]+'_effect'].values,effect_pd[method[j]+'_effect'].values))
        print()

In [124]:
effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle'
#effect_pd = pd.DataFrame(pickle.load(open(effect_file,'rb')))
#effect_file = airbnb_path+'5_Select/6_PairRank/0_1/airbnb_pairrank_limitvocab.csv'
#effect_pd = pd.read_csv(effect_file)
effect_pd.shape

(1108, 7)

In [125]:
cal_spearmanr(effect_file)

knn,vt SpearmanrResult(correlation=0.46903437477259624, pvalue=0.0)
knn,ctf SpearmanrResult(correlation=0.56073354989311586, pvalue=0.0)
knn,csf SpearmanrResult(correlation=0.45515039238535887, pvalue=0.0)

vt,ctf SpearmanrResult(correlation=0.82225065729983104, pvalue=0.0)
vt,csf SpearmanrResult(correlation=0.77319946905611348, pvalue=0.0)

ctf,csf SpearmanrResult(correlation=0.73341005244381385, pvalue=0.0)




#### 5.7 Percentage of negative instances among topn high treatment sentences according to each method
> Sort all sentence in descending order and select topn (according to each method) <br>
> Calculate the percentage of negative instances among topn <br>

In [None]:
def neg_perct_intopn(effect_file):
    effect_df = pd.DataFrame(pickle.load(open(effect_file,'rb')))
    perct = []
    for method in ['knn_effect','vt_effect','ctf_effect','csf_effect']:
        st_effect_df = effect_df.sort([method],ascending=False)
        topn = []
        for n in [100,1000,10000]:
            topn_df = st_effect_df.iloc[:n]
            topn[n] = len(topn_df[topn_df.true_y == 0])/n
        perct.append({method:topn})
    return pd.DataFrame(perct,columns=['knn_effect','vt_effect','ctf_effect','csf_effect'])

In [None]:
this_path = airbnb_path
prefix = "airbnb"
effect_file = this_path+'5_Select/'+prefix+'_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limitvocab.pickle'
effect_df = pd.DataFrame(pickle.load(open(effect_file,'rb'))).sort(['knn_effect'],ascending=False)
print(effect_df.shape)

In [None]:
effect_df.iloc[0].sentence

In [None]:
n=100
topn_df = effect_df.iloc[:n]
len(topn_df[topn_df.true_y == 0])/n

### 6. Revise for previous results

#### 6.1 Assign new treatment effect to existing amt labeled airbnb files
- For cases when existing word pairs and sentences have re-calculated treatment effects

In [4]:
def assign_new_effects(airbnb_effect_file,airbnb_join_file,airbnb_neweffect):
    """
    Parameters: 
        new effect file, existing word pair and sentence file, new effect file assigned to existing word pairs and sentences
    Function:
        For existing word pair and sentences, mapping in new effect file, and assign new effect to them.
    Return: 
        csv file with following fields:
        ['source','target','src_sentence','tar_sentence','knn_effect','vt_effect','ctf_effect','csf_effect',
        'true_y','id','src_ratings','src_ratings_avg','tar_ratings','tar_ratings_avg','amt_effect']
    """
    airbnb_pd = pickle.load(open(airbnb_effect_file,'rb'))
    join_pd = pd.read_csv(airbnb_join_file)
    
    new_info = []
    for idx,row in join_pd.iterrows():
        new_row = row.to_dict()
        new_pd = airbnb_pd[(airbnb_pd.source == row.source) & (airbnb_pd.target == row.target) & (airbnb_pd.sentence == row.src_sentence)]
        if(new_pd.shape[0] == 1):
            new_row['knn_effect'] = new_pd.knn_effect.values[0]
            new_row['vt_effect'] = new_pd.vt_effect.values[0]
            new_row['ctf_effect'] = new_pd.ctf_effect.values[0]
            new_row['csf_effect'] = new_pd.csf_effect.values[0]
        else:
            print(idx,row.source,row.target,new_pd.shape)
        
        new_info.append(new_row)
        
    pd.DataFrame(new_info).to_csv(airbnb_neweffect,columns=['source','target','src_sentence','tar_sentence',
                                                            'knn_effect','vt_effect','ctf_effect','csf_effect','true_y','id'],
                                                            index=False)

In [7]:
airbnb_effect_file = airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_limitvocab.pickle'
airbnb_join_file = '/data/2/zwang/2018_S_WordTreatment/V2_AMT/6_amt/hits/airbnb_join_old.csv'
airbnb_neweffect = '/data/2/zwang/2018_S_WordTreatment/V2_AMT/6_amt/hits/airbnb_join.csv'
assign_new_effects(airbnb_effect_file,airbnb_join_file,airbnb_neweffect)

48 exciting stunning (2, 9)
60 amazing incredible (0, 9)
91 yummy good (2, 9)
101 predominantly mostly (0, 9)


In [9]:
test_pd = pickle.load(open(airbnb_path+'5_Select/airbnb_wdpair_sents_4effects_posinters_poscom1_bigramcheck_limit_limitvocab.pickle','rb'))
test_pd.shape

(723297, 9)

In [12]:
my_str = 'From amazing restaurants to the Walk of Fame, Dolby Theater, Chinese Theater, Hollywood and Highland Shopping mall, cafes, Runyon Canyon, Hollywood Sign, Griffith Park and more!'
my_pd = test_pd[(test_pd.source == 'amazing') & (test_pd.target == 'spectacular') & (test_pd.sentence == my_str)]

In [13]:
my_pd

Unnamed: 0,csf_effect,ctf_effect,id,knn_effect,sentence,source,target,true_y,vt_effect
1444493,0.08962,0.12736,1444493,0.06667,"From amazing restaurants to the Walk of Fame, ...",amazing,spectacular,0,0.01797


### test

In [64]:
vocab_bigram_dfdict = pickle.load(open(vocab_bigram_file,'rb'))

In [177]:
vocab_bigram_dfdict['drill']['a drill']

21

In [164]:
vocab_bigram_dfdict['buddy']['my buddy']

354

In [112]:
wd_tags_file = '/data/2/zwang/2018_S_WordTreatment/V2_airbnb/1_Process/wd_tags.pickle'

In [113]:
wd_tags_dfdict = pickle.load(open(wd_tags_file,'rb'))

In [121]:
wd_tags_dfdict['predominantly']

Counter({'RB': 202})

In [122]:
wd_tags_dfdict['especially']

Counter({'RB': 7180})

In [94]:
mystr = "The South is the largest collection of Victorian Brownstones in the world and it is a now a trendy, gentrified neighborhood with amazing restaurants and easy access to all that Boston has to offer."
'end' in mystr

False

In [87]:
'good' in "hi Good morning"

False