# IHLT. Semantic Textual Similarity Project

**Authors**: Bernhard Bockenhoff and Lucía Urcelay

**Abstract**: According to the task description paper, "Semantic Textual Similarity (STS) measures the degree of semantic equivalence between two texts". This project aims to compute the semantic textual similarity between pairs of sentences from the SMT dataset so that it best resembles the gold standard score provided by linguistic experts.

To do this, firstly, several submissions from the SemEval 2012 workshops have been analysed in order to get a general idea of different possible approaches. Secondly, "UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures" submission from authors Daniel Bar, Chris Biemann, Iryna Gurevych and Torsten Zesch has been chosen as reference.

As for the pipeline of our approach, this can be summed up in the following steps:


1.   *Resource importation*: nltk, pandas, numpy, sklearn...
2.   *Train and test data loading*: six datasets, three for training and three for testing, have been loaded and concatenated with each other (SMTeuropearl, MSRvid, MSRpar)
3.   *Preprocessing*: convert everything to lowercase, filter non alphanumeric characters, tokenize and lemmatize
4.   *Feature engineering*: definition of different functions (similarity measures, sentence processing and manipulation, string similarities, semantic similarities...)
5.   *Feature extraction*: computation of similarity of processed sentences by the functions defined in the previous step
6.   *Feature combination*: in this step we combine all the similarity scores that have been obtained using a Support Vector Machine classifier and calculate Pearson Correlation for the testing dataset

**Outcome**: The result is in the form of a similarity given by the **Pearson Correlation**: being the best value obtained of **0.7773**

# Import resources

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('wordnet_ic')

from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
brown_ic = wordnet_ic.ic('ic-brown.dat')
import spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words
from difflib import SequenceMatcher

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import jaccard_score
from scipy.stats import pearsonr

import numpy as np
from sklearn import linear_model
from sklearn import svm

import re


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Load training and testing data

In [None]:
# Function for reading MSRpar dataset, due to error in parsing with pd.read_csv

def load_dataframe(input_filepath):
  current_file_path = input_filepath
  try:
    data = []
    with open(input_filepath, 'r') as f:
      lines = f.read().splitlines()
      for line in lines:
        data.append(line.split("\t"))
    df = pd.DataFrame(data, columns = [0, 1])
    
  except Exception as e:
    raise Exception(f"ERROR while reading {current_file_path}:\n\t{e}")

  return df

In [None]:
##########################################################################################################
                                  # Load training data
##########################################################################################################

dfSMT = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Train/SMTeuroparl/STS.input.SMTeuroparl.txt',sep='\t',header=None)
dfSMT.columns = ['Sen1', 'Sen2']
dfSMT['gs'] = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Train/SMTeuroparl/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

dfMSRv = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Train/MSRvid/STS.input.MSRvid.txt',sep='\t',header=None)
dfMSRv.columns = ['Sen1', 'Sen2']
dfMSRv['gs'] = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Train/MSRvid/STS.gs.MSRvid.txt',sep='\t',header=None)

dfMSRp = load_dataframe('/content/drive/My Drive/IHLT/finalproject/Data/Test/MSRpar/STS.input.MSRpar.txt')
dfMSRp.columns = ['Sen1', 'Sen2']
dfMSRp['gs'] = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Test/MSRpar/STS.gs.MSRpar.txt',sep='\t',header=None)

#Here we can concetenate the datasets
data_train = pd.concat([dfSMT, dfMSRv, dfMSRp])

In [None]:
# Training data

data_train.head(5)

Unnamed: 0,Sen1,Sen2,gs
0,"In Nigeria, Chevron has been accused by the Al...","In Nigeria, the whole ijaw indigenous showed C...",4.2
1,I know that in France they have had whole herd...,"I know that in France, the principle of slaugh...",4.25
2,"Unfortunately, the ultimate objective of a Eur...",Unfortunately the final objective of a Europea...,4.8
3,The right of a government arbitrarily to set a...,The right for a government to draw aside its c...,4.8
4,"The House had also fought, however, for the re...",This Parliament has also fought for this reduc...,4.0


In [None]:
##########################################################################################################
                                  # Load testing data
##########################################################################################################

dfSMT_test = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Test/SMTeuroparl/STS.input.SMTeuroparl.txt',sep='\t',header=None)
dfSMT_test.columns = ['Sen1', 'Sen2']
dfSMT_test['gs'] = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Test/SMTeuroparl/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

dfMSRv_test = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Test/MSRvid/STS.input.MSRvid.txt',sep='\t',header=None)
dfMSRv_test.columns = ['Sen1', 'Sen2']
dfMSRv_test['gs'] = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Test/MSRvid/STS.gs.MSRvid.txt',sep='\t',header=None)

dfMSRp_test = load_dataframe('/content/drive/My Drive/IHLT/finalproject/Data/Test/MSRpar/STS.input.MSRpar.txt')
dfMSRp_test.columns = ['Sen1', 'Sen2']
dfMSRp_test['gs'] = pd.read_csv('/content/drive/My Drive/IHLT/finalproject/Data/Test/MSRpar/STS.gs.MSRpar.txt',sep='\t',header=None)

data_test = pd.concat([dfSMT_test, dfMSRv_test, dfMSRp_test])

In [None]:
# Testing data

data_test.head(5)

Unnamed: 0,Sen1,Sen2,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


# Preprocessing

In [None]:
def lemmas(sen):
  t_list = nltk.word_tokenize(sen) 
  t_POS_list = nltk.pos_tag(t_list)
  li = [lemmatize(pair) for pair in t_POS_list]
  st = ' '.join([str(item) for item in li])
  return (st)

def lemmatize(p):
  translate = {'N': 'n', 'V': 'v','J': 'a','R': 'r'}
  if p[1][0] in {'N','V','J','R'}:
      return wnl.lemmatize(p[0].lower(), pos= translate[p[1][0]])  
  return p[0]

In [None]:
def preprocessing(sen):
  sen = sen.lower()
  sen = re.sub('[^0-9a-zA-Z ]+', '', sen)
  sen = lemmas(sen)
  return(sen)

data_train['Sen1'] = data_train.apply(lambda row: (preprocessing(row['Sen1'])), axis=1)
data_train['Sen2'] = data_train.apply(lambda row: (preprocessing(row['Sen2'])), axis=1)

data_test['Sen1'] = data_test.apply(lambda row: (preprocessing(row['Sen1'])), axis=1)
data_test['Sen2'] = data_test.apply(lambda row: (preprocessing(row['Sen2'])), axis=1)


In [None]:
data_train.head(5)

Unnamed: 0,Sen1,Sen2,gs
0,in nigeria chevron have be accuse by the allij...,in nigeria the whole ijaw indigenous show chev...,4.2
1,i know that in france they have have whole her...,i know that in france the principle of slaught...,4.25
2,unfortunately the ultimate objective of a euro...,unfortunately the final objective of a europea...,4.8
3,the right of a government arbitrarily to set a...,the right for a government to draw aside its c...,4.8
4,the house have also fight however for the redu...,this parliament have also fight for this reduc...,4.0


# Feature engineering

In [None]:
################################  Similarity Measures        ################################

def soft_sim(x,y, alpha, bias):
  setx = set(x)
  sety = set(y)
  return (len( setx.intersection(sety))+ bias)/(alpha* max(len(setx),len(sety))+(1-alpha)*min(len(setx),len(sety)))

def jac_sim (x, y ):
  setx = set(x)
  sety = set(y)
  try:
    jac_sim = len( setx.intersection(sety)) / len(setx.union(sety)) 
  except:
    jac_sim = 0
  return jac_sim

def containment(x,y):
  setx = set(x)
  sety = set(y)
  try:
    containment = len(setx.intersection(sety)) / len(setx) 
  except:
    containment = 0
  return containment

In [None]:
################################    Sentence Processing    ################################
def stopword_removal(sen):
  t_list = nltk.word_tokenize(sen) 
  return [word for word in t_list if word.lower() not in sw_spacy]


def ml_syns(sen):
  t_list = nltk.word_tokenize(sen) 
  t_POS_list = nltk.pos_tag(t_list)
  result= [ml_syn(pair, t_list) for pair in t_POS_list]
  return result

def ml_syn(p, context):
  translate = {'N': 'n', 'V': 'v','J': 'a','R': 'r'}
  if p[1][0] in {'N','V','J','R'}:
    if nltk.wsd.lesk(context, p[0].lower(), translate[p[1][0]]) is None:
      return lemmatize(p)
    else:
      return nltk.wsd.lesk(context, p[0].lower(), translate[p[1][0]]).name()
  return p[0]

def numberOfSynset(word):
  if not word[1] in ["DT","PR","CC"]:
    for synset in wn.synsets( word[0], translation[word[1]]):
      count = sum([l.count() for l in synset.lemmas()])
      if maximum < count:
        maximum = count
        name = synset
  return name

In [None]:
################################    Word similairty    ################################
def pairwise_word_similarity(sen1, sen2, type='res', onlydifference=False):
  syns1 = ml_syns1(sen1)
  syns2 = ml_syns1(sen2)

  if onlydifference:
    syns1 = list(set(syns1) - set(syns2))
    syns2 = list(set(syns2) - set(syns1))

  if type == 'res' or type == 'lin':
    syns1 = [x for x in syns1 if x.pos() != 'a' and x.pos() != 'r']
    syns2 = [x for x in syns2 if x.pos() != 'a' and x.pos() != 'r']
    if (len(syns1) == 0)or (len(syns2) == 0):
      return 10

  res12 = np.zeros([len(syns1),len(syns2)], dtype=np.float64)
  res21 = np.zeros([len(syns2),len(syns1)], dtype=np.float64)

  for x, synset1 in enumerate(syns1):   
    for y, synset2 in enumerate(syns2):
      if type == 'path':
        res12[x][y] = synset1.path_similarity(synset2)
        res21[y][x] = synset2.path_similarity(synset1)

      if type == 'wup':
        res12[x][y] = synset1.wup_similarity(synset2)
        res21[y][x] = synset2.wup_similarity(synset1)

      if type == 'lch':
        res[x][y] = synset1.lch_similarity(synset2)
        res[y][x] = synset2.lch_similarity(synset1)

      if synset1.pos() == synset2.pos():  
        try:  
          if type == 'lin':
            res12[x][y] = synset1.lin_similarity(synset2, brown_ic)
            res21[y][x] = synset2.lin_similarity(synset1, brown_ic)

          if type == 'res':
            res12[x][y] =synset1.res_similarity(synset2, brown_ic)
            res21[y][x] =synset2.res_similarity(synset1, brown_ic)

        except:
          print(synset1.pos())
          res12[x][y] = 0
          res21[y][x] = 0

  non12 = np.nan_to_num(res12)
  non21 =np.nan_to_num(res21)
  try:
    return (np.mean(np.amax(non12, axis=1)) + np.mean(np.amax(non21, axis=1)))/2
  except:
    return 1


def ml_syns1(sen):
  t_list = nltk.word_tokenize(sen) 
  t_POS_list = nltk.pos_tag(t_list)
  result= [ml_syn1(pair, t_list) for pair in t_POS_list]
  result = [x for x in result if x is not None]
  return result

def ml_syn1(p, context):
  translate = {'N': 'n', 'V': 'v','J': 'a','R': 'r'} #look into translations
  if p[1][0] in {'N','V','J','R'}:
    if nltk.wsd.lesk(context, p[0].lower(), translate[p[1][0]]) is None:
      return 
    else:
      return nltk.wsd.lesk(context, p[0].lower(), translate[p[1][0]])
  return

In [None]:
def longestCommonSubstringMeasure(string1, string2):
  a_offset, b_offset, size = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
  return (size / len(string2) + size / len(string2))/2

In [None]:
def generate_N_grams(text, ngram=1, stw=False):
  if stw:
    words=[word for word in text.split(' ') if word not in set(stopwords.words('english'))]
  else:
    words=[word for word in text.split(' ')]
  temp=zip(*[words[i:] for i in range(0,ngram)])
  ans=[' '.join(ngram) for ngram in temp]
  return ans

In [None]:
def char_N_grams(text, ngram):
  letters = [letter for letter in list(text) if letter not in set(' ')]
  temp=zip(*[letters[i:] for i in range(0,ngram)])
  ans=[' '.join(ngram) for ngram in temp]
  return ans


In [None]:
def pos_tag_ngram(sen, ngram):
  t_list = nltk.word_tokenize(sen) 
  t_POS_list = nltk.pos_tag(t_list)
  ans = [x[1] for x in t_POS_list]
  temp=zip(*[ans[i:] for i in range(0,ngram)])
  ans=[' '.join(ngram) for ngram in temp]
  return ans

# Feature extraction

In [None]:
##########################################################################################################
                                  # Function for extracting features
##########################################################################################################

def extract_features(data):
  
  # Longest Common Substring
  data['LCStr'] =  data.apply(lambda row: longestCommonSubstringMeasure(row['Sen1'], row['Sen2']), axis=1)

  # Raw Preprocessed
  data['soft preprocessed'] =  data.apply(lambda row: soft_sim(row['Sen1'], row['Sen2'], 0.74, -0.06), axis=1)
  data['JS preprocessed'] =  data.apply(lambda row: jac_sim(row['Sen1'], row['Sen2']), axis=1)

  # Synsets
  data['soft synsets'] =  data.apply(lambda row: soft_sim(ml_syns(row['Sen1']),ml_syns(row['Sen2']), 0.74, -0.06), axis=1)
  data['JS Synsets'] =  data.apply(lambda row: jac_sim(ml_syns(row['Sen1']), ml_syns(row['Sen2'])), axis=1)

  # Character 2-,3-, and 4-grams 
  data['char 2 gram jac'] =  data.apply(lambda row: jac_sim(char_N_grams(row['Sen1'],2), char_N_grams(row['Sen2'],2)), axis=1)
  data['char 3 gram jac'] =  data.apply(lambda row: jac_sim(char_N_grams(row['Sen1'],3), char_N_grams(row['Sen2'],3)), axis=1)
  data['char 4 gram jac'] =  data.apply(lambda row: jac_sim(char_N_grams(row['Sen1'],4), char_N_grams(row['Sen2'],4)), axis=1)
  data['char 2 gram cont'] =  data.apply(lambda row: containment(char_N_grams(row['Sen1'],2), char_N_grams(row['Sen2'],2)), axis=1)
  data['char 3 gram cont'] =  data.apply(lambda row: containment(char_N_grams(row['Sen1'],3), char_N_grams(row['Sen2'],3)), axis=1)
  data['char 4 gram cont'] =  data.apply(lambda row: containment(char_N_grams(row['Sen1'],4), char_N_grams(row['Sen2'],4)), axis=1)
  data['char 5 gram jac'] =  data.apply(lambda row: jac_sim(char_N_grams(row['Sen1'],5), char_N_grams(row['Sen2'],5)), axis=1)
  data['char 6 gram jac'] =  data.apply(lambda row: jac_sim(char_N_grams(row['Sen1'],6), char_N_grams(row['Sen2'],6)), axis=1)
  data['char 7 gram jac'] =  data.apply(lambda row: jac_sim(char_N_grams(row['Sen1'],7), char_N_grams(row['Sen2'],7)), axis=1)
  data['char 5 gram cont'] =  data.apply(lambda row: containment(char_N_grams(row['Sen1'],5), char_N_grams(row['Sen2'],5)), axis=1)
  data['char 6 gram cont'] =  data.apply(lambda row: containment(char_N_grams(row['Sen1'],6), char_N_grams(row['Sen2'],6)), axis=1)
  data['char 7 gram cont'] =  data.apply(lambda row: containment(char_N_grams(row['Sen1'],7), char_N_grams(row['Sen2'],7)), axis=1)

  # Word 1- and 2-grams (Containment,w/o stopwords)
  data['Con 1gram w/o stw'] =  data.apply(lambda row: containment(generate_N_grams(row['Sen1'],1,True), generate_N_grams(row['Sen2'],1,True)), axis=1)
  data['Con 2gram w/o stw'] =  data.apply(lambda row: containment(generate_N_grams(row['Sen1'],2,True), generate_N_grams(row['Sen2'],2,True)), axis=1)
  data['Con 3gram w/o stw'] =  data.apply(lambda row: containment(generate_N_grams(row['Sen1'],3,True), generate_N_grams(row['Sen2'],3,True)), axis=1)
  data['Con 4gram w/o stw'] =  data.apply(lambda row: containment(generate_N_grams(row['Sen1'],4,True), generate_N_grams(row['Sen2'],4,True)), axis=1)

  # Word 1-,3-, and 4-grams (Jaccard)
  data['JS 1gram'] =  data.apply(lambda row: jac_sim(generate_N_grams(row['Sen1'],1), generate_N_grams(row['Sen2'],1)), axis=1)
  data['JS 3gram'] =  data.apply(lambda row: jac_sim(generate_N_grams(row['Sen1'],3), generate_N_grams(row['Sen2'],3)), axis=1)
  data['JS 4gram'] =  data.apply(lambda row: jac_sim(generate_N_grams(row['Sen1'],4), generate_N_grams(row['Sen2'],4)), axis=1)
  
  # Word 1-,3-, and 4-grams (Jaccard, w/o stopwords)
  data['JS 1gram w/o stw'] =  data.apply(lambda row: jac_sim(generate_N_grams(row['Sen1'],1,True), generate_N_grams(row['Sen2'],1,True)), axis=1)
  data['JS 3gram w/o stw'] =  data.apply(lambda row: jac_sim(generate_N_grams(row['Sen1'],3,True), generate_N_grams(row['Sen2'],3,True)), axis=1)
  data['JS 4gram w/o stw'] =  data.apply(lambda row: jac_sim(generate_N_grams(row['Sen1'],4,True), generate_N_grams(row['Sen2'],4,True)), axis=1)

  # Pos Tag Ngram
  data['POS 1gram jac'] =  data.apply(lambda row: jac_sim(pos_tag_ngram(row['Sen1'],1), pos_tag_ngram(row['Sen2'],1)), axis=1)
  data['POS 2gram jac'] =  data.apply(lambda row: jac_sim(pos_tag_ngram(row['Sen1'],2), pos_tag_ngram(row['Sen2'],2)), axis=1)
  data['POS 3gram jac'] =  data.apply(lambda row: jac_sim(pos_tag_ngram(row['Sen1'],3), pos_tag_ngram(row['Sen2'],3)), axis=1)
  
  data['POS 1gram cont'] =  data.apply(lambda row: containment(pos_tag_ngram(row['Sen1'],1), pos_tag_ngram(row['Sen2'],1)), axis=1)
  data['POS 2gram cont'] =  data.apply(lambda row: containment(pos_tag_ngram(row['Sen1'],2), pos_tag_ngram(row['Sen2'],2)), axis=1)
  data['POS 3gram cont'] =  data.apply(lambda row: containment(pos_tag_ngram(row['Sen1'],3), pos_tag_ngram(row['Sen2'],3)), axis=1)

  # Pairwise Word Simmilarity
  data['res dif'] =  data.apply(lambda row: pairwise_word_similarity(row['Sen1'],row['Sen2'],'res', True), axis=1)
  data['lin'] =  data.apply(lambda row: pairwise_word_similarity(row['Sen1'],row['Sen2'],'lin'), axis=1)
  data['wup'] =  data.apply(lambda row: pairwise_word_similarity(row['Sen1'],row['Sen2'],'wup'), axis=1)

  # Stopword Soft Similarity 
  data['soft stopword'] =  data.apply(lambda row: soft_sim(stopword_removal(row['Sen1']),stopword_removal(row['Sen2']), 0.74, -0.06), axis=1)

  return data

In [None]:
##########################################################################################################
                                  # Extract features from data
##########################################################################################################

data_train = extract_features(data_train)
data_test = extract_features(data_test)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


In [None]:
data_train.head(5)

Unnamed: 0,Sen1,Sen2,gs,LCStr,soft preprocessed,JS preprocessed,soft synsets,JS Synsets,char 2 gram jac,char 3 gram jac,char 4 gram jac,char 2 gram cont,char 3 gram cont,char 4 gram cont,char 5 gram jac,char 6 gram jac,char 7 gram jac,char 5 gram cont,char 6 gram cont,char 7 gram cont,Con 1gram w/o stw,Con 2gram w/o stw,Con 3gram w/o stw,Con 4gram w/o stw,JS 1gram,JS 3gram,JS 4gram,JS 1gram w/o stw,JS 3gram w/o stw,JS 4gram w/o stw,POS 1gram jac,POS 2gram jac,POS 3gram jac,POS 1gram cont,POS 2gram cont,POS 3gram cont,res dif,lin,wup,soft stopword
0,in nigeria chevron have be accuse by the allij...,in nigeria the whole ijaw indigenous show chev...,4.2,0.175141,0.997391,1.0,0.725619,0.588235,0.578512,0.398936,0.334928,0.714286,0.551471,0.496454,0.286364,0.246696,0.214592,0.440559,0.391608,0.34965,0.647059,0.25,0.133333,0.071429,0.542857,0.098039,0.058824,0.5,0.074074,0.038462,0.642857,0.375,0.214286,0.692308,0.521739,0.346154,0.647161,0.70139,0.77242,0.674476
1,i know that in france they have have whole her...,i know that in france the principle of slaught...,4.25,0.178571,0.96482,0.956522,0.687213,0.535714,0.521277,0.364341,0.273973,0.753846,0.61039,0.506329,0.212903,0.176101,0.148148,0.4125,0.35443,0.307692,0.777778,0.375,0.0,0.0,0.653846,0.097561,0.04878,0.538462,0.0,0.0,0.75,0.235294,0.125,0.818182,0.4,0.25,0.631904,0.708967,0.805906,0.644252
2,unfortunately the ultimate objective of a euro...,unfortunately the final objective of a europea...,4.8,0.373016,0.951818,0.913043,0.729492,0.6,0.625,0.485507,0.451389,0.779221,0.644231,0.607477,0.421769,0.393333,0.359477,0.579439,0.551402,0.514019,0.6,0.333333,0.25,0.142857,0.6,0.21875,0.121212,0.428571,0.142857,0.076923,1.0,0.545455,0.44,1.0,0.666667,0.578947,1.037632,0.625,0.703875,0.548889
3,the right of a government arbitrarily to set a...,the right for a government to draw aside its c...,4.8,0.243478,0.997143,1.0,0.74625,0.6,0.73494,0.59292,0.512195,0.859155,0.752809,0.684783,0.434109,0.365672,0.302158,0.615385,0.544444,0.47191,0.777778,0.375,0.0,0.0,0.684211,0.064516,0.033333,0.636364,0.0,0.0,0.888889,0.470588,0.208333,0.888889,0.615385,0.333333,1.643078,0.863771,0.788657,0.771111
4,the house have also fight however for the redu...,this parliament have also fight for this reduc...,4.0,0.079812,0.966302,0.958333,0.475191,0.333333,0.604027,0.338521,0.254967,0.714286,0.450777,0.361502,0.223602,0.189759,0.169643,0.325792,0.285068,0.259091,0.428571,0.25,0.105263,0.055556,0.428571,0.088235,0.043478,0.321429,0.064516,0.033333,0.727273,0.576923,0.358974,0.8,0.652174,0.466667,1.673213,0.493735,0.687988,0.442094


In [None]:
data_test.head(5)

Unnamed: 0,Sen1,Sen2,gs,LCStr,soft preprocessed,JS preprocessed,soft synsets,JS Synsets,char 2 gram jac,char 3 gram jac,char 4 gram jac,char 2 gram cont,char 3 gram cont,char 4 gram cont,char 5 gram jac,char 6 gram jac,char 7 gram jac,char 5 gram cont,char 6 gram cont,char 7 gram cont,Con 1gram w/o stw,Con 2gram w/o stw,Con 3gram w/o stw,Con 4gram w/o stw,JS 1gram,JS 3gram,JS 4gram,JS 1gram w/o stw,JS 3gram w/o stw,JS 4gram w/o stw,POS 1gram jac,POS 2gram jac,POS 3gram jac,POS 1gram cont,POS 2gram cont,POS 3gram cont,res dif,lin,wup,soft stopword
0,the leader have now be give a new chance and l...,the leader benefit aujourd hui of a new luck a...,4.5,0.1375,0.806667,0.68,0.377382,0.24,0.291667,0.193548,0.14,0.488372,0.352941,0.269231,0.098039,0.067961,0.038462,0.196078,0.14,0.081633,0.5,0.0,0.0,0.0,0.347826,0.0,0.0,0.307692,0.0,0.0,0.7,0.368421,0.130435,0.875,0.583333,0.25,0.470321,0.424578,0.478894,0.526738
1,amendment no 7 proposes certain change in the ...,amendment no 7 be propose certain change in th...,5.0,0.642857,0.954904,0.944444,0.846678,0.769231,0.857143,0.816667,0.761905,0.933333,0.907407,0.872727,0.703125,0.646154,0.590909,0.833333,0.792453,0.75,0.857143,0.666667,0.4,0.25,0.769231,0.461538,0.307692,0.75,0.25,0.142857,0.875,0.615385,0.461538,0.875,0.8,0.666667,10.0,0.9,0.958333,0.848571
2,let me remind you that our ally include ferven...,i would like to remind you that among our ally...,4.25,0.22973,0.816779,0.708333,0.521682,0.380952,0.414286,0.27907,0.224719,0.568627,0.436364,0.37037,0.186813,0.152174,0.11828,0.320755,0.269231,0.215686,0.428571,0.0,0.0,0.0,0.380952,0.086957,0.0,0.272727,0.0,0.0,0.636364,0.3,0.136364,0.875,0.545455,0.272727,0.583386,0.533131,0.630757,0.453704
3,the vote will take place today at 530 pm,the vote will take place at 530pm,4.5,0.757576,0.916667,0.894737,0.700472,0.6,0.78125,0.71875,0.65625,0.806452,0.766667,0.724138,0.59375,0.53125,0.46875,0.678571,0.62963,0.576923,0.5,0.4,0.25,0.0,0.6,0.333333,0.25,0.428571,0.2,0.0,1.0,0.75,0.5,1.0,0.75,0.571429,0.496858,0.812678,0.847222,0.433036
4,the fisherman be inactive tired and disappointed,the fisherman be inactive tired and disappointed,5.0,1.0,0.996471,1.0,0.991429,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,1.0,1.0,0.985


# Feature combination



In [None]:
##########################################################################################################
                                  # Prepare data for model
##########################################################################################################

X_train = data_train.drop(['Sen1', 'Sen2'], axis=1)
corr_train = X_train.corr()
X_train = X_train.drop('gs', axis=1)
Y_train = data_train['gs']

X_test = data_test.drop(['Sen1', 'Sen2'], axis=1)
corr_test = X_test.corr()
X_test = X_test.drop('gs', axis=1)
Y_test = data_test['gs'] 

In [None]:
##########################################################################################################
                          # Algorithm selection, model fitting and prediction
##########################################################################################################


classifier = svm.SVR()

classifier.fit(X_train, Y_train)
predictions = classifier.predict(X_test)

corr = pearsonr(Y_test,predictions)[0]
print(classifier , ':',corr)


SVR() : 0.7773784720051551
