## 4 Summary Evaluation 

## 4.1 Building Non-Redundancy Scorers
## 4.1.1 
Train your classifier with the following three features on training data, with summary as
input and “non-redundancy” as gold-standard scores, and report evaluation performance on
test data. For evaluation, please use metrics of Mean Squared Error (MSE) and Pearson
correlation, both calculated between your classifier’s predictions and gold-standard scores of
samples in the test data.

Each feature implementation worths 3 points, and building classifiers worths 3 points.

1. Maximum repetition of unigrams: calculate the frequencies of all unigrams (remove
stop words), and use the maximum value as the feature.

2. Maximum repetition of bigrams: calculate the frequencies of all bigrams, and use the
maximum value as the feature.

3. Maximum sentence similarity: each sentence is represented as average of word embeddings,
then compute cosine similarity between pairwise sentences, use the maximum
similarity as the features. Use word embeddings GoogleNews-vectors-negative300.bin.gz
from Word2vechttps://code.google.com/archive/p/word2vec/ as input for each
word. Words in a summary that are not covered by Word2vec should be discarded.


In [315]:
import os
import re
import time
import nltk
import gensim
import string
import textstat
import numpy as np
import pandas as pd
from sklearn import svm
from collections import Counter
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LinearRegression

In [316]:
## read the data
train_data = pd.read_csv("q4_train_set.csv")
test_data = pd.read_csv("q4_test_set.csv")

In [317]:
def preprocess(data):
    ## get stop words and punc
    stop_words = nltk.corpus.stopwords.words("english")
    punc = string.punctuation
    
    ## replace the NA values with 0
    data.fillna(0,inplace=True)
    data.replace("null", 0, inplace=True)

    ## read the data in sentences format with and without stopwords
    word_list_summary = {}
    word_list_summary_with_stopwords = {}
    pattern = "[\w\-]+"

    for i, summary in enumerate(data["Summary"]):
        list_of_sentences = nltk.tokenize.sent_tokenize(data["Summary"][i])
        word_list_summary[i] = []
        word_list_summary_with_stopwords[i] = []
        for sentences in list_of_sentences:
            sentences = sentences.replace("\t"," ").replace("\n", " ")
            sentences = sentences.replace("  "," ")
            words_in_sentences = [word for word in nltk.tokenize.word_tokenize(sentences) if re.match(pattern,word) and word not in punc and word not in stop_words]
            words_in_sentences_with_stopwords = [word.lower() for word in nltk.tokenize.word_tokenize(sentences) if re.match(pattern,word) and word not in punc]
            if words_in_sentences != []:
                word_list_summary[i].append(words_in_sentences)
                word_list_summary_with_stopwords[i].append(words_in_sentences_with_stopwords)
    return (word_list_summary, word_list_summary_with_stopwords)

In [318]:
## get the preprocessed data
train_word_list_summary, train_word_list_summary_with_stopwords = preprocess(train_data)
test_word_list_summary, test_word_list_summary_with_stopwords = preprocess(test_data)

In [306]:
## model for word2vec conversion
model = gensim.models.KeyedVectors.load_word2vec_format(r'C:\Users\mm199\NLP\HW2\GoogleNews-vectors-negative300.bin', binary=True)  
global model

In [319]:
## get the word embedding
def word_embedding(list_of_sentences):
    global model
    dict_word2vec = {}
    for index, word_list in enumerate(list_of_sentences):
        arr = np.array([0.0 for i in range(0, 300)])
        for word in word_list:
            try:
                arr += model.get_vector(word)
            except KeyError:
                continue 
        dict_word2vec[index] = arr / len(word_list)
    try:
        max_val_cosine = max(np.sort(cosine_similarity(list(dict_word2vec.values())))[:,-2])
    except IndexError:
        max_val_cosine = 0.0 ## probably only one sentence in the summary hence the redudancy value is low
    return (max_val_cosine)

In [320]:
## prepare the dataset for part 1 with all features
def create_df_non_red(word_list_summary):
    features = {}
    new_features = {}
    for num in word_list_summary:
        list_of_words = word_list_summary[num]
        unigram_words = [i for word in list_of_words for i in word]
        max_uni = max(Counter(unigram_words).values())
        bi_gram = []
        for index, i in enumerate(unigram_words):
            if index == len(unigram_words) - 1:
                break
            bi = unigram_words[index] + " " + unigram_words[index+1]
            bi_gram.append(bi)  
        max_bi = max(Counter(bi_gram).values())
        max_val_cosine = word_embedding(list_of_words)
        features[num] = {"max_uni" : max_uni, "max_bi" : max_bi, "max_val_cosine" : max_val_cosine}
        num_unigrams = len(set(unigram_words))
        num_bigrams = len(set(bi_gram))
        new_features[num] = {"num_unigrams" : num_unigrams, "num_bigrams" : num_bigrams}
    df_non_red_features = pd.DataFrame(features).T
    df_new_non_red_features = pd.concat([df_non_red_features, pd.DataFrame(new_features).T], axis = 1)
    return df_new_non_red_features

In [321]:
## create df with all the features for non redundancy
df_new_non_red_features_train = create_df_non_red(train_word_list_summary) 
df_new_non_red_features_test = create_df_non_red(test_word_list_summary) 


In [346]:
## train the model on training dataset
def train_model (X, y):
    clf = LinearRegression()
    clf.fit(X, y)
    return clf

In [323]:
## get mse and pearson's coeff
def get_accuracy(clf, X, y):
    pred = clf.predict(X)
    mse = mean_squared_error(pred, y)
    r_coeff = pearsonr(pred, y)
    print (" MSE : ", mse, " Pearson's coeff : ", r_coeff)
    

In [347]:
## get y label for training
y_label_train = train_data["Non-Redundancy"]
y_label_train = [float(i) for i in y_label_train]

## get results
y_label_test = test_data["Non-Redundancy"]
y_label_test = [float(i) for i in y_label_test]

clf = train_model (df_new_non_red_features_train[["max_bi", "max_uni", "max_val_cosine"]], y_label_train)

print ("MSE and pearson's coeff with three features: ")
get_accuracy (clf, df_new_non_red_features_test[["max_bi", "max_uni", "max_val_cosine"]], y_label_test)


MSE and pearson's coeff with three features: 
 MSE :  0.24594064969574916  Pearson's coeff :  (0.6770281804152865, 3.603437070070039e-28)



## 4.1.2 
Design two new features for this task. Add each feature to the classifier built in 4.1.1, and
report MSE and Pearson correlation. You will get 2 bonus points if any of your proposed
feature gets better MSE AND Pearson. You will get 4 bonus points if both features improve
previous classifier’s performance.

In [348]:
## run with new features
clf = train_model (df_new_non_red_features_train[["max_bi", "max_uni", "max_val_cosine","num_unigrams"]], y_label_train)

print ("\nMSE and pearson's coeff with one new feature that is total number of unigrams in each summary: ")
get_accuracy (clf, df_new_non_red_features_test[["max_bi", "max_uni", "max_val_cosine","num_unigrams"]], y_label_test)

clf = train_model (df_new_non_red_features_train[["max_bi", "max_uni", "max_val_cosine","num_bigrams"]], y_label_train)

print ("\nMSE and pearson's coeff with another new feature that is total number of bigrams in each summary: ")
get_accuracy (clf, df_new_non_red_features_test[["max_bi", "max_uni", "max_val_cosine","num_bigrams"]], y_label_test)

clf = train_model (df_new_non_red_features_train, y_label_train)

print ("\nMSE and pearson's coeff with two additional features features: ")
get_accuracy (clf, df_new_non_red_features_test, y_label_test)

print ("\nAccuracy seems to have improved with the new features (individually and combined)")

                                                   


MSE and pearson's coeff with one new feature that is total number of unigrams in each summary: 
 MSE :  0.23378108051889313  Pearson's coeff :  (0.6949437690469922, 3.538334592158559e-30)

MSE and pearson's coeff with another new feature that is total number of bigrams in each summary: 
 MSE :  0.23521397491151624  Pearson's coeff :  (0.6925751437462656, 6.647741971187499e-30)

MSE and pearson's coeff with two additional features features: 
 MSE :  0.2333698620964361  Pearson's coeff :  (0.6956291624461577, 2.9447800488451285e-30)



## 4.2 Building Fluency Scorers


## 4.2.1 
Train your classifier with the following three features on training data, with summary as input
and “fluency” as gold-standard scores, and report evaluation performance on test data. For
evaluation, please use metrics of Mean Squared Error (MSE) and Pearson correlation, both
calculated between your classifier’s predictions and gold-standard scores of samples in the
test data.

Each feature implementation worths 3 points, and building classifiers worths 3 points.
1. Total number of repetitive unigrams: count how many unigrams are the same as the
previous unigrams. For example, for a summary “The the article talks talks about
language understanding”, the value should be 2.

2. Total number of repetitive bigrams: count how many bigrams are the same as the
previous bigrams. For example, for a summary “The article the article talks about
about language understanding”, the value should be 1.

3. Minimum Flesch reading-ease score: use tool from https://pypi.org/project/readability/
to get readability score for each sentence, and use the minimum value as the feature.



In [349]:
## prepare the dataset for part 2
def create_df_fluency(dataset, word_list_summary_with_stopwords):
    fluency_features = {}
    flesch_score_dict = {}
    new_features = {}
    for index, summary in enumerate(dataset["Summary"]):
        flesch_score_dict[index] = textstat.flesch_reading_ease(summary) 
    for num in word_list_summary_with_stopwords:
        list_of_words = word_list_summary_with_stopwords[num]
        unigram_words = [i for word in list_of_words for i in word]
        uni_rep = 0
        pos_uni_rep = 0
        bi_rep = 0
        pos_bi_rep = 0
        bi_gram = []
        bi_gram_pos = []
        for index, i in enumerate(unigram_words):
            if index == len(unigram_words) - 1:
                break       
            pos_tagged = nltk.pos_tag(unigram_words)     
            if unigram_words[index] == unigram_words[index+1]:
                uni_rep += 1  
            if pos_tagged[index][1] == pos_tagged[index+1][1] :
                pos_uni_rep += 1   
            bi = unigram_words[index] + " " + unigram_words[index+1]
            pos_bi = pos_tagged[index][1] + " " + pos_tagged[index+1][1]
            bi_gram.append(bi)  
            bi_gram_pos.append(pos_bi)
        for index, i in enumerate(bi_gram):
            if index == len(bi_gram) - 1:
                break     
            if bi_gram[index] == bi_gram[index+1]:
                bi_rep += 1
            if bi_gram_pos[index] == bi_gram_pos[index+1]:
                pos_bi_rep += 1   
        fluency_features[num] = {"uni_rep" : uni_rep, "bi_rep" : bi_rep, "flesch_score_dict" : flesch_score_dict[num]}
        new_features[num] = {"pos_uni_rep" : pos_uni_rep, "pos_bi_rep" : pos_bi_rep}

    df_fluency_features = pd.DataFrame(fluency_features).T
    df_new_fluency_features = pd.concat([df_fluency_features, pd.DataFrame(new_features).T], axis = 1)
    return df_new_fluency_features

In [351]:
## create df with all the features for fluency
df_new_fluency_features_train = create_df_fluency(train_data, train_word_list_summary_with_stopwords) 
df_new_fluency_features_test = create_df_fluency(test_data, test_word_list_summary_with_stopwords) 

In [352]:
## get y label for training
y_label_train = train_data["Fluency"]
y_label_train = [float(i) for i in y_label_train]

## get results
y_label_test = test_data["Fluency"]
y_label_test = [float(i) for i in y_label_test]

clf = train_model (df_new_fluency_features_train[["uni_rep", "bi_rep", "flesch_score_dict"]], y_label_train)

print ("MSE and pearson's coeff with three features: ")
get_accuracy (clf, df_new_fluency_features_test[["uni_rep", "bi_rep", "flesch_score_dict"]], y_label_test)


MSE and pearson's coeff with three features: 
 MSE :  0.25075492433424956  Pearson's coeff :  (0.2212366148686577, 0.0016425027188746615)


## 4.2.2
Design two new features for this task. Add each feature to the classifier built in 4.2.1, and
report MSE and Pearson correlation. You will get 2 bonus points if any of your proposed
feature gets better MSE AND Pearson. You will get 4 bonus points if both features improve
previous classifier’s performance.

In [353]:
## run with new features
clf = train_model (df_new_fluency_features_train[["uni_rep", "bi_rep", "flesch_score_dict","pos_uni_rep"]], y_label_train)

print ("\nMSE and pearson's coeff with one new feature that is total number of pos repetition of unigrams in each summary: ")
get_accuracy (clf, df_new_fluency_features_test[["uni_rep", "bi_rep", "flesch_score_dict","pos_uni_rep"]], y_label_test)

clf = train_model (df_new_fluency_features_train[["uni_rep", "bi_rep", "flesch_score_dict","pos_bi_rep"]], y_label_train)

print ("\nMSE and pearson's coeff with another new feature that is total number of pos repetition of bigrams in each summary: ")
get_accuracy (clf, df_new_fluency_features_test[["uni_rep", "bi_rep", "flesch_score_dict","pos_bi_rep"]], y_label_test)

clf = train_model (df_new_fluency_features_train, y_label_train)

print ("\nMSE and pearson's coeff with two additional features features: ")
get_accuracy (clf, df_new_fluency_features_test, y_label_test)

print ("\nAccuracy seems to have improved with the new features (individually and combined)")
                                                   
                                                   


MSE and pearson's coeff with one new feature that is total number of pos repetition of unigrams in each summary: 
 MSE :  0.24683437001413627  Pearson's coeff :  (0.2559373168944643, 0.00025426318721765963)

MSE and pearson's coeff with another new feature that is total number of pos repetition of bigrams in each summary: 
 MSE :  0.24963642729618507  Pearson's coeff :  (0.23070142556224768, 0.0010140147198971592)

MSE and pearson's coeff with two additional features features: 
 MSE :  0.24664295252884258  Pearson's coeff :  (0.25886747745060773, 0.00021451118912109213)

Accuracy seems to have improved with the new features (individually and combined)
