# Preprocessing Stages

This project has undergone a multitude of different preprocessing steps. Rather than scattering them across all the attempts, they will all be mentioned here.

The main section headings will denote different preprocessing attempts. This notebook is not cohesive across sections, it only serves to document various attempts throughout the analysis.

## V0 - Initial Preprocessing

This section shows the first methods used for preprocessing. It uses a subsection of the dataset that only includes the word "data" in the `post_content`.

In [1]:
import re
import json
import warnings
import string
from pathlib import Path
import collections
from collections import Counter
import pickle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

import gensim
from gensim import corpora, models
# Ignore smart open warning until update pushed to conda: https://github.com/RaRe-Technologies/gensim/pull/2530
warnings.filterwarnings('ignore',message='.+smart_open.+')

import nltk
from nltk.corpus import cmudict, stopwords

import spacy
from spacy.tokens import Doc, Token

from tqdm import tqdm_notebook, tnrange

from FaiText import Tokenizer, Vocab

In [2]:
nlp = spacy.load("en_core_web_md")
#nlp.disable_pipes('parser','ner')

In [3]:
dpath = Path('grad_scrape/gradsop/data')
sop_data = pd.read_csv(dpath/'data_posts_all.csv')
sop_data.head(3)

Unnamed: 0,name,post_content,post_date,seqnum,slug,thread_title,url_id,user_id,user_likes,user_posts,user_threads,username
0,guiaria,"<div class=""pTx""><h2>data field - the area tha...","May 15, 2019",0,statement-applying-german-university-master-83160,Statement of Purpose for applying to German Un...,83160,113287.0,,-,1,guiaria
1,"Maria, EF Contrib","<div class=""pTx"">@guiaria<br/>Hi there! Let's ...","May 15, 2019",1,statement-applying-german-university-master-83160,Statement of Purpose for applying to German Un...,83160,112562.0,248.0,630,-,Maria
2,İlkay Albayrak,"<div class=""pTx"">Hello everyone, I am currentl...","Apr 30, 2019",0,great-investment-future-motivation-msc-data-83059,'great investment for my future' - Motivation ...,83059,113040.0,,1,1,dreiframe


Despite personally creating this dataset, it still requires some additional processing before use. I opted to not strip out html from the text to avoid additional processing time while scraping.

In [4]:
# Fill missing values
sop_data.loc[:, ['user_posts','user_threads']] = sop_data.loc[:, ['user_posts','user_threads']].apply(lambda x: x.str.replace(',','').replace('-','0')).fillna(0).astype(np.int)
sop_data['user_likes'] = sop_data.user_likes.fillna(0)
sop_data['user_id'] = sop_data.user_id.fillna(0).astype(np.int)

In [5]:
df_text = sop_data.copy()
# Clean html
df_text['post_content'] = df_text['post_content'].str.replace('<br/>','\n').apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
# assign unique id to each document
df_text['doc_id'] = df_text['url_id'].astype(str) + '_' + df_text['seqnum'].astype(str)
df_text = df_text[['doc_id','url_id','user_id','seqnum','post_content']]
df_text.head()

Unnamed: 0,doc_id,url_id,user_id,seqnum,post_content
0,83160_0,83160,113287,0,data field - the area that attracts me the mos...
1,83160_1,83160,112562,1,@guiaria\nHi there! Let's work through your es...
2,83059_0,83059,113040,0,"Hello everyone, I am currently going through a..."
3,83059_1,83059,112562,1,While the formality of your introductory parag...
4,83059_2,83059,113040,2,Thank you so much Maria!\nThis is really helpf...


In [6]:
# seqnum = 0 is the original post, subsequent numbers are the responses
df_text.pivot_table(['post_content','user_id'],['url_id','seqnum'],aggfunc=lambda x: x).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,post_content,user_id
url_id,seqnum,Unnamed: 2_level_1,Unnamed: 3_level_1
67923,0,Could any one help me with the structure?\n\nA...,96760
67923,1,"Hi Serena, as I read and understand your essay...",91533
67923,2,"Hi, thank you so much for your advice, you kno...",96760
67923,3,"Hi Serena, indeed, it is hard to write let alo...",91533
67973,0,"Hi, everyone.\n\nI am currently applying to Ba...",96803
67973,1,"Hi Akmal, as I read and reviewed your essay, I...",91533
67973,2,"Hi, @justivy03! Thank you for your thoughtful ...",96803
68018,0,"Hi everyone, I have written a statement of pur...",96869
68018,1,"Hi Tommaso, I have read a few paragraphs of yo...",91533
68018,2,"Hi Justivy03, thank you for your feedback! How...",96869


Find all instances where OP added a reply to their post. 

```
OP is when seqnum == 0;
get user_id for OP;
find where user_id = OP.user_id and seqnum != 0;
if any, check if url_id is same;
when true, OP has replied to own post
```

In [7]:
user_piv = df_text.pivot('url_id','seqnum','user_id')
op_multi = user_piv[user_piv.apply(lambda x: x[0] in x[1:].values,axis=1)]
op_ids = op_multi.loc[:,0]
op_multi.head(5)

seqnum,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
url_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
67923,96760.0,91533.0,96760.0,91533.0,,,,,,,...,,,,,,,,,,
67973,96803.0,91533.0,96803.0,,,,,,,,...,,,,,,,,,,
68018,96869.0,91533.0,96869.0,91533.0,,,,,,,...,,,,,,,,,,
70237,97454.0,98377.0,97454.0,91533.0,97454.0,91533.0,,,,,...,,,,,,,,,,
70641,98742.0,98707.0,98742.0,,,,,,,,...,,,,,,,,,,


In [None]:
# Approx. 1.5min runtime
docs = df_text.post_content.apply(nlp)
df_text['tokens'] = docs

In [None]:
df_text.to_pickle('proc_data/data_all_df_text.pkl')

The 4 primary dataframes we will be using throughout the analysis. Representing
- all_text: Full texts (original posts + all responses to the posts)
- op_post: The initial thread post
- op_resp: Instances where the creator of a thread responds to their own thread
- critiques: All non-creator responses to a thread

In [9]:
df_text = pd.read_pickle('proc_data/data_all_df_text.pkl')
#df_text = pickle.load(open('saves/data_all_df_text.pkl','rb'))
df_origpost = df_text.query('seqnum == 0')
df_op_reply = df_text.query('url_id.isin(@op_multi.index) & user_id.isin(@op_ids) & seqnum != 0')
df_critiques = df_text.drop(df_op_reply.index).query('seqnum > 0')

In [10]:
df_op_reply[df_op_reply['post_content'].apply(len) > 1000].head(10)

Unnamed: 0,doc_id,url_id,user_id,seqnum,post_content,tokens
50,73925_3,73925,100360,3,Hi please evaluate these two sections:\nIntrod...,"(Hi, please, evaluate, these, two, sections, :..."
55,75109_2,75109,101299,2,I have chosen to apply to the University of Am...,"(I, have, chosen, to, apply, to, the, Universi..."
97,75090_21,75090,101268,21,@Holt for thesis course plzz review this\nDuri...,"(@Holt, for, thesis, course, plzz, review, thi..."
154,72896_2,72896,99871,2,"@Holt\n\nHi Holt, I really appreciate your fee...","(@Holt, \n\n, Hi, Holt, ,, I, really, apprecia..."
158,72896_4,72896,99871,4,"@Holt\n\nHi Holt, I appreciate your suggestion...","(@Holt, \n\n, Hi, Holt, ,, I, appreciate, your..."
218,79397_2,79397,106917,2,STATEMENT OF PURPOSE\n\nI have been working in...,"(STATEMENT, OF, PURPOSE, \n\n, I, have, been, ..."
247,81045_3,81045,109897,3,Thanks for the inpust komziee and Holt\n\n@Hol...,"(Thanks, for, the, inpust, komziee, and, Holt,..."
278,81613_2,81613,110661,2,"Dear @Holt , thank you so much for your feedba...","(Dear, @Holt, ,, thank, you, so, much, for, yo..."
282,81613_5,81613,110661,5,Dear @Holt this is my revision. As much as I c...,"(Dear, @Holt, this, is, my, revision, ., As, m..."
352,82046_2,82046,111372,2,"Great thanks for your advice, Mary! I also wor...","(Great, thanks, for, your, advice, ,, Mary, !,..."


There are a few instances where the original poster revised something and asked again for advice. To simplify things, we will disregard the information for the time being and limit replies purely to those by other individuals.

## V1 - Custom HTML tokenizer

In this stage, I attempted to create a custom tokenizer that added in new token values based on html markup. The premise was similar to the approach that Fast.ai employees, so it seemed like a reasonable class to extend.

This version of the corpus was used for building a language model.

Much of the interesting code for this section is in separate files
* `FaiText.py` - A standalone version of Fast.ai's text module.
* `HTMLutils.py` - The custom html tokenizing functions, stylized in a similar fashion to fast.ai

In [None]:
import warnings
import string
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import unidecode
# Ignore smart open warning until update pushed to conda: https://github.com/RaRe-Technologies/gensim/pull/2530
warnings.filterwarnings('ignore',message='.+smart_open.+')
import spacy

from FaiText import Tokenizer, defaults
import HTMLutils as HU

In [None]:
nlp = spacy.load("en_core_web_md")

In [18]:
dpath = Path('grad_scrape/gradsop/data')

sop_data = pd.read_csv(dpath/'posts_all.csv')

sop_data.head()

Unnamed: 0,name,post_content,post_date,seqnum,slug,url_id,user_id,user_likes,user_posts,user_threads,username
0,Yeu Deck Ngui,"<div class=""pTx"">Hi guys, I would need help re...","Mar 16, 2019",0,study-plan-canada-permit-meng-civil-env-82750,82750,112438.0,,-,1,decimo1993
1,Mary Rose,"<div class=""pTx"">Yeu, I am afraid that this is...","Mar 17, 2019",1,study-plan-canada-permit-meng-civil-env-82750,82750,99412.0,2001.0,7529,-,Holt
2,snow,"<div class=""pTx""><span class=""b"">Please confin...","Mar 17, 2019",0,application-statement-msc-electronic-82755,82755,112507.0,,-,1,Airydisc
3,nkitha,"<div class=""pTx"">I really need help in this pa...","Mar 21, 2019",0,goal-study-plan-korean-university-82784,82784,112553.0,,-,1,itha20
4,"Constance, EF Contrib","<div class=""pTx"">Since this is an academic pap...","Mar 22, 2019",1,application-statement-msc-electronic-82755,82755,112560.0,9.0,19,-,Constance


In [19]:
# Fill missing values
sop_data.loc[:, ['user_posts','user_threads']] = sop_data.loc[:, ['user_posts','user_threads']].apply(lambda x: x.str.replace(',','').replace('-','0')).fillna(0).astype(np.int)
sop_data['user_likes'] = sop_data.user_likes.fillna(0)
sop_data['user_id'] = sop_data.user_id.fillna(0).astype(np.int)

```
<span class="q"> -> xxquo
<del> -> xxdel
<span class="r">,<span class="b">,<span class="g"> -> xxhl (or xxred, xxblu,xxgrn)
<b>,<strong> -> xxbld
<em>,<i> -> xxitl
```

In [14]:
defaults.text_pre_rules

[<function FaiText.fix_html(x: str) -> str>,
 <function FaiText.replace_rep(t: str) -> str>,
 <function FaiText.replace_wrep(t: str) -> str>,
 <function FaiText.spec_add_spaces(t: str) -> str>,
 <function FaiText.rm_useless_spaces(t: str) -> str>]

In [20]:
tknzr = Tokenizer(
    pre_rules=[HU.HTMLTokenizer.tokenize, fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces],
    post_rules = [replace_all_caps, deal_caps],
    special_cases=defaults.text_spec_tok+HU.CSPEC_CASE,
    n_cpus=1
)

In [39]:
sop_data.post_content.iloc[2]

'<div class="pTx"><span class="b">Please confine your statement to no more than 300 words. The online form can accept English characters only.<br/>The content of your statement should explain why you wish to study the programme and how the qualification is relevant to your career aspirations, as well as your expectation of the programme. If applicable, provide other information (e.g. work experience, non-academic achievements, community services) that you think is relevant to the assessment of your application.</span><br/><br/>Application Statement<br/><br/>Much of my motivation for pursuing MSc Electronic Information Engineering provided by CityU originates from an intense desire to take my academic attainment to the next level and to prepare me for a successful career. As a matter of fact, I have been keen to study signal processing and wireless communication since childhood. My proudest accomplishment was to assemble a radio receiver with components that I bought online. As my passi

In [40]:
HU.HTMLTokenizer.tokenize(sop_data.post_content.iloc[2])

' xxhlb Please confine your statement to no more than 300 words. The online form can accept English characters only.\nThe content of your statement should explain why you wish to study the programme and how the qualification is relevant to your career aspirations, as well as your expectation of the programme. If applicable, provide other information (e.g. work experience, non-academic achievements, community services) that you think is relevant to the assessment of your application. xxhle \n\nApplication Statement\n\nMuch of my motivation for pursuing MSc Electronic Information Engineering provided by CityU originates from an intense desire to take my academic attainment to the next level and to prepare me for a successful career. As a matter of fact, I have been keen to study signal processing and wireless communication since childhood. My proudest accomplishment was to assemble a radio receiver with components that I bought online. As my passion grew, I was enrolled in the undergradu

In [41]:
" ". join(tknzr.process_all(sop_data.post_content.iloc[:3])[2])

'  xxhlb xxmaj please confine your statement to no more than 300 words . xxmaj the online form can accept xxmaj english characters only . \n  xxmaj the content of your statement should explain why you wish to study the programme and how the qualification is relevant to your career aspirations , as well as your expectation of the programme . xxmaj if applicable , provide other information ( e.g. work experience , non - academic achievements , community services ) that you think is relevant to the assessment of your application . xxhle \n \n  xxmaj application xxmaj statement \n \n  xxmaj much of my motivation for pursuing msc xxmaj electronic xxmaj information xxmaj engineering provided by cityu originates from an intense desire to take my academic attainment to the next level and to prepare me for a successful career . xxmaj as a matter of fact , i have been keen to study signal processing and wireless communication since childhood . xxmaj my proudest accomplishment was to assemble a r

## V2 - Documents on Disk

The primary purpose of this section was to put the documents in a format that could easily be transfered to Colab. From there, they were used to train a several language models with the intention of creating a sentiment classifier. The entries ended up being prohibitively long for transformer models but did train with ULMfit.

The documents were also used for summarizing and key-phrase extraction models. 

In [49]:
import string
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import unidecode

import spacy

In [42]:
dpath = Path('grad_scrape/gradsop/data')

sop_data = pd.read_csv(dpath/'posts_all.csv')

sop_data.head()

Unnamed: 0,name,post_content,post_date,seqnum,slug,url_id,user_id,user_likes,user_posts,user_threads,username
0,Yeu Deck Ngui,"<div class=""pTx"">Hi guys, I would need help re...","Mar 16, 2019",0,study-plan-canada-permit-meng-civil-env-82750,82750,112438.0,,-,1,decimo1993
1,Mary Rose,"<div class=""pTx"">Yeu, I am afraid that this is...","Mar 17, 2019",1,study-plan-canada-permit-meng-civil-env-82750,82750,99412.0,2001.0,7529,-,Holt
2,snow,"<div class=""pTx""><span class=""b"">Please confin...","Mar 17, 2019",0,application-statement-msc-electronic-82755,82755,112507.0,,-,1,Airydisc
3,nkitha,"<div class=""pTx"">I really need help in this pa...","Mar 21, 2019",0,goal-study-plan-korean-university-82784,82784,112553.0,,-,1,itha20
4,"Constance, EF Contrib","<div class=""pTx"">Since this is an academic pap...","Mar 22, 2019",1,application-statement-msc-electronic-82755,82755,112560.0,9.0,19,-,Constance


In [43]:
sop_data.isna().any()

name             True
post_content    False
post_date       False
seqnum          False
slug            False
url_id          False
user_id          True
user_likes       True
user_posts       True
user_threads     True
username         True
dtype: bool

In [44]:
# Fill missing values
sop_data.loc[:, ['user_posts','user_threads']] = sop_data.loc[:, ['user_posts','user_threads']].apply(lambda x: x.str.replace(',','').replace('-','0')).fillna(0).astype(np.int)
sop_data['user_likes'] = sop_data.user_likes.fillna(0)
sop_data['user_id'] = sop_data.user_id.fillna(0).astype(np.int)

In [45]:
df_text = sop_data.copy()

In [46]:
# Clean html
df_text['post_content'] = df_text['post_content'].str.replace('<br/>','\n').apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
# assign unique id to each document
df_text['doc_id'] = df_text['url_id'].astype(str) + '_' + df_text['seqnum'].astype(str)
df_text = df_text[['doc_id','url_id','user_id','seqnum','post_content']]
df_text.head()

Unnamed: 0,doc_id,url_id,user_id,seqnum,post_content
0,82750_0,82750,112438,0,"Hi guys, I would need help reviewing this stud..."
1,82750_1,82750,99412,1,"Yeu, I am afraid that this is not a valid stud..."
2,82755_0,82755,112507,0,Please confine your statement to no more than ...
3,82784_0,82784,112553,0,I really need help in this particular essay. T...
4,82755_1,82755,112560,1,"Since this is an academic paper, please remove..."


In [47]:
df_text[df_text.post_content.str.contains("😘")].post_content.values[0]#.replace("👍",":thumbsup:")

'Thanks for your advice!!! It does help!\n\nI\'ll take your advice and adjust my essay. As you mentioned, "ineffective sentence" is quite a problem. I will try my best to revise my sentence in a concise, clear, and effective way.\n\nThanks again for your feedback!\n\n谢谢你😘'

In [55]:
unq_chars = np.array([*set(" ".join(df_text.post_content))]); unq_chars

array(['f', 'E', '_', ':', '|', 'I', '"', 'A', '=', '6', '\\', ';', 'S',
       'h', 'L', 'F', 'Y', '0', '?', '`', '(', 'x', 'B', '1', '2', 'k',
       'p', 'w', '$', 'D', '<', 'o', ']', 'u', 'a', '&', 'C', 'U', 't',
       'j', ',', '>', 'H', ')', "'", 'R', '{', 'J', '%', 'M', 'z', 'Z',
       'v', '[', '~', '^', '9', '3', 'b', '8', 'W', 'X', '@', '}', 'i',
       'm', '#', 'l', '+', 'G', 'N', '4', 'g', ' ', 'K', 'P', 'r', 'O',
       'q', '!', '.', '7', '-', 'n', '/', '*', 'Q', 'c', 's', 'V', 'd',
       'T', 'e', '5', '\n', 'y'], dtype='<U1')

In [50]:
df_text['post_content'] = df_text.post_content.apply(lambda x: unidecode.unidecode(x.replace("👍"," :thumbsup: ").replace("😘"," :kissing_heart: ")))

In [53]:
user_piv = df_text.pivot('url_id','seqnum','user_id')
op_multi = user_piv[user_piv.apply(lambda x: x[0] in x[1:].values,axis=1)]
op_ids = op_multi.loc[:,0]

In [54]:
df_origpost = df_text.query('seqnum == 0')
df_op_reply = df_text.query('url_id.isin(@op_multi.index) & user_id.isin(@op_ids) & seqnum != 0')
df_critiques = df_text.drop(df_op_reply.index).query('seqnum > 0')

In [57]:
dfo = df_origpost[['url_id','doc_id','post_content']]
dfc = df_critiques[['url_id','doc_id','post_content']]

In [58]:
sop_data['doc_id'] = sop_data.url_id.astype(str) + '_' + sop_data.seqnum.astype(str)

df_users = sop_data.iloc[:, np.r_[-7:0, 3]]
dfu_o = df_users.query('seqnum==0')
dfu_c = df_users[df_users.doc_id.isin(df_critiques.doc_id)]
dfu_oc = dfu_o.merge(dfu_c, on='url_id')

dfu_c.user_likes.quantile([.3])

0.3    1.0
Name: user_likes, dtype: float64

In [59]:
dfu_oc.query('user_likes_y > 0').url_id.nunique()

2574

In [60]:
dfu_oc.url_id.nunique()

3107

In [62]:
dfu_oc.head()

Unnamed: 0,url_id,user_id_x,user_likes_x,user_posts_x,user_threads_x,username_x,doc_id_x,seqnum_x,user_id_y,user_likes_y,user_posts_y,user_threads_y,username_y,doc_id_y,seqnum_y
0,82750,112438,0.0,0,1,decimo1993,82750_0,0,99412,2001.0,7529,0,Holt,82750_1,1
1,82755,112507,0.0,0,1,Airydisc,82755_0,0,112560,9.0,19,0,Constance,82755_1,1
2,82784,112553,0.0,0,1,itha20,82784_0,0,112560,9.0,19,0,Constance,82784_1,1
3,82784,112553,0.0,0,1,itha20,82784_0,0,112562,241.0,602,0,Maria,82784_2,2
4,82784,112553,0.0,0,1,itha20,82784_0,0,112095,0.0,2,1,IamIana,82784_3,3


In [66]:
dfu_c.merge(df_critiques).head()

Unnamed: 0,url_id,user_id,user_likes,user_posts,user_threads,username,doc_id,seqnum,post_content
0,82750,99412,2001.0,7529,0,Holt,82750_1,1,"Yeu, I am afraid that this is not a valid stud..."
1,82755,112560,9.0,19,0,Constance,82755_1,1,"Since this is an academic paper, please remove..."
2,82784,112560,9.0,19,0,Constance,82784_1,1,This comments are just some suggestions that m...
3,82784,112562,241.0,602,0,Maria,82784_2,2,"Itha20,\n\nLooking through your essay, I think..."
4,82784,112095,0.0,2,1,IamIana,82784_3,3,I think your essay is a little messy. I would ...


In [64]:
dfu_c.merge(df_critiques).user_posts.quantile(.15)

5.0

In [None]:
dfo.merge(dfc, on='url_id').to_pickle('proc_data/post_pair_df.pkl')

In [None]:
path_pd = Path('proc_data')
path_al = path_pd/'all_texts'
path_cr = path_pd/'critiques'
path_op = path_pd/'op_replies'
path_og = path_pd/'orig_posts'

In [None]:
def to_file_corpus(df, datapath, idcol='doc_id', postcol='post_content'):
    print(f'Writting {df.shape[0]} files to: {datapath} ...')
    for doc_id, content in df[[idcol,postcol]].values:
        Path(datapath/doc_id).with_suffix('.txt').write_text(content)
    print('Done.')

In [None]:
to_file_corpus(df_text, path_al)
to_file_corpus(df_critiques, path_cr)
to_file_corpus(df_op_reply, path_op)
to_file_corpus(df_origpost, path_og)

## V3 - Strip Quotes

This section was done with the intention of improving phrase extraction by removing quoted text from replies.

Various references used:

https://stackoverflow.com/questions/30285706/detecting-similar-paragraphs-in-two-documents

https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

https://en.wikipedia.org/wiki/W-shingling

https://stackoverflow.com/questions/17740833/checking-fuzzy-approximate-substring-existing-in-a-longer-string-in-python

In [88]:
from tqdm import tqdm_notebook, tnrange
from unidecode import unidecode

import Levenshtein
from fuzzywuzzy import fuzz, process
import itertools

**Objective:** compare similarity between every sentence in original post with review posts. Determine a suitable threshold and strip out those sentences from reviews.

Quotes are unreliable for this task because people often quote small phrases. There is a potential for stripping out long quote surrounded passages, but this is likely still going to be less effective.

[Key Phrase Extraction](https://en.wikipedia.org/wiki/Automatic_summarization#Keyphrase_extraction) is less effective when there are large quoted passages. Stripping these out will almost certainly improve the extraction process and will likely help with summarization as well.

[Sent2Vec](https://rare-technologies.com/sent2vec-an-unsupervised-approach-towards-learning-sentence-embeddings/) may or may not be useful in this task. Something as simple as a BLEU score could be equally, or even more, effective.

The ideal metric would disregard the length of sentences, or at least discount them heavily. Weight should be given to number of matching consecutive terms where order is important. However, this also opens several problems. Reviewers have a tendency to inject words, strip words, and ellipsize passages.

Examples:

should have high similarity
```
OP: I landed a job with TCS, Asia's largest IT service provider.
REV: I landed a --job-- position with TCS, Asia's largest ... provider.
```

should have low similarity
```
OP: I have been given many suggestions throughout my career
REV: I have many suggestions for improvements throughout your statement
```


In [None]:
nlp = spacy.load("en_core_web_md")
#nlp.disable_pipes('tagger','ner')

In [68]:
df_text = pd.read_pickle('proc_data/minparse/full_text_minproc.pkl')
op_idx = np.load('proc_data/minparse/op_idx.npy')
op_rep_idx = np.load('proc_data/minparse/op_rep_idx.npy')
crit_idx = np.load('proc_data/minparse/crit_idx.npy')

df_op = df_text.loc[op_idx]
df_crit = df_text.loc[crit_idx]

df_op = df_op.drop(['user_id','seqnum','tokens'],1)
df_crit = df_crit.drop(['user_id','seqnum','tokens'],1)

df_merged = df_op.merge(df_crit, on='url_id', suffixes=('_op','_rv'))[['doc_id_rv','post_content_op','post_content_rv']]

Simple base cases, suggestions that contain the word "change", and suggestions that have a quotation mark.

In [70]:
df_merged[df_merged.post_content_rv.str.contains('change',case=False)&~df_merged.post_content_op.str.contains('change',case=False)].head()

Unnamed: 0,doc_id_rv,post_content_op,post_content_rv
2,82784_1,I really need help in this particular essay. T...,This comments are just some suggestions that m...
8,79311_1,Question 2: The Ivey MSc. Business Analytics r...,"Funmi, your response to the prompt is too acad..."
11,78374_2,learning journey in humanitarian and developme...,"Asli, this is not exactly a proper statement o..."
15,79386_1,application essay to the financial sector\n\nH...,"Xue, your discussion is misdirected. The more ..."
31,75532_1,"Hello guys\nHere is my essay, please tell me w...","Robin, you have too much going on in your firs..."


In [71]:
quodf = df_merged[df_merged.post_content_rv.str.contains('"')]
quoted_texts = quodf.post_content_rv.str.findall(r'("[^"]*")') # store quoted strings

In [72]:
# strip out quoted texts from reviews
df_merged.loc[quodf.index, 'post_content_rv'] = quodf.post_content_rv.str.replace(r'("[^"]*")','')

Taking this approach is by no means foolproof, but it's a start and we shouldn't loose too much information of value since the goal is filter down to editing terms.

In [73]:
df_merged[df_merged.post_content_rv.str.findall(r"(\B'[^']+')").apply(len) > 0].post_content_rv.str.findall(r"(\B'[^']+')")

99      ['Internet of Things (IoT)', '&', 'and', 'a St...
165                                                 ['d']
274     ['I', 'I specifically want to...', 'Solving ch...
286                                                ['Hi']
291             ['but', 'and', 'However', 'Nevertheless']
                              ...                        
5440                                  ['but', 'although']
5468    ['s business successful. Personally, I believe...
5469                   ['curiosity about the balance...']
5515                                            ['ímply']
5524                                    ['strategically']
Name: post_content_rv, Length: 218, dtype: object

Single quotes are a bit tricker, semantically, they can have several different meanings which may or may not be quoted text from the author, not to mention their use as apostrophes. For these reasons, we will leave them in the text.

In [74]:
def strip_pipe(s):
    return gensim.parsing.strip_multiple_whitespaces(gensim.parsing.strip_non_alphanum(unidecode(s))).lower()

In [75]:
vord = np.vectorize(ord,otypes=['uint16'])
def fingerprint(s,ngram=3,tonum=False):
    s_fp = strip_pipe(s).replace(' ','')
    chargrams = np.array([*nltk.ngrams(s_fp,ngram)])
    return vord(chargrams) if tonum else chargrams

In [76]:
samp = df_merged.iloc[:,1:].sample(1).iloc[0].values

In [77]:
"".join(fingerprint(samp[1]).flatten())

'hi1i1b1bebegegigininnnnininingngogofofefeaeacachchshsesenentntetenencnceceseshshohououluldldbdbebeieininanacacacapapipititatalalllleletetttteterer2r2p2plpleleaeasasesegegigivivevebeblblalananknkskspspapacaceceaeafaftftetererereaeacachchchcocomommmmamaqaququeueseststitioiononmnmamararkrkakanandndcdcocomompmplpleletetitioionononofofsfsesenentntetenencnceceseststrtryrytytotoaoavavovoioididtdththehesesesesesmsmamalallllmlmimisiststatakakekesesasanandndtdtrtryrytytotoeoexexpxplplalaiaininynyoyouoururirididedeaeasasisinindndedetetataiaililslsusucuchchahasaswswhwhahatatitisiststeteaeacachchfhfofororirinindndidiaiaiaisistsththihisisasananynymymomovovevemememenentntotofofafacacacadadedememymywywhwhahatatdtdodotoththeheyeywywowororkrksksosotoththahatatrtrereaeadadederercrcacananununundndederersrststatanandndjdjujusuststatasasusuguggggegeseststitioiononbnbebeteththehecechchahanangngegeyeyoyououwuwawanantnttttotososeseeeeieinintnththehewewowororlrldldldlilininenesesbsbybygygaganandndhdhihiiiinins

In [79]:
[*nltk.ngrams(strip_pipe(samp[0]).split(), 3)][:10]

[('be', 'the', 'change'),
 ('the', 'change', 'you'),
 ('change', 'you', 'want'),
 ('you', 'want', 'to'),
 ('want', 'to', 'see'),
 ('to', 'see', 'in'),
 ('see', 'in', 'the'),
 ('in', 'the', 'world'),
 ('the', 'world', 'lines'),
 ('world', 'lines', 'by')]

In [80]:
def find_bleusim(op_doc, rv_doc, minsim=0.20, strip_nalpn=True, smoothing=None):
    """Find similar sentences using Bleu scores
    
    Args:
        op_doc, rv_doc : str, paragraphs to search and compare to similarity measures
        minsim : float, minimum bleu score to consider a sentence pair a match
        strip_nalpn: bool, if True will strip non-alphanumeric chars and consecutive whitespace
        smoothing : a nltk.translate.bleu_score.SmoothingFunction method to correct for mismatch ngrams
    """
    op_doc,rv_doc = (nlp(op_doc), nlp(rv_doc)) if isinstance(op_doc,str) else (op_doc,rv_doc)
    
    for sent_op,sent_rv in itertools.product(op_doc.sents,rv_doc.sents):
        op_text, rv_text = strip_pipe(sent_op.text), strip_pipe(sent_rv.text)
    
        bleuscore = nltk.translate.bleu([op_text.split()], rv_text.split(), smoothing_function=smoothing)
        brevpen = nltk.translate.bleu_score.brevity_penalty(len(op_text), len(rv_text))
        if bleuscore > minsim:
            print('OP: ', sent_op.text.strip())
            print('REV:', sent_rv.text.strip())
            print('BLEU: {:0.5f}, BrevityPenalty {:0.5f}, NoPenalty: {:0.5f}'.format(bleuscore, brevpen, bleuscore/brevpen))
            print('---------------')

In [81]:
def find_levsim(op_doc, rv_doc, minsim=0.01, strip_nalpn=True):
    """Find similar sentences using Levenshtein similarity
    
    Args:
        op_doc, rv_doc : str, paragraphs to search and compare to similarity measures
        minsim : float, minimum similarity to consider a sentence pair a match
        strip_nalpn: bool, if True will strip non-alphanumeric chars and consecutive whitespace
    """
    op_doc,rv_doc = (nlp(op_doc), nlp(rv_doc)) if isinstance(op_doc,str) else (op_doc,rv_doc)
    for sent_op,sent_rv in itertools.product(op_doc.sents,rv_doc.sents):
        op_text, rv_text = sent_op.text, sent_rv.text
        if strip_nalpn:
            op_text, rv_text = strip_pipe(sent_op.text), strip_pipe(sent_rv.text)
            
        sim = gensim.similarities.levenshtein.levsim(op_text, rv_text, min_similarity=minsim)
        if sim > minsim:
            print(f'SIM: {sim:0.6f}')
            print('OP: ', sent_op.text.strip())
            print('REV:', sent_rv.text.strip())
            print('---------------')

In [82]:
def fuzz_all(s1,s2):
    """Performs multiple fuzz similarity tests and returns array of calculations"""
    pr = fuzz.partial_ratio(s1,s2)
    ptsr = fuzz.partial_token_sort_ratio(s1,s2)
    r = fuzz.ratio(s1,s2)
    tsetr = fuzz.token_set_ratio(s1,s2)
    tsortr = fuzz.token_sort_ratio(s1,s2)
    qr = fuzz.QRatio(s1,s2)
    ratios = np.array([pr,ptsr,r,tsetr,tsortr,qr])
    return ratios 

In [83]:
def find_fuzzsim(op_doc, rv_doc, cos_sim_thresh=0.91, fzmean_thresh=50, fzmax_thresh=60, print_removals=True):
    """Find similar sentences using fuzzywuzzy and spaCy's cosine similarity
    
    Args:
        op_doc, rv_doc : str, paragraphs to search and compare to similarity measures
        cos_sim_thresh : float, minimum cosine similarity between sentences to consider a match
        fzmean_thresh : int, minimum average fuzzy ratio between sentences
        fzmax_thresh : int, minimum maximum fuzzy ratio value returned from `fuzz_all`
        
    yields:
        str, matching sentences based on threshold criteria
    """
    op_doc,rv_doc = (nlp(op_doc), nlp(rv_doc)) if isinstance(op_doc,str) else (op_doc,rv_doc)
    
    for sent_op,sent_rv in itertools.product(op_doc.sents,rv_doc.sents):
        op_text, rv_text = map(gensim.parsing.strip_multiple_whitespaces, (sent_op.text, sent_rv.text))
        
        # check for empty vectors before calculating similarity
        spcsim = sent_op.similarity(sent_rv) if sent_op.has_vector and sent_rv.has_vector else 0
        
        # first threshold check, significantly faster than checking with fuzz first
        if spcsim > cos_sim_thresh:
            ratios = fuzz_all(op_text, rv_text)
            rmean,rmin,rmax = ratios.mean(),ratios.min(),ratios.max()
            # second threshold to exclude content that is contextually similar but lexically different 
            if rmean > fzmean_thresh and rmax > fzmax_thresh:
                if print_removals:
                    print('OP: ', sent_op.text.strip())
                    print('REV:', sent_rv.text.strip())
                    print('Avg: {:0.2f}, Min: {}, Max: {}, CosSim: {:0.4f}'.format(rmean,rmin,rmax,spcsim))
                    print('----------')
                yield sent_rv.text

In [84]:
def rm_dup_sents(df, mincos=0.98):
    """Removes sentences from a dataframe, in place, that met the similarity threshold"""
    for idx, (op_post, rev_post) in df.drop('doc_id_rv',1).iterrows():
        for sent in find_fuzzsim(op_post,rev_post,cos_sim_thresh=mincos):
            rev_post = rev_post.replace(sent,'')
        df.loc[idx,'post_content_rv'] = rev_post

Drop obvious mirrored sentences, i.e. cos sim > ~0.98

In [None]:
rm_dup_sents(df_merged, mincos=0.98)

In [None]:
df_merged.to_feather('proc_data/merge_deduped.df')

In [85]:
df_merged = pd.read_feather('proc_data/merge_deduped.df')

In [89]:
samp10 = df_merged.iloc[:,1:].sample(10).values
for exop,exrv in samp10:
    *rmsents, = find_fuzzsim(exop,exrv,cos_sim_thresh=0.98)
    print('rmsents:',rmsents)
    print('================')

rmsents: []
rmsents: []
rmsents: []
rmsents: []
rmsents: []
rmsents: []
rmsents: []
rmsents: []
rmsents: []
rmsents: []


In [91]:
smoothing = nltk.translate.bleu_score.SmoothingFunction()

In [92]:
for exop,exrv in samp10:
    find_bleusim(exop,exrv, minsim=0.25, smoothing=smoothing.method1)
    print('================')

OP:  It was on the Christmas Eve of 2010,
REV: It was on the Christmas Eve of 2010, that night, I luckily found out a branch seizedthat piqued my curiosity
BLEU: 0.34670, BrevityPenalty 1.00000, NoPenalty: 0.34670
---------------
OP:  that night, I luckily found out a branch seized my curiosity.
REV: It was on the Christmas Eve of 2010, that night, I luckily found out a branch seizedthat piqued my curiosity
BLEU: 0.37903, BrevityPenalty 1.00000, NoPenalty: 0.37903
---------------


As we can see, our filtering method is not perfect, some sentences were sufficiently different to slip passed the threshold. Bleu score can help identify these misses, but choosing the correct threshold is particularly difficult. Additionally, this steps well outside the metrics intended use. It was developed for translation tasks and alternative use has quite a few [caveats to consider](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213).