# Excel Reduce

The idea of this notebook is to demonstrate a way to convert a set of scripts that can take a .csv file of sentences as an input, process the text, convert it to a vector, change the vector into a normalized value and then compare the values to determine how closely similar the sentences are.  Similar sentences are dropped whilst the more dissimilar sentences are kept.  The output is another .csv file with significantly fewer rows of data in it for human verification.

In [29]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# track progress
from tqdm import tqdm

# spaCy
import spacy
import plac
from spacy.attrs import ORTH
import io
from spacy.lang.en.stop_words import STOP_WORDS

# Doc2Vec imports
import gensim
from gensim.models.doc2vec import TaggedDocument
import collections

## Pre-processing

Some pre-processing is done here in order to account for differences in how computers see words.  For example, "Python" and "python" are counted as 2 different words.  In addition, the text is lemmatized to reduce the words to their most basic form, stop words are removed as they provide little value to the text and punctuation is stripped.

In [55]:
# import and clean up
questions = pd.read_csv("Quora_Examples_2.csv")
questions.head()

Unnamed: 0,Questions
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [15]:
# load the SpaCy module
nlp = spacy.load('en_core_web_sm', disable = ["ner"])

In [71]:
# process the text by lower casing everything, tokenizing, lemmatizing and stripping punctuation
def process_text(text):
    text = text.lower()
    text = nlp(text)
    lemmas = [token.lemma_ for token in text if token.is_stop != True and token.is_punct != True]
    return lemmas

In [58]:
tqdm.pandas()
questions["lemmatized_questions"] = questions["Questions"].progress_apply(process_text)
questions.head()

100%|██████████| 200/200 [00:01<00:00, 164.84it/s]


Unnamed: 0,Questions,lemmatized_questions
0,What is the step by step guide to invest in sh...,step step guide invest share market india
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,story kohinoor koh noor diamond
2,How can I increase the speed of my internet co...,increase speed internet connection vpn
3,Why am I mentally very lonely? How can I solve...,mentally lonely solve
4,"Which one dissolve in water quikly sugar, salt...",dissolve water quikly sugar salt methane carbo...


Some additional pre-processing has to be done to format the data into a form that doc2vec can use.  This requires a unique label for each row of data and the words have to be split into a list.

In [59]:
# add a label
questions["label"] = range(1, len(questions) + 1)
questions.head()

Unnamed: 0,Questions,lemmatized_questions,label
0,What is the step by step guide to invest in sh...,step step guide invest share market india,1
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,story kohinoor koh noor diamond,2
2,How can I increase the speed of my internet co...,increase speed internet connection vpn,3
3,Why am I mentally very lonely? How can I solve...,mentally lonely solve,4
4,"Which one dissolve in water quikly sugar, salt...",dissolve water quikly sugar salt methane carbo...,5


In [67]:
# use TaggedDocument from the gensim library to convert our data into a list with sentences split up into lists
# and the tags added on as another tinier list
docs = []
index = 0
while index < len(questions) - 1:
    docs.append(TaggedDocument(questions["lemmatized_questions"].iloc[[index]].str.split(), 
                               [questions["label"].iloc[[index]]]))
    index += 1

In [72]:
docs = questions.apply(lambda r: TaggedDocument(words = process_text(r["Questions"]), 
                                               tags = [r.label]), axis = 1)

## Semantic Analysis

A model is built using Doc2Vec.  This vectorizes individual words in the data with the aid of the words around the word being vectorized in order to determine 'context'.  The model is then used to vectorize the data in question so that we can obtain a numeric value for the sentence to compare with each other to determine similarity.

In [74]:
# build the model
model = gensim.models.doc2vec.Doc2Vec(vector_size = 50, min_count = 2, epochs = 40)
model.build_vocab([x for x in tqdm(docs)])

100%|██████████| 200/200 [00:00<00:00, 200109.92it/s]


In [100]:
# function that takes a lemmatized string, converts to a vector array and then normalizes it
def convert_vector(text):
    vector_list = model.infer_vector(text)
    normalized_vector = sum(vector_list) / len(text)
    return normalized_vector

In [101]:
# apply the function to the lemmatized_questions and record the value
questions["value"] = questions["lemmatized_questions"].progress_apply(convert_vector)

100%|██████████| 200/200 [00:00<00:00, 6891.16it/s]


In [102]:
questions.head()

Unnamed: 0,Questions,lemmatized_questions,label,value
0,What is the step by step guide to invest in sh...,step step guide invest share market india,1,0.002506
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,story kohinoor koh noor diamond,2,0.000178
2,How can I increase the speed of my internet co...,increase speed internet connection vpn,3,0.000298
3,Why am I mentally very lonely? How can I solve...,mentally lonely solve,4,-0.001619
4,"Which one dissolve in water quikly sugar, salt...",dissolve water quikly sugar salt methane carbo...,5,0.000953


In [127]:
# compare values between rows, delete as necessary
final_questions = pd.DataFrame(columns = ["Questions", "lemmatized_questions", "label", "value"])
final_questions = final_questions.append(questions.iloc[[0]])

In [128]:
final_questions.head()

Unnamed: 0,Questions,lemmatized_questions,label,value
0,What is the step by step guide to invest in sh...,step step guide invest share market india,1,0.002506


In [138]:
index = 0
threshold = 0.0000001

reference_value = float(questions["value"].iloc[[0]])
    
while index < len(questions) - 1:
    difference = float(questions["value"].iloc[[index]]) - float(questions["value"].iloc[[index + 1]])
    if difference > threshold:
        final_questions = final_questions.append(questions.iloc[[index + 1]])
        reference_value = float(questions["value"].iloc[[index + 1]])
        index += 1
    else:
        index += 1

The reason why so many sentences are kept is because we set the threshold to an extremely low value.  The key will be to pick the correct threshold.

In [139]:
final_questions.head()

Unnamed: 0,Questions,lemmatized_questions,label,value
0,What is the step by step guide to invest in sh...,step step guide invest share market india,1,0.002506
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,story kohinoor koh noor diamond,2,0.000178
3,Why am I mentally very lonely? How can I solve...,mentally lonely solve,4,-0.001619
6,Should I buy tiago?,buy tiago,7,0.001038
7,How can I be a good geologist?,good geologist,8,0.000272


Final question is whether this will work on the given file or not.  To be continued...