In [48]:
#Importing necessary modules for this task
import pandas as pd
import numpy as np
import hashlib
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
import random
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Near Duplicates:
Near Duplicates or Fuzzy Duplicates are a commonly arising problem in the process of deduplication. It refers to the set of those partial or full duplicates, that are slightly disimilar to their corressponding records but carry the same context i.e. these duplicates represent the same record (since they are duplicates) but have slight differences in their naming e.g. due to typos. misspellings, grammatical errors, synonyms, etc.

# Causes for Near Duplicates:
Near Duplicates are a frequently occurring problem in Machine Learning datasets as it happens to be the case, there are many reasons for near duplicates:

## Data Entry Errors:
The most common of the bunch in cases of machine learning tasks are data entry errors i.e. the value for the data entered has some typos or misspellings induced in it by the entry person or machine. Example of this case might be entering the record "Jane" instead of "Jone" for a student table, etc.

## Multiple Entries:
Another case of this can be a dataset being filled up by multiple people or groups at the same time. This results in each person or group saving the entries as per their set of information and may result in near duplicate entries. Example of this case might be two entry operators entering the records "John Smith" and "Jon Smith" for a student table, etc.

## Text Translation:
Near duplicates may also be the result of translation of one language to another. Example of this case is are the words "Good Morning" if converted to Japense translate to "Ohayō", which are different in syntax but represent the same context.

## Plagirism:
Another root cause of near duplicates is plagirism when generating content or data. This is frequent in content writing, critic reviews, news articles, school assignments, etc. For example, one critic may copy some other critic's lines and mix it with his own to create his review on a particular book, etc.

# Removal of Near Duplicates:
The process of removing near duplicates and identifying the records that represent the same real world entity is referred to as "Entity Recognition" or simply "Record Linkage". Since near duplicates are deemed as a very important issue in terms of machine learning, due to their effect on the efficiency of the models, many different techniques have been proposed to remove them from a given dataset. Each technique is based on the set of requirements for the given problem and can supposedly be distributed into two phases i.e. prepartion and similarity measurement.

## Data Preparation Phase:
In this phase, different types of techniques are used to prepare the data so as to perform similarity metrics on the records. The goal of data preparation phase is to use the data format required in accordance with the given problem and improvise the process of record linkage. Data preparation phase itself can be divided into two sub-phases i.e. preprocessing and extraction.

### Data Preprocessing Phase:
In this phase, the data is preprocessed so as to obtain an efficient format of data to generate relevant features from given variables to get an effective result for near duplicates. There are many data preprocessing techniques that can be used for this case, some of these are as follows:

* Standardization
* Sorting
* Blocking
* Sampling
* Summarization

### Feature Extraction Phase:
In this phase, the data is provided to a certain function that results in helping in the identification of near duplicates. The resultant value for the function can be a fingerprint, set of tokens or a numerical vector, etc. There are many feature extraction techniques that can be used for this case, some of these are as follows:

* N-gram Fingerprinting
* Shingling
* MinHash
* Local Sensitivity Hashing (LSH)
* Tf-Idf Vectorizer
* Siamese Networks

## Similarity Measurement Phase:
Once the required features have been extracted from the given data, these features are provided to similarity metrics which allow to determine whether the records are near duplicates or not. Similarity Measures can be categorized as both character based as well as context based, but we are going to consider only character ones in this notebook. These include:

* Levenshtein Distance
* Jaccard Similarity
* Sorenson-Dice Coefficient
* Cosine Similarity

# Data Processing Phase:
We will now look at the different techniques mentioned in this phase, with the help of examples:

## Data Preprocessing:
The data preprocessing phase comprises of several techniques, which can either be used singly or in a combination with one another depending on the type of data we need for the feature extraction step. The most common preprocessing techniques that are mostly used in this case include standardization, filtering, encoding, etc. Lets now discuss the ones we mentioned step by step:

### Data Standardization:
Data Standardization is a preprocessing step that is used to standardize the records into a desired format. Standardization may include removing stop words from text, eliminating special characters, etc. The goal is to ensure that all the records in the given dataset are comparable to each other. This serves as a commonly used preprocessing step as it allows to generate records so that they can be easily compare for similarity measurement.

#### Example:
Lets consider the example of the two text records "Karachi, Pakistan" and "karachi in Pakistan". Removing the special characters and prepositions and lowercasing the result, we get the following standardized result for the two records "karachipakistan".

In [6]:
#Loading records as a list
records = ["Karachi, Pakistan", "Karachi in Pakistan"]

#Function to generate the corresponding standardized versions of record 
def standardize_sp_prep(record):
    '''
        Method to standardize a record by removing its special characters and prepositions and giving result in lowercase format
    '''
    tokens = nltk.word_tokenize(record)
    tagged = nltk.pos_tag(tokens)
    return ''.join(x[0].lower() for x in tagged if (not(x[1] == 'IN') and x[0].isalpha()))

for record in records:
    print('Standardized record for "{}": {}'.format(record, standardize_sp_prep(record)))

Standardized record for "Karachi, Pakistan": karachipakistan
Standardized record for "Karachi in Pakistan": karachipakistan


### Character Sorting:
Character Sorting is another preprocessing step that is helpful in the process of record linkage as it allows to align all the characters for a sentence in a particular order, thus, helping in cases where the records are in a reorder format. There are many different sorting algorithms out there which can be utilized to achieve this task. 

#### Example:
Lets consider the example of two records "He ate an apple yesterday" and "Yesterday, he ate an apple". Applying standardization and then sorting the result for characters will give us the result "aaaadeeeeehlnpprsttyy" for both records.

In [7]:
#Loading the records
records = ["He ate an apple yesterday", "Yesterday, he ate an apple"]

#Function to generate character sorted versions of standardized records
def sort_standard_records(records, standard_func):
    records = [''.join(x for x in sorted(standard_func(record))) for record in records]
    return records

#Generate sorted versions for the given inputs
sort_standard_records(records, standardize_sp_prep) 

['aaaadeeeeehlnpprsttyy', 'aaaadeeeeehlnpprsttyy']

### Blocking:
Blocking is a preprocessing technique that allows to distinguish the different records of the dataset based on some criteria so as to reduce the number of comparisons to be made between the records of the same kind. This criteria depends on the type of application and can differ for different usecases. Once the blocking has divided the records into subsets, each subset can undergo the defined techniques to find out if they are near duplicates of each other.

#### Example:
Lets consider the example of determining the subset of records based on the department of the students i.e. given sample records of students of a university from different departments, we have to use blocking to generate subsets for each department, so that we can check for near duplicates for each sub-block.

In [8]:
#Loading the records
records = [{'name':'John Smith', 'dept':'Mech'}, {'name':'Jane Doe', 'dept':'Chem'}, {'name':'John, Smith', 'dept':'Mech'}]

#Method to define the blocking criteria 
def blocking_criteria(record):
    return record['dept']

#Method to generate blocks based on blocking criteria
def generate_blocks(records, blocking_criteria):
    subsets = defaultdict(list)
    for record in records:
        criteria = blocking_criteria(record)
        subsets[criteria].append(record)
    return subsets

#Generate blocks for the given records
subsets = generate_blocks(records, blocking_criteria)

#Checking the subsets based on the defined criteria
for criteria, record in subsets.items():
    print("Subsets for {}: {}".format(criteria, record))

Subsets for Mech: [{'name': 'John Smith', 'dept': 'Mech'}, {'name': 'John, Smith', 'dept': 'Mech'}]
Subsets for Chem: [{'name': 'Jane Doe', 'dept': 'Chem'}]


### Sampling:
Sampling is the process of randomly selecting a set of records from the given records and determine near duplicates for the selected sample. The goal of the sampling preprocessing step is the same as blocking i.e. to reduce the number of comparisons to make with other records for determination of near duplicates. It is to be noted that sampling in this case takes place without replacement so that no two same records can be repeated. 

#### Example:
Lets consider an example for sampling where we will sample two records at a time (without replacement) from the provided records, which can later be utilized for detection of near duplicates.

In [9]:
#Loading the records
records = ["I live in Karachi, Pakistan", "I lived in Karachi in Pakistan", "I love Karachi", "I live in Pakistan"]

#Method to sample records without replacement
def sample_without_replacement(records, sample_size):
    random.shuffle(records)
    sampled_records = []
    for i in range(0, len(records), sample_size):
        sampled_records.append(records[i:i+sample_size])
    return sampled_records

#Generting sample subsets for records of size 2
sample_without_replacement(records, 2)

[['I love Karachi', 'I live in Karachi, Pakistan'],
 ['I lived in Karachi in Pakistan', 'I live in Pakistan']]

### Text Summarization:
Text Summarization is the process of summarizing large piles of text so as to ease the process of feature extraction on them. In this case, near duplicates gets difficult to compute with the increase in size of text. Thus, in case the context of the passage is to be put into consideration, one can apply the process of text summarization to generate a smaller version of the text. Since the context is the same for the summarized text, thus, near duplicates can still be detected from the summarized ones.

Since manual summarization is out of option, automatic summarization is opted for this step. Automatic summarization can take place in two ways i.e.

* __Extractive Summarization__ - This method extracts only the important sentences and phrases from provided text and use them to generate the summary for given input

* __Abstractive Summarization__ - This method compresses the provided input text by organizing the text to capture only the salient features of the source text and may embed new sentences as well.

#### Example:
Lets take into consideration a piece of text about "Artificial Intelligence" and summarize it. We can use text of similar kinds in a given dataset to summarize them and then find near duplicates between them. Here we are going to make use of extractive summarization to get the results.

In [10]:
#Loading the records
record = "Artificial intelligence (AI) is the simulation of human intelligence in machines that are programmed to think and learn. It has become one of the most rapidly growing and impactful technologies of the 21st century. AI systems can perform tasks such as recognizing speech, making decisions, and translating languages with a high degree of accuracy. The field of AI research was founded on the belief that a machine can be made to think like a human if only we understand how the human mind works. AI has the potential to revolutionize various industries, from healthcare to transportation, and is already being used in a wide range of applications such as image recognition, natural language processing, and self-driving cars. However, there are also concerns about the potential negative impacts of AI, such as job displacement and ethical issues. As AI continues to advance, it is important for society to consider both the benefits and the potential risks of this powerful technology."

#Method to perform text summarization
def get_record_summary(record):
    #Generating a parsed document from the given record
    parser = PlaintextParser.from_string(record, Tokenizer("english"))
    #Creating summary of the given text
    summarizer = TextRankSummarizer()
    #Generate summary using highest ranking sentences
    summary = summarizer(parser.document, 2)
    #Join the sentences to generate summary paragraph
    return ' '.join(str(x) for x in summary)    
        
#Get summary for record
get_record_summary(record)

'The field of AI research was founded on the belief that a machine can be made to think like a human if only we understand how the human mind works. AI has the potential to revolutionize various industries, from healthcare to transportation, and is already being used in a wide range of applications such as image recognition, natural language processing, and self-driving cars.'

## Feature Extraction:
The feature extraction phase makes use of the preprocessed data to extract important meanings from it in the form of features. These obtained features can then be utilized by similarity metrics to get a grip of the quantity of near duplicates that are present in the provided dataset. 

### N-Gram Fingerprinting:
N-gram fingerprinting is a technique that generates fingerprints for records or documents by generating a set of non-overlapping N-grams for the words/characters in the standardized data provided to it. In case of large documents, it first uses a hash function so as to compress the size of the document to a fixed hash value, after which it generates the N-grams. It then compares the N-grams for each record to check for near duplicates. A variant of this is the sorted N-gram fingerprinting in which each N-gram is first sorted before comparison.

#### Example:
Consider the example of sorted N-gram fingerprinting for the records "Yesterday, Ali went to the historic meuseum at Karachi, Pakistan" and "Ali, yesterday, went to the historic meuseum at Karachi in Pakistan", giving us the same result. (Since the text isn't large enough, we won't use hashing in this case)

In [11]:
#Loading records for n-gram fingerprinting
records = ["Yesterday, Ali went to the historic meuseum at Karachi, Pakistan",
           "Ali, yesterday, went to the historic meuseum at Karachi in Pakistan"]

#Method to generate sorted n_gram fingerprints for a record
def generate_sorted_ngram_fingerprint(record, n):
    #Standardize input
    record = standardize_sp_prep(record)
    #Generate ngrams
    ngrams = [record[i:i+n] for i in range(0, len(record), n)]
    #Sort generated ngrams to give a fingerprint
    sorted_ngram = ''.join(sorted(ngrams))
    return sorted_ngram

#Generate fingerprints for both records (using n=3)
fingerprints = [generate_sorted_ngram_fingerprint(record,3) for record in records]
fingerprints

['achalidayeumeushisicmipakarkistanterthetorttowenyes',
 'achalidayeumeushisicmipakarkistanterthetorttowenyes']

### Shingling:
Shingling is the process of generating shingles i.e. a subset of overlapping words from the tokenized standard text data provided to it. The number of words to allow in a shingle is set up using the value of parameter k. Once the shingles have been generated, they are hashed, with the resultant hashes used for similarity comparisons. The hashing step may not take place in cases of small records. One thing to remember is that in shingling, the value of k determines the accurracy of the shingles. If the value of k is low, the number of shingles produced are greater but the similarity is usually less. Similarly, if the value of k is high, the number of shingles produced is less but the similarity is usually greater.

#### Example:
Consider the example of shingling for the records "John Smith studies with his buddies." and "Jon Smith, studies with his buddyes.", which give us similar results (using similarity measures we conclude whether the results produced are similar or not). We are using k = 2 for this case.

In [12]:
#Loading the records for shingling
records = ["John Smith studies with his buddies.", "Jon Smith, studies with his buddyes."]

#Method to generate k-shingles from given record
def generate_shingles(record, k):
    #Standarizing the record
    std_record = ''.join(x for x in record if (x.isalpha() or x==' '))
    #Tokenizing the record
    tokens = std_record.split()
    #Generating shingles from tokens for k = 2
    shingles = [''.join(tokens[i:i+k]) for i in range(len(tokens)-k+1)]
    return shingles

#Generating shingles for the provided records
shingles = [generate_shingles(record, 2) for record in records]
shingles #See that 3/5 are matching while 2/5 don't

[['JohnSmith', 'Smithstudies', 'studieswith', 'withhis', 'hisbuddies'],
 ['JonSmith', 'Smithstudies', 'studieswith', 'withhis', 'hisbuddyes']]

### MinHash:
MinHash is another common technique that is used to determine near duplicates from given dataset. The MinHash method relies on a number of steps, which are underlined as under:

* Generate shingles for the given records for an optimal value of k
* Assign a fixed number of integer buckets, such that some hash function h(x) can map each element in the produced shingles to exactly 1 bucket
* Use the hash function to assign all items of the shingles to a bucket
* For two different records, extract the minimum hash value obtained.
* Check if the minimum hash values are same, if yes, then the two records are considered near duplicates of each other (as per Jaccard's index)

#### Example:
Consider the example of MinHash for the records "John Smith studies with his buddies." and "Jon Smith, studies with his buddyes.", using a hash function to map each shingle of these records to an integer value, we get the following result from using MinHash: (Here we have used just a single hash function to showcase the process but in practice k different hash functions are used and the ratio x/k is used to determine the efficiency of the MinHash where x are the number of hash functions where the property of minHash satisfies)

In [13]:
#Loading the records
records = ["John Smith studies with his buddies.", "Jon Smith, studies with his buddyes."]

#Method for hashing function to use
def hash_function(x, p):
    hash_value = 0
    for i, c in enumerate(x):
        hash_value += ord(c) * (31 ** i)
    return hash_value % p

#Method for applying MinHash to the given records
def apply_min_hash(record, k, p):
    shingles = generate_shingles(record, k)
    hashes = [hash_function(shingle, p) for shingle in shingles]
    min_hash = min(hashes)
    return min_hash

#Generating minhash results for given records
minhash_records = [apply_min_hash(record, 2, 11) for record in records]
minhash_records #Gives same value for both records indicating they are near duplicates

[0, 0]

### Local Sensitive Hashing (LSH):
Local Sensitive Hashing (LSH) is a technique to determine near duplicates in a dataset. LSH can be considered as the general implementation for MinHash and is implemented as follows:

* Generate k-shingles for the given piece of text
* Once the shingles have been generated, use MinHashing to generate MinHash value for each record
* The MinHashing procedure is carried several times using multiple hash functions say k, giving us a k-sized signature.
* The process of banding takes place on the obtained signatures for each record and dividing them into a set of sub-signatures, each of which is assigned to a band.
* For each sub-signature, we use a hash function to generate the hash value and then put it accordingly into the corresponding band group.
* The sub-signatures that turn out to be in the same group are considered similar in nature (i.e. represents pieces of text which are near duplicates in the given documents)

### Tf-Idf Vectorization:
Tf-Idf (Term-Frequency Inverse-Document-Frequency) Vectorization is a technique that is used to map text into numerical vectors and can be utilized to detect near duplicate records. The vectorization process works by representing each record/document as a vector of Tf-Idf values which are computed by taking the product of the term frequency (number of occurrences of a term) and its inverse document frequency (measure of rarity of the term in the records/documents). Thus, this allows us to compare the similarity between the Tf-Idf vectors obtained for given records, which can tell us whether the records are near duplicates or not.

#### Example:
Consider the following two records: "The cat sat on the mat" and "The feline was sitting on the carpet", we will use the machine learning library scikit-learn to perform the process of tf-idf vectorization. The results obtained can then be used by similarity metrics to show whether the records are near duplicates or not.

In [14]:
# Define the records
records = ["The cat sat on the mat", "The feline was sitting on the carpet"]

#Method to compute tf-idf vectors for given records
def compute_tfidf_vectorizers(records):
    # Create an instance of TfidfVectorizer
    vectorizer = TfidfVectorizer()

    # Fit the vectorizer on the records
    tfidf_vectors = vectorizer.fit_transform(records)

    # Get the feature names
    feature_names = vectorizer.get_feature_names()

    # Store tf-idf values in a list of dictionaries
    result = []
    for i, record in enumerate(records):
        temp = {}
        for j, feature in enumerate(feature_names):
            temp[feature] = tfidf_vectors[i,j]
        result.append(temp)
    return result

#Generating resultant tf-idf vectors for records
compute_tfidf_vectorizers(records)



[{'carpet': 0.0,
  'cat': 0.42519636159088015,
  'feline': 0.0,
  'mat': 0.42519636159088015,
  'on': 0.30253071324069974,
  'sat': 0.42519636159088015,
  'sitting': 0.0,
  'the': 0.6050614264813995,
  'was': 0.0},
 {'carpet': 0.39129369358468363,
  'cat': 0.0,
  'feline': 0.39129369358468363,
  'mat': 0.0,
  'on': 0.2784086857278066,
  'sat': 0.0,
  'sitting': 0.39129369358468363,
  'the': 0.5568173714556132,
  'was': 0.39129369358468363}]

### Siamese Networks:
Siamese Networks are a type of neural network architecture that is used to learn similarity function between pairs of input data. The main idea behind this type of network is to use a shared set of weights to process the pair provided to it and make use of some sort of similarity measure to check if the two match or not. This property of Siamese networks allows it to be a good representative for identifying near duplicates in text. 

A basic approach to acheive this is by using a LSTM layer and pass the given inputs to this shared layer. The resultant output from the layer would be its last hidden state, which would give us a representation for our inputs that can be utilized by similarity metrics to generate whether they are near duplicates or not. Other approaches can also take place depending on the type of the dataset

# Similarity Measurement Phase:
We will now look at the similarity measurement phase, in which we try to find out the degree of similarity between the obtained features from given records

### Levenshtein Distance:
Levenshtein Distance is a character based similarity metric that measures the minimum number of string operations i.e. insertions, deletions and substitutions that are required to make the given pair of record same. It is sometimes generally referred to as "Edit Distance", though it is just a variant of it. A variant of Levenshtein Distance (called  Damerau–Levenshtein distance ) makes use of transpositions as well to compute the difference.

#### Example:
Consider the two records "John Smith studies with his buddies." and "Jon Smith, studies with his buddyes." for which we have to compute the levenshtein distance, we have:

In [15]:
#Loading the records
records = ["John Smith studies with his buddies.", "Jon Smith, studies with his buddyes."]

#Method to compute Levenshtein Distance between pair of records
def compute_levenshtein_distance(record1, record2):
    distance = 0
    count = 0
    iters = min(len(record1), len(record2))
    while (count < iters):
        if record1[count] != record2[count]:
            distance += 1
        count += 1
    distance += len(record1[count:]) + len(record2[count:])
    return distance

#Computing levenshtein distance for the texts
compute_levenshtein_distance(records[0], records[1])

9

### Jaccard Similarity:
Jaccard Similarity Coefficient is a metric used for measuring similarity of the number of occurrences of particular words/characters in two different records. The metric is defined as the number of unique words/characters common in the two sets divided by the total number of unique words/characters in the two sets. Thus, if we represent the number of words/characters as a set for each record, then jaccard similarity coefficient for the two records can be given by:

<p><center>$J(A,B) = \frac{A \cap B}{A \cup B}$</center></p>

#### Example:
Consider the following two records whose similarity is to be computed via the help of Jaccard similarity i.e. we have "The old man went to the shop" and "The old lady went to the shop", then based on the defined formula:

In [16]:
#Loading the records to handle
records = ["The old man went to the shop", "The old lady went to the shop"]

#Method to compute Jaccard similarity
def compute_jaccard_similarity(record1, record2):
    record1_set = set(record1.split(' '))
    record2_set = set(record2.split(' '))
    score = len(record1_set.intersection(record2_set)) / len(record1_set.union(record2_set))
    return score

#Computing jaccard score for given records
compute_jaccard_similarity(records[0], records[1])

0.75

### Sorenson-Dice Coefficient:
Sorenson-Dice Coefficient is yet another metric that can provide insight about the similarity of two records based on the characters. The coefficient is given by computing the common characters/words in the given records divided by the total number of elements in both records. Thus, if we represent the number of characters/words as a set of records, then sorenson-dice coefficient is given by:

<p><center>$DSC = \frac{2|A \cap B|}{|A|+|B|}$</center></p>

#### Example:
Consider the following two records whose similarity is to be computed via the help of Sorenson-Dice similarity i.e. we have "The old man went to the shop" and "The old lady went to the shop", then based on the defined formula:

In [28]:
#Loading the records to handle
records = ["The old man went to the shop", "The old lady went to the shop"]

#Method to compute Sorenson-Dice similarity
def compute_dice_similarity(record1, record2):
    record1_set = set(record1.split(' '))
    record2_set = set(record2.split(' '))
    score = (2 * len(record1_set.intersection(record2_set))) / (len(record1_set) + len(record2_set)) 
    return score

#Computing sorenson-dice score for given records
compute_dice_similarity(records[0], records[1])

0.8571428571428571

### Cosine Similarity:
Cosine Similarity is a metric for measuring the similarity of text in two records/documents and does so by making use of numeric vector representations of the inputted records. It makes use of non-zero vectors of the records and uses them to determine the value of cosine angle between them (making use of their magnitudes and dot product). This obtained value of cosine of given angle lies in the range [-1,1] with 1 indicating an angle of 0 degree (fully identical) and -1 indicating an angle of 180 degrees (completely opposite). The formula for computing the angle of cosine for given two vectors, we have:

<h3><center>$cos({\theta_{(A,B)}}) = \frac{\bar{A} . \bar{B}}{|A||B|}$</center></h3>

Cosine similarity is ususally used for measure similarity after converting the records under consideration into their numerical vector representations via some sort of word embedding feature extraction techniques e.g. GloVe, Word2Vec, etc. In other cases, it can also be utilized by other vectorization techniques such as Tf-Idf vectorization to get satisfactory results.

#### Example:
Consider the following two records: "The cat sat on the mat" and "The feline was sitting on the carpet", we will use the machine learning library scikit-learn to perform the process of tf-idf vectorization. The obtained results would then be fed to the cosine similarity method to measure the degree of similarity between them.

In [51]:
# Define the records
records = ["The cat sat on the mat", "The feline was sitting on the carpet"]

#Method to compute cosine similarity between two vector representations
def compute_cosine_similarity(vec1, vec2):
    return (vec1 @ vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

#Standardize the records
for i in range(len(records)):
    records[i] = ''.join(x for x in records[i] if x not in stopwords.words('english'))

#Generating resultant tf-idf vectors for records
tfidf_vectors = compute_tfidf_vectorizers(records)

#Extracting vector representations for the two records
vec1, vec2 = [np.array(list(vector.values())) for vector in tfidf_vectors]

#Computing the cosine similarity between the vectors
compute_cosine_similarity(vec1, vec2)

0.5023287782256718

Here I have covered only a glimpse of the techniques and measures that are used for the process of text/document similarity. There are metrics that identify whether the two inputs have same semantic meaning, there are techniques to improvise the efficiceny of determining these measures and there are tons of techniques and researches going out there that are targetting the problem of removing near duplicates from datasets. That being said, this notebook just touched over the mere basics of what possibly can be used to reduced this problem. This is where we end the topic of Duplicate Handling for datasets, we will start looking at the topic of missing values starting tomorrow!