<center><h1>Text Preprocessing using text mining</h1></center>

<h3><u>Tasks :</u></h3> 
<h4>- Text Cleaning</h4> 
<h4>- Text similarity evaluation</h4> 
<h4>- Extraction of common and uncommon parts</h4> 
<h4>- N gram distribution</h4>  
<h4>- Apply the mask on the uncommon parts</h4> 

## 0) Import the libraries

In [11]:
import numpy as np
import re
import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder, TrigramAssocMeasures, TrigramCollocationFinder
from nltk.util import ngrams
import spacy

## 1) Text cleaning

In [12]:
# init the sentences to preprocess

sentence1 = "I love to pay my video games in my free time, especially retro video games."
sentence2 = "I love to play oreo games in my free thyme, especially retro video games."

In [13]:
# Remove punctuation

sentence1 = re.sub(r'[^\w\s]', '', sentence1)
sentence2 = re.sub(r'[^\w\s]', '', sentence2)
sentence1, sentence2

('I love to pay my video games in my free time especially retro video games',
 'I love to play oreo games in my free thyme especially retro video games')

## 2) Text similarity analysis

In [None]:
# Low priority task will be done using cosine similarity for a starter

## 3) Extraction of common and uncommon parts

### Using Dynamic programming

In [62]:
def common_and_uncommon_extraction(str1, str2, len1, len2):
    # initialize the L matrix with zeros
    L = [[0] * (len2 + 1) for _ in range(len1 + 1)]

    # fill in the L matrix using dynamic programming
    for i in range(len1 + 1):
        for j in range(len2 + 1):
            # if either string is empty, the longest common substring is zero
            if i == 0 or j == 0:
                L[i][j] = 0
            # if the characters match, add one to the length of the longest common substring
            elif str1[i - 1] == str2[j - 1]:
                L[i][j] = L[i - 1][j - 1] + 1
            # if the characters don't match, take the maximum length from the previous row or column
            else:
                L[i][j] = max(L[i - 1][j], L[i][j - 1])

    # calculate the index based on the length of the longer string
    index = max(len(str1), len(str2))

    # initialize the common list with empty strings
    common = [""] * (index + 1)
    common[index] = ""

    # initialize the uncommon substring lists
    uncommon_str1 = []
    uncommon_str2 = []

    # set i and j to the end of each string
    i = len1
    j = len2

    tracker_str1 = -1
    tracker_str2 = -1
    sub_uncommon_str1 = []
    sub_uncommon_str2 = []

    # loop through the L matrix to find the common and uncommon substrings
    while i > 0 and j > 0:
        # if the characters match, add the character to the common list and move to the previous diagonal cell
        if str1[i - 1] == str2[j - 1]:
            common[index - 1] = str1[i - 1]
            i -= 1
            j -= 1
            index -= 1
        # if the length of the substring from the previous row is greater, add the uncommon character to uncommon_str2 list and move to the previous row
        elif L[i - 1][j] > L[i][j - 1]:
            if tracker_str2 == -1:
                tracker_str2 = i - 1
                common[index - 1] = "#"
                index -= 1
                sub_uncommon_str2.append(str1[i - 1])
            elif tracker_str2 == i:
                sub_uncommon_str2.append(str1[i - 1])
                tracker_str2 = i - 1
            else:
                sub_uncommon_str2.reverse()
                uncommon_str2.append(sub_uncommon_str2)
                sub_uncommon_str2 = []
                tracker_str2 = i - 1
                common[index - 1] = "#"
                index -= 1
                sub_uncommon_str2.append(str1[i - 1])

            i -= 1
        # if the length of the substring from the previous column is greater, add the uncommon character to uncommon_str1 list and move to the previous column
        else:
            if tracker_str1 == -1: 
                tracker_str1 = j - 1
                sub_uncommon_str1.append(str2[j - 1])
            elif tracker_str1 == j:
                sub_uncommon_str1.append(str2[j - 1])
                racker_str1 = j - 1
            else:
                sub_uncommon_str1.reverse()
                uncommon_str1.append(sub_uncommon_str1)
                sub_uncommon_str1 = []
                tracker_str1 = j - 1
                sub_uncommon_str1.append(str2[j - 1])

            j -= 1

    if len(sub_uncommon_str1) > 0:
        sub_uncommon_str1.reverse()
        uncommon_str1.append(sub_uncommon_str1)
    if len(sub_uncommon_str2) > 0:
        sub_uncommon_str2.reverse()
        uncommon_str2.append(sub_uncommon_str2)

    # join the common list into a sentence
    common_sentence = " ".join(common[2:-1])

    # reverse the order of the uncommon substring lists
    uncommon_str1.reverse()
    uncommon_str2.reverse()

    # return the common sentence and the lists of uncommon substrings
    return common_sentence, uncommon_str1, uncommon_str2

In [63]:
print("Sentence 1 : "+ sentence1)
print("\nSentence 2 : "+ sentence2)

common_sentence, uncommon_str1, uncommon_str2 = common_and_uncommon_extraction(sentence1.split(), 
                                                                            sentence2.split(), 
                                                                            len(sentence1.split()),
                                                                            len(sentence2.split()))

print("\nCommon parts of the 2 sentences with the uncommon parts masked :\n"+common_sentence) 
print("\nUncommon parts of sentence 1 : "+ str(uncommon_str1))
print("\nUncommon parts of sentence 2 : "+ str(uncommon_str2))

Sentence 1 : I love to pay my video games in my free time especially retro video games

Sentence 2 : I love to play oreo games in my free thyme especially retro video games

Common parts of the 2 sentences with the uncommon parts masked :
I love to # games in my free # especially retro video games

Uncommon parts of sentence 1 : [['play', 'oreo'], ['thyme']]

Uncommon parts of sentence 2 : [['pay', 'my', 'video'], ['time']]


## 4) N gram distribution

### 4.1) Finding the most fit value of N

In [None]:
# In progress

### 4.2) finding the best N-grams

In [101]:
def ngram_distribution(uncommon_str1, uncommon_str2, sentence1, sentence2):
    # Initialize empty lists to store the final best uncommon n-grams for each string
    final_uncommon_str1 = []
    final_uncommon_str2 = []
    
    # Iterate through the uncommon parts for each string
    for i in range(len(uncommon_str1)):
        # Check if the lengths of the current uncommon part for each string are different
        if len(uncommon_str1[i]) != len(uncommon_str2[i]):
            # If the lengths are different, create a list of all potential bigrams from sentence1 and sentence2
            n = 2
            bigram_measures = BigramAssocMeasures()

            # Find the best uncommon bigrams for uncommon_str1 in case the length of the uncommon part is greater than 2
            if len(uncommon_str1[i]) > 2:
                # Make a copy of the current list of the current uncommon part for string 1
                uncommon_str1_i = uncommon_str1[i].copy()

                # Variable containing the common words that won't allowed in the bigrams
                common_words_str1 = list(set(sentence1.split()) - set(uncommon_str1_i))

                # Tokenize the sentence
                tokens_str1 = nltk.word_tokenize(sentence1)

                # Generate a list of all n-grams of size n for the sentence
                n_grams_str1 = list(ngrams(tokens_str1, n))
                
                # Use the bigram collocation finder to get the best bigrams for the sentence
                finder_str1 = BigramCollocationFinder.from_words(tokens_str1)
                best_bigrams_str1 = finder.nbest(bigram_measures.pmi, len(n_grams_str1))

                # Filter out bigrams that contain common words from the current list of uncommon words
                best_uncommon_ngrams_str1 = [ngram for ngram in best_bigrams_str1 if (not any(p_ngrams in ngram for p_ngrams in common_words_str1))]
                
                # Generate the final list of uncommon n-grams for string 1 by filtering the filtered bigrams and remaining uncommon words
                uncommon_ngrams_str1 = []
                for i in best_uncommon_ngrams_str1:
                    if i[0] in uncommon_str1_i and i[1] in uncommon_str1_i:
                        uncommon_ngrams_str1.append(" ".join(list(i)))
                        uncommon_str1_i.remove(i[0])
                        uncommon_str1_i.remove(i[1])
                if uncommon_str1_i != []:
                    uncommon_ngrams_str1.extend(uncommon_str1_i)

                final_uncommon_str1.append(uncommon_ngrams_str1) # Append the final uncommon n-grams for string 1 to the list
            else :
                # If the length of the current uncommon words list for string 1 is 2 or less, append the current list to the final list
                final_uncommon_str1.append(uncommon_str1[i])

            if len(uncommon_str2[i]) > 2:
                # Make a copy of the current list of the current uncommon part for string 2
                uncommon_str2_i = uncommon_str2[i].copy()

                # Variable containing the common words that won't allowed in the bigrams
                common_words_str2 = list(set(sentence2.split()) - set(uncommon_str2_i))
                
                # Tokenize the sentence
                tokens_str2 = nltk.word_tokenize(sentence2)
                
                # Generate a list of all n-grams of size n for the sentence
                n_grams_str2 = list(ngrams(tokens_str2, n))
                
                # Use the bigram collocation finder to get the best bigrams for the sentence
                finder_str2 = BigramCollocationFinder.from_words(tokens_str2)
                best_bigrams_str2 = finder.nbest(bigram_measures.pmi, len(n_grams_str2))

                # Filter out bigrams that contain common words from the current list of uncommon words
                best_uncommon_ngrams_str2 = [ngram for ngram in best_bigrams_str2 if (not any(p_ngrams in ngram for p_ngrams in common_words_str2))]
                
                # Generate the final list of uncommon n-grams for string 2 by filtering the filtered bigrams and remaining uncommon words
                uncommon_ngrams_str2 = []
                for i in best_uncommon_ngrams_str2:
                    if i[0] in uncommon_str2_i and i[1] in uncommon_str2_i:
                        uncommon_ngrams_str2.append(" ".join(list(i)))
                        uncommon_str2_i.remove(i[0])
                        uncommon_str2_i.remove(i[1])
                if uncommon_str2_i != []:
                    uncommon_ngrams_str2.extend(uncommon_str2_i)

                final_uncommon_str2.append(uncommon_ngrams_str2) # Append the final uncommon n-grams for string 2 to the list
            else :
                # If the length of the current uncommon words list for string 2 is 2 or less, append the current list to the list
                final_uncommon_str2.append(uncommon_str2[i])

        else:
            # If the lengths of the current uncommon parts for each string are the same, append the current lists to the final lists
            final_uncommon_str1.append(uncommon_str1[i])
            final_uncommon_str2.append(uncommon_str2[i])

    return final_uncommon_str1, final_uncommon_str2

final_uncommon_str1, final_uncommon_str2 = ngram_distribution(uncommon_str1, uncommon_str2, sentence1, sentence2)

In [102]:
final_uncommon_str1

[['play', 'oreo'], ['thyme']]

In [103]:
final_uncommon_str2

[['pay my', 'video'], ['time']]

## 5) Apply the mask on the uncommon parts

In [104]:
def masking(common_sentence, uncommon_str):
    for i in range(len(uncommon_str)):
        mask = "[MASK] " * len(uncommon_str[i])
        common_sentence = common_sentence.replace("#", mask, 1)
    return " ".join(common_sentence.split())
masked_sentence = masking(common_sentence, final_uncommon_str1)

In [105]:
print("Sentence 1 : "+ sentence1)
print("\nSentence 2 : "+ sentence2)
print("\nCommon parts of the 2 sentences with the uncommon parts masked :\n"+ masked_sentence) 
print("\nUncommon parts of sentence 1 : "+ str(final_uncommon_str1))
print("\nUncommon parts of sentence 2 : "+ str(final_uncommon_str2))

Sentence 1 : I love to pay my video games in my free time especially retro video games

Sentence 2 : I love to play oreo games in my free thyme especially retro video games

Common parts of the 2 sentences with the uncommon parts masked :
I love to [MASK] [MASK] games in my free [MASK] especially retro video games

Uncommon parts of sentence 1 : [['play', 'oreo'], ['thyme']]

Uncommon parts of sentence 2 : [['pay my', 'video'], ['time']]
