# Hi y'all, this is Team-5 and here we've documented the entire deduplication methods and pipelines utilised in our assignment.

## The primary common step of reading the input folders and writing the output folders

We have one parent folder which in turn stores a number of folders corresponding to each unique source. These folders in turn contain a list of .txt files. Through our deduplication methodoligies, we read through all the .txt files accross all the folders and remove the duplicate .txt files which have similarity scores above a pre-set threshold.

The following code helps in reading and storing all the text files across all the input folders; and writing the unique text files into the new output folders in the same manner.

In [22]:
import os
from datasketch import MinHash, MinHashLSH

First of all, we'll load all the text files from our folder structure

In [23]:
import os
import chardet

In [24]:
def detect_file_encoding(file_path):
    # Open the file in binary mode for encoding detection
    with open(file_path, 'rb') as file:
        raw_data = file.read()
        result = chardet.detect(raw_data)
        encoding = result['encoding']
        confidence = result['confidence']
        return encoding, confidence

def load_documents_from_folders(root_folder):
    documents = []
    file_paths = []

    # Recursively walking through all subfolders and files
    for dirpath, _, filenames in os.walk(root_folder):
        for filename in filenames:
            if filename.endswith(".txt"):  # Process only .txt files
                file_path = os.path.join(dirpath, filename)
                file_paths.append(file_path)

                # Detect encoding of the file
                encoding, confidence = detect_file_encoding(file_path)
                print(f"Detected encoding for {file_path}: {encoding} (confidence: {confidence})")

                try:
                    # Use the detected encoding to read the file
                    with open(file_path, 'r', encoding=encoding) as file:
                        documents.append(file.read())
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")

    return documents, file_paths


Here, we're writing the unique documents back to the output folder, while preserving our document structure as well

In [25]:
def save_unique_documents(unique_docs, original_file_paths, output_folder):
    for doc_id in unique_docs:
        original_file_path = original_file_paths[int(doc_id.split('_')[1])]

        #recreating the original folder structure in the output folder
        relative_path = os.path.relpath(original_file_path, start=input_folder)
        output_file_path = os.path.join(output_folder, relative_path)
        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)

        #copying the content to the new file
        with open(output_file_path, 'w', encoding='utf-8') as out_file:
            out_file.write(documents[int(doc_id.split('_')[1])])

Storing the paths to our input and output parent folder, also loading the documents and file paths.

In [26]:
# Path to your input folder and output folder
input_folder = r"C:\Users\jiyad\Downloads\parentfolder"
output_folder = r"C:\Users\jiyad\OneDrive\Desktop\IITGN\NLP\deduplicated_dataset"

In [27]:
pip install charset-normalizer




In [28]:
documents, file_paths = load_documents_from_folders(input_folder)

Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f1\22.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f1\27_copy.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f1\32.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f2\31.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f2\36.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f2\37.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f3\22+plus some random.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f3\34.txt: utf-8 (confidence: 0.99)
Detected encoding for C:\Users\jiyad\Downloads\parentfolder\parentfolder\f3\35.txt: utf-8 (confidence: 0.9

Now, we've successfully managed to input all the .txt. files

## Deduplication Pipelines

### Technique 1 - MinHash+LSH

First of all we do a basic text normalization to improve the deduplication by lowercasing and removing extra spaces.

In [29]:
def normalize_text(text):
    return ' '.join(text.lower().split())

Now, the major step comes in of creating a MinHash object and an LSH model to group similar documents

In [30]:
def compute_minhash(doc):
    m = MinHash(num_perm=128)
    normalized_doc = normalize_text(doc)  #Applying normalization
    for word in normalized_doc.split():
        m.update(word.encode('utf8'))
    return m

threshold = 0.85  #Setting similarity threshold for near-duplicate detection
lsh = MinHashLSH(threshold=threshold, num_perm=128)

#Inserting MinHashes into the LSH
minhashes = {}
for idx, doc in enumerate(documents):
    m = compute_minhash(doc)
    lsh.insert(f'doc_{idx}', m)
    minhashes[idx] = m

The next step is pretty straightforward, we'll keep a track of unique and duplicates documents and ensure that the copy of the parent folder we're creating as an output only stores the unique ones (the first occurence among the cases of duplications)

In [31]:
unique_docs = set()
duplicates = set()

for idx, doc in enumerate(documents):
    m = minhashes[idx]

    if f'doc_{idx}' not in unique_docs:
        #Querying similar documents from LSH
        similar_docs = lsh.query(m)
        for sim_doc in similar_docs:
            if sim_doc != f'doc_{idx}':
                duplicates.add(sim_doc)
        unique_docs.add(f'doc_{idx}')

#Filtering out duplicates
filtered_docs = [f'doc_{idx}' for idx in range(len(documents)) if f'doc_{idx}' not in duplicates]

Saving the unique documents to the output folder, maintaining folder structure


In [32]:
save_unique_documents(filtered_docs, file_paths, output_folder)

print(f"Deduplicated documents saved to {output_folder}")

Deduplicated documents saved to C:\Users\jiyad\OneDrive\Desktop\IITGN\NLP\deduplicated_dataset


Voila!

### Clustering