# Near duplicate detection

Near-duplicate detection in texts refers to the process of identifying documents or text segments that are almost, but not exactly, the same. These near-duplicates might differ in terms of a few words, punctuation, formatting, or minor rephrasing, but they convey very similar or identical information.

Applications of near duplicate detection include:
- Search Engines: To improve diversity of search results by filtering out duplicate pages that have substantially the same content.
- Plagiarism Detection: To identify instances where a text might have been copied with minor modifications.
- Data Deduplication: In large data repositories, it's important to avoid storing multiple near-identical versions of the same document. This is especially important in machine learning as duplicate texts may lead to imbalanced datasets.

# Loading OpenAI API key

In [1]:
import os
from dotenv import load_dotenv
from pathlib import Path

# Path to the .env file in the parent directory
env_path = Path("..") / ".env"

# Load environment variables from .env file
load_dotenv(dotenv_path=env_path)

# Get the OpenAI API key from the environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

# The dataset

The dataset we'll be using comes from Kaggle and can be found [here](https://www.kaggle.com/datasets/stackoverflow/statsquestions). This dataset contains questions and answers from the cross-validated stack exchange (which is the machine learning equivilant of stack overflow). We'll use near-duplicate detection to find duplicate questions.

# High-level approach
To achieve near duplicate detection using text embeddings we'll:
- Split our texts into segments
- Retrieve embeddings for each of our segments
- Find the n most similar segments for each segment
- Record which segment pairs are have similarity higher than some threshold
- Provide an easy way of removing texts from a pandas dataframe