In [19]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources (you only need to do this once)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Function to preprocess text
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join tokens into a single string
    preprocessed_text = ' '.join(lemmatized_tokens)

    return preprocessed_text

# Function to check if any word matches
def check_word_match(prompt, website_content):
    # Tokenize and preprocess prompt and website content
    prompt_tokens = word_tokenize(prompt.lower())
    website_tokens = word_tokenize(website_content.lower())

    prompt_filtered = [token for token in prompt_tokens if token not in stopwords.words('english')]
    website_filtered = [token for token in website_tokens if token not in stopwords.words('english')]

    preprocessed_prompt = [WordNetLemmatizer().lemmatize(token) for token in prompt_filtered]
    preprocessed_website = [WordNetLemmatizer().lemmatize(token) for token in website_filtered]

    # Check word similarity
    for word in preprocessed_prompt:
        if word in preprocessed_website:
            return True

    return False

# Get user input
prompt = input("Enter a prompt: ")

# Website content (replace with your web scraping code)
website_content = "With the vector embeddings added to the database and indexed, we’re ready to start finding similar content. When users submit their article text as input, a request is made to an API endpoint that uses Pinecone’s SDK to query the index of vector embeddings. The endpoint returns 10 similar articles that were possibly plagiarized and displays them in the app’s UI. That’s it! Simple enough, right?The UI features a simple textarea input in which the user can paste the text from an article. When the user clicks the Submit button, this input is used to query a database of articles. Results and their match scores are then displayed to the user. To help reduce the amount of noise, the app also includes a slider input in which the user can specify a similarity threshold to only show extremely strong matches.Plagiarism is rampant on the internet and in the classroom. With so much content out there, it’s sometimes hard to know when something has been plagiarized. Authors writing blog posts may want to check if someone has stolen their work and posted it elsewhere. Teachers may want to check students’ papers against other scholarly articles for copied work. News outlets may want to check if a content farm has stolen their news articles and claimed the content as its own."

# Check if any word matches
is_copied = check_word_match(prompt, website_content)

# Output the result
if is_copied:
    print("Some words match. The content may be copied.")
else:
    print("No matching words found. The content seems original.")


[nltk_data] Downloading package stopwords to C:\Users\Samarth-
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Samarth-
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Samarth-
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Enter a prompt: When the user clicks the Submit button
Some words match. The content may be copied.


**Certainly! Let's go through the code and explain the theory behind it.**

The code you provided is a Python script that demonstrates how to preprocess text and check for word matches between a user prompt and website content. Here's a breakdown of the important parts:

**Importing necessary NLTK modules:**

nltk is the Natural Language Toolkit library, which provides various tools and resources for working with human language data.
stopwords from nltk.corpus contains a list of common words (e.g., "the," "is," "and") that are often removed from text as they don't contribute much to the overall meaning.
word_tokenize from nltk.tokenize is used to split text into individual words or tokens.
WordNetLemmatizer from nltk.stem is a tool for reducing words to their base or root form (e.g., "running" becomes "run").

**Downloading NLTK resources:**

The nltk.download function is used to download the necessary resources for tokenization, stopwords, and lemmatization. This step is required only once to ensure you have the required data.

**Preprocessing Text:**

The preprocess_text function takes a text input, tokenizes it into words, removes stop words, lemmatizes the remaining tokens, and then joins them back into a single string.
Tokenization is the process of splitting text into individual words or tokens. word_tokenize is used to perform tokenization.
Stop words are commonly occurring words that do not carry much information about the content of the text. The stopwords.words('english') returns a list of English stopwords, and the function removes these words from the tokenized text.
Lemmatization reduces words to their base or root form. The WordNetLemmatizer is used to lemmatize the tokens.
Finally, the preprocessed tokens are joined back into a single string using ' '.join(lemmatized_tokens).

**Checking for Word Matches:**

The check_word_match function takes a user prompt and website content as inputs and checks if any words in the prompt match with the words in the website content.
The prompt and website content are tokenized and preprocessed using similar steps as in the preprocess_text function.
The function then iterates through each word in the preprocessed prompt and checks if it exists in the preprocessed website content. If a match is found, the function returns True.
If no match is found for any word in the prompt, the function returns False.

**User Input and Website Content:**

The code prompts the user to enter a prompt using the input function and stores it in the prompt variable.
The website_content variable contains a sample text that represents the content of a website (e.g., obtained through web scraping).

**Performing the Word Match Check:**

The check_word_match function is called with the prompt and website_content as arguments to check if any words match.
The result is stored in the is_copied variable.

**Outputting the Result:**

Finally, the code checks the value of is_copied and prints an appropriate message based on the result.
The purpose of this code is to provide a simple demonstration of text preprocessing and word matching using NLTK. It can be used to check if a user prompt has any word matches with website content, which might suggest that the content has been copied or plagiarized.