# Introduction to the StringSimilarity Class

## Overview

The `StringSimilarity` class is a powerful tool designed for Natural Language Processing (NLP) applications, particularly focusing on computing the similarity between different text documents. Its primary function revolves around processing text, evaluating similarity scores, and managing a collection of text documents. 

## Class Structure and Functionality

## Project Folder Structure

The `StringSimilarity` class is part of a broader project structure, designed to streamline file management and ease of use. All essential files and folders are neatly organized within a main directory named "text_document_similarity". This structured approach ensures efficient access and processing of text data.

### Main Directory: "text_document_similarity"

- The `main.ipynb` Jupyter Notebook, which houses the `StringSimilarity` class, is located at the root of this main directory. This notebook serves as the central script for executing the class's functionality.
- An example text file is also placed in the "text_document_similarity" directory. This file can be used for initial testing and demonstration purposes, providing a practical example of how the class processes and compares text.
- The `Corpus` folder, a critical component of the project, is situated within the "text_document_similarity" directory. It contains a selection of text files, labeled `text1.txt`, `text2.txt`, and `text3.txt`. These files constitute the text corpus against which new documents are compared for similarity.
- This consolidated folder structure not only ensures a tidy and logical organization of files but also simplifies the class's operation. Users can effortlessly navigate through the project, add new documents, and perform similarity analysis without the hassle of complex file management.

By maintaining this clear and concise folder structure, the `StringSimilarity` class within the "text_document_similarity" directory stands as a user-friendly and efficient solution for text similarity assessment, enhancing the overall experience of users engaging with this NLP tool.


### Initialization

- The `__init__` method serves as the class constructor, setting up essential structures for document storage and processing. 
- Key structures initialized include:
  - `document_pool`: A dictionary to hold processed documents.
  - `vector_pool`: A repository for the vector representations of these documents.
  - `dictionary`: A set to store unique words across all documents, crucial for vectorization and textual analysis.

### Text Cleaning and Processing

- Stopwords, common words with limited informational value, are listed for exclusion during text processing. 
- The class employs methods like `cleaning_text`, `string_to_list`, and `main_cleaning` for preparing text data. These methods collectively facilitate the removal of non-words, normalization of text, and exclusion of stopwords.

### Document Management and Corpus Creation

- Methods such as `add_documents`, `load_text`, and `create_corpus` enable efficient document management.
- These functions allow adding individual texts or batches of texts, processing them, and updating the document pool and dictionary accordingly.

### Vector Representation and Similarity Computation

- The `create_vector` method translates documents into binary vectors based on the class's dictionary, capturing the presence or absence of words.
- The class offers various similarity computation methods (`dot_product_normal`, `cosine_Similarity`, `Euclidean_distance`, `Jaccard_similarity`) to cater to different analytical needs.

### User Interaction

- The `user_interaction` method allows for versatile user engagement, enabling users to input a text string or specify a file for comparison.
- Users can choose the similarity metric to apply, and the function returns a DataFrame showing the calculated similarity scores.

## Application

The `StringSimilarity` class is an embodiment of versatility and efficiency in text similarity analysis. Whether it's for academic research, content recommendation systems, or other NLP applications, this class provides a robust foundation for comparing text documents, making it a valuable asset in the field of computational linguistics and data analysis.


In [None]:


#!pip install numpy

import os
import numpy as np
import pandas as pd
from numpy.linalg import norm
import re


class StringSimilarity:

    """
    A class for computing string similarity using various metrics.
    This class provides functionality to clean and process text documents,
    calculate similarity scores, and manage a collection of text documents.

    """

    def __init__(self):
        """
        Initializes the StringSimilarity class with empty structures for storing documents.

        """

        # Dictionary to store the processed documents
        self.document_pool = {}

        # Dictionary to store vector representations of the documents
        self.vector_pool = {}

        # Set to store unique words across all documents
        self.dictionary = set()

        # list of stopwords for basic text filtering
        self.stopwords = [

            "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
            "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
            "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
            "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are",
            "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does",
            "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",
            "while", "of", "at", "by", "for", "with", "about", "against", "between", "into",
            "through", "during", "before", "after", "above", "below", "to", "from", "up", "down",
            "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here",
            "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more",
            "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so",
            "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"
        ]

    def add_documents(self, name, document):
        """
        Manual adds a document to the document pool after processing it.

        Args:
            name (str): The name or identifier for the document.
            document (str): The text of the document to be added.

        Raises:
            ValueError: If document or name is string
            ValueError: If name or document is empty 
            ValueError: Processed document is empty. It might contain only stopwords or non-words
            ValueError: If the processed document already exists in the document pool.
        """

        if not isinstance(name, str) or not isinstance(document, str):

            raise TypeError("Both name and document must be strings.")

        if not name:
            raise ValueError("Document name is empty.")

        if not document:
            raise ValueError("Document content is empty.")

        processed_document = self.main_cleaning(document)

        if not processed_document:
            raise ValueError(
                "Processed document is empty. It might contain only stopwords or non-words.")

        # Check if the document is not already in the pool
        if processed_document not in list(self.document_pool.keys()):

            self.document_pool[name] = processed_document

            word_check = False
            for i in processed_document:

                if i not in self.dictionary:
                    word_check = True
                    break
                else:
                    continue

            if word_check:

                self.dictionary.update(set(processed_document))

                # after a new document is added to pool, all vectors have to be updated because dictionary is longer.
                self.update_vectorpool()
            else:
                self.vector_pool[name] = self.create_vector(processed_document)

                print(
                    "Text has been added to the pool but no new vocabularies were added.")

        else:
            raise ValueError(
                f"The text {processed_document} has already been added to pool")

    @staticmethod
    def cleaning_text(text):
        """
        Static method to clean a given text.

        Args:
            text (str): The text to be cleaned.

        Returns:
            str: The cleaned text.

        Raises: 
            TypeError: If the input text is not a string.
            ValueError: If Input text is empty or only contains whitespace

        """

        if not isinstance(text, str):
            raise TypeError("Input text must be a string.")

        if text.strip() == "":

            raise ValueError(
                "Input text is empty or only contains whitespace.")

        text = text.strip()  # removes whitespaces in the beginning and end
        # Removes hyphens or underscores that are surrounded by word characters.
        text = re.sub(r'\b[_-]+|(?<=\w)[_-]+|[_-]+(?=\w)', '', text)
        # Replaces abbreviations or initials and optional trailing commas with a space.
        text = re.sub(r'\b(?:[a-zA-Z]\.)+[a-zA-Z]?[,]*\b', ' ', text)
        text = re.sub(r"\W", " ", text)  # remove non words char
        text = re.sub(r"\d", " ", text)  # remove digits char
        text = re.sub(r"[\s]+", " ", text)  # remove extra white space
        text = text.lower()  # lower char for matching

        return text

    @staticmethod
    def load_text(path):
        """
        Static method to load text from a given file path.

        Args:
            path (str): The file path from which to load the text.

        Returns:
            list: The processed list of words from the file.

        Raises: 
            ValueError: If input is not a string
            FileNotFoundError: If the path can not be found within the operating system 
        """
        if not isinstance(path, str):

            raise ValueError("The file path must be a string.")

        if not os.path.exists(path):
            raise FileNotFoundError(
                f"The file does not exist at the path: {path}")

        # add try ... except???
        with open(path, 'r') as file:  # Automatically closes the file after reading

            # file = StringSimilarity.string_to_list(file.read())

            file = file.read()

        return file

    @staticmethod
    def create_doc_list(curr_path):
        """
        Static method to create a list of document names in the 'Corpus' directory.

        Args:
            curr_path (str): The current working directory path.

        Returns:
            list: A list of filenames found in the 'Corpus' subdirectory.

        Raises: 
            FileNotFoundError: If the path can not be found within the operating system 
        """

        # Construct the path to the 'Corpus' directory which contains .txt files
        corpus_path = os.path.join(curr_path, 'Corpus')

        if not os.path.exists(corpus_path):
            raise FileNotFoundError(
                f"The file does not exist at the path: {corpus_path}")

        # List all files in the 'Corpus' directory
        objects = os.listdir(corpus_path)

        return objects

    def create_corpus(self):
        """
        Method to create a corpus by processing and adding text files from the 'Corpus' directory.
        Updates the document pool with new documents and their processed content.

        Returns:
            str: A message indicating the outcome of the corpus creation

        Raises: 
            Exception: If an unexpected error occurs during file processing.

        """

        # Get the current working directory
        path = os.getcwd()

        # Retrieve the list of text files in the 'Corpus' directory

        try:

            text_files = StringSimilarity.create_doc_list(path)

        except Exception as e:

            raise Exception(f'Failed to create document list: {e}')

        # create path to Corpus folder
        corpus_path = os.path.join(path, 'Corpus')

        # count number of documents
        new_count = 0

        for i in text_files:

            # Process only text files and avoid duplicates
            if i.endswith('.txt'):

                # avoid duplicates in document pool
                if i not in self.document_pool.keys():

                    try:
                        # Load and process the text file
                        temp_text = StringSimilarity.load_text(
                            os.path.join(corpus_path, i))

                        temp_text = self.main_cleaning(temp_text)

                        # Update the dictionary and document pool
                        self.dictionary.update(set(temp_text))
                        self.document_pool[i] = list(set(temp_text))
                        new_count += 1
                    except Exception as e:

                        raise Exception(
                            f'Failed to load document {i} because of {e}')

                else:
                    continue
            else:
                continue

        # Update the vector pool with new vectors

        self.update_vectorpool()

        if new_count == 0:

            return "no new documents in folder"
        else:

            return f"There have been {str(new_count)} new documents added to the folder"

    @staticmethod
    def string_to_list(string1):
        """
        Static method to convert a cleaned string into a list of words.

        Args:
            string1 (str): The string to be converted.

        Returns:
            list: A list of words from the string.

        Raises: 
            TypeError: If string is not string
            ValueError: If string is empty after cleaning 
        """

        if not isinstance(string1, str):
            raise TypeError("Input must be a string.")

        # Convert the cleaned string into a list of words
        clean_text = StringSimilarity.cleaning_text(string1)

        if not clean_text.strip():
            raise ValueError(
                "Input string is empty or contains only whitespace after cleaning.")

        return clean_text.split()

    def removing_stopwords(self, list_words):
        """
        Method to remove stopwords from a list of words.

        Args:
            list_words (list): The list of words from which stopwords are to be removed.

        Returns:
            list: A list of words with stopwords removed.

        Raises: 
            TypError: If Type of Input is not a list of words 
            ValueError: If list from Input is empty


        """
        if not isinstance(list_words, list):
            raise TypeError("Input must be a list of words.")

        if not list_words:
            raise ValueError("Input list of words is empty.")

        # Filter out stopwords from the list of words
        text_without_stop = [
            word for word in list_words if word not in self.stopwords]

        return text_without_stop

    def main_cleaning(self, text):
        """
        Method to perform cleaning of the text, converting it into a list of words and removing stopwords. Finally remove words which are shorter than 3

        Args:
            text (str): The text to be cleaned.

        Returns:
            list: A list of cleaned words from the text.

        Raises: 
            TypeError: If input is not a string 
            ValueError: If the input text is empty.
        """

        if not isinstance(text, str):
            raise TypeError("Input must be a string.")

        if text.strip() == "":
            raise ValueError(
                "Input text is empty or only contains whitespace.")

        # Clean text, convert text to a list of words and remove stopwords
        text_list = StringSimilarity.string_to_list(text)
        text_list = self.removing_stopwords(text_list)

        # Filter out words that are too short (e.g., less than 3 characters)
        text_list = [word for word in text_list if len(word) > 2]

        return text_list

    def create_vector(self, word_list):
        """
        Creates a binary vector representation for a given list of words.

        Args:
            word_list (list): A list of words to be converted into a vector.

        Returns:
            list: A binary vector where 1 represents the presence of a word from the word list in the dictionary.

        Raises: 
            TypeError: If the input is not a list.
            ValueError: If the input list is empty or the dictionary is not initialized.

        """

        if not isinstance(word_list, list):
            raise TypeError("Input must be a list of words.")

        if not word_list:
            raise ValueError("Input word list is empty.")

        if not self.dictionary:
            raise ValueError(
                "Dictionary is not initialized. Add some documents first.")

        # Initialize a vector of zeros with the same length as the dictionary
        vector = [0] * len(self.dictionary)

        # Set elements to 1 in the vector for words present in the word list
        for i, word in enumerate(self.dictionary):

            if word in word_list:
                vector[i] = 1
            else:
                continue

        return vector

    def update_vectorpool(self):
        """
        Updates the vector representations for all documents in the document pool.

        Raises: 
            ValueError: If the document pool is empty.
        """

        if not self.document_pool:
            raise ValueError(
                "Document pool is empty. Add some documents before updating the vector pool.")

        # Check if the dictionary is initialized
        if not self.dictionary:
            raise ValueError(
                "Dictionary is not initialized. Add some documents to create the dictionary.")

        try:
            # Update vector for each document in the document pool
            for i in self.document_pool.keys():
                self.vector_pool[i] = self.create_vector(self.document_pool[i])

            print("All vectors are updated")

        except Exception as e:
            raise Exception(
                f"An error occurred while updating the vector pool: {e}")

    @staticmethod
    def rank_vectors(dict1):
        """
        Ranks vectors based on their values.

        Args:
            dict1 (dict): A dictionary of vectors to be ranked.

        Returns:
            dict: A dictionary with vectors ranked in descending order of their values.

        Raises:
            TypeError: If the input is not a dictionary.
            ValueError: If the input dictionary is empty.
        """

        if not isinstance(dict1, dict):
            raise TypeError("Input must be a dictionary.")

        if not dict1:
            raise ValueError("Input dictionary is empty.")

        # Sort the dictionary in descending order based on values
        return dict(sorted(dict1.items(), key=lambda item: item[1], reverse=True))

    def dot_product_normal(self, new_doc, new_vector):
        """
        Calculates the dot product similarity between a new document and all documents in the document pool.

        Args:
            new_doc (str): The text of the new document.

        Returns:
            dict: A dictionary of dot product similarity scores.

        Raises:
            Valueerror: If the type of new_doc is not list.
            ValueError: If the new document list is empty.
            ValueError: If the document pool is empty.
        """

        if not isinstance(new_doc, list):
            raise TypeError("The new document must be a list.")

        if len(new_doc) == 0:
            raise ValueError("The list of words does not contains any words")

        if not self.document_pool:
            raise ValueError(
                "Document pool is empty. Add some documents before calculating Euclidean distance.")

        final_dict = {}

        # Calculate dot product with each document vector
        for text in self.document_pool.keys():

            final_dict[text] = np.dot(new_vector, self.vector_pool[text])

        return StringSimilarity.rank_vectors(final_dict)

    def cosine_Similarity(self, new_doc, new_vector):
        """
        Calculates the cosine similarity between a new document and all documents in the document pool.

        Args:
            new_doc (str): The text of the new document.

        Returns:
            dict: A dictionary of cosine similarity scores.

        Raises:
            ValueError: If the new document is empty or only contains whitespace.
            ValueError: If the document pool is empty.
        """

        if not self.document_pool:
            raise ValueError(
                "Document pool is empty. Add some documents before calculating cosine similarity.")

        cosine_values = {}

        # Calculate cosine similarity with each document vector
        for i in self.document_pool.keys():

            temp_vector = self.vector_pool[i]

            if norm(new_vector)*norm(temp_vector) != 0:

                cosine = np.dot(new_vector, temp_vector) / \
                    (norm(new_vector)*norm(temp_vector))

                cosine_values[i] = cosine

            else:
                cosine_values[i] = 'no matches'

        return StringSimilarity.rank_vectors(cosine_values)

    def Euclidean_distance(self, new_doc, new_vector):
        """
        Calculates the Euclidean distance between a new document and all documents in the document pool.

        Args:
            new_doc (str): The text of the new document.

        Returns:
            dict: A dictionary of Euclidean distance scores.

        Raises:
            ValueError: If type of input new_doc is not a list. 
            ValueError: If the new document list is empty.
            ValueError: If the document pool is empty.

        """
        if not isinstance(new_doc, list):
            raise TypeError("The new document must be a list.")

        if len(new_doc) == 0:
            raise ValueError("The list of words does not contains any words")

        if not self.document_pool:
            raise ValueError(
                "Document pool is empty. Add some documents before calculating Euclidean distance.")

        euclidean_values = {}

        # Calculate Euclidean distance with each document vector
        for i in self.document_pool.keys():

            temp_vector = self.vector_pool[i]

            dist = np.linalg.norm(np.array(temp_vector) - np.array(new_vector))
            euclidean_values[i] = dist

        return StringSimilarity.rank_vectors(euclidean_values)

    def Jaccard_similarity(self, clean_words):
        """
        Calculates the Jaccard similarity between a new document and all documents in the document pool.

        Args:
            new_doc (str): The text of the new document.

        Returns:
            dict: A dictionary of Jaccard similarity scores.

        Raises:
            TypeError: If the new document is not a list.
            ValueError: If the new document list 
            ValueError: If the document pool is empty.
        """
        if not isinstance(clean_words, list):
            raise TypeError("The new document must be a list.")

        if len(clean_words) == 0:
            raise ValueError("The list of words does not contains any words")

        if not self.document_pool:
            raise ValueError(
                "Document pool is empty. Add some documents before calculating Jaccard similarity.")
        jaccard_values = {}

        clean_words = set(clean_words)

        # Iterate over each document in the document pool
        for name, words in self.document_pool.items():

            set_old_words = set(words)

            # Calculate the intersection and union
            intersection = clean_words.intersection(set_old_words)
            union = clean_words.union(set_old_words)

            # Calculate Jaccard similarity and add to the dictionary
            jaccard_sim = len(intersection) / len(union) if union else 0
            jaccard_values[name] = jaccard_sim

        return StringSimilarity.rank_vectors(jaccard_values)

    @staticmethod
    def create_dataframe(dict1, dict2, dict3, dict4):
        """
        Creates a DataFrame from four dictionaries of similarity scores by each method.

        Args:
            dict1, dict2, dict3 (dict): Dictionaries of similarity scores seperated by method.

        Returns:
            DataFrame: A DataFrame with the similarity scores from the three dictionaries.

        """

        df = pd.DataFrame([dict1, dict2, dict3, dict4])

        df = df.T  # Transpose to have keys as rows

        df.columns = ["dot_product", "cosine", "Euclidean", "jaccard"]

        return df

    def user_interaction(self, text_type, method="all", export="No"):
        """
        Facilitates user interaction for comparing a new text with the document pool.

        This function allows the user to input a text string or specify a file. It then performs text cleaning,
        computes various similarity scores with the documents in the pool, and optionally exports the results to Excel.

        Args:
            text_type (str): Type of text input, either "string" or "file".
            method (str, optional): The method to use for computing similarity scores. Defaults to "all".
            export (str, optional): Option to export the results to an Excel file. Defaults to "No".

        Returns:
            DataFrame: A DataFrame showing the similarity scores of the new text with each document in the pool.

        Raises:
            TypeError: If text_type is not a string.
            ValueError: If export is empty or not 'Yes'/'No'.
            ValueError: If text_type is empty or not 'string'/'file'.
            ValueError: If the input text is too long (more than 200 characters) or empty.
            ValueError: If the specified file is not found in the directory.
            Exception: For errors in loading the document or in similarity score calculations.
            Exception: For errors encountered while creating the Excel file.
        """

        # Check if text_type is a valid string
        if not isinstance(text_type, str):
            raise TypeError("The text_type must be a string.")

        # Check for empty string
        if text_type.strip() == "":
            raise ValueError(
                "text_type is empty or only contains whitespace. Please enter valid text as argument.")

        # Validate text_type value
        if not export in ["Yes", "No"]:
            raise ValueError("Argument needs to be either 'Yes' or 'No'.")

        if not text_type in ["string", "file"]:
            raise ValueError("Argument needs to be either 'string' or 'file'.")

        # Handling 'string' input
        if text_type == "string":
            # Prompt the user to enter text
            q1 = input('Enter a string under 500 characters')

            # Check for length constraint
            if len(q1) > 500:
                raise ValueError("Your input text was too long!")

            # Check for empty input
            if q1.strip() == "":
                raise ValueError(
                    "Entered text is empty or only contains whitespace. Please enter valid text.")

        # Handling 'file' input
        if text_type == "file":
            # Prompt the user to enter file name
            q1 = input(
                'Enter the name of the document (needs to be in the same directory as the script)')

            # Check for empty input
            if q1.strip() == "":
                raise ValueError(
                    "Entered text is empty or only contains whitespace. Please enter valid text.")

            # Check if file exists in directory
            objects = os.listdir(os.getcwd())
            if not q1 in objects:
                raise ValueError(
                    f"The file {q1} is not in the directory of this Jupyter notebook.")

            # Attempt to load the text from the file
            try:
                q1 = self.load_text(q1)
            except Exception as e:
                raise Exception(f'Failed to load document {q1} because of {e}')

        # Clean the text and create vector
        clean_text = self.main_cleaning(q1)
        new_vector = self.create_vector(clean_text)

        # Compute and store similarity scores
        try:
            # Compute similarity scores
            result1 = self.dot_product_normal(clean_text, new_vector)
            result2 = self.cosine_Similarity(clean_text, new_vector)
            result3 = self.Euclidean_distance(clean_text, new_vector)
            result4 = self.Jaccard_similarity(clean_text)

            # Generate DataFrame based on selected method
            if method == "all":
                final_df = StringSimilarity.create_dataframe(
                    result1, result2, result3, result4)
            elif method == "dot":
                final_df = pd.DataFrame(list(result1.items()), columns=[
                                        'Document', 'Dot Product Similarity'])
            elif method == "cosine":
                final_df = pd.DataFrame(list(result2.items()), columns=[
                                        'Document', 'Cosine Similarity'])
            elif method == "euclidean":
                final_df = pd.DataFrame(list(result3.items()), columns=[
                                        'Document', 'Euclidean Distance'])
            elif method == "jaccard":
                final_df = pd.DataFrame(list(result4.items()), columns=[
                                        'Document', 'Jaccard Similarity'])
        except Exception as e:
            raise Exception(
                f"An error occurred while calculating similarity scores: {e}")

        # Handle export option
        if export == "No":
            return final_df

        if export == "Yes":
            try:

                # export dataframe to results.xlsx
                final_df.to_excel("results.xlsx")
                print("Excel has been created!")
                return final_df
            except Exception as e:
                raise Exception(f"An error occurred while creating Excel: {e}")

### Step 1: Class Initialization and Structure Overview
In the first step, we initiate an instance of the `StringSimilarity` class. This action sets the foundation for our text similarity analysis, establishing the essential structures within the class. Upon initialization, the class creates various key components such as the `document_pool`, `vector_pool`, `dictionary`, and a predefined list of `stopwords`. These components are integral to the class's functionality.

- The `document_pool` is a dictionary meant to store processed documents.
- The `vector_pool` holds vector representations of these documents.
- The `dictionary` is a set containing unique words found across all documents.
- The `stopwords` list includes common words to be filtered out during text processing.

After initializing the class, we print the initial state of each component. This display helps us understand the class's initial setup and confirms that the essential structures are in place, ready to be populated and utilized in subsequent steps of text similarity analysis.


In [None]:
# Step 1: Class Initialization and Structure Overview
similarity = StringSimilarity()

# Display the initial state of the class
print("Initial Document Pool:", similarity.document_pool)
print("Initial Vector Pool:", similarity.vector_pool)
print("Initial Dictionary:", similarity.dictionary)
print("Stopwords List:", similarity.stopwords)

### Step 2: Adding Documents to the Corpus
In this step, we demonstrate the process of adding documents to our corpus. This is a critical part of building the text analysis framework, as it populates the document pool with initial data for comparison. The `add_documents` method of the `StringSimilarity` class is utilized here. This method takes two arguments: a unique identifier for the document (like "Doc1") and the actual text of the document. After adding a sample document, we print out the updated list of document names present in the document pool. This allows us to verify that the document has been successfully added and is ready for further processing and analysis.


In [None]:
# Step 2: Adding Documents to the Corpus
# Adding a small text manually
similarity.add_documents("Doc1", "This is a sample document.")
print("Updated list of documents in Document Pool:",
      list(similarity.document_pool.keys()))

### Step 3: Loading Documents from Files
In this step, we demonstrate the functionality of the `StringSimilarity` class for loading documents from external files. This capability is essential for processing and analyzing larger text documents that are stored as files, enabling the class to handle real-world data scenarios effectively.

- We utilize the `load_text` method of the class to import text from an external file named "example.txt".
- The `load_text` method is designed to read the content of the file and return it as a string, allowing for further processing.

The result of this operation is printed to the console, displaying the loaded text. This step is crucial for verifying that the class can successfully access and retrieve text data from files, a common requirement in text analysis projects. It ensures that our `StringSimilarity` class is not just limited to handling manually inputted text but is also capable of working with pre-existing text documents.


In [None]:
# Step 3: Loading Documents from Files
# Load text from an external file
loaded_text = similarity.load_text("example.txt")
print("Loaded Text:", loaded_text)

### Step 4: Creating and Managing Corpus
This step showcases the `StringSimilarity` class's ability to automate the creation and management of a text corpus. This is a vital feature for handling multiple documents simultaneously and efficiently.

- The `create_corpus` method is called to process and add all documents located within the 'Corpus' folder. This folder contains various text files that make up our text corpus.
- This method reads each file in the 'Corpus' folder, cleans and processes the text using the class's internal methods, and then adds the resulting data to the class's internal structures for document management.

After executing this method, we print the current state of the class to confirm that the corpus has been successfully created and the internal structures (`document_pool`, `vector_pool`, `dictionary`, and `stopwords`) have been updated accordingly.

- We also display the message returned by `create_corpus`, which indicates the outcome of the corpus creation process, such as the number of new documents added.

This functionality exemplifies the class's robustness in handling multiple text files, automating the tedious process of manually adding each document. It's a demonstration of how the `StringSimilarity` class simplifies the management of a text corpus, making it a useful tool for large-scale text analysis.


In [None]:
# Step 4: Creating and Managing Corpus
# Process and add all documents from the 'Corpus' folder
corpus_creation_message = similarity.create_corpus()

# Display state of class after adding the documents
print("\n")
print("State has been updated after Corpus was created")
print("Initial Document Pool:", similarity.document_pool)
print("Initial Vector Pool:", similarity.vector_pool)
print("Initial Dictionary:", similarity.dictionary)
print("Stopwords List:", similarity.stopwords)
print("\n")

print(corpus_creation_message)
print("Updated list of documents in Document Pool:",
      list(similarity.document_pool.keys()))

### Step 5: Text Preprocessing and Cleaning
In this step, we demonstrate the crucial process of text preprocessing and cleaning, an essential part of any text analysis task. The `StringSimilarity` class provides a method `main_cleaning` that efficiently handles this process.

- We start with a raw text string: `"THIS is @n example string!!!stop-words, et 1, , 4 ,5 will be filtert__ out"`. This string intentionally includes various elements like uppercase letters, special characters, numbers, and potential stop-words to illustrate the effectiveness of the cleaning process.
- The `main_cleaning` method is applied to this raw text, which internally utilizes other class methods like `cleaning_text`, `string_to_list`, and `removing_stopwords`. These methods collectively perform various cleaning actions such as:
    - Converting text to lowercase.
    - Removing special characters and numbers.
    - Stripping unnecessary whitespace.
    - Filtering out stopwords.
    - Excluding short words (less than 3 characters).

- After cleaning, we display both the original and the cleaned text. The cleaned text should reflect the removal of unwanted elements and the standardization of the text format.

This step is a vital demonstration of the `StringSimilarity` class's ability to preprocess text, which is a foundational step in preparing text data for further similarity analysis. It shows how the class streamlines the transformation of raw, unstructured text into a cleaner, more analyzable format.


In [None]:
# Step 5: Text Preprocessing and Cleaning
# Demonstrate text cleaning process
raw_text = "THIS is @n example string!!!stop-words, et 1, , 4 ,5 will be filtert__ out"
cleaned_text = similarity.main_cleaning(raw_text)
print("Original Text:", raw_text)
print("Cleaned Text:", cleaned_text)

### Step 6: Vector Representation
In this part of our demonstration, we focus on converting cleaned text into a binary vector using the `StringSimilarity` class. The vector representation is a pivotal aspect of text similarity analysis as it quantifies the text in a format that can be easily compared using various similarity metrics.

- Using the `create_vector` method of the `StringSimilarity` class, we convert the previously cleaned text into a binary vector. This method works by mapping each word in the cleaned text against the class's word dictionary. 
- Each element of the vector corresponds to a unique word in the class’s dictionary. The binary value (0 or 1) for each element indicates the absence or presence of the corresponding word in the cleaned text.
- The output is a binary vector, where '1' signifies the presence of a word from the cleaned text in the dictionary, and '0' indicates its absence.

This step is essential in transforming textual data into a numerical format, a prerequisite for employing mathematical techniques for similarity calculation. By showcasing the `create_vector` function, we highlight how the `StringSimilarity` class facilitates the transition from textual to numerical analysis, setting the stage for the next steps in text similarity computation.


In [None]:
# Step 6: Vector Representation
# Convert cleaned text to binary vector
vector = similarity.create_vector(cleaned_text)
print("Binary Vector Representation:", vector)

### Step 7: Calculating Similarity Scores
This step is the culmination of our journey through the `StringSimilarity` class, where we finally calculate similarity scores using various methods. The purpose here is to demonstrate how the class can be utilized to analyze the similarity of a new text document with the existing corpus.

- First, we load an external text file named "example.txt" using the `load_text` method. This showcases the class's ability to process and prepare external documents for analysis.
- We then clean this loaded text using the `main_cleaning` method to ensure it's in the right format for similarity analysis. This step is vital as it ensures consistency in data preparation.
- Next, we convert the cleaned text into a binary vector using the `create_vector` method, which is essential for numerical comparison.
- Now, we calculate similarity scores using four different methods: `dot_product_normal`, `cosine_Similarity`, `Euclidean_distance`, and `Jaccard_similarity`. Each of these methods provides a unique approach to assessing text similarity, highlighting the versatility of the `StringSimilarity` class.
  - The Dot Product method emphasizes direct overlap in words.
  - Cosine Similarity focuses on the orientation of the text in the vector space, making it suitable for texts of varying lengths.
  - Euclidean Distance provides a 'straight-line' measure of similarity, intuitive in its approach.
  - Jaccard Similarity uses the ratio of common words to total words, effective for binary comparisons.
- Finally, we display the results for each similarity score, providing a comprehensive view of how the new document compares to the existing corpus across multiple dimensions of similarity.

This step not only illustrates the practical application of the `StringSimilarity` class but also provides insights into how different similarity metrics can be used in real-world scenarios, such as text comparison, recommendation systems, and more.


In [None]:
# Step 7: Calculating Similarity Scores
# Using a new input text for similarity calculation
loaded_text = similarity.load_text("example.txt")
new_text = similarity.main_cleaning(loaded_text)

new_vector = similarity.create_vector(similarity.main_cleaning(loaded_text))

# Calculate similarity scores
dot_product_scores = similarity.dot_product_normal(new_text, new_vector)
cosine_similarity_scores = similarity.cosine_Similarity(new_text, new_vector)
euclidean_distance_scores = similarity.Euclidean_distance(new_text, new_vector)
jaccard_similarity_scores = similarity.Jaccard_similarity(new_text)

# Display results
print("Dot Product Scores:", dot_product_scores)
print("Cosine Similarity Scores:", cosine_similarity_scores)
print("Euclidean Distance Scores:", euclidean_distance_scores)
print("Jaccard Similarity Scores:", jaccard_similarity_scores)

### Step 8: User Interaction Demonstration
In this final step, we demonstrate the user interaction capabilities of the `StringSimilarity` class. This step is crucial for showcasing how end users can easily utilize the class to analyze text similarity in practical scenarios.

- Before we begin, we clean up our document pool by deleting "Doc1," which was added in Step 2. This is done to maintain the purity of our corpus and ensure that our results are based on the original, unaltered corpus.
- We then print the keys of the `document_pool` to confirm that "Doc1" has been successfully removed, ensuring our corpus is in its intended state.
- Next, we simulate a user interaction scenario where a user inputs a string directly. We demonstrate this by using the `user_interaction` method with the "string" argument.
- The `user_interaction` method is designed to be versatile, allowing users to either input a string directly or specify a file for text comparison. This flexibility makes the class highly accessible and user-friendly.
- In this demonstration, we prompt the user to enter a string (in this case, "Example user input text"). The class then processes this input, calculates similarity scores using the different methods, and outputs the results.
- This step exemplifies how the `StringSimilarity` class can be interactively used in real-world applications, such as content analysis, document retrieval, and text-based recommendation systems, where end-user interaction is a key component.

Through this step, we not only show the interactive nature of the class but also its practical applicability in scenarios requiring direct user input for text similarity analysis.


In [None]:
# Step 8: User Interaction Demonstration

# delete Doc1 created in step 2
del similarity.document_pool['Doc1']

print(similarity.document_pool.keys())

# Example of user interaction with a string
user_input = "Example user input text"
similarity.user_interaction("string")

## Comprehensive Analysis of Text Similarity Results Across Four Methods

Our exploration into recommending books based on textual similarity employed four distinct methods: Dot Product, Cosine Similarity, Euclidean Distance, and Jaccard Similarity. Each method offers unique insights, and by analyzing their results, we can form a nuanced understanding of the textual relationships between our reference book and others in our corpus.

### Dissecting the Results Across Methods

#### Dot Product Similarity:
- 'text2.txt' leads the pack with the highest score, indicating a substantial overlap in vocabulary with our reference book. This suggests thematic or stylistic parallels.
- 'text1.txt' and 'text3.txt' follow, with moderate and lower scores respectively, pointing to lesser degrees of similarity.

#### Cosine Similarity:
- Reflecting on the orientation of texts in vector space, this method also places 'text2.txt' at the top. Its high score reinforces the idea of a strong thematic alignment with our reference book.
- The scores for 'text1.txt' and 'text3.txt' are lower, suggesting varying levels of thematic divergence.

#### Euclidean Distance:
- In this metric, lower scores denote closer similarity. 'text2.txt' scores the lowest, aligning with previous findings and underscoring its closeness to the reference book.
- 'text1.txt' and 'text3.txt' have higher scores, indicating more significant differences in word usage and content.

#### Jaccard Similarity:
- Consistent with earlier observations, 'text2.txt' scores highest, reflecting a high degree of shared unique words.
- The scores for 'text1.txt' and 'text3.txt' are lower, indicating less overlap in specific word occurrences.

### Interpreting the Results and Conclusions

The consistency of 'text2.txt' scoring high across all methods strongly suggests it as the most similar book to our reference text. This uniformity across diverse metrics indicates that not only do these books share a significant number of words, but they also align closely in thematic and stylistic aspects.

'text1.txt' presents as a moderately similar option, making it an appealing recommendation for readers seeking a balance of familiar and new content. 'text3.txt', while scoring lower across the board, could be recommended for readers looking for a more distinct but somewhat related reading experience.

### Most Meaningful Measurement and Book Recommendation

While each method provides valuable insights, Cosine Similarity stands out for its ability to normalize differences in text length and focus on the overall direction of the text in vector space. This makes it especially useful in capturing thematic similarities beyond mere word counts.

Considering all the results, the most compelling book recommendation aligns with the consensus of our analysis: 'text2.txt'. Its consistent high scores across all metrics make it the closest match to our reference book, offering readers a familiar yet potentially enriching literary journey.

In summary, this multi-faceted analysis not only highlights the intricate relationships between texts but also demonstrates the effectiveness of using a blend of similarity metrics for nuanced book recommendations.


In [None]:
# example.txt
similarity.user_interaction("file", "dot")

In [None]:
# example.txt
similarity.user_interaction("file", "cosine")

In [None]:
# example.txt
similarity.user_interaction("file", "euclidean")

In [17]:
# example.txt
similarity.user_interaction("file", "jaccard")

Unnamed: 0,Document,Jaccard Similarity
0,text2.txt,0.253854
1,text1.txt,0.234624
2,text3.txt,0.163129


In [None]:
# example.txt
similarity.user_interaction("file", export="Yes")