Individually Graded: ***Due date: April 15***


**Submission Intruction:**
Demo the lab to your TA/instructor and then upload your Jupyter Notebook to Canvas

**Lab Assignment: Text Processing and Document Analysis**

**Objective:**
The objective of this lab assignment is to practice text processing techniques and document analysis using Python. You will use the file: documents.txt that can be accessed: https://raw.githubusercontent.com/sumonacalpoly/Datasets/main/documents.txt

**Tasks:**
1. Implement a `TextVector` class that represents a vector of text data. This class should provide methods for adding words to the vector, calculating word frequencies, and retrieving various statistics about the text data.
2. Implement a `DocumentCollection` class that represents a collection of documents. This class should be able to add documents, serialize and deserialize the collection, and provide methods for analyzing the text data within the documents.
3. Create a function to extract data from a text file containing multiple documents. This function should parse the document metadata and text bodies, and populate a `DocumentCollection` object with the extracted data.
4. Utilize the `requests` library to fetch the text data from a URL and process it using the implemented classes and functions.
5. Implement a `main` function to orchestrate the entire process, including fetching the data, processing it, and analyzing it.
6. Serialize the resulting `DocumentCollection` object to a file for future use.

**Lab Tasks:**
1. Implement the `TextVector` class as described above.
2. Implement the `DocumentCollection` class as described above.
3. Write a function to extract data from a text file (`extract_data`). This function should take the text content of the file as input and return a `DocumentCollection` object populated with the extracted data.
4. Implement the `main` function to fetch the text data from a given URL, process it using the `extract_data` function, and perform analysis on the extracted documents.
5. Serialize the resulting `DocumentCollection` object to a file named `document_collection.pkl`.
6. Write test cases to verify the functionality of the implemented classes and functions.

Your implementation should give the following output:






```
Word = is
Frequency = 31
Distinct Number of Words = 103509
Total word count = 142867
Raw vector entry set: dict_items([('hello', 2), ('world', 1), ('python', 1)])
Contains 'hello': True
Frequency of 'hello': 2
Total word count: 4
Distinct word count: 3
Highest raw frequency: 2
Most frequent word: ('hello', 2)
```





**Submission:**
Submit the following files:
- Python script containing the implementation of the `TextVector` and `DocumentCollection` classes, the `extract_data` function, and the `main` function.
Complete all the modules and functions. You can add extra functions or modify the given code as per your understanding.

**Note:** Ensure proper documentation and comments are provided throughout the code for clarity and understanding.



To complete this task, please refer to the Jupyter notebook tutorial: Tutorial_3_Objects_Classes_Python.ipynb on how to create objects and classes if you have forgotten or it is new to you.

This task has helper code for some of the components.

Your implementation should have two classes: class TextVector and class DocumentCollection.

The class TextVector should have the following methods:
This `TextVector` class provides methods for working with a collection of words. Here's what each method does:

1. `__init__(self)`: Initializes a new instance of `TextVector`. It initializes two attributes: `word_freq`, which is a defaultdict that stores the frequency of each word, and `text`, which stores the concatenated text.

2. `add_word(self, word)`: Adds a word to the `word_freq` dictionary and appends the word to the `text` attribute if its length is greater than or equal to 2.

3. `getRawVectorEntrySet(self)`: Returns the items (word, frequency) in the `word_freq` dictionary.

4. `add(self, word)`: A convenience method to add a word using the `add_word` method.

5. `contains(self, word)`: Checks if a word is present in the `word_freq` dictionary. It converts the word to lowercase before checking.

6. `getRawFrequency(self, word)`: Returns the frequency of a word in the `word_freq` dictionary. It converts the word to lowercase before retrieving its frequency.

7. `getTotalWordCount(self)`: Returns the total count of words in the `word_freq` dictionary, which is the sum of all word frequencies.

8. `getDistinctWordCount(self)`: Returns the count of distinct words in the `word_freq` dictionary.

9. `getHighestRawFrequency(self)`: Returns the highest frequency of any word in the `word_freq` dictionary. If the dictionary is empty, it returns 0.

This class provides basic functionality to work with word frequencies and text statistics.

These are the basic modules that your implementation MUST have, but you can have other modules as well.

The other class: ***class DocumentCollection***
 represents a collection of documents, where each document is represented as a TextVector.

--The add_document method adds a document to the collection. It splits the text into words, adds each word to the corresponding TextVector, and updates the collection.

--The serialize method serializes the collection to a file. (Helper code provided)

--The deserialize method deserializes a collection from a file. (Helper code provided)

The other functions outside the classes are:

***extract_data function:***

This function extracts data from a text and creates a DocumentCollection instance.
It iterates through the text lines, identifies document boundaries, and adds each document to the collection using the add_document method.

***main function***:

This is the entry point of the script.
It retrieves text data from a URL, creates a DocumentCollection using the extract_data function, and performs some analysis on the text data.
It prints the most frequent word, its frequency, the distinct number of words, and the total word count.
Finally, it serializes the DocumentCollection instance to a file.

In [3]:
import requests
import re
import pickle
from collections import defaultdict

from collections import defaultdict

class TextVector:
    def __init__(self):
        """
        Initialize a TextVector instance.
        """
        self.word_freq = defaultdict(int)
        self.text = ""

    def add_word(self, word):
        """
        Add a word to the TextVector instance.
        Parameters:
            word (str): The word to add.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        if len(word) >= 2:
            self.word_freq[word.lower()] += 1
            self.text += word + " "

    def getRawVectorEntrySet(self):
        """
        Get the raw vector entry set of the TextVector instance.
        Returns:
            dict_items: The items (word, frequency) in the word_freq dictionary.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        return self.word_freq.items()

    def add(self, word):
        """
        Alias for add_word method.
        Parameters:
            word (str): The word to add.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        self.add_word(word)

    def contains(self, word):
        """
        Check if the word is present in the TextVector instance.
        Parameters:
            word (str): The word to check.
        Returns:
            bool: True if the word is present, False otherwise.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        return word.lower() in self.word_freq

    def getRawFrequency(self, word):
        """
        Get the frequency of a word in the TextVector instance.
        Parameters:
            word (str): The word to get frequency for.
        Returns:
            int: The frequency of the word.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        return self.word_freq[word.lower()]

    def getTotalWordCount(self):
        """
        Get the total word count in the TextVector instance.
        Returns:
            int: The total word count.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        return sum(self.word_freq.values())

    def getDistinctWordCount(self):
        """
        Get the count of distinct words in the TextVector instance.
        Returns:
            int: The count of distinct words.
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        return len(self.word_freq)

    def getHighestRawFrequency(self):
        """
        Get the highest raw frequency of any word in the TextVector instance.
        Returns:
            int: The highest raw frequency.
            #Write the code to return the word(s) with the highest frequency in the text vector along with their frequency.
        # first checks if the word_freq dictionary is empty. If it's empty, it returns None, indicating that there are no words in the text vector.

        #If the word_freq dictionary is not empty, it proceeds to find the word(s) with the highest frequency.

        #find the maximum frequency value in the word_freq dictionary using the max function.

        #then create a list most_freq_words using a list comprehension.
        #This list comprehension iterates over the items (word, frequency) in the word_freq
        #dictionary and selects only those words where the frequency is equal to the maximum frequency found in the previous step.

        #Finally, return the first word in the most_freq_words list along with its frequency, as a tuple (word, frequency).
        #example: return most_freq_words[0], max_freq
        """
        # WRITE YOUR CODE HERE TO COMPLETE THIS MODULE

        if not self.word_freq:
            return 0
        return max(self.word_freq.values())

    # You can add more methods here...

# Example usage:

# Initialize a TextVector instance
vector = TextVector()

# Add words to the TextVector instance
vector.add("hello")
vector.add("world")

# Get the total word count
total_word_count = vector.getTotalWordCount()

# Print the total word count
print("Total word count:", total_word_count)





Total word count: 2


The following class description is provided below:

1. **`DocumentCollection` Class**:
   - This class represents a collection of documents.

2. **`__init__` Method**:
   - Initializes an empty `DocumentCollection`. It initializes an attribute `documents` as an empty dictionary to store documents.

3. **`add_document` Method**:
   - This method adds a document to the collection.
   - Parameters:
     - `doc_id` (int): The identifier for the document.
     - `text` (str): The text content of the document.
   - It splits the text into words using a regular expression.
   - It creates a `TextVector` object for the document and adds each word to it using the `add_word` method of `TextVector`.

4. **`serialize` Method**:
   - Serializes the `DocumentCollection` to a file using pickle.
   - Parameters:
     - `filename` (str): The name of the file to which the collection will be serialized.
   - It opens the file in binary write mode and dumps the object into the file using `pickle.dump`.

5. **`deserialize` Method**:
   - Deserializes a `DocumentCollection` from a file using pickle.
   - Parameters:
     - `filename` (str): The name of the file from which the collection will be deserialized.
   - It opens the file in binary read mode and loads the object from the file using `pickle.load`.

Overall, this code provides a basic structure for managing a collection of documents. It allows adding documents, serializing the collection to a file, and deserializing the collection from a file. It uses the `TextVector` class to represent the text content of each document.

Complete the following code to finish implementing the DocumentCollection class.

In [5]:
import pickle

class DocumentCollection:
    def __init__(self):
        self.documents = {}

    def add_document(self, doc_id, text):
        """
        Adds a document to the collection.

        Parameters:
            doc_id (int): The identifier for the document.
            text (str): The text content of the document.
        """
        # Split text into words using regular expression
        # you can modify this code to fit your code and variables
        self.documents[doc_id] = TextVector()
        words = re.split("[^a-zA-Z]+", text)

        # Write a for loop that is responsible for processing each word in the text and adding it to the word
        #frequency dictionary of the `TextVector` object associated with the current document.
        # The for loop in the `add_document` method iterates over each word in the `words` list,
        #which contains the result of splitting the `text` parameter using a regular expression pattern.

        #Here's a breakdown of what happens inside the loop:

        #1. **Iterate over each word:** The loop iterates over each word extracted from the text.

        #2. **Add word to TextVector object:** For each word, it calls the `add_word` method of the `TextVector` object associated with the current document (`doc_id`).
        #This method is responsible for adding the word to the word frequency dictionary within the `TextVector` object.


         #WRITE YOUR CODE TO IMPLEMENT THE FOR LOOP

        for word in words:
           self.documents[doc_id].add_word(word)

    def serialize(self, filename):
        """
        Serializes the DocumentCollection to a file.

        Parameters:
            filename (str): The name of the file to which the collection will be serialized.
        """
        with open(filename, 'wb') as f:
            pickle.dump(self, f)

    @staticmethod
    def deserialize(filename):
        """
        Deserializes a DocumentCollection from a file.

        Parameters:
            filename (str): The name of the file from which the collection will be deserialized.

        Returns:
            DocumentCollection: The deserialized DocumentCollection object.
        """
        with open(filename, 'rb') as f:
            return pickle.load(f)


The following code for the function ***extract_data()*** has already been implemented for you. You can modify the code or make your own code based on your logic to do the same task. Here is the explanation of what this function does:
This code defines a function `extract_data` that reads text data formatted in a specific way and extracts documents from it. Each document is expected to have an identifier (starting with ".I") and content (starting with ".W"). The function then creates a `DocumentCollection` object and adds each document to it using the `add_document` method.

Here's a step-by-step explanation:

1. **Define `extract_data` function:** This function takes a string `text` as input and returns a `DocumentCollection` object containing the extracted documents.

2. **Initialize `DocumentCollection`:** Inside the function, a new `DocumentCollection` object named `collection` is created.

3. **Iterate over lines in text:** The function splits the input `text` into lines and iterates over each line.

4. **Identify document start:** If a line starts with ".I" (indicating the start of a new document), the function checks if there is an ongoing document (i.e., if `doc_id` is not None). If so, it adds the current document to the collection using `add_document` method and resets the `text_buffer` variable.

5. **Extract document ID:** The document ID is extracted from the line starting with ".I" and stored in `doc_id`.

6. **Extract document content:** If a line starts with ".W" (indicating the start of document content), the function extracts the content by removing the ".W" and appends it to `text_buffer`.

7. **Continuously append lines to text_buffer:** If the line does not start with ".I" or ".W", it is considered part of the document content. In this case, the line is stripped of leading and trailing whitespaces and added to `text_buffer`.

8. **Add last document:** After processing all lines, if there was an ongoing document (`doc_id` is not None), the function adds the last document to the collection.

9. **Retrieve text data from URL:** The code retrieves the text data from a URL using the requests library.

10. **Process text data:** If the request is successful (status code 200), the text data is extracted from the response.

11. **Extract documents:** The `extract_data` function is called with the extracted text data, resulting in a `DocumentCollection` object containing the extracted documents. If the request fails, an error message is printed.

In [11]:
def extract_data(text):
    collection = DocumentCollection()
    doc_id = None
    text_buffer = ""
    for line in text.splitlines():
        if line.startswith(".I"):
            if doc_id is not None:
                collection.add_document(doc_id, text_buffer)
                text_buffer = ""
            doc_id = int(line.strip().split()[-1])
        elif line.startswith(".W"):
            text_buffer += line.strip().replace(".W", "")
        else:
            text_buffer += line.strip()
    if doc_id is not None:  # Add the last document
        collection.add_document(doc_id, text_buffer)
    return collection

# URL to the raw text file on GitHub
url = 'https://raw.githubusercontent.com/sumonacalpoly/Datasets/main/documents.txt'

# Get the content of the file
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Get the text content of the file
    text = response.text

    # Now you can proceed with processing the text data
    collection = extract_data(text)
else:
    print("Failed to retrieve the file. Status code:", response.status_code)

# collection.serialize("C:/Users/Andrew/Documents/CSC 466/Labs/Test_Serialize.pkl")


dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 22

Complete the ***main()*** function. This `main` function is responsible for processing the text data stored in the `collection` object, calculating some statistics, and serializing the `DocumentCollection` object to a file named "document_collection.pkl". Here's a breakdown of what the function does:

1. **Define Noise Words Array:** A list named `noise_word_array` is defined, containing common noise words that are typically not relevant for analysis. The "noise word array" in this context refers to a list of common words that are often considered noise or stop words in natural language processing tasks. These are words that occur frequently in a language but typically do not carry much meaningful information about the content of the text. Including them in text analysis tasks can sometimes add noise to the results without providing much value.

In the provided code, the `noise_word_array` is a list of such words, including common articles, conjunctions, prepositions, and other frequently occurring words that are unlikely to be useful for the analysis being performed. These words are filtered out during text processing to focus on more meaningful content.

2. **Initialize Variables:** Initialize variables `word_freq_sum`, `distinct_words_sum`, and `highest_freq_word`. These variables will be used to calculate statistics about the text data.

3. **Iterate Over Documents:** Iterate over each document in the `collection.documents` dictionary. For each document:

    a. **Iterate Over Words:** Iterate over each word-frequency pair in the `word_freq` dictionary of the `TextVector` object associated with the document.
    
    b. **Check Noise Words:** Check if the word is not in the `noise_word_array`. If it's not a noise word:
    
        - Increment `word_freq_sum` by the frequency of the word.
        - Increment `distinct_words_sum` by 1.
        - Check if the frequency of the word is higher than the frequency of the current `highest_freq_word`. If so, update `highest_freq_word` to this word.
        
4. **Print Statistics:** Print the following statistics:
    
    - The word with the highest frequency (`highest_freq_word`).
    - The frequency of the highest frequency word.
    - The total number of distinct words (`distinct_words_sum`).
    - The total word count (`word_freq_sum`).

5. **Serialize Collection:** Serialize the `DocumentCollection` object `collection` to a file named "document_collection.pkl" using the `serialize` method.

6. **Main Function:** If this script is executed as the main program, the `main` function is called to execute the code.

In [16]:
def main():
    noise_word_array = ["a", "about", "above", "all", "along",
        "also", "although", "am", "an", "and", "any", "are", "aren't", "as", "at",
        "be", "because", "been", "but", "by", "can", "cannot", "could", "couldn't",
        "did", "didn't", "do", "does", "doesn't", "e.g.", "either", "etc", "etc.",
        "even", "ever", "enough", "for", "from", "further", "get", "gets", "got", "had", "have",
        "hardly", "has", "hasn't", "having", "he", "hence", "her", "here",
        "hereby", "herein", "hereof", "hereon", "hereto", "herewith", "him",
        "his", "how", "however", "i", "i.e.", "if", "in", "into", "it", "it's", "its",
        "me", "more", "most", "mr", "my", "near", "nor", "now", "no", "not", "or", "on", "of", "onto",
        "other", "our", "out", "over", "really", "said", "same", "she",
        "should", "shouldn't", "since", "so", "some", "such",
        "than", "that", "the", "their", "them", "then", "there", "thereby",
        "therefore", "therefrom", "therein", "thereof", "thereon", "thereto",
        "therewith", "these", "they", "this", "those", "through", "thus", "to",
        "too", "under", "until", "unto", "upon", "us", "very", "was", "wasn't",
        "we", "were", "what", "when", "where", "whereby", "wherein", "whether",
        "which", "while", "who", "whom", "whose", "why", "with", "without",
        "would", "you", "your", "yours", "yes"]

    #Write a block of code  to
    #analyzing the text data to find the highest frequency word, along with
    #some additional statistics about the word frequencies and the number of distinct words.
    #Hint: You must have  a loop to iterate over each document
    #in the collection (collection.documents).
    #For each document, it iterates over each word and its
    #frequency in the word_freq dictionary within the TextVector object associated with that document.
    #The guideline is provided here:


    #1. **Looping through documents and words:**
    #  - The outer loop iterates over each document in the collection.
    # - The inner loop iterates over each word and its frequency in the current document.

    #2. **Checking for noise words:**
     # - Inside the inner loop, there's a condition that checks if the current word is not in the `noise_word_array`.
     # - `noise_word_array` seems to contain a list of common noise words that are typically filtered out in text analysis tasks (e.g., "the", "and", "or", etc.).

    #3. **Updating word frequency and distinct word count:**
     # - If the current word is not a noise word, its frequency (`freq`) is added to the `word_freq_sum`. This variable seems to accumulate the total frequency of all non-noise words in the collection.
     # - Similarly, the `distinct_words_sum` variable is incremented by 1 for each non-noise word encountered, representing the total count of distinct words in the collection.

    #4. **Finding the highest frequency word:**

    #5. **Printing results:**
      # print out information about the highest frequency word (`highest_freq_word`) and its frequency.
      # also print the distinct number of words encountered (`distinct_words_sum`) and the total word count (`word_freq_sum`).


    # NOW WRITE YOUR CODE HERE

    # Calc highest freqs
    highest_freq_word = ["", 0]
    
    # Calc distinct words
    distinct_words_sum = 0
    all_words = []

    # Calc Total Freq (not including noise words)
    word_freq_sum = 0

    for key in collection.documents.keys():
        document = collection.documents[key]
        for word in document.word_freq.keys():
            
            # Calc Highest Freqs
            if word not in noise_word_array:
                
                # Increment Total Freq
                word_freq_sum += document.word_freq[word]
                
                # Increment distinct word count if word is not already accounted for
                if word not in all_words:
                    distinct_words_sum += 1
                    all_words.append(word)
                
                # See If Word Has Highest Freq
                if document.word_freq[word] > highest_freq_word[1]:
                    highest_freq_word[0] = word
                    highest_freq_word[1] = document.word_freq[word]

    collection.serialize("document_collection.pkl")

    print(f"Total Word Count: {word_freq_sum}")
    print(f"Distinct word count: {distinct_words_sum}")
    print(f"Highest raw frequency: {highest_freq_word[1]}")
    print(f"Most frequent word: {highest_freq_word[0]}")


if __name__ == "__main__":
    main()

# Example usage:
# vector = TextVector()
# vector.add_word("hello")
# vector.add_word("world")
# vector.add_word("hello")
# vector.add_word("python")
# print("Raw vector entry set:", vector.getRawVectorEntrySet())
# print("Contains 'hello':", vector.contains("hello"))
# print("Frequency of 'hello':", vector.getRawFrequency("hello"))
# print("Total word count:", vector.getTotalWordCount())
# print("Distinct word count:", vector.getDistinctWordCount())
# print("Highest raw frequency:", vector.getHighestRawFrequency())
# print("Most frequent word:", vector.getMostFrequentWord())

Total Word Count: 142867
Distinct word count: 21859
Highest raw frequency: 31
Most frequent word: is


You just created your non-machine learning model: ***document_collection*** which can be used for data mining a text corpus.

### Helper code on how to use pickel model/ Advantage of saving your model ###

The document_collection.pkl file contains a serialized version of the DocumentCollection object, which represents a collection of documents along with their text data and associated statistics.

After serializing the DocumentCollection object to the document_collection.pkl file, you can perform various operations with it, such as:

Loading: You can load the serialized DocumentCollection object back into memory in another Python script or session. This allows you to access the document data and perform further analysis or processing without needing to re-parse the original text data.

In [None]:
with open("document_collection.pkl", "rb") as f:
    collection = pickle.load(f)


Once loaded, you can analyze the document data stored in the DocumentCollection object. You can calculate statistics, perform text mining tasks, generate visualizations, or extract insights from the document corpus.

In [None]:
# Calculate total word count of the collection
total_word_count = sum(text_vector.getTotalWordCount() for text_vector in collection.documents.values())

print(total_word_count)