<a href="https://colab.research.google.com/github/R-802/LING-226-Assignments/blob/main/LING226_2023T3_Assignment_One.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Shemaiah Rangitaawa
- 300601546
- Attempting Challenge

## Text Preprocessing Function `preprocess_text`

This function is designed to preprocess text data for natural language processing tasks. The preprocessing steps include:

1. **Remove Punctuation**: The function strips all punctuation from the text. Punctuation often doesn't contribute to the meaning of text for many NLP tasks.

2. **Remove Stopwords**: Stopwords (common words that typically don't contribute much meaning, like "the", "is", "at") are removed from the text. This helps in focusing on words that carry more significance.

3. **Lowercase All Words**: The text is converted to lowercase. This standardization is crucial as it prevents the same words in different cases from being counted as different words (e.g., "Hello" and "hello").

4. **Remove Words Below/Above a Certain Frequency**: Words that appear very rarely or very frequently in the dataset can be removed. This threshold can be set as per the requirements of the task. Rare words might be typos or irrelevant, and very common words might not carry useful information.


In [None]:
def preprocess_text(text, stopwords, random_frequency):
    """
    Preprocesses the given text by removing punctuation, stopwords, making it lowercase,
    and removing words that occur more than the specified random frequency.

    Parameters:
    text (str): The input text to be preprocessed.
    stopwords (set): A set of stopwords to be removed.
    random_frequency (int): Threshold frequency for word removal.

    Returns:
    str: The preprocessed text.
    list: The words removed based on exceeding the random frequency.
    """

    # Remove punctuation and make lowercase
    text = ''.join([char.lower() for char in text if char.isalnum() or char.isspace()])
    words = text.split()

    # Count the frequency of each word
    word_frequency = {}
    for word in words:
        if word not in stopwords:
            word_frequency[word] = word_frequency.get(word, 0) + 1

    # Remove words that occur more than random_frequency times
    processed_words = []
    removed_words = []
    for word in words:
        if word_frequency.get(word, 0) > random_frequency:
            if word not in removed_words:
                removed_words.append(word)
        else:
            processed_words.append(word)

    return ' '.join(processed_words), removed_words

## `print_removed`
   1. Begins by breaking down the preprocessed text into individual words.
   2. It examines and accumulates the characters that were removed between these words.
   3. Then it returns these removed characters as a single string

In [None]:
from collections import Counter

def print_removed(original_text, preprocessed_text):
    """
    Analyzes the differences between the original and preprocessed texts to attempt to infer the word removal frequency.

    Parameters:
    original_text (str): The original text before preprocessing.
    preprocessed_text (str): The text after it has been preprocessed.

    Returns:
    removed_characters (str): The characters removed during preprocessing.
    removed_words (list): List of removed words.
    inferred_frequency (int): Inferred word removal frequency.
    """
    # Process original text
    original_text_processed = ''.join([char.lower() for char in original_text if char.isalnum() or char.isspace()])
    original_words = original_text_processed.split()

    # Split the preprocessed text into words
    preprocessed_words = preprocessed_text.split()

    # Count frequency of words in the original text
    original_word_freq = Counter(original_words)

    # Calculate frequency of removed words
    removed_words_freq = original_word_freq - Counter(preprocessed_words)
    removed_words = list(removed_words_freq.keys())

    # Attempt to infer the frequency
    if removed_words_freq:
        # Calculate the average frequency of removed words
        avg_freq = sum(removed_words_freq.values()) / len(removed_words_freq)
        inferred_frequency = round(avg_freq)
    else:
        inferred_frequency = None

    # Calculate removed characters
    removed_characters = ''.join(set(original_text) - set(preprocessed_text))

    return removed_characters, removed_words, inferred_frequency

## Importing NLTK Stopwords as Set And Usage Example

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Convert NLTK's stopwords to a set
stopwords_set = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
text = """This is an example text demonstrating text preprocessing's importance.
It includes various words, including common stopwords. Preprocessing techniques
like tokenization, removing stopwords, and stemming are essential for converting
raw text into analyzable format in NLP. In NLP, we encounter diverse data, from
social media to research papers, each with unique challenges. Text preprocessing
helps clean and prepare data for extracting insights. Preprocessing methods may
vary based on the NLP task. Understanding and applying these techniques are
fundamental for extracting valuable information from text."""

In [None]:
import random
random_frequency = random.randint(1, 10) # Generate random word removal frequency
preprocessed_text, removed_words_list = preprocess_text(text, stopwords_set, random_frequency)

print("Raw Text:")
print(text)

print("\nProcessed Text:")
print(preprocessed_text)

# Call the function and store the results
removed_characters, removed_words, inferred_frequency = print_removed(text, preprocessed_text)

# Print the removed characters
print("\nCharacters Removed During Preprocessing:")
print(removed_characters)

# Print the inferred frequency
print(f"\nInferred Removal Frequency: {inferred_frequency}")

# Print the actual removal frequency
print(f"Actual Removal Frequency: {random_frequency}")

Raw Text:
This is an example text demonstrating text preprocessing's importance. 
It includes various words, including common stopwords. Preprocessing techniques 
like tokenization, removing stopwords, and stemming are essential for converting 
raw text into analyzable format in NLP. In NLP, we encounter diverse data, from 
social media to research papers, each with unique challenges. Text preprocessing 
helps clean and prepare data for extracting insights. Preprocessing methods may
vary based on the NLP task. Understanding and applying these techniques are
fundamental for extracting valuable information from text.

Processed Text:
this is an example text demonstrating text preprocessings importance it includes various words including common stopwords preprocessing techniques like tokenization removing stopwords and stemming are essential for converting raw text into analyzable format in nlp in nlp we encounter diverse data from social media to research papers each with unique challeng