<a href="https://colab.research.google.com/github/PawelJakubczyk/ml_data_preprocessing_utilities/blob/main/ml_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup: Install Required Libraries

In [1]:
%pip install numpy
%pip install pandas
%pip install language_tool_python
%pip install textaugment

Collecting language_tool_python
  Downloading language_tool_python-2.7.1-py3-none-any.whl (34 kB)
Installing collected packages: language_tool_python
Successfully installed language_tool_python-2.7.1
Collecting textaugment
  Downloading textaugment-2.0.0-py3-none-any.whl (19 kB)
Collecting googletrans>=2 (from textaugment)
  Downloading googletrans-3.0.0.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans>=2->textaugment)
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting hstspreload (from httpx==0.13.3->googletrans>=2->textaugment)
  Downloading hstspreload-2024.2.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting chardet==3.* (from httpx==0.13.3->googletrans>=2->textaugment)
  Downloading ch

# Import Libraries

In [2]:
# Read manage data
import pandas as pd
from concurrent.futures import ProcessPoolExecutor

# Data Preprocesing
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from re import sub, findall
import string
import nltk
import spacy
from language_tool_python import LanguageTool
from collections import Counter

# data augmentation
from textaugment import Wordnet

# Define Functions

1. **normalize_text(input_text: str) -> str**:
   This function normalizes the input text by converting it to lowercase and removing leading/trailing whitespaces.

2. **remove_punctuation(input_string: str) -> str**:
   This function removes punctuation marks from the input string.

3. **replace_tabs_enters_and_spaces(input_string:str) -> str**:
   This function takes an input string and replaces tabs with a single space and multiple consecutive spaces with a single space.

4. **remove_stopwords(input_string: str) -> str**:
   This function removes stopwords from the input string. Stopwords are commonly used words (e.g., "the", "is", "in") that do not carry significant meaning.

5. **remove_html_tags(input_text: str) -> str**:
   This function removes HTML tags from the input text.

6. **remove_stopwords(input_string: str, stop_words: Set[str]) -> str**:
   This function removes stopwords from the input string. Stopwords are commonly used words (e.g., "the", "is", "in") that do not carry significant meaning.

7. **lemmatize(text: str) -> str**:
   This function lemmatizes the input text. Lemmatization is similar to stemming but considers the context of the word to determine its base or dictionary form.

8. **stemming(text: str) -> str**:
   This function performs stemming on the input text. Stemming reduces words to their root or base form, often by removing suffixes.

    **Note:** It is recommended to avoid using lemmatization (`lemmatize`) and stemming (`stemming`) simultaneously, as both methods serve to normalize words but in different ways. Using them together may lead to excessive text processing and loss of semantic information.

9. **remove_pos_tags(text: str) -> str**:
   This function removes all words from the input text except for nouns (NN). It relies on part-of-speech tagging to identify and retain only nouns in the text.

10. **correct_grammar(text: str) -> str**:
    This function corrects grammar mistakes in the input text.

11. **remove_common_words(text: str, common_threshold: int) -> str**:
    This function removes words from the text that occur too frequently.

12. **remove_rare_words(text: str, rare_threshold: int) -> str**:
    This function removes words from the text that occur too rarely.

13. **clean_empty_data(df: pd.DataFrame, columns_to_check: list) -> pd.DataFrame**:
    Cleans the input DataFrame by removing rows with missing value or empty strings.



In [3]:
# Prior to running the functions, required resources need to be downloaded
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')
STOP_WORDS = nlp.Defaults.stop_words
stemmer = PorterStemmer()

def normalize_text(input_text: str) -> str:
    """Normalizes the input text by converting it to lowercase and removing leading/trailing whitespaces."""
    # Convert to lowercase
    normalized_text = input_text.lower()

    # Remove leading and trailing whitespaces
    normalized_text = normalized_text.strip()

    return normalized_text

def remove_punctuation(input_string: str) -> str:
    """Removes punctuation from the input string"""
    return input_string.translate(str.maketrans('', '', string.punctuation))

def replace_tabs_enters_and_spaces(input_string:str) -> str:
    """Replaces tabs with a single space and multiple spaces with a single space in the input string"""
    replaced_tabs = sub(r'\t', ' ', input_string)
    replace_enters = sub(r'\n', ' ', replaced_tabs)
    replaced_spaces = sub(r'\s+', ' ', replace_enters)
    return replaced_spaces

def remove_html_tags(input_text: str) -> str:
    """Removes HTML tags from the input text."""
    clean_text = sub(r'<.*?>', '', input_text)
    return clean_text

def remove_stopwords(input_string: str) -> str:
    """Removes stopwords from the input string"""
    words = input_string.split()
    filtered_words = [word for word in words if word.lower() not in STOP_WORDS]
    return ' '.join(filtered_words)

def lemmatize(input_string: str)-> str:
    """Lemmatizes the input text"""
    doc = nlp(input_string)
    sent = [token.lemma_ for token in doc if token.text not in STOP_WORDS]
    return ' '.join(sent)

def stemming(input_string: str)-> str:
    """Performs stemming on the input text"""
    tokens = word_tokenize(input_string)
    stem_words = [stemmer.stem(word) for word in tokens]
    return ' '.join(stem_words)

def remove_pos_tags(input_string: str)-> str:
    """Removes all words except nouns (NN) from the input text"""
    doc = nlp(input_string)
    sent = [token.text for token in doc if token.tag_ == 'NN']
    return ' '.join(sent)

def correct_grammar(text: str) -> str:
    """Corrects grammar mistakes in the input text."""
    # Creates a LanguageTool object for English
    tool = LanguageTool('en-US')
    corrected_text = tool.correct(text)
    return corrected_text

def remove_common_words(text: str, common_threshold: int) -> str:
    """Removes words from the text that occur too frequently."""
    words = findall(r'\b\w+\b', text.lower())  # Tokenizacja tekstu na słowa
    word_counts = Counter(words)
    common_words = {word for word, count in word_counts.items() if count > common_threshold}
    filtered_text = ' '.join(word for word in words if word not in common_words)
    return filtered_text

def remove_rare_words(text: str, rare_threshold: int) -> str:
    """Removes words from the text that occur too rarely."""
    words = findall(r'\b\w+\b', text.lower())  # Tokenizacja tekstu na słowa
    word_counts = Counter(words)
    rare_words = {word for word, count in word_counts.items() if count <= rare_threshold}
    filtered_text = ' '.join(word for word in words if word not in rare_words)
    return filtered_text

def clean_empty_data(df: pd.DataFrame, columns_to_check: list) -> pd.DataFrame:
    """Cleans the input DataFrame by removing rows with incorrect or invalid values."""
    # Make a copy to avoid modifying the original DataFrame
    cleaned_df = df.copy()

    for column in columns_to_check:
        cleaned_df = cleaned_df[cleaned_df[column].notna()]
        cleaned_df = cleaned_df[cleaned_df[column] != ""]

    return cleaned_df

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Usage Examples

## 1. **normalize_text(input_text: str) -> str**:
   This function normalizes the input text by converting it to lowercase and removing leading/trailing whitespaces.

In [4]:
# Example 1: Simple text
text1 = "   This is a Sample Text   "
normalized_text1 = normalize_text(text1)
print("Original Text 1:", text1)
print("Normalized Text 1:", normalized_text1)

# Example 2: Text with uppercase letters
text2 = "HELLO WORLD"
normalized_text2 = normalize_text(text2)
print("Original Text 2:", text2)
print("Normalized Text 2:", normalized_text2)

# Example 3: Text with multiple white spaces
text3 = "   This     is     another    example    "
normalized_text3 = normalize_text(text3)
print("Original Text 3:", text3)
print("Normalized Text 3:", normalized_text3)

Original Text 1:    This is a Sample Text   
Normalized Text 1: this is a sample text
Original Text 2: HELLO WORLD
Normalized Text 2: hello world
Original Text 3:    This     is     another    example    
Normalized Text 3: this     is     another    example


## 2. **remove_punctuation(input_string: str) -> str**:
   This function removes punctuation marks from the input string.

In [5]:
# Example 1: Text with punctuation
text1 = "Hello, World!"
cleaned_text1 = remove_punctuation(text1)
print("Original Text 1:", text1)
print("Cleaned Text 1:", cleaned_text1)

# Example 2: Text with special characters
text2 = "This is a #test with @special characters!"
cleaned_text2 = remove_punctuation(text2)
print("Original Text 2:", text2)
print("Cleaned Text 2:", cleaned_text2)

# Example 3: Mixed text with punctuation and numbers
text3 = "Don't worry, it's just 9:00 AM!"
cleaned_text3 = remove_punctuation(text3)
print("Original Text 3:", text3)
print("Cleaned Text 3:", cleaned_text3)

Original Text 1: Hello, World!
Cleaned Text 1: Hello World
Original Text 2: This is a #test with @special characters!
Cleaned Text 2: This is a test with special characters
Original Text 3: Don't worry, it's just 9:00 AM!
Cleaned Text 3: Dont worry its just 900 AM


## 3. **replace_tabs_enters_and_spaces(input_string:str) -> str**:
   This function takes an input string and replaces tabs with a single space and multiple consecutive spaces with a single space.

In [7]:
input_text = "This\tis\na\tsample\ttext\nwith\ttabs\nand\nnew\tlines."
print("Original Text:")
print(input_text)

processed_text = replace_tabs_enters_and_spaces(input_text)
print("\nProcessed Text:")
print(processed_text)

Original Text:
This	is
a	sample	text
with	tabs
and
new	lines.

Processed Text:
This is a sample text with tabs and new lines.


## 4. **remove_stopwords(input_string: str) -> str**:
   This function removes stopwords from the input string. Stopwords are commonly used words (e.g., "the", "is", "in") that do not carry significant meaning.

In [8]:
html_text = "<p>This is <b>HTML</b> <i>text</i> with <a href='#'>links</a>.</p>"
cleaned_text = remove_html_tags(html_text)
print("Original HTML Text:")
print(html_text)
print("\nCleaned Text:")
print(cleaned_text)

Original HTML Text:
<p>This is <b>HTML</b> <i>text</i> with <a href='#'>links</a>.</p>

Cleaned Text:
This is HTML text with links.


## 5. **remove_html_tags(input_text: str) -> str**:
   This function removes HTML tags from the input text.


## 6. **remove_stopwords(input_string: str, stop_words: Set[str]) -> str**:
   This function removes stopwords from the input string. Stopwords are commonly used words (e.g., "the", "is", "in") that do not carry significant meaning.

## 7. **lemmatize(text: str) -> str**:
   This function lemmatizes the input text. Lemmatization is similar to stemming but considers the context of the word to determine its base or dictionary form.

## 8. **stemming(text: str) -> str**:
   This function performs stemming on the input text. Stemming reduces words to their root or base form, often by removing suffixes.

    **Note:** It is recommended to avoid using lemmatization (`lemmatize`) and stemming (`stemming`) simultaneously, as both methods serve to normalize words but in different ways. Using them together may lead to excessive text processing and loss of semantic information.

## 9. **remove_pos_tags(text: str) -> str**:
   This function removes all words from the input text except for nouns (NN). It relies on part-of-speech tagging to identify and retain only nouns in the text.


## 10. **correct_grammar(text: str) -> str**:
    This function corrects grammar mistakes in the input text.

## 11. **remove_common_words(text: str, common_threshold: int) -> str**:
    This function removes words from the text that occur too frequently.

## 12. **remove_rare_words(text: str, rare_threshold: int) -> str**:
    This function removes words from the text that occur too rarely.

## 13. **clean_empty_data(df: pd.DataFrame, columns_to_check: list) -> pd.DataFrame**:
    Cleans the input DataFrame by removing rows with missing value or empty strings.