<h1>Text Cleanup</h1>
<h4>Methodology:</h4>
<ul>
  <li><strong>NLTK Resources Download:</strong> The code initiates the download of essential NLTK resources—'punkt', 'stopwords', and 'wordnet'—necessary for text processing tasks.</li>
  <li><strong>Text Cleaning Function:</strong> The <code>clean_text</code> function embodies a series of operations to preprocess textual data effectively:
    <ul>
      <li><strong>Special Character Removal:</strong> Removes special characters and punctuation, excluding specific characters like '@', '#', and '' (apostrophe).</li>
      <li><strong>Text Normalization:</strong> Converts the text to lowercase for uniformity.</li>
      <li><strong>Tokenization:</strong> Segments the text into individual tokens (words).</li>
      <li><strong>Stopword Removal:</strong> Eliminates common English stopwords to filter out noise from the text.</li>
      <li><strong>Lemmatization:</strong> Applies lemmatization to reduce words to their base or root form, ensuring consistency.</li>
      <li><strong>Whitespace Handling:</strong> Cleans and normalizes whitespaces for consistency in the cleaned text.</li>
    </ul>
  </li>
</ul>

<h3>Imports</h3>

In [1]:
import pandas as pd

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

<h3>Text Cleaning Process Overview</h3>
<p>The provided code snippet showcases a comprehensive text cleaning process utilizing the NLTK library.</p>

In [2]:
# Download NLTK resources if you haven't done so already
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to perform text cleaning
def clean_text(text):
    # Remove special characters and punctuation except @, #, and '
    cleaned_text = re.sub(r"[^\w\s#@']", "", text)

    # Normalize text (convert to lowercase)
    cleaned_text = cleaned_text.lower()

    # Tokenize the text
    tokens = word_tokenize(cleaned_text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Initialize lemmatization
    lemmatizer = WordNetLemmatizer()

    # Apply stemming and lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    # Join tokens back into text
    cleaned_text = ' '.join(lemmatized_tokens)

    # Handle whitespaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text

[nltk_data] Downloading package punkt to /Users/andrew/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andrew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/andrew/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<h3>Applying Cleanup to the Training Dataset</h3>

In [3]:
# Load train data (assuming you have train.csv for training)
train_dataset = pd.read_csv('kmaml223/train.csv')

# Create a new column 'comment_text_cleaned' and assign the preprocessed text data
train_dataset['comment_text_cleaned'] = train_dataset['comment_text'].apply(clean_text)

# Save the modified DataFrame back to a CSV file
train_dataset.to_csv('kmaml223/train.csv', index=False)

<h3>Applying Cleanup to the Testing Dataset</h3>

In [4]:
# Load train data (assuming you have train.csv for training)
test_dataset = pd.read_csv('kmaml223/test.csv')

# Create a new column 'comment_text_cleaned' and assign the preprocessed text data
test_dataset['comment_text_cleaned'] = test_dataset['comment_text'].apply(clean_text)

# Save the modified DataFrame back to a CSV file
test_dataset.to_csv('kmaml223/test.csv', index=False)