# NLP Preprocessing: Step-by-Step Guide

This notebook demonstrates various Natural Language Processing (NLP) preprocessing techniques applied to a legal dataset. NLP preprocessing is essential for converting raw text data into a format that can be effectively analyzed by machine learning algorithms.

The notebook covers:
- Text exploration and basic statistics
- Tokenization of text into words
- Removal of stopwords
- Stemming and lemmatization
- Part-of-speech tagging
- Text cleaning and normalization
- Text visualization
- Vectorization techniques

## 1. Import Required Libraries

First, we need to import the necessary Python libraries for NLP preprocessing:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, ne_chunk
from collections import Counter
from wordcloud import WordCloud  # You may need to install this: pip install wordcloud
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Download required NLTK resources
print("Downloading required NLTK resources...")
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
print("Downloads complete.")

## 2. Load and Explore the Dataset

Now we'll load the legal dataset from a CSV file and explore its basic properties:

In [None]:
# Load the dataset
print("Loading the legal dataset...")
df = pd.read_csv('legal_dataset.csv')

# Display basic information
print(f"\nDataset shape: {df.shape}")
print("\nFirst 5 rows:")
display(df.head())

print("\nBasic information about the dataset:")
df.info()

print("\nSummary statistics:")
display(df.describe(include='all'))

In [None]:
# Check for missing values
print("\nChecking for missing values:")
print(df.isnull().sum())

# Choose the text column to preprocess (first column)
text_col = df.columns[0]
print(f"\nWorking with text column: '{text_col}'")

# Display text length statistics
df['text_length'] = df[text_col].astype(str).apply(len)
df['word_count'] = df[text_col].astype(str).apply(lambda x: len(x.split()))
df['sentence_count'] = df[text_col].astype(str).apply(lambda x: len(sent_tokenize(str(x))))

print("\nText statistics:")
print(f"Average text length: {df['text_length'].mean():.2f} characters")
print(f"Average word count: {df['word_count'].mean():.2f} words")
print(f"Average sentence count: {df['sentence_count'].mean():.2f} sentences")

# Visualize text length distribution
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['word_count'], kde=True)
plt.title('Distribution of Word Count')
plt.xlabel('Number of Words')
plt.subplot(1, 2, 2)
sns.histplot(df['sentence_count'], kde=True)
plt.title('Distribution of Sentence Count')
plt.xlabel('Number of Sentences')
plt.tight_layout()
plt.show()

## 3. Text Cleaning

Before tokenization, we'll clean the text by:
- Converting to lowercase
- Removing special characters and digits
- Removing extra whitespace

In [None]:
# Function to clean text
def clean_text(text):
    # Convert to string in case there are non-string entries
    text = str(text)
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^\w\s]', '', text)
    # Remove digits
    text = re.sub(r'\d+', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply cleaning
df['cleaned_text'] = df[text_col].apply(clean_text)

# Show cleaning example
print("Text cleaning example:")
example_idx = 0  # Using the first document as an example
original = df[text_col].iloc[example_idx]
cleaned = df['cleaned_text'].iloc[example_idx]

# Display the first 200 characters of each for comparison
print(f"\nOriginal text (first 200 chars):\n{original[:200]}...")
print(f"\nCleaned text (first 200 chars):\n{cleaned[:200]}...")

## 4. Tokenization

Tokenization breaks text into smaller units such as sentences or words. We'll use NLTK's word_tokenize and sent_tokenize functions:

In [None]:
# Word tokenization
df['tokenized'] = df['cleaned_text'].apply(word_tokenize)

# Sentence tokenization
df['sentences'] = df[text_col].astype(str).apply(sent_tokenize)

# Show tokenization example
print("Tokenization example:")
example_idx = 0  # Using the first document as an example

# Display first 15 tokens
tokens = df['tokenized'].iloc[example_idx]
print(f"\nFirst 15 word tokens: {tokens[:15]}")

# Display first 2 sentences (if available)
sentences = df['sentences'].iloc[example_idx]
if len(sentences) >= 2:
    print(f"\nFirst 2 sentences:")
    for i, sent in enumerate(sentences[:2]):
        print(f"{i+1}. {sent}")
else:
    print(f"\nSentences: {sentences}")

# Create a histogram of token counts
token_counts = df['tokenized'].apply(len)
plt.figure(figsize=(10, 5))
sns.histplot(token_counts, kde=True, bins=30)
plt.title('Distribution of Token Count per Document')
plt.xlabel('Number of Tokens')
plt.ylabel('Frequency')
plt.show()

## 5. Stopword Removal

Stopwords are common words (like "the", "a", "an", "in") that don't carry much meaning in NLP analysis. We'll remove them using NLTK's stopwords list:

In [None]:
# Get English stopwords
stop_words = set(stopwords.words('english'))
print(f"Number of stopwords: {len(stop_words)}")
print(f"Sample stopwords: {list(stop_words)[:10]}")

# Remove stopwords and non-alphanumeric tokens
df['no_stopwords'] = df['tokenized'].apply(
    lambda words: [w for w in words if w.lower() not in stop_words and w.isalnum()]
)

# Show stopwords removal example
example_idx = 0  # Using the first document
original_tokens = df['tokenized'].iloc[example_idx]
filtered_tokens = df['no_stopwords'].iloc[example_idx]

print("\nStopword removal example:")
print(f"Original token count: {len(original_tokens)}")
print(f"After stopword removal: {len(filtered_tokens)}")
print(f"\nFirst 15 original tokens: {original_tokens[:15]}")
print(f"First 15 filtered tokens: {filtered_tokens[:15]}")

# Calculate stopword percentage
original_counts = df['tokenized'].apply(len)
filtered_counts = df['no_stopwords'].apply(len)
percent_removed = ((original_counts - filtered_counts) / original_counts * 100).mean()

print(f"\nOn average, {percent_removed:.2f}% of tokens were removed as stopwords")

## 6. Stemming

Stemming reduces words to their root form (stem) by removing affixes. For example, "running", "runs", and "ran" all become "run". We'll use the Porter Stemmer algorithm:

In [None]:
# Initialize Porter Stemmer
stemmer = PorterStemmer()

# Apply stemming to the filtered tokens
df['stemmed'] = df['no_stopwords'].apply(lambda words: [stemmer.stem(w) for w in words])

# Show stemming examples
example_idx = 0
filtered_tokens = df['no_stopwords'].iloc[example_idx]
stemmed_tokens = df['stemmed'].iloc[example_idx]

print("Stemming examples:")
print(f"\nFirst 15 tokens before stemming: {filtered_tokens[:15]}")
print(f"First 15 tokens after stemming: {stemmed_tokens[:15]}")

# Display some interesting examples of stemming
print("\nSpecific stemming examples:")
examples = ["running", "runs", "ran", "easily", "fairly", "wolves", "better", "was", "mice"]
for word in examples:
    print(f"{word} → {stemmer.stem(word)}")

# Show stemming results in a DataFrame for the first few tokens
stem_comparison = pd.DataFrame({
    'Original': filtered_tokens[:10],
    'Stemmed': stemmed_tokens[:10]
})
display(stem_comparison)

## 7. Lemmatization

Lemmatization is similar to stemming but produces actual dictionary words. It's more sophisticated but slower than stemming. For example, "better" becomes "good" rather than "better" or "bet":

In [None]:
# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to the filtered tokens
df['lemmatized'] = df['no_stopwords'].apply(lambda words: [lemmatizer.lemmatize(w) for w in words])

# Show lemmatization examples
example_idx = 0
filtered_tokens = df['no_stopwords'].iloc[example_idx]
lemmatized_tokens = df['lemmatized'].iloc[example_idx]

print("Lemmatization examples:")
print(f"\nFirst 15 tokens before lemmatization: {filtered_tokens[:15]}")
print(f"First 15 tokens after lemmatization: {lemmatized_tokens[:15]}")

# Display some interesting examples of lemmatization
print("\nSpecific lemmatization examples:")
examples = ["better", "worse", "running", "mice", "wolves", "are", "feet", "children", "companies"]
for word in examples:
    print(f"{word} → {lemmatizer.lemmatize(word)}")

# Compare stemming and lemmatization side by side
comparison = []
for word in examples:
    comparison.append({
        'Original': word,
        'Stemmed': stemmer.stem(word),
        'Lemmatized': lemmatizer.lemmatize(word)
    })
    
comparison_df = pd.DataFrame(comparison)
display(comparison_df)

## 8. Part-of-Speech Tagging

POS tagging identifies the grammatical parts of speech of each word (noun, verb, adjective, etc.). This can be helpful for lemmatization and for understanding the structure of sentences:

In [None]:
# Apply POS tagging to the filtered tokens
df['pos_tags'] = df['no_stopwords'].apply(pos_tag)

# Show POS tagging examples
example_idx = 0
pos_tagged = df['pos_tags'].iloc[example_idx]

print("Part-of-Speech tagging example:")
print("\nFirst 15 tokens with POS tags:")
for i, (word, tag) in enumerate(pos_tagged[:15]):
    print(f"{i+1}. {word} → {tag}")

# Explain common POS tags
pos_tag_explanations = {
    'CC': 'Coordinating conjunction',
    'CD': 'Cardinal number',
    'DT': 'Determiner',
    'IN': 'Preposition or subordinating conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, comparative',
    'JJS': 'Adjective, superlative',
    'NN': 'Noun, singular or mass',
    'NNP': 'Proper noun, singular',
    'NNPS': 'Proper noun, plural',
    'NNS': 'Noun, plural',
    'RB': 'Adverb',
    'VB': 'Verb, base form',
    'VBD': 'Verb, past tense',
    'VBG': 'Verb, gerund or present participle',
    'VBN': 'Verb, past participle',
    'VBP': 'Verb, non-3rd person singular present',
    'VBZ': 'Verb, 3rd person singular present'
}

print("\nCommon POS tag explanations:")
for tag, explanation in list(pos_tag_explanations.items())[:10]:
    print(f"{tag}: {explanation}")

# Count frequency of each POS tag
all_tags = [tag for doc_tags in df['pos_tags'] for _, tag in doc_tags]
tag_freq = Counter(all_tags)

# Display the most common POS tags
print("\nMost common POS tags in the dataset:")
for tag, count in tag_freq.most_common(10):
    explanation = pos_tag_explanations.get(tag, "Other")
    print(f"{tag} ({explanation}): {count} occurrences")

## 9. Text Visualization

Visualizing text data can provide insights about the most frequent terms and their relationships. Let's create a word cloud and frequency distribution plot:

In [None]:
# Get all words from the lemmatized tokens
all_lemmas = [word for doc_lemmas in df['lemmatized'] for word in doc_lemmas]
lemma_freq = Counter(all_lemmas)

# Plot frequency distribution of top words
top_n = 20
top_words = dict(lemma_freq.most_common(top_n))

plt.figure(figsize=(12, 6))
plt.bar(top_words.keys(), top_words.values())
plt.xticks(rotation=45, ha='right')
plt.title(f'Top {top_n} Most Frequent Words After Preprocessing')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Generate a word cloud
try:
    # Create a text string for the word cloud
    lemmatized_text = ' '.join(all_lemmas)
    
    # Generate word cloud
    wordcloud = WordCloud(width=800, height=400, 
                         background_color='white',
                         max_words=200,
                         contour_width=3,
                         contour_color='steelblue')
    
    wordcloud.generate(lemmatized_text)
    
    # Display the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud of Lemmatized Words')
    plt.tight_layout()
    plt.show()
except Exception as e:
    print(f"Couldn't generate word cloud: {e}")
    print("You may need to install the wordcloud package: pip install wordcloud")

## 10. Vectorization

To use our preprocessed text with machine learning algorithms, we need to convert the text into numerical vectors. We'll demonstrate two common approaches:

1. Bag of Words (Count Vectorization)
2. TF-IDF (Term Frequency-Inverse Document Frequency)

In [None]:
# Prepare text for vectorization by joining tokens
df['processed_text'] = df['lemmatized'].apply(lambda tokens: ' '.join(tokens))

# Create a smaller sample for demonstration if needed
sample_texts = df['processed_text'].tolist()

print(f"Number of documents to vectorize: {len(sample_texts)}")
print(f"Sample document: {sample_texts[0][:100]}...")

# 1. Bag of Words Vectorization
count_vectorizer = CountVectorizer(max_features=1000)  # Limit to top 1000 features
bow_matrix = count_vectorizer.fit_transform(sample_texts)

print("\nBag of Words Vectorization:")
print(f"Shape of BoW matrix: {bow_matrix.shape}")
print(f"Number of unique words in vocabulary: {len(count_vectorizer.vocabulary_)}")
print(f"First 10 features: {list(count_vectorizer.vocabulary_.keys())[:10]}")

# 2. TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(sample_texts)

print("\nTF-IDF Vectorization:")
print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
print(f"Number of unique words in vocabulary: {len(tfidf_vectorizer.vocabulary_)}")

# Display a sample document as vectors
doc_idx = 0  # First document

# Get the non-zero features for this document
bow_features = bow_matrix[doc_idx].nonzero()[1]
tfidf_features = tfidf_matrix[doc_idx].nonzero()[1]

# Display as a table
print("\nSample of vectorization for the first document:")
print("\nBag of Words representation (first 10 non-zero features):")
for i in bow_features[:10]:
    feature_name = count_vectorizer.get_feature_names_out()[i]
    feature_value = bow_matrix[doc_idx, i]
    print(f"{feature_name}: {feature_value}")

print("\nTF-IDF representation (first 10 non-zero features):")
for i in tfidf_features[:10]:
    feature_name = tfidf_vectorizer.get_feature_names_out()[i]
    feature_value = tfidf_matrix[doc_idx, i]
    print(f"{feature_name}: {feature_value:.4f}")

## 11. Save Processed Data

Finally, let's save our preprocessed data for future use:

In [None]:
# Save the processed DataFrame to CSV
df.to_csv('legal_dataset_preprocessed.csv', index=False)
print("Saved preprocessed data to legal_dataset_preprocessed.csv")

# Save vectorized data (optional)
# Convert sparse matrices to DataFrames
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())
bow_df.to_csv('legal_dataset_bow_vectors.csv', index=False)
print("Saved Bag of Words vectors to legal_dataset_bow_vectors.csv")

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.to_csv('legal_dataset_tfidf_vectors.csv', index=False)
print("Saved TF-IDF vectors to legal_dataset_tfidf_vectors.csv")

# Display final dataset columns
print("\nFinal preprocessed DataFrame columns:")
for i, col in enumerate(df.columns):
    print(f"{i+1}. {col}")

print("\nPreprocessing pipeline complete!")

## 12. Conclusion

In this notebook, we've covered the essential steps in NLP preprocessing:

1. **Data Exploration**: Understanding the structure and content of our dataset
2. **Text Cleaning**: Removing special characters and normalizing text
3. **Tokenization**: Breaking text into individual words and sentences
4. **Stopword Removal**: Filtering out common words that don't carry much meaning
5. **Stemming**: Reducing words to their word stem
6. **Lemmatization**: Converting words to their base dictionary form
7. **Part-of-Speech Tagging**: Identifying grammatical components
8. **Visualization**: Creating word clouds and frequency plots
9. **Vectorization**: Converting text to numerical vectors for machine learning

These preprocessing steps are fundamental for any NLP task, including:
- Text classification
- Sentiment analysis
- Document clustering
- Topic modeling
- Information retrieval
- And more!

With our processed data, we're now ready to apply various machine learning algorithms for deeper analysis.